Projects

Facebook Bot Detection

Identifying bots in an online auction

Is a bidder in an online auction a bot or a human?

The data consists of a list of bid events (auction id, user id, time, IP, location) and a table, X, with the bidder id's, the hashed contact and payment addresses and whether the bidder is a robot or a human.

The interesting part of this problem is to find ways to characterize bidding behaviour - to generate features that can be inserted into X. Then X can be used to train a classification algorithm to distinguish between bots and humans.

The most useful features I identified were: the median time between a user's bid and that user's previous bid, the mean number of bids a user made per auction, the entropy for how many bids a user placed on each day of the week, the maximum number of bids in a 20 min span, the total number of bids placed by the user, the average number of bids a user placed per referring URL, the number of bids placed by the user on each of the three weekdays in the data, and the minimum and median times between a user's bid and the previous bid by another user in the same auction.

Here's the script I used to process the data and generate predictions.

A quick look at the data

A sample auction
Examining the bidding activity in one auction suggested a few useful features right off the bat. I was a little unlucky with first auction I picked: one bidder made about 800 bids and completely dominated the auction. Must be a bot, eh? So the mean number of bids per auction, bids_per_auction_mean is probably instructive. What about the time between a user's bid and the previous bid in a given auction? (dt_others_median). How rapidly does the user place subsequent bids (in the same auction or a different auction)? (dt_self_median). How rapidly does a user switch between IPs (in the same auction or a different auction)? (dt_change_ip_median)

Another useful observation is that users which place bids from multiple countries often only place bids from one country in a given auction - this suggests that multiple people may be placing bids from the same account, but that only one individual on the account usually follows any one auction.

Time information

Bidding action over time

Since the time axis had mystery units it was first useful to map the time values to something in hours. I expected that bidding activity would probably have a periodicity of a day, so I made a histogram of bids over time. Surprise! Instead of having one continuous chunk of bidding action, there are three chunks.

In the bids/unit time histogram, each of the three chunks contains about three periods worth of bidding activity, which suggests that each chunk of data is probably three days. The durations of each chunk are the same (+/- 1 mystery unit) and the durations of the gaps between each chunk are also the same, which suddenly made it very easy to calculate the length of a day. The duration of the entire data set is 31 times one_day. Had the data chunks not all been so suggestively the same duration, the positions of peaks in the autocorrelation function of the bid histogram could have been used to calculate the length of a day.

Characterizing the distribution of times at which users place bids

Both the histograms of bids over time for all the auctions and bids over time for single auctions exhibited a periodicity of one day - and this periodicity was more obvious for human bidders than for bots. (This is to be expected as humans sleep.) There was also a daily spike in human bidding activity, which was puzzling (did all auctions end at midnight GMT? Was this just the busiest time of day for bidders in Asia?) Nevertheless, the variation in the number of bids as a function of the time of day was a feature that could be exploited using a simple histogram. What fraction of a user's bids were placed in the first 30 minutes of the day? The second 30 minutes? A second measure is the ratio between the number of bids placed in the first/second/n 30 minutes of the day and the number of bids in the 30 minute segment in which the most bids were placed. Both these measures gave similar improvements to the classification (see below for comments on cross-validation).

It would be reasonable to expect to need to shift the bids/time histogram along the time axis depending on what time zone the bidder was in. This is tricky though, since IP is not a reliable indicator of location. And the daily spike in human bidding activity suggested that shifting the bids/time histogram might not actually be very instructive.

Bidding strategy over the course of an auction

Because the data is in three chunks, it's impossible to tell if there are any complete auctions - if the first bid in the data set is sufficiently far from t_start, then it's reasonable to assume that this was the first bid in the auction (ditto for the last bid, where the last bid falls reasonable far from the end of the data set). I would have liked to have created a description of the bidding strategy which included where in the auction a user was bidding - did the user bid only at the last minute? Or only in the beginning of an auction? These behaviours correspond to two different bot strategies: the first bot exists to drive up the price, while the second bot is designed to rapidly make a lot of bids to finish the auction. However, there were no clues in the bidding data to suggest that any of the auctions were complete.

Using entropy to characterize variety in weekday, IP, referring URL

One way to characterize the variation in how bids were placed is the entropy. A concrete example of the entropy is the IP entropy: N!/(N__IP1! N_IP2!... N_IPn!). N is the total number of bids and N__IPn is the total number of bids placed from the nth IP. The entropy is a measure of how both randomly distributed the bids are, how many IPs there are and how many bids there are. A bidder that places all their bids from the same IP has an entropy of N!/N! = 1. Because the entropy is very large, it is smarter to calculate the log of the entropy.

Entropy also turned out to be a useful way to characterize how a user's bidding activity is distributed over each of the three days of the week as well as over the referring URLs.

Other useful features

The model got additional mileage out of the fraction of bids a user place in each of the 199 countries, as well as the fraction of IPs from which a user placed a bid from which another user who was a bot also placed a bid.

Auction duration
Some of the auctions only had bidding activity in one of the three-day chunks, while other auctions included bidding activity in two of the chunks - ie, some of the auctions lasted at least 17 days. It turned out that the durations of auctions which the bidder participated in was slightly helpful in labeling bots. What fraction of the auctions the user participated in were longer than two weeks? What was the median time between the user's bids and the start of the auction? The median time between a user's bids and the end of an auction?

I was really stumped about why there were hardly any bids placed by robots between 11 and 14 days before the end of the auction. I wondered if this behaviour might have helped to explain why we only got to see three days out of every two weeks in the data....

Classification

Choosing the algorithm
Since this is a classification problem scored by the AUC, the classification algorithm should generate the probabilities for each class. The options include RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier. scikit-learn makes it easy just to try them all. Breaking the training set into a training and validation set and comparing the AUC predictions from each classifier made it clear RandomForestClassifier with criterion='entropy' is the best option. The final submissions were created by averaging five predictions from RandomForestClassifier.

Assessing the quality of predictions
There was a lot of variation in the data set, so quantifying the performance of the classifier was difficult. I used 100 splits (80%-train, 20%-validation) to cross-validate the predictions and then examined the 10th percentile and 25th percentile score. Ultimately, this was helpful up until the last 0.003 or so. I typically found that my leaderboard score was about 0.008 less than the scores I was getting for the 25th percentile for my cross-validation.

Things I tried that didn't work

Weekly bidder entropy

In the training data, a larger fraction of bids by humans takes place in the first data chunk (week0) than in the third data chunk (week4). Originally, I tried to exploit this as a feature, but eventually I decided that this distribution was more likely a result of how the data had been partitioned into the training and test data sets. When I dropped the week from my model, my leaderboard score went up.

Using a cluster algorithm to group users which use similar sets of IPs

I used MiniBatchKMeans to cluster users according to the fraction of bids they placed on IPs that occur more than twice. I was hoping to identify users with multiple accounts. Clustering was computationally expensive and didn't add anything measurable when tested using cross-validation.

Feature-weighted linear stacking

Using feature-weighted linear stacking to blend the results of several different models didn't produce results any better than RandomForestClassifier. In FWLS, the training data is broken into several folds and predictions for different models are generated for each fold. FWLS probably did not work very well because variation between the scores for predictions of a single model on different folds is quite large and because RandomForestClassifier exploited slightly different features every time the model was fitted on a different fold.

Using payment_account and address data

Again, I was hoping to exploit users who were placing bids with multiple accounts. Let's take a look at the first few lines of train.csv - the file which contains bidder info, along with the outcome (whether the bidder is a human or a robot).

bidder_id	payment_account	address	outcome
91a3c57b13234af24875c56fb7e2b2f4rb56a	a3d2de7675556553a5f08e4c88d2c228754av	a3d2de7675556553a5f08e4c88d2c228vt0u4	0.0
624f258b49e77713fc34034560f93fb3hu3jo	a3d2de7675556553a5f08e4c88d2c228v1sga	ae87054e5a97a8f840a3991d12611fdcrfbq3	0.0
1c5f4fc669099bfbfac515cd26997bd12ruaj	a3d2de7675556553a5f08e4c88d2c2280cybl	92520288b50f03907041887884ba49c0cl0pd	0.0
4bee9aba2abda51bf43d639013d6efe12iycd	51d80e233f7b6a7dfdee484a3c120f3b2ita8	4cb9717c8ad7e88a9a284989dd79b98dbevyi	0.0
4ab12bc61c82ddd9c2d65e60555808acqgos1	a3d2de7675556553a5f08e4c88d2c22857ddh	2a96c3ce94b3be921e0296097b88b56a7x1ji	0.0
7eaefc97fbf6af12e930528151f86eb91bafh	a3d2de7675556553a5f08e4c88d2c228yory1	5a1d8f28bc31aa6d72bef2d8fbf48b967hra3	0.0
25558d24bca82beef0f9db4ba1fe2045ynnvq	81580585d4dedd473da11aabf37fe9d4e2s2n	9a6d81115b9b653ba326eb510e9163b47drqj	0.0
88ae7a35e374a6fddd079ebb28c822eeohwse	a3d2de7675556553a5f08e4c88d2c2289zref	3a7e6a32b24aeab0688e91a41f3188e22iuec	0.0

Many of these account names have been hashed to something that starts with "a3d2de7675556553a5f08e4c88d2c228". I stripped off the last five elements of the payment_account and address hashes and looked for duplicates. Using a pivot table with info about repeated values of address[0:-5] and payment_account[0:-5] didn't improve CV score.