Twitter Spam

Twitter has become one of the most popular social networking sites among the world. Due to the popularity of Twitter, it becomes an attractive platform for spammers to spread spam. Twitter spam is referred as unsolicited tweets containing malicious links that directs victims to external sites containing malware downloads, phishing, drug sales, or scams, etc. It has become a severe issue on Twitter. To make Twitter a spam-free platform, we have collected a large number of tweets and investigated the characteristics of Twitter spam. We are working to propose novel detection mechanisms.

To help researchers study Twitter spam, we make some of our labelled groudtruth available here. Our groundtruth is labelled by our reserach partner -- Trend Micro's Web Reputation Technology. The files are ARFF format, which can be directly opened by Weka. Each line represents a tweet from our collection. The last column is the tweet class (spammer or non-spammer) and the rest columns are the feature values. The features are listed in the table below.

account_age The age (days) of an account since its creation until the time of sending the most recent tweet
no_follower The number of followers of this twitter user
no_following The number of followings/friends of this twitter user
no_userfavourites The number of favourites this twitter user received
no_lists The number of lists this twitter user added
no_tweets The number of tweets this twitter user sent
no_retweets The number of retweets this tweet
no_hashtag The number of hashtags included in this tweet
no_usermention The number of user mentions included in this tweet
no_urls The number of URLs included in this tweet
no_char The number of characters in this tweet
no_digits The number of digits in this tweet

The dataset used in our ICC paper "6 Million Spam Tweets: A Large Ground Truth for Timely Twitter Spam Detection" can be downloaded below:

Download ICC Dataset