Twitter has become one of the most popular social networking sites among the world. Due to the popularity of Twitter, it becomes an attractive platform for spammers to spread spam. Twitter spam is referred as unsolicited tweets containing malicious links that directs victims to external sites containing malware downloads, phishing, drug sales, or scams, etc. It has become a severe issue on Twitter. To make Twitter a spam-free platform, we have collected a large number of tweets and investigated the characteristics of Twitter spam. We are working to propose novel detection mechanisms.
To help researchers study Twitter spam, we make some of our labelled groudtruth available here. Our groundtruth is labelled by our reserach partner -- Trend Micro's Web Reputation Technology. The files are ARFF format, which can be directly opened by Weka. Each line represents a tweet from our collection. The last column is the tweet class (spammer or non-spammer) and the rest columns are the feature values. The features are listed in the table below.
account_age | The age (days) of an account since its creation until the time of sending the most recent tweet |
no_follower | The number of followers of this twitter user |
no_following | The number of followings/friends of this twitter user |
no_userfavourites | The number of favourites this twitter user received |
no_lists | The number of lists this twitter user added |
no_tweets | The number of tweets this twitter user sent |
no_retweets | The number of retweets this tweet |
no_hashtag | The number of hashtags included in this tweet |
no_usermention | The number of user mentions included in this tweet |
no_urls | The number of URLs included in this tweet |
no_char | The number of characters in this tweet |
no_digits | The number of digits in this tweet |
The dataset used in our ICC paper "6 Million Spam Tweets: A Large Ground Truth for Timely Twitter Spam Detection" can be downloaded below: