Welcome to NSCLab

Twitter Spam

Twitter has become one of the most popular social networking sites among the world. Due to the popularity of Twitter, it becomes an attractive platform for spammers to spread spam. Twitter spam is referred as unsolicited tweets containing malicious links that directs victims to external sites containing malware downloads, phishing, drug sales, or scams, etc. It has become a severe issue on Twitter. To make Twitter a spam-free platform, we have collected a large number of tweets and investigated the characteristics of Twitter spam. We are working to propose novel detection mechanisms.

To help researchers study Twitter spam, we make some of our labelled groudtruth available here. Our groundtruth is labelled by our reserach partner -- Trend Micro's Web Reputation Technology. The files are ARFF format, which can be directly opened by Weka. Each line represents a tweet from our collection. The last column is the tweet class (spammer or non-spammer) and the rest columns are the feature values. The features are listed in the table below.

account_age	The age (days) of an account since its creation until the time of sending the most recent tweet
no_follower	The number of followers of this twitter user
no_following	The number of followings/friends of this twitter user
no_userfavourites	The number of favourites this twitter user received
no_lists	The number of lists this twitter user added
no_tweets	The number of tweets this twitter user sent
no_retweets	The number of retweets this tweet
no_hashtag	The number of hashtags included in this tweet
no_usermention	The number of user mentions included in this tweet
no_urls	The number of URLs included in this tweet
no_char	The number of characters in this tweet
no_digits	The number of digits in this tweet

The dataset used in our ICC paper "6 Million Spam Tweets: A Large Ground Truth for Timely Twitter Spam Detection" can be downloaded below:

Download ICC Dataset