MC-Fake dataset
Introduction
The popularity of social media in recent years has promoted the spread of fake news. Detecting fake news on social media is challenging, as pieces of fake news pieces are intentionally written to mislead consumers, which means that it is often not possible to spot fake news from news content itself. For this reason, social context based detection methods, which attempt to model the spreading patterns of fake news by utilising the collective wisdom from users on social media, have been attracting increasing attention. In response to this, the MC-Fake dataset has been created to facilitate the detection of fake news using such methods.
Dataset description
The MC-Fake fake news dataset contains 28334 news events on multiple topics (Politics, Entertainment, Health, Covid-19, Syria War) and corresponding social context (tweets, retweets, replies, users, retweet_relations, replying relations, user-follows-user relations) collected from Twitter.
Dataset format
The majority of information about the news articles and their corresponding social context is provided in the form of a csv file. The user-follows-user relations are provided in a separate social network file. The format of both types of files is described below.csv file
The news dataset and corresponding social context (apart from user-follows-user relations, which are in the separate social network file, see below) are provided in the form of csv files, consisting of 16 columns:
- news_id: the id of the news event
- title: title of the news
- url: source url of the news
- publish_date: publish date
- source: news source
- text: text content of the news
- labels: veracity label of the news
- n_tweets: tweet counts
- n_retweets: retweet counts
- n_replies: reply counts
- n_users: user counts
- tweet_ids: IDs of the relevant tweets, separated by commas
- retweet_ids: IDs of the relevant retweets, separated by commas
- reply_ids: IDs of the relevant replies, separated by commas
- user_ids: IDs of the relevant users, separated by commas
- retweet_relations: retweet relations indicated by a list of tokens {tweet_ID_A}-{tweet_ID_B}-{user_ID of tweet A}-{user_ID of tweet B} denoting A retweets
- reply_relations: reply relations indicated by a list of tokens {tweet_ID_A}-{tweet_ID_B}-{user_ID of tweet A}-{user_ID of tweet B} denoting A replies B
- data_name: news category
Social network file
The user-follows-user relations are available in a large social network file.
Each line in the file is in the format of "{userA_ID},{userB_ID}", denoting a "follow" relationship from user A to user B.
Availability
The csv file is available for download according to the terms of the licence below.
The large user social network file is available in four separate parts:
http://www.nactem.ac.uk/data/edges_all.txt.gz.aa
http://www.nactem.ac.uk/data/edges_all.txt.gz.ab
http://www.nactem.ac.uk/data/edges_all.txt.gz.ac
http://www.nactem.ac.uk/data/edges_all.txt.gz.ad
Please use the following command to merge all files and then uncompress
cat edges_all.tar.gz.* > edges_all.tar.gz
Related Publication
Min, E., Rong, Y, Xu, T., Bian, Y., Zhao, P., Huang, J. and Ananiadou, S. (2022). Divide-and-Conquer: Post-User Interaction Network for Fake News Detection on Social Media. In: Proceedings of The Web Conference 2022, pp. 1148-1158.Licence
The dataset was constructed at the National Centre for Text Mining (NaCTeM), School of Computer Science, University of Manchester, UK. It is licensed under a Creative Commons Attribution 4.0 International License. Please attribute NaCTeM when using the dataset, and please cite the following article:
Min, E., Rong, Y, Xu, T., Bian, Y., Zhao, P., Huang, J. and Ananiadou, S. (2022). Divide-and-Conquer: Post-User Interaction Network for Fake News Detection on Social Media. In: Proceedings of The Web Conference 2022, pp. 1148-1158.
Featured News
- Prof. Sophia Ananiadou accepted as an ELLIS fellow
- Call for papers: CL4Health @ NAACL 2025
- Invited talk at the 15th Marbach Castle Drug-Drug Interaction Workshop
- BioNLP 2025 and Shared Tasks accepted for co-location at ACL 2025
- Prof. Junichi Tsujii honoured as Person of Cultural Merit in Japan
- Participation in panel at Cyber Greece 2024 Conference, Athens
- Shared Task on Financial Misinformation Detection at FinNLP-FNP-LLMFinLegal
- New Named Entity Corpus for Occupational Substance Exposure Assessment
- FinNLP-FNP-LLMFinLegal @ COLING-2025 - Call for papers
Other News & Events
- Keynote talk at Manchester Law and Technology Conference
- Keynote talk at ACM Summer School on Data Science, Athens
- Invited talk at the 8th Annual Women in Data Science Event at the American University of Beirut
- Invited talk at the 2nd Symposium on NLP for Social Good (NSG), University of Liverpool
- Invited talk at Annual Meeting of the Danish Society of Occupational and Environmental Medicine