Datasets for Social Network Analysis

SNNameNodeEdgeBehavior/ContentDescription
Microblogging networks
{{mecroblogging.SN}}{{mecroblogging.name}}{{mecroblogging.node}}{{mecroblogging.edge}}{{mecroblogging.behavior}}{{mecroblogging.desc}}
Patent data set from Patentminer.org
{{patentDate.SN}}{{patentDate.name}}{{patentDate.node}}{{patentDate.edge}}{{patentDate.behavior}}{{patentDate.desc}}
Other online social networks
{{social.SN}}{{social.name}}{{social.node}}{{social.edge}}{{social.behavior}}{{social.desc}}
Knowledge linking data set
{{knowledge.SN}}{{knowledge.name}}{{knowledge.node}}{{knowledge.edge}}{{knowledge.behavior}}{{knowledge.desc}}
Mobile data set
{{mobileData.SN}}{{mobileData.name}}{{mobileData.node}}{{mobileData.edge}}{{mobileData.behavior}}{{mobileData.desc}}

Twitter-Dynamic-Net:

Overview:

We crawled a twitter dataset. To begin the collection process, we selected the most popular user on Twitter, i.e., “Lady Gaga”, and randomly collected 10,000 of her followers. We took these users as seed users and used a crawler to collect all followers of these users by traversing “following” relationships and these followers are viewed as the user list and the total number is 112,044. The crawler monitored the change of the network structure among the 112,044 users from 10/12/2010 to 12/23/2010 and finally obtained 443,399 dynamic “following” relationships between them.

Tweets were crawled for those users from Jan1, 2010 to Oct, 2010 and from Oct 1, 2010 to Jan 15, 2010.

Description:

1. Network data

There are totally 443,399 follow relationships.

Data format:
1) graph_cb.txt

First column: person id 1

Second column: person id 2

Third column: the timestamp when the person 1 follows person 2


Sample lines:

0 11 1

0 4893 1

2) user_list.txt

The number k on the ith (start from 0) line represents mapping the original user id k to the new id i.

3) user_map.txt

format:

original_user_id username

2. Tweet data

The tweet data involves 156,487 users, 99,696,204 tweets where there are 28,699,842 replies and 11,408,918 retweets

Data format:

User_Name

Tweet_ID

Time

Via // the type of the web app

retweet_from

reply_to_user reply_to_tweet(if not reply, just "-1")

content // please notice that we have already transfer the original content into word index

Number_of_link_in_tweet // Here the links include URL, @user.

type_of_link1 link1

type_of_link2 link2

type_of_link3 link3

...

Sample data:

Joes_face

8977943263

Thu Feb 11 21:13:40 +0000 2010

via TwitPic

-1

-1

21589626 20760882

3

tweet-url web http://twitpic.com/12ndbp

tweet-url username /joejonas

tweet-url username /nickjonas

Download:

http://arnetminer.org/lab-datasets/tweet/twitter_network.rar

http://arnetminer.org/lab-datasets/tweet/Tweets-withoutwords.rar (Please notice that we have already transferred the content into word index)

https://static.aminer.org/lab-datasets/tweet/WordTable.tgz

References:

Twitter-Dynamic-Action:

Overview:

Twitter. The data set is crawled from Twitter by starting from the user “Carel Pedre (carelpedre)”,5 one of Haitian most popular radio DJs, who used Twitter to inform the world about the earthquake which ravaged his country. We extract all followers (> 11, 704) of “carelpedre” and the users he is following, and continue the process for each extracted Twitter user. We further crawl all tweets posted by the users as attributes. Finally, a data set used for action prediction consists of 7,521 users, 304,275 time varying following and followed relationships, and 730,568 tweets (blogs) posted by the users.

Download:

Due to the request from Twitter, we can not publish the content of Twitter.

http://arnetminer.org/lab-datasets/stnt/data/allstate.rarhttp://arnetminer.org/lab-datasets/stnt/data/twitter.rar

References:

Twitter-Competitor:

Overview:

The twitter data is about company. We collected all the patents (3,770,411 patents) from USPTO , from which we extracted 195,263 companies and 2,430,375 inventors. For each company, we used it as the query to search Twitter and retrieved the top returned tweets, from which we further extracted the information of users. So far, we have collected 1,033,750 tweets written by 87,603 Twitter users, which cover 1393 major companies.

Download:

Due to the request from Twitter, we can not publish the content of Twitter.

http://arnetminer.org/lab-datasets/competitor/twitter_content.rar

References:

Weibo-Net-Tweet:

Overview:

The data set was crawled in the following ways. To begin with, 100 random users were selected as seed users,and then their followees and followees’ followees were collected. The crawling process produced in total 1.7 million users and 4 billion following relationships among them, with average 200 followees per user. For each user, the crawler collected her 1,000 most recent microblogs (including tweets and retweets). The process in totally 1 billion microblogs. We also crawled all the users’ profiles which contain name, gender, verification status, #bi-following, #followers, #followees, and #microblogs. We focus on the retweet behaviors in the microblogging network. Thus we select 300,000 popular microblog diffusion episodes from the data set. Each diffusion episode contains the original microblog and all its retweets. On average each microblog has been retweeted for about 80 times.

Description:

Please refer  https://aminer.org/Influencelocality  for details.

Download:

https://aminer.org/Influencelocality

Reference:

Patent

Overview:

We have collected total 4,179,629 patents from USPTO

Description:

Data format:

#* --- patentTitle

#@ --- inventor(split by #)

#year ---- Year

#assignee --- assignee(split by #)

#region --- region(split by #)

#index ---- index id in patminer database

#pn ---- id of the patent

#% ---- the id of references of this patent (there are multiple lines, with each indicating a reference)

#! ---terms

Download:

http://arnetminer.org/lab-datasets/pminer/patent-data.rar

Reference:

Slashdot-large:

Overview

Slashdot is a network of friends. Slashdot is a site for sharing  technology related news. In 2002, Slashdot introduced the Slashdot Zoo which allows users to tag each other as ``friends'' (like) or ``foes'' (dislike). The data set contains two parts, including a friend/foe network and a news-comment diffusion dataset. The network contains 93139 users and 577025 relationships between users. The news-comment diffusion dataset contains 35065 news and 3505736 comments.

Description:

Network:

slashdot_nwdata.txt

This file describes the Slashdot network.

First line consists of two integers, representing the number of users N and number of follow relationships M respectively.

In the following M lines, each line starts with an integer v1_id, representing the user_id of user v1, followed by another integer k. And the following 2k numbers describes the users "FOLLOWED" by v1, each represented by a user id v2_id and a number indicating the type of relationship. In Slashdot data set, "1" indicates that user v1 is a "fan" of v2, while "0" indicates that v1 considers v2 as her "foe".

Slashdot Comment Data

1. slashdot_comment_diffusion.txt

This file consists of comment logs on 35,065 news. The comment logs on each news are represented by several lines. In the first line, two integers A and B, separated by a [TAB], describe the news_id and the number of comments respectively. In the following B lines, each line describes an action of commenting by two integers T and U, represent the time stamp when this comment was posted, and the user_id who commented this news repectively. Time stamp T is encoded by how many seconds have passed since 1970-01-01-00:00:00.

---------------------------------

Example:

0 111

[news_id] [number_of_comments]

1165115580 24966

[time_stamp] [user_id]

1165115880 2520

[time_stamp] [user_id]

...

---------------------------------

Note that the number of comments is not actually how many times the news has been commented in the entire Slashdot network, but how many unique users in our user set have commented this post. Please also note that if a user commented the same post for multiple times, I only kept the first comment log and removed the others. Thus each user will only appear at most once in a comment logs of a news.

2. slashdot_news_idlist.txt

Each line is the "original news_id" for each news, and the corresponding line number (starts from 0) is the hashed news_id in "slashdot_comment_diffusion.txt".

3. slashdot_uidlist.txt

Each line is the "original slashdot uid" for each user, and the corresponding line number (starts from 0) is the hashed user_id in "slashdot_comment_diffusion.txt".

Download:

http://arnetminer.org/lab-datasets/slashdot/slashdot.rar

Slashdot-small

Overview:

Slashdot is a network of friends. Slashdot is a site for sharing  technology related news. In 2002, Slashdot introduced the Slashdot Zoo which allows users to tag each other as ``friends'' (like) or ``foes'' (dislike). The data set is comprised of 77,357 users and 516,575 relationships of which 76.7% are ``friend'' relationships. The data set can be used to infer the ``friend'' relationships between users, and to study the positive and negative influence.

Description:

First column: person id

Second column: person id

Third column: relationship type, 1 means friend relationship, -1 means foe relationship

Sample lines:

0 1 1

0 22 -1

Download:

Slashdot network:

http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/slashdot.zip

Reference:

Epinions-1:

Overview:

Epinions is a network of product reviewers. Each user on the site can post a review for any product and other users would rate the review with trust or distrust. In this data, we created a network of reviewers connected with trust and distrust relationships. The data set consists of 131,828 users and 841,372 relationships, of which about 85.0% are trust relationships. 80,668 users received at least one trust or distrust relationships. The data set can be used to infer the trust relationships between users.

Description:

First column: person id

Second column: person id

Third column: relationship type, 1 means trust relationships, 0 means distrust relationship

Sample lines:

0 1 -1

4 2282 1

Download:

Epinions network:

http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/epinions.zip

Reference:

Epinions-2:

Overview:

Epinions is a network of product reviewers. Each user on the site can post a review for any product and other users would rate the review with trust or distrust. In this data, for each user, we have his profile, his ratings and his trust relations. For each rating, we have the product name and its category, the rating score, the time point when the rating is created, and the helpfulness of this rating

Description:

The first file is rating.mat, includeing the rating information. there are five columns and they are userid, productid, categoryid, rating, helpfulness, respectively.

For example, for one row

(1,2,3,4,5)

It means that user 1 gives a rating of 4 to the product 2 from the category 3. The helpfulness of this rating is 5.

The second file is trustnetwork.mat, including the trust relations between users. There are two columns and both of them are userid.

For example, for one row, (1,2)

It means that user 1 trusts user 2.

Please refer  http://www.jiliang.xyz/trust.html for details

Enron:

Overview:

Two types of relationships, i.e., manager-subordinate and colleague, were annotated between these employees.There are in total 3,572 relationships, of which 133 are manager-subordinate relationships.

Download:

http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/enron.zip

Reference:

Flickr-large:

Overview:

The dataset contains friend relationships, user to group relationships, images, and the activities of user comment image. The number of users is 2037538 and the number of relationships is 219098660. The number of groups is 655917. The number of images is 1262978. The comments on images are 14913164.

Detail:

1.name2id.txt: map from user_name to user_id, format:

user1_name user1_id

2.user2group: relationships of user and group, format:

user1_id group1_id group2_id......

3. user2user: relationships between users, format:

user1_id user2_id (user2 is in user1's contactList)

4.images_comts.txt: comts of images, format:

image1_id owner_id user1_id comts1 user2_id comts2...

Download:

http://arnetminer.org/lab-datasets/flickr/flickr.rar

Flickr-medium:

Overview

Flickr is a popular photo sharing network for users to upload photos and share photos. The Flickr dataset is crawled in early 2014, which consists of 215,495 individual users and 9,114,557 links. Like LiveJournal, the links here reflect friend relationships among users.

Download

http://arnetminer.org/lab-datasets/multi-sns/flickr.tar.gz

Reference:

Flickr-small:

Overview

The Flickr data set is the data set related with the http://socialnetworks.mpi-sws.org/ . You can mail to the original author to get the coarse data.

Download

A subset can be downloaded from here:

http://arnetminer.org/lab-datasets/stnt/data/flickr.rar

Reference:

LiveJournal:

Overview

Livejournal is a free on-line social network where users can keep a blog, journal or diary. Our dataset is crawled from its website in late 2013, which contains 3,017,286 users and 87,037,567 links. Here defines link as the friend relationship, that is,two users are linked if one exists in the other's friends list.

Download

http://arnetminer.org/lab-datasets/multi-sns/livejournal.tar.gz

Reference:

Last.fm:

Overview

Last.fm provides a streaming radio service, where users can search music and get personalized recommendation. Last.fm builds detailed profile of users' musical taste and preference, which is the foundation for music recommendation. We crawled Last.fm in late 2013 and obtained a network that contains 136,420 users and  1,685,524 following links among them.

Download

http://arnetminer.org/lab-datasets/multi-sns/lastfm.tar.gz

Reference:

MySpace:

Overview

MySpace is a social networking website which also has a strong music emphasis. The dataset we obtained from MySpace contains 854,498 user profiles as well as 6,489,736 directed connections among users. To reconstruct the network of these users, we treat connections as undirected and parallel links are combined into one.

Download

http://arnetminer.org/lab-datasets/multi-sns/myspace.tar.gz

Reference:

LinkedIn:

Overview

LinkedIn is a professional network, where users can maintain their profiles and social connections. We collected public profiles from LinkedIn. As we cannot crawl user connections on LinkedIn, we pursued another method to construct the network. We consider two profiles to be linked if they were viewed (“co-viewed”) by the same user. In this way, we obtained a network of 2,985,414 user profiles and 25,965,384 relationships.

Download

http://arnetminer.org/lab-datasets/multi-sns/linkedin.tar.gz

Complete dataset (with profile contents and relationships):

link:http://pan.baidu.com/s/1eQxwLpw password: fbjn

Reference:

Movie:

Overview

A data set consists of movies, actors, directors, writers, and various relationships between them crawled fromhttp://en.wikipedia.org/wiki/Category:English-language_films. newmovies.rar: a heterogeneous network. It contains 10 topics: American film actors, American television actors, Black and white films, Drama films, Comedy films, British films, American film directors, Independent films, American screenwriters, American stage actors.

Description:

The dataset consists of a star-director-film-writer network. Each data file consists of two sections: *Vertices and *Edges. “*Vertices 348” indicates that there are 348 heterogeneous nodes in the network.

The lines following “*Vertices 348”, e.g., “0 "Ann Blyth" 6035 starring 1928 births;Living people;American film actors;American musical theatre actors;American child actors;People from Westchester County, New York;”, each represents the attributes of a node, with multiple columns: noderid, node name, node weight, node type (e.g., star, or writer), multiple categories (topics) separated by semicolon.

The weight is simply the number of words introducing the node on Wikipedia. Type and categories are extracted from Wikipedia pages.

The lines following “*Edges”, e.g., “233 234 1”, each represents an edge between nodes, with three columns: nodeid1, node2, (always 1). The edge indicates that the two node names appear on the same Wikipedia page.

Download

http://arnetminer.org/lab-datasets/soinf/newmovies.rar

Reference:

Name disambiguation:

Overview

This data set is used for studying name disambiguation in digital library. It contains 110 author names and their disambiguation results (ground truth). Each author name corresponds to a raw file in the "raw-data" folder and an answer file (ground truth) in the "Answer" folder. (The simple version does not contain "citation", "co-affiliation-occur", "homepage". Refer to our ICDM 2011 paper for the definition of these features.)

Description and Download:

http://arnetminer.org/disambiguation

Reference:

Wikipedia-patent:

Overview

Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. For example, a patent expert may be interested in finding related patents for a product. The dataset is used to match patents with the corresponding products in Wikipedia.

Description:

http://arnetminer.org/document-match

Download:

http://arnetminer.org/upload/files/1355335509673572.zip

Wikipedia-baidu Baike:

Overview

Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. For example, a patent expert may be interested in finding related patents for a product. The dataset is used to match pages in Wikipedia to corresponding pages in Baidu baike.

Description:

http://arnetminer.org/document-match

Download:

http://arnetminer.org/lab-datasets/docmatch/data/

Mobile-1:

Overview

It consists of the logs of calls, blue-tooth scanning data and cell tower IDs of 107 users during about ten months. If two users communicated (by making a call and sending a text message) with each other or co-occurred in the same place, we create a relationship between them.

Download:

Mobile network:

http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/mobileu.dat

Reference:

Mobile-2:

Overview

Nodes are employees in a company and relationships are formed by calls and short messages sent between each other during a few months. In this mobile network, each user is labeled with her/his position (such as manager or ordinary employee) in the company. In total, there are 232 users (50 managers and 182 ordinary employees) and 3,567 relationships (including calling and texting messages) between the users.

Download:

http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/mobiled.dat

Reference: