| SN | Name | Node | Edge | Behavior/Content | Description |
|---|---|---|---|---|---|
| Microblogging networks | |||||
| {{mecroblogging.SN}} | {{mecroblogging.name}} | {{mecroblogging.node}} | {{mecroblogging.edge}} | {{mecroblogging.behavior}} | {{mecroblogging.desc}} |
| Patent data set from Patentminer.org | |||||
| {{patentDate.SN}} | {{patentDate.name}} | {{patentDate.node}} | {{patentDate.edge}} | {{patentDate.behavior}} | {{patentDate.desc}} |
| Other online social networks | |||||
| {{social.SN}} | {{social.name}} | {{social.node}} | {{social.edge}} | {{social.behavior}} | {{social.desc}} |
| Knowledge linking data set | |||||
| {{knowledge.SN}} | {{knowledge.name}} | {{knowledge.node}} | {{knowledge.edge}} | {{knowledge.behavior}} | {{knowledge.desc}} |
| Mobile data set | |||||
| {{mobileData.SN}} | {{mobileData.name}} | {{mobileData.node}} | {{mobileData.edge}} | {{mobileData.behavior}} | {{mobileData.desc}} |
We crawled a twitter dataset. To begin the collection process, we selected the most popular user on Twitter, i.e., “Lady Gaga”, and randomly collected 10,000 of her followers. We took these users as seed users and used a crawler to collect all followers of these users by traversing “following” relationships and these followers are viewed as the user list and the total number is 112,044. The crawler monitored the change of the network structure among the 112,044 users from 10/12/2010 to 12/23/2010 and finally obtained 443,399 dynamic “following” relationships between them.
Tweets were crawled for those users from Jan1, 2010 to Oct, 2010 and from Oct 1, 2010 to Jan 15, 2010.
There are totally 443,399 follow relationships.
First column: person id 1
Second column: person id 2
Third column: the timestamp when the person 1 follows person 2
Sample lines:
0 11 1
0 4893 1
The number k on the ith (start from 0) line represents mapping the original user id k to the new id i.
format:
original_user_id username
The tweet data involves 156,487 users, 99,696,204 tweets where there are 28,699,842 replies and 11,408,918 retweets
Data format:
User_Name
Tweet_ID
Time
Via // the type of the web app
retweet_from
reply_to_user reply_to_tweet(if not reply, just "-1")
content // please notice that we have already transfer the original content into word index
Number_of_link_in_tweet // Here the links include URL, @user.
type_of_link1 link1
type_of_link2 link2
type_of_link3 link3
...
Sample data:
Joes_face
8977943263
Thu Feb 11 21:13:40 +0000 2010
via TwitPic
-1
-1
21589626 20760882
3
tweet-url web http://twitpic.com/12ndbp
tweet-url username /joejonas
tweet-url username /nickjonas
http://arnetminer.org/lab-datasets/tweet/twitter_network.rar
http://arnetminer.org/lab-datasets/tweet/Tweets-withoutwords.rar (Please notice that we have already transferred the content into word index)
https://static.aminer.org/lab-datasets/tweet/WordTable.tgz
Twitter. The data set is crawled from Twitter by starting from the user “Carel Pedre (carelpedre)”,5 one of Haitian most popular radio DJs, who used Twitter to inform the world about the earthquake which ravaged his country. We extract all followers (> 11, 704) of “carelpedre” and the users he is following, and continue the process for each extracted Twitter user. We further crawl all tweets posted by the users as attributes. Finally, a data set used for action prediction consists of 7,521 users, 304,275 time varying following and followed relationships, and 730,568 tweets (blogs) posted by the users.
Due to the request from Twitter, we can not publish the content of Twitter.
http://arnetminer.org/lab-datasets/stnt/data/allstate.rarhttp://arnetminer.org/lab-datasets/stnt/data/twitter.rarThe twitter data is about company. We collected all the patents (3,770,411 patents) from USPTO , from which we extracted 195,263 companies and 2,430,375 inventors. For each company, we used it as the query to search Twitter and retrieved the top returned tweets, from which we further extracted the information of users. So far, we have collected 1,033,750 tweets written by 87,603 Twitter users, which cover 1393 major companies.
Due to the request from Twitter, we can not publish the content of Twitter.
http://arnetminer.org/lab-datasets/competitor/twitter_content.rar
The data set was crawled in the following ways. To begin with, 100 random users were selected as seed users,and then their followees and followees’ followees were collected. The crawling process produced in total 1.7 million users and 4 billion following relationships among them, with average 200 followees per user. For each user, the crawler collected her 1,000 most recent microblogs (including tweets and retweets). The process in totally 1 billion microblogs. We also crawled all the users’ profiles which contain name, gender, verification status, #bi-following, #followers, #followees, and #microblogs. We focus on the retweet behaviors in the microblogging network. Thus we select 300,000 popular microblog diffusion episodes from the data set. Each diffusion episode contains the original microblog and all its retweets. On average each microblog has been retweeted for about 80 times.
Please refer https://aminer.org/Influencelocality for details.
https://aminer.org/Influencelocality
We have collected total 4,179,629 patents from USPTO
#* --- patentTitle
#@ --- inventor(split by #)
#year ---- Year
#assignee --- assignee(split by #)
#region --- region(split by #)
#index ---- index id in patminer database
#pn ---- id of the patent
#% ---- the id of references of this patent (there are multiple lines, with each indicating a reference)
#! ---terms
http://arnetminer.org/lab-datasets/pminer/patent-data.rar
Slashdot is a network of friends. Slashdot is a site for sharing technology related news. In 2002, Slashdot introduced the Slashdot Zoo which allows users to tag each other as ``friends'' (like) or ``foes'' (dislike). The data set contains two parts, including a friend/foe network and a news-comment diffusion dataset. The network contains 93139 users and 577025 relationships between users. The news-comment diffusion dataset contains 35065 news and 3505736 comments.
slashdot_nwdata.txt
This file describes the Slashdot network.
First line consists of two integers, representing the number of users N and number of follow relationships M respectively.
In the following M lines, each line starts with an integer v1_id, representing the user_id of user v1, followed by another integer k. And the following 2k numbers describes the users "FOLLOWED" by v1, each represented by a user id v2_id and a number indicating the type of relationship. In Slashdot data set, "1" indicates that user v1 is a "fan" of v2, while "0" indicates that v1 considers v2 as her "foe".
Slashdot Comment Data
This file consists of comment logs on 35,065 news. The comment logs on each news are represented by several lines. In the first line, two integers A and B, separated by a [TAB], describe the news_id and the number of comments respectively. In the following B lines, each line describes an action of commenting by two integers T and U, represent the time stamp when this comment was posted, and the user_id who commented this news repectively. Time stamp T is encoded by how many seconds have passed since 1970-01-01-00:00:00.
---------------------------------
Example:
0 111
[news_id] [number_of_comments]
1165115580 24966
[time_stamp] [user_id]
1165115880 2520
[time_stamp] [user_id]
...
---------------------------------
Note that the number of comments is not actually how many times the news has been commented in the entire Slashdot network, but how many unique users in our user set have commented this post. Please also note that if a user commented the same post for multiple times, I only kept the first comment log and removed the others. Thus each user will only appear at most once in a comment logs of a news.
Each line is the "original news_id" for each news, and the corresponding line number (starts from 0) is the hashed news_id in "slashdot_comment_diffusion.txt".
Each line is the "original slashdot uid" for each user, and the corresponding line number (starts from 0) is the hashed user_id in "slashdot_comment_diffusion.txt".
Slashdot is a network of friends. Slashdot is a site for sharing technology related news. In 2002, Slashdot introduced the Slashdot Zoo which allows users to tag each other as ``friends'' (like) or ``foes'' (dislike). The data set is comprised of 77,357 users and 516,575 relationships of which 76.7% are ``friend'' relationships. The data set can be used to infer the ``friend'' relationships between users, and to study the positive and negative influence.
First column: person id
Second column: person id
Third column: relationship type, 1 means friend relationship, -1 means foe relationship
Sample lines:
0 1 1
0 22 -1
http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/slashdot.zip
Epinions is a network of product reviewers. Each user on the site can post a review for any product and other users would rate the review with trust or distrust. In this data, we created a network of reviewers connected with trust and distrust relationships. The data set consists of 131,828 users and 841,372 relationships, of which about 85.0% are trust relationships. 80,668 users received at least one trust or distrust relationships. The data set can be used to infer the trust relationships between users.
First column: person id
Second column: person id
Third column: relationship type, 1 means trust relationships, 0 means distrust relationship
Sample lines:
0 1 -1
4 2282 1
http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/epinions.zip
Epinions is a network of product reviewers. Each user on the site can post a review for any product and other users would rate the review with trust or distrust. In this data, for each user, we have his profile, his ratings and his trust relations. For each rating, we have the product name and its category, the rating score, the time point when the rating is created, and the helpfulness of this rating
The first file is rating.mat, includeing the rating information. there are five columns and they are userid, productid, categoryid, rating, helpfulness, respectively.
For example, for one row
(1,2,3,4,5)
It means that user 1 gives a rating of 4 to the product 2 from the category 3. The helpfulness of this rating is 5.
The second file is trustnetwork.mat, including the trust relations between users. There are two columns and both of them are userid.
For example, for one row, (1,2)
It means that user 1 trusts user 2.
Please refer http://www.jiliang.xyz/trust.html for details
Two types of relationships, i.e., manager-subordinate and colleague, were annotated between these employees.There are in total 3,572 relationships, of which 133 are manager-subordinate relationships.
http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/enron.zip
The dataset contains friend relationships, user to group relationships, images, and the activities of user comment image. The number of users is 2037538 and the number of relationships is 219098660. The number of groups is 655917. The number of images is 1262978. The comments on images are 14913164.
1.name2id.txt: map from user_name to user_id, format:
user1_name user1_id
2.user2group: relationships of user and group, format:
user1_id group1_id group2_id......
3. user2user: relationships between users, format:
user1_id user2_id (user2 is in user1's contactList)
4.images_comts.txt: comts of images, format:
image1_id owner_id user1_id comts1 user2_id comts2...
Flickr is a popular photo sharing network for users to upload photos and share photos. The Flickr dataset is crawled in early 2014, which consists of 215,495 individual users and 9,114,557 links. Like LiveJournal, the links here reflect friend relationships among users.
http://arnetminer.org/lab-datasets/multi-sns/flickr.tar.gz
The Flickr data set is the data set related with the http://socialnetworks.mpi-sws.org/ . You can mail to the original author to get the coarse data.
http://arnetminer.org/lab-datasets/stnt/data/flickr.rar
Livejournal is a free on-line social network where users can keep a blog, journal or diary. Our dataset is crawled from its website in late 2013, which contains 3,017,286 users and 87,037,567 links. Here defines link as the friend relationship, that is,two users are linked if one exists in the other's friends list.
http://arnetminer.org/lab-datasets/multi-sns/livejournal.tar.gz
Last.fm provides a streaming radio service, where users can search music and get personalized recommendation. Last.fm builds detailed profile of users' musical taste and preference, which is the foundation for music recommendation. We crawled Last.fm in late 2013 and obtained a network that contains 136,420 users and 1,685,524 following links among them.
http://arnetminer.org/lab-datasets/multi-sns/lastfm.tar.gz
MySpace is a social networking website which also has a strong music emphasis. The dataset we obtained from MySpace contains 854,498 user profiles as well as 6,489,736 directed connections among users. To reconstruct the network of these users, we treat connections as undirected and parallel links are combined into one.
http://arnetminer.org/lab-datasets/multi-sns/myspace.tar.gz
LinkedIn is a professional network, where users can maintain their profiles and social connections. We collected public profiles from LinkedIn. As we cannot crawl user connections on LinkedIn, we pursued another method to construct the network. We consider two profiles to be linked if they were viewed (“co-viewed”) by the same user. In this way, we obtained a network of 2,985,414 user profiles and 25,965,384 relationships.
http://arnetminer.org/lab-datasets/multi-sns/linkedin.tar.gz
link:http://pan.baidu.com/s/1eQxwLpw password: fbjn
A data set consists of movies, actors, directors, writers, and various relationships between them crawled fromhttp://en.wikipedia.org/wiki/Category:English-language_films. newmovies.rar: a heterogeneous network. It contains 10 topics: American film actors, American television actors, Black and white films, Drama films, Comedy films, British films, American film directors, Independent films, American screenwriters, American stage actors.
The dataset consists of a star-director-film-writer network. Each data file consists of two sections: *Vertices and *Edges. “*Vertices 348” indicates that there are 348 heterogeneous nodes in the network.
The lines following “*Vertices 348”, e.g., “0 "Ann Blyth" 6035 starring 1928 births;Living people;American film actors;American musical theatre actors;American child actors;People from Westchester County, New York;”, each represents the attributes of a node, with multiple columns: noderid, node name, node weight, node type (e.g., star, or writer), multiple categories (topics) separated by semicolon.
The weight is simply the number of words introducing the node on Wikipedia. Type and categories are extracted from Wikipedia pages.
The lines following “*Edges”, e.g., “233 234 1”, each represents an edge between nodes, with three columns: nodeid1, node2, (always 1). The edge indicates that the two node names appear on the same Wikipedia page.
http://arnetminer.org/lab-datasets/soinf/newmovies.rar
This data set is used for studying name disambiguation in digital library. It contains 110 author names and their disambiguation results (ground truth). Each author name corresponds to a raw file in the "raw-data" folder and an answer file (ground truth) in the "Answer" folder. (The simple version does not contain "citation", "co-affiliation-occur", "homepage". Refer to our ICDM 2011 paper for the definition of these features.)
http://arnetminer.org/disambiguation
Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. For example, a patent expert may be interested in finding related patents for a product. The dataset is used to match patents with the corresponding products in Wikipedia.
http://arnetminer.org/document-match
Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. For example, a patent expert may be interested in finding related patents for a product. The dataset is used to match pages in Wikipedia to corresponding pages in Baidu baike.
http://arnetminer.org/document-match
It consists of the logs of calls, blue-tooth scanning data and cell tower IDs of 107 users during about ten months. If two users communicated (by making a call and sending a text message) with each other or co-occurred in the same place, we create a relationship between them.
http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/mobileu.dat
Nodes are employees in a company and relationships are formed by calls and short messages sent between each other during a few months. In this mobile network, each user is labeled with her/his position (such as manager or ordinary employee) in the company. In total, there are 232 users (50 managers and 182 ordinary employees) and 3,567 relationships (including calling and texting messages) between the users.
http://arnetminer.org/lab-datasets/infer_social_tie_across_heter/Data/mobiled.dat