| Name | Node | Edge | Description |
|---|---|---|---|
| {{dataset.name}} | {{dataset.node}} | {{dataset.edge}} | {{dataset.Des}} |
The data set is designed for research purpose only. The citation data is extracted from DBLP, ACM, and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title.
The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
A larger version will be released soon.
The content of this data includes paper information, paper citation, author information and author collaboration. 2,092,356 papers and 8,024,869 citations between them are saved in the file AMiner-Paper.rar ; 1,712,433 authors are saved in the file AMiner-Author.zip and 4,258,615 collaboration relationships are saved in the file AMiner-Coauthor.zip.
This data set contains 6 different networks: Epinions, Slashdot, MobileU, MobileD, Coauthor, and Enron.
Citation network consists of paper and citation relationship chosen from ArnetMiner. The raw citation data consists of 2555 papers and 6101 citation relationship. The papers are mainly from 10 research fields:
Topic 16: Data Mining / Association Rules
Topic 107: Web Services
Topic 131: Bayesian Networks / Belief function
Topic 144: Web Mining / Information Fusion
Topic 145: Semantic Web / Description Logics
Topic 162: Machine Learning
Topic 24: Database Systems / XML Data
Topic 75: Information Retrieval
Topic 182: Pattern recognition / Image analysis
Topic 199: Natural Language System / Statistical Machine Translation.
This data set includes three different real-world social networks:
We are developing extraction tools in ArnetMiner, a researcher social network system. The tool will be used to extract researcher profile from the Web page and outputs the extracted information into a researcher database.
The data set and related documents are used for researcher profile extraction.
The work intends to study how to quantify link semantics. Specifically, an ideal output of link semantics analysis is to provide users with the following information: (1) multiple topics discussed in each page; (2) semantics of a link between two pages; and (3) the influential strength of each link. With such an analysis, a user could easily trace the origins of an idea/technique, analyze the evolution and impact of a topic, filter the pages by certain categories of links, as well as zoom in and zoom out the linkage tracing graph with the degree of influence.
This data set consists of publication papers chosen from ArnetMiner. original_data.rar contains both original papers, some contains the whole content, others only contain the abstract, and annotate_data.txt is the output of the annotation tool.
We have collected topics and their related people lists from as many sources as possible. We randomly chose 13 topics and created 13 people lists. The data sets were used as the “golden metric” for expert finding. They were also used to create the test sets for association search. The following table shows the 13 topics and statistics of people we have collected. In the 13 topics, OA and SW are from PC members of the related conferences or workshops. DM is from a list of data mining people organized by kmining.com. IE is from a list of information extraction researchers that were collected by Muslea. BS and SVM are from their official web sites, respectively. PL, IA, ML, and NLP are from a page organized by Russell and Norvig, which links to 849 pages around the web with information on Artificial Intelligence.
To evaluate the effectiveness of our proposed association search approach, we created 8 test sets. Each of the person pair contains a source person (including his name and id) and a target person (including his name and id). The test sets were created as follows. We randomly selected 1,000 person pairs from the researcher network and create the first test set.
We use the above people lists to create the other 8 test sets. We created four test sets by randomly selecting person pairs from SW, DM, and IE respectively. With the three test sets, we are aimed at testing association search between persons from the same research community. We created the other five test sets by selecting persons from different research fields.
Aminer Author Name and ID
It consists mapping between name and id of authors in Arnetminer. The data is form as a 2 column list. The first column is Arnetminer id and the second column is Author name.
Aminer Topic Top 5000 Publications and Authors
It consists the top 5000 publication of each topics in Arnetminer. The data is formed as 3 xml files. Each consists data of topics, publications and authors respectively.
ACTMaps Author Topic
It consists the topic distribution given author. The data is organized into 733602 rows, each for an author. For each row, it consists columns separated by a blank space. Each column is the topic id and weight separated by a ":"
Aminer FOAF Data Set
It consists of the FOAF data of authors in arnetminer.org. The data is organized in standard FOAF format.
This data set is used for studying name disambiguation in digital library. It contains 110 author names and their disambiguation results (ground truth). Each author name corresponds to a raw file in the "raw-data" folder and an answer file (ground truth) in the "Answer" folder. (The simple version does not contain "citation", "co-affiliation-occur", "homepage". Refer to our ICDM 2011 paper for the definition of these features.)
Credit to the team leaded by Professor Jibing Gong and Haopeng Zhang from YSU (Yanshan University) for labeling some of the data.
1. Email
For Email extraction, we labeled a dataset of around 2000 people, for training and testing. The name list is selected randomly from AMiner. For each person in this name list, we leveraged Google to search for and extract candidate email addresses. We used contact information in the Aminer system as most of the ground truths, and had some human experts (without knowledge about our classification model) to label and double-check the data.
2. Gender
For Gender inference, we offer a labeled xlsx file of around 2400 people from the AMiner system, with fields including name, organization, position and homepage.
We release the Aminer dataset for interested researchers. The dataset includes 57037 persons and 42230 affiliations harvested from Aminer. We have tried some effort to disambiguate persons with the same name and eliminate multiple writings of the same address (There may still be noises). We also collect 722 curricula vitae from the Internet which can be treated as the real world ground truth.
We have collected data from different social networking site. The dataset consists of two collections of social networks, where the networks within a collection are overlapped with each other (i.e. have users corresponding to the same real world person).
SNS network collection
The SNS data collection consists of five popular online social networking sites: Twitter, LiveJournal, Flickr, Last.fm, and MySpace.
The group truth mapping of SNS network collections was originally collected by Perito el. al through Google Profiles service. Please contact the original owner to obtain the data. Here, we provide a subset of the data for evaluation.
Twitter - Livejournal
Twitter - Flickr
Twitter - Lastfm
Twitter - MySpace
Livejournal - Flickr
Livejournal - Lastfm
Livejournal - MySpace
Filckr - Lastfm
Flickr - MySpace
Lastfm - MySpace
Academia network collection
The Academia data collection consists of three academic or professional social networks: ArnetMiner (AM), Linkedin and Videolectures.
The ground truth for Academia dataset is obtained through a crowdsourcing service on ArnetMiner. On each researcher's ArnetMiner profile, users can fill in urls linking to the external accounts. This service has been running on-line for more than one year and more than 10,000 interlinks record has been collected. Here, we provide a subset of the data for evaluation.
AMiner-Linkedin
This data set is generated by linking two large academic graphs Microsoft Academic Graph (strong MAG ) and AMiner.
The data set is used for research purpose only. This version includes 166,192,182 papers from MAG and 154,771,162 papers from AMiner. We generated 64,639,608 linking (matching) relations between the two graphs. In the future, more linking results, like authors, will be published. It can be used as a unified large academic graph for studying citation network, paper content, and others, and can be also used to study integration of multiple academic graphs.
Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, webpages) containing that person’s name may be returned. Which documents are about the person we care about? Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web.
SciKG is a rich knowledge graph designed for scientific purpose (currently including computer science (CS)), consisting of concepts, experts, and papers. The concepts and their relationships are extracted from ACM computing classification system, supplemented with the definition of each concept from, e.g., Wikipedia. We further use AMiner to associate top ranked experts and most relevant papers to each concept. Each expert has position, affiliation, research interests and also the link connecting to AMiner (for further rich information if necessary) and each paper contains meta information such as title, authors, abstract, publication venue, and year.
130,750 scholars, 343,746 scholarily articales, 229,937 specialties from 103 conferences
AMiner Knowledge Graph is a structured entity network extracted from AMiner. It is comprised of over 500,00 entities and about 290,000,000 links among them. The knowledge graph can be used as a benchmark to study knowledge graph construction and also used as an external resource for search/recommendation.