Term Filtering with Bounded Error
In this paper, we consider a novel problem referred to as term filtering with bounded error to reduce the term (feature) space by eliminating terms without (or with bounded) information loss. Different from existing works, the obtained term space provides a complete view of the original term space. More interestingly, several important questions can be answered such as: 1) how different terms interact with each other and 2) how the filtered terms can be represented by the other terms.
We perform a theoretical investigation of the term filtering problem and link it to the Geometric Covering By Discs problem, and prove its NP-hardness. We present two novel approaches for both lossless and lossy term filtering with bounds on the introduced error. Experimental results on multiple text mining tasks validate the effectiveness of the proposed approaches.
Version 0.1, May 1st, 2010.
(1) Usage for Lossless term filter
java -jar termfilt.jar lf in_file_path in_file_charset out_file_path out_file_charset
e.g., java -jar termfilt.jar lf pubs.txt utf-8 pubs-lf.txt utf-8
(2) Usage for Greedy lossy filter
java -jar termfilt.jar glf in_file_path in_file_charset out_file_path out_file_charset error_V error_D
e.g., java -jar termfilt.jar glf pubs.txt utf-8 pubs-glf.txt utf-8 1.0 1.0
(3) Usage for Multithreaded lossy filter
java -jar termfilt.jar mlf in_file_path in_file_charset out_file_path out_file_charset error_V error_D num_threads
e.g., java -jar termfilt.jar mlf pubs.txt utf-8 pubs-mlf.txt utf-8 1.0 1.0 8
(4) Usage for LSH-based lossy filter
java -jar termfilt.jar llf in_file_path in_file_charset out_file_path out_file_charset error_V error_D num_lines num_buckets
e.g., java -jar termfilt.jar llf pubs.txt utf-8 pubs-llf.txt utf-8 1.5 1.5 10 100
We conduct the experiment on two real-world data sets: ArnetMiner and 20-Newsgroups. ArnetMiner is an academic publication collection, which contains 10,768 papers and employs 8,212 terms. 20-Newsgroups2, is a widely used data collection, which consists of 18,774 postings form 20 Usenet news-groups, and employs 61,188 terms. Preprocessing includes stopword filtering and word stemming.