In this system, we leverage three major approaches for Gender inference, and present a voting model to incorporate their results for the final decision.
Face Recognition (FR) uses the name and affiliation information as the query terms inGoogle Images, and extracts the first returned picture as the user portrait. By employing APIs provided byFace++ for Human Face Recognition, we can easily fetch Gender information of the portrait. When no face detected in the portrait, FR returns"UNKNOWN", and randomly assigns a Gender value. FR corresponds to the "Google Image" item in the interface.
Facebook Generated Name List (FGNL) is proposed in [Tang, 2011], and has been introduced as one of the baselines in section 3.c of [Gu, 2016]. Basically, it collects a list of common names with their corresponding gender values from Facebook. If the user's name is matched with any entry in the list, FGNL returns the gender value. Otherwise, it returns `"UNKNOWN". FGNL corresponds to the "Common Name" item in the interface.
Web Based Gender Predictor (WebGP) stands for our supervised extraction framework represented in the "Approach" section of[Gu, 2016]. In brief, we automatically construct effective query in search engines like Google to fetch relevant snippets which may contain gender information of the target user. The framework can be paired with any classification model for Gender inference, and easily outperfoms state-of-the-art even with the simplest model. In this system, we choose the SVM implementation from thesklearn package. All labeled data used to train WebGP model can be foundhere. WebGP corresponds to the "Google Snippet" item in the interface.
Vote model (Final) integrate inference results from all these methods following the principle of "one man, one vote", and selects the Gender value with more votes as the final result. The intuition of this voting model is quite natural, since each approach is good at predicting users with certain features, but also has restrictions. For example, FGNL includes most common names in Western countries with obvious Gender bias (e.g. "Nancy" is usually a girl's name), and is thus very precise for listed names. However, its recall is limitted by the coverage of the list, and can hardly match foreign names from countries like Korea and Japan. So the straightfoward solution would be to train another classifier, which takes prediction results from each method, learns the "weight" or "credibility" for it, and gives a "weighted" prediction. Here, we simplify it to this voting model, which implies that we equally trust each method. Experiments show that such voting model works well in improving overall performance.
| [Tang, 2011] | Tang, Cong, Keith Ross, Nitesh Saxena, and Ruichuan Chen. "What’s in a name: a study of names, gender inference, and gender behavior in facebook." In International Conference on Database Systems for Advanced Applications, pp. 344-356. Springer Berlin Heidelberg, 2011. |
| [Gu, 2016] | Gu, Xiaotao, Hong Yang, Jie Tang, and Jing Zhang. "Web user profiling using data redundancy." In Advances in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM International Conference on, pp. 358-365. IEEE, 2016. |