Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2013
…
5 pages
1 file
Online Social Networks (OSNs) generate a huge volume of user-originated texts. Gender classification can serve multiple purposes. For example, commercial organizations can use gender classification for advertising. Law enforcement may use gender classification as part of legal investigations. Others may use gender information for social reasons. Here we explore language independent gender classification. Our approach predicts gender using five color-based features extracted from Twitter profiles (e.g., the background color in a user's profile page). Most other methods for gender prediction are typically language dependent. Those methods use high-dimensional spaces consisting of unique words extracted from such text fields as postings, user names, and profile descriptions. Our approach is independent of the user's language, efficient, and scalable, while attaining a good level of accuracy. We prove the validity of our approach by examining different classifiers over a large dataset of Twitter profiles.
2013
Abstract-Online Social Networks (OSNs) generate a huge volume of user-originated texts. Gender classification can serve multiple purposes. For example, commercial organizations can use gender classification for advertising. Law enforcement may use gender classification as part of legal investigations. Others may use gender information for social reasons. Here we explore language independent gender classification. Our approach predicts gender using five color-based features extracted from Twitter profiles (e.g., the background color in a user's profile page). Most other methods for gender prediction are typically language dependent. Those methods use high-dimensional spaces consisting of unique words extracted from such text fields as postings, user names, and profile descriptions. Our approach is independent of the user's language, efficient, and scalable, while attaining a good level of accuracy. We prove the validity of our approach by examining different classifiers ove...
2013 12th International Conference on Machine Learning and Applications, 2013
Online Social Networks (OSNs) provide reliable communication among users from different countries. The volume of texts generated by OSNs is huge and highly informative. Gender classification can serve commercial organizations for advertising, law enforcement for legal investigation, and others for social reasons. Here we explore profile characteristics for gender classification on Twitter. Unlike existing approaches to gender classification that depend heavily on posted text such as tweets, here we study the relative strengths of different characteristics extracted from Twitter profiles (e.g., first name and background color in a user's profile page). Our goal is to evaluate profile characteristics with respect to their predictive accuracy and computational complexity. In addition, we provide a novel technique to reduce the number of features of text-based profile characteristics from the order of millions to a few thousands and, in some cases, to only 40 features. We prove the validity of our approach by examining different classifiers over a large dataset of Twitter profiles.
The rapid growth ofsocial networks has produced an unprecedented amount of user-generated data, which provides an excellent opportunity for text mining.Authorship analysis, an important part of text mining, attempts to learn about the author of the text throughsubtle variations in the writing styles that occur between gender, age and social groups. Such information has a variety of applications including advertising and law enforcement. One of the most accessible sources of user-generated data is Twitter, which makes the majority of its user data freely available through its data access API. In this study we seek to identify the gender of users on Twitter using Perceptron and Naïve Bayes with selected1 through 5-gram features from tweet text.Stream applications of these algorithms were employed for gender prediction to handle the speed and volume of informal tweet traffic. Because informal text, such as tweets, cannot be easily evaluated using traditional dictionary methods, -gram features were implementedin this study to represent streaming tweets. The large number of 1 through 5-grams requires that only a subset of thembe used in gender classification, for this reason informative -gram features were chosen using multiple selection algorithms.In the best casethe Naïve Bayes and Perceptron algorithms producedaccuracy, balanced accuracy, and F-measure above 99%.
2018
This paper describes our participation at the PAN 2018 Author Profiling shared task. Given texts and images from some Twitter's authors, the goal is to estimate their genders. We considered all the languages (Arabic, English and Spanish) and all the prediction types (only from texts, only from images and combined). The final submitted system is a stacked classifier composed of two main parts. The first one, based on previous PAN Author Profiling editions, concerns gender prediction from texts. It consists in a pipeline of preprocessing, word n-grams from 1 to 2, TF-IDF with sublinear weighting, Linear Support Vector classification and probability calibration. The second part is formed by different layers of classifiers used for gender estimation from images: four base classifiers (object detection, face recognition, colour histograms, local binary patterns) in the first layer, a meta classifier in the second layer and an aggregation classifier as third layer. Finally, the two ge...
VAWKUM Transactions on Computer Sciences, 2018
This paper describes the accuracy of various algorithms for classification of text on the basis of gender identification. We examined the knowledge extracted from corpus of twitter's online social media in term of gender identity. By comparing algorithms on different feature sets, we established a feature set of 20 distinct arguments which increase the correctness of gender identification on all over the twitter. We reported accuracies of three algorithms obtained by using two approaches applied on two classes of gender i.e. male and female; a model where a lot of features are reduced using powerset transformation.
ArXiv, 2019
Author profiling is the characterization of an author through some key attributes such as gender, age, and language. In this paper, a RNN model with Attention (RNNwA) is proposed to predict the gender of a twitter user using their tweets. Both word level and tweet level attentions are utilized to learn 'where to look'. This model (this https URL) is improved by concatenating LSA-reduced n-gram features with the learned neural representation of a user. Both models are tested on three languages: English, Spanish, Arabic. The improved version of the proposed model (RNNwA + n-gram) achieves state-of-the-art performance on English and has competitive results on Spanish and Arabic.
Unstructured textual data from online pro les is often used in conjunction with other user metadata to mine, in a supervised fashion, the latent demographic attributes of social media users (e.g. age, gender, occupation). Supervised methods, however, require labeled training data, which are often expensive to generate, and thus it would be attractive to re-use models across di erent domains and groups, i.e. training on a labeled dataset in order to mine the same latent attributes in those datasets for which training labels are missing. However, online conversations are often in uenced by a myriad of topics and other factors, such as external events, and thus not all the features generated from this kind of data may perform well in a cross-domain setting. Here we study which of the features commonly found in public user pro les are portable across domains. As benchmark we focus on the very common task of detecting the gender of Twitter users from their public pro le information | tweets, screen name, and pro le picture. Our approach, based on a boosted stacked classi er, outperforms the state of the art in the task. Using data from two very di erent samples of Twitter users | one drawn from the public random stream and one about a recent social movement | we show that screen name and pro le picture generalize across domains well, while text does not. Social media platforms have become attractive sources of data for computational approaches to social modeling, mainly due to their rapid growth and for the surprising ability to o er insight into real-world phenomena. Cross-domain user mining methods can help computational social science research by providing a richer and more accurate context to social phenomena.
Frontiers in Artificial Intelligence and Applications
Social media offers an invaluable wealth of data to understand what is taking place in our society. However, the use of social media data to understand phenomena occurring in populations is difficult because the data we obtain is not representative and the tools which we use to analyze this data introduce hidden biases on characteristics such as gender or age. For instance, in France in 2021 women represent 51.6% of the population [1] whereas on Twitter they represent only 33.5% of the french users [2]. With such a difference between social networks user demographics and real population, detecting the gender or the age before going into a deeper analysis becomes a priority. In this paper we provide the results of an ongoing work on a comparative study between three different methods to estimate gender. Based on the results of the comparative study, we evaluate future work avenues.
This paper addresses the task of user gender classification in social media, with an application to Twitter. The approach automatically predicts gender by leveraging observable information such as the tweet behavior, linguistic content of the user’s Twitter feed and the celebrities followed by the user. This paper first evaluates linguistic content based features using LIWC dictionary and popular neighborhood features using Wikipedia and Freebase. Then augments both features which yielded a significant increase in the accuracy for gender prediction. Results show that rich linguistic features combined with popular neighborhood prove valuables and promising for additional user classification needs.
With the rapid growth of web-based social networking technologies in recent years, author identification and analysis have proven increasingly useful. Authorship analysis provides information about a document's author, often including the author's gender. Men and women are known to write in distinctly different ways, and these differences can be successfully used to make a gender prediction. Making use of these distinctions between male and female authors, this study demonstrates the use of a simple stream-based neural network to automatically discriminate gender on manually labeled tweets from the Twitter social network. This neural network, the Modified Balanced Winnow, was employed in two ways; the effectiveness of data stream mining was initially examined with an extensive list of n-gram features. Feature selection techniques were then evaluated by drastically reducing the feature list using WEKA's attribute selection algorithms. This study demonstrates the effectiveness of the stream mining approach, achieving an accuracy of 82.48%, a 20.81% increase above the baseline prediction. Using feature selection methods improved the results by an additional 16.03%, to an accuracy of 98.51%. characters, there is less content available to predict an author's gender. On the other hand, the character limit for tweets means that users must fit whatever they want to say into a smaller space. This has the effect of concentrating the user's writing style, increasing the necessity of using the characteristic text styles prevalent in social media. Studies have shown that women have a tendency to make more frequent use of emotionally charged language, adjectives, adverbs, and apologetic language when compared with men, and that men tend to use more aggressive, authoritative language .
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Proceedings of the 13th Linguistic Annotation Workshop, 2019
Proceedings of the 3rd Workshop on Noisy User-generated Text, 2017
International Journal of Advanced Trends in Computer Science and Engineering , 2021
Cornell University - arXiv, 2022
International Journal of Machine Learning and Computing, 2020
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, 2018
Proceedings of the 3rd international workshop on Search and mining user-generated contents - SMUC '11, 2011
International Journal of Advanced Computer Science and Applications
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion
Journal of Sociolinguistics, 2014
Proceedings of the International AAAI Conference on Web and Social Media
2019 International Conference on Electronics, Communications and Computers (CONIELECOMP), 2019
International Journal of Knowledge Society Research, 2014
Proceedings of the First Workshop on NLP and Computational Social Science, 2016