Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010, Pattern Recognition Letters
AI
Text classification is enhanced using a modified bag-of-words approach that incorporates lexical dependency patterns and a pruning strategy. By adding grammatical relations between words as features and removing less informative ones, the proposed method significantly outperforms traditional text classification techniques on multiple datasets. Experimental results demonstrate the effectiveness of using both word pruning and dependency features, paving the way for more accurate document categorization.
Pattern Analysis and Applications, 2010
In this study, a comprehensive analysis of the lexical dependency and pruning concepts for the text classification problem is presented. Dependencies are included in the feature vector as an extension to the standard bag-of-words approach. The pruning process filters features with low frequencies so that fewer but more informative features remain in the solution vector. The pruning levels for words, dependencies, and dependency combinations for different datasets are analyzed in detail. The main motivation in this work is to make use of dependencies and pruning efficiently in text classification and to achieve more successful results using much smaller feature vector sizes. Three different datasets were used in the experiments and statistically significant improvements for most of the proposed approaches were obtained.
We present new methods for pruning and enhancing item- sets for text classification via association rule mining. Pruning methods are based on dependency syntax and enhancing methods are based on replacing words by their hyperonyms of various orders. We discuss the impact of these methods, compared to pruning based on tfidf rank of words.
2009
Abstract. Traditional machine learning methods only consider relationships between feature values within individual data instances while disregarding the dependencies that link features across instances. In this work, we develop a general approach to supervised learning by leveraging higher-order dependencies between features. We introduce a novel Bayesian framework for classification named Higher Order Naive Bayes (HONB). Unlike approaches that assume data instances are independent, HONB leverages co-occurrence relations between feature values across different instances. Additionally, we generalize our framework by developing a novel data-driven space transformation that allows any classifier operating in vector spaces to take advantage of these higher-order cooccurrence relations. Results obtained on several benchmark text corpora demonstrate that higher-order approaches achieve significant improvements in classification accuracy over the baseline (first-order) methods. Key words:...
ArXiv, 2017
A substantial amount of research has been carried out in developing machine learning algorithms that account for term dependence in text classification. These algorithms offer acceptable performance in most cases but they are associated with a substantial cost. They require significantly greater resources to operate. This paper argues against the justification of the higher costs of these algorithms, based on their performance in text classification problems. In order to prove the conjecture, the performance of one of the best dependence models is compared to several well established algorithms in text classification. A very specific collection of datasets have been designed, which would best reflect the disparity in the nature of text data, that are present in real world applications. The results show that even one of the best term dependence models, performs decent at best when compared to other independence models. Coupled with their substantially greater requirement for hardware...
International Journal of Computer Applications, 2015
Text categorization is the task of assigning text or documents into pre-specified classes or categories. For an improved classification of documents text-based learning needs to understand the context, like humans can decide the relevance of a text through the context associated with it, thus it is required to incorporate the context information with the text in machine learning for better classification accuracy. This can be achieved by using semantic information like part-of-speech tagging associated with the text. Thus the aim of this experimentation is to utilize this semantic information to select features which may provide better classification results. Different datasets are constructed with each different collection of features to gain an understanding about what is the best representation for text data depending on different types of classifiers.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP '09, 2009
In text categorization, feature selection (FS) is a strategy that aims at making text classifiers more efficient and accurate. However, when dealing with a new task, it is still difficult to quickly select a suitable one from various FS methods provided by many previous studies. In this paper, we propose a theoretic framework of FS methods based on two basic measurements: frequency measurement and ratio measurement. Then six popular FS methods are in detail discussed under this framework. Moreover, with the guidance of our theoretical analysis, we propose a novel method called weighed frequency and odds (WFO) that combines the two measurements with trained weights. The experimental results on data sets from both topic-based and sentiment classification tasks show that this new method is robust across different tasks and numbers of selected features.
Book Publisher International (a part of SCIENCEDOMAIN International), 2021
In this paper we present automated text classification in text mining that is gaining greater relevance in various fields every day. Text mining primarily focuses on developing text classification systems able to automatically classify huge volume of documents, comprising of unstructured and semi structured data. The process of retrieval, classification and summarization simplifies extract of information by the user. The finding of the ideal text classifier, feature generator and distinct dominant technique of feature selection leading all other previous research has received attention from researchers of diverse areas as information retrieval, machine learning and the theory of algorithms. To automatically classify and discover patterns from the different types of the documents [1], techniques like Machine Learning, Natural Language Processing (NLP) and Data Mining are applied together. In this paper we review some effective feature selection researches and show the results in a table form.
2010
Current text classification methods are mostly based on a supervised approach, which require a large number of examples to build models accurate. Unfortunately, in several tasks training sets are extremely small and their generation is very expensive. In order to tackle this problem in this paper we propose a new text classification method that takes advantage of the information embedded in the own test set. This method is supported on the idea that similar documents must belong to the same category. Particularly, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same test set. Experimental results in four data sets of different sizes are encouraging. They indicate that the proposed method is appropriate to be used with small training sets, where it could significantly outperform the results from traditional approaches such as Naive Bayes and Support Vector Machines.
International Journal of Advance Research, Ideas and Innovations in Technology, 2021
Cyberspace has elevated business insights and created a virtual space to store all forms of information online. Due to the rapid development in the online world, the usage of digital documents has increased because it is comfortable for the users to share, update or keep track of the records in one place without losing data. However, maintaining massive data does not suit optimal decision-making and is extremely expensive for storage, processing, and collection. There is a gigantic possibility that human annotators make errors while classifying data because of distraction, monotony, fatigue, and failure to meet the requirements. Once the text classification method uses machine learning approaches, the process will execute with fewer mistakes and more accuracy. The main goal of this review paper is to highlight and explain the role of different machine learning methodologies in text classification. Concurrently, this paper describes the challenges faced by other machine learning techniques and text representation. Furthermore, this review paper will provide an extensive survey on how various machine learning techniques such as Neural Networks, Naive Bayes, Logistic Regression, Random Forest, Decision Trees, and Support Vector Machine (SVM)-are implemented in Text classification.
International Journal of Science and Research (IJSR)
The rapid growth of World Wide Web has led to explosive growth of information. As most of information is stored in the form of texts, text mining has gained paramount importance. With the high availability of information from diverse sources, the task of automatic categorization of documents has become a vital method for managing, organizing vast amount of information and knowledge discovery. Text classification is the task of assigning predefined categories to documents. The major challenge of text classification is accuracy of classifier and high dimensionality of feature space. These problems can be overcome using Feature Selection. Feature selection is a process of identifying a subset of the most useful features from the original entire set of features. Feature selection (FS) is a strategy that aims at making text document classifiers more efficient and accurate. Feature selection methods provide us a way of reducing computation time, improving prediction performance, and a better understanding of the data. This paper surveys of text classification, several approaches of text classification, feature selection methods and applications of text classifications.
In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed.
2018
Textual data is a high-dimensional data. In high-dimensional data, the number of features xceeds the number of samples. Hence, it equally increased the amount of noise, and irrelevant features. At this point, dimensionality reduction is necessary. Feature selection is an example of dimensionality reduction techniques. Moreover, it had been an indispensable component in classification. Thus, in this paper, we presented three feature selection approaches; filter, wrapper and embedded. Their aims, advantages and disadvantages are also briefly explained. Besides, this study reviews several significant studies for each feature selection approach for text classification. Based on the studies, we found that wrapper approach is less used by researchers since it is prone to over-fit and exposed local-optima for text classification while filter and embedded achieved an amount of research. However, in filter approach, the classification accuracies cannot be guaranteed because it does not incor...
KSII Transactions on Internet and Information Systems, 2019
The performance of text classification is highly related to the feature selection methods. Usually, two tasks are performed when a feature selection method is applied to construct a feature set; 1) assign score to each feature and 2) select the top-N features. The selection of top-N features in the existing filter-based feature selection methods is biased by their discriminative power and the empirical process which is followed to determine the value of N. In order to improve the text classification performance by presenting a more illustrative feature set, we present an approach via a potent representation learning technique, namely DBN (Deep Belief Network). This algorithm learns via the semantic illustration of documents and uses feature vectors for their formulation. The nodes, iteration, and a number of hidden layers are the main parameters of DBN, which can tune to improve the classifier's performance. The results of experiments indicate the effectiveness of the proposed method to increase the classification performance and aid developers to make effective decisions in certain domains.
Web Intelligence, 2020
Text classification (a.k.a text categorisation) is an effective and efficient technology for information organisation and management. With the explosion of information resources on the Web and corporate intranets continues to increase, it has being become more and more important and has attracted wide attention from many different research fields. In the literature, many feature selection methods and classification algorithms have been proposed. It also has important applications in the real world. However, the dramatic increase in the availability of massive text data from various sources is creating a number of issues and challenges for text classification such as scalability issues. The purpose of this report is to give an overview of existing text classification technologies for building more reliable text classification applications, to propose a research direction for addressing the challenging problems in text mining.
This paper proposes a method for Feature Selection in Text Categorization. This task is performed in two steps. Firstly, an analysis of relevance is performed and after that analysis of redundancy is done. For this purpose, a range of similarity measures are adopted and converted into symmetrical ones using several aggregation operators. This fact assures that the similarity between two words are independent of the order they are considered. Several experiments over four corpora are performed, leading to conclude that this method reaches good results.
International Journal of Control and Automation, 2019
Text classification for data preprocessing methods regularly uses bagof-words (BoWs). In a large dataset, BoWs always include many vectors with very large sizes and high dimensions. The authors introduced a new data preprocessing method for feature reduction of short text classification, namely NDTMD. It reduces features of the dataset using BoWs and word embedding (WE), and can solve the weaknesses of BoWs. The experiment consisted of four steps: 1) 5 datasets were selected from the data science community website, Kaggle; 2) the new methods were compared with 5 commonly used data preprocessing methods and 4 of these 5 methods used the state of the art as their baseline, while the other one used BoWs. One of the new data preprocessing methods used features reduction of BoWs to produce a new document termed matrix data (NDTMD); 3) the authors generated many classification models by 3 classifiers: support vector machine, logistic regression, and convolutional neural network for text classification; and 4) the above classifiers were applied to each preprocessing dataset and evaluated using feature reduction rate (FRR), accuracy, kappa, and running time performance. The results showed that classification models had the highest performance when using NDTMD. In particular, classifier algorithms had the highest accuracy and kappa but the lowest running time. The new data preprocessing methods can be used to preprocess short text classification and also can be applied with real social media data.
arXiv (Cornell University), 2021
Text classification is a fundamental problem in the field of natural language processing. Text classification mainly focuses on giving more importance to all the relevant features that help classify the textual data. Apart from these, the text can have redundant or highly correlated features. These features increase the complexity of the classification algorithm. Thus, many dimensionality reduction methods were proposed with the traditional machine learning classifiers. The use of dimensionality reduction methods with machine learning classifiers has achieved good results. In this paper, we propose a hybrid feature selection method for obtaining relevant features by combining various filter-based feature selection methods and fastText classifier. We then present three ways of implementing a feature selection and neural network pipeline. We observed a reduction in training time when feature selection methods are used along with neural networks. We also observed a slight increase in accuracy on some datasets.
1999
Most research in text classification has used the "bag of words" representation of text. This paper examines some alternative ways to represent text based on syntactic and semantic relationships between words (phrases, synonyms and hypernyms). We describe the new representations and try to justify our suspicions that they could have improved the performance of a rule-based learner. The representations are evaluated using the RIPPER rule-based learner on the Reuters-21578 and DigiTrad test corpora, but on their own the new representations are not found to produce a significant performance improvement. Finally, we try combining classifiers based on different representations using a majority voting technique. This step does produce some performance improvement on both test collections. In general, our work supports the emerging consensus in the information retrieval community that more sophisticated Natural Language Processing techniques need to be developed before better text representations can be produced. We conclude that for now, research into new learning algorithms and methods for combining existing learners holds the most promise.
2002
Recently, an original extension of the well-known Rocchio model (i.e. the Generalized Rocchio Formula (¡ £¢ ¥¤ )) as a feature weighting method for text classification has been presented. The assessment of such a model requires a statistically motivated parameter estimation method and wider empirical evidence. In this paper, three different corpora have been adopted in two languages. Results suggest that ¡ £¢ £¤ , integrating linguistic information, is a viable more efficient alternative to state-of-art ¦ §¤ systems. ¤ © ¢¡ ¤£ ¥£ ¥£ ¥£ ¥¡ ¤ §¦ ©¨, representing topics (e.g. "Politics, Entertainment"). An extensive collection of texts already classified, often called training set, induces the classification function. Profile-based (or linear) classifiers are characterized by a function based on a similarity measure between the representation of incoming documents and each class ¤ . Both representations are vectors and similarity is traditionally estimated as the cosine angle between the two vectors. The description
A simple and efficient baseline for text classification is to represent sentences as bag-of-words (BoW) and train a linear classifier. The bag-of-words model is simple to implement and offers flexibility for customization by providing different scoring techniques for user specific text data. In many problem domains, these linear classifiers are preferred over more complex models like CNN, LSTM because of their efficiency, robustness and interpretability. However, a large vocabulary can cause extremely sparse representations which are harder to model, where the challenge is for models to harness very little information in such a large representational space. Also, these classification problems are categorized by large number of classes and highly imbalanced distribution of data across these classes. In such cases, the traditional linear classifiers would treat each word separately and assign them different coefficients based on the frequency in which they occur in the train set. This would result in lower test accuracy when it comes across instances where a word which was occurring less frequently in the train set, occurs more often in the test set. Our thesis aims to solve this problem by constraining weights of rare features by similar, more frequent ones, using semantic similarity. This would enforce similar words to have similar weights thereby improving model performance. Thus, based on how similar two features are, our proposed model can improve the feature importance of a sparse word by increasing its regression coefficient , thereby improving the test accuracy in the above mentioned scenario.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.