Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
There appears to be various information available online in the form of document. Finding these kinds of documents and retaining them, corresponding to their category has never been more automatic. This paper acknowledges the issue of classifying genre of different English novels with the help of different Natural Language Processing and Machine Learning methods. Different novels are collected and divided into training Dataset and test Dataset. Originally for the purpose of classification uses three dissimilar varieties of Fiction genre specifically Romantic, Fairy Tales and Thriller. The genres that have been taken are some of the most widely read genres of book among different age groups. Using different linguistic feature to obtain representative features for the genres. The training module uses the feature Datasets to provide the base for classification feature.
mantech publications, 2023
Rapid progress in digital data acquisition techniques have led to huge volume of data. More than 80 percent of today's data is composed of unstructured or semi-structured data. The recovery of similar patterns and trends to see the text data from huge volume of data is a big issue. Text mining is a process of extracting interesting and nontrivial patterns from huge amount of text documents. There lie many techniques and tools to mine the text documents and discover the information for future and process in decision making. The choice of selecting the right and appropriate text mining technique helps to recover the speed and slows the time and effort required to get valuable information. This paper briefly discusses and analyzes the text mining techniques and their applications. With the advancement of technology, more and more data is available in digital form. Among which, most of the data (approx. 85%) is in unstructured textual form. Thus, it has become essential to build better techniques and algorithms to get useful and interesting data from the large amount of textual data. Hence, the field of information extraction and text mining became popular areas of research, to get interesting and needful information
International Journal of Computer Science and Mobile Computing, 2021
Text classification is playing a vital role in current era. Its requirement is increasing day by day because of increase of text data as number of digital users are increasing rapidly. As a result, machine learning algorithms are used to classify certain text data, resulting in better predictions and accuracy. By constructing a data set with proper structure and data, the genre is predicted by the title and abstract of the book. The dataset will consist books which are translated to English from Guajarati or Hindi originate books. In this paper, some weaknesses in text classification techniques are analysed and worked on to improve the accuracy of structured data. The main focus here was to classify a book by genre using machine learning algorithms.
Proceedings of the Second Workshop on Storytelling, 2019
In this work, we deploy a logistic regression classifier to ascertain whether a given document belongs to the fiction or non-fiction genre. For genre identification, previous work had proposed three classes of features, viz., low-level (character-level and token counts), high-level (lexical and syntactic information) and derived features (type-token ratio, average word length or average sentence length). Using the Recursive feature elimination with cross-validation (RFECV) algorithm, we perform feature selection experiments on an exhaustive set of nineteen features (belonging to all the classes mentioned above) extracted from Brown corpus text. As a result, two simple features viz., the ratio of the number of adverbs to adjectives and the number of adjectives to pronouns turn out to be the most significant. Subsequently, our classification experiments aimed towards genre identification of documents from the Brown and Baby BNC corpora demonstrate that the performance of a classifier containing just the two aforemen-tioned features is at par with that of a classifier containing the exhaustive feature set.
Title-based Book Classification System, 2019
Title-based book classification system is a web based system which can classify books into thirtytwo (32) different genres by their title. Most libraries have the e-library section where information or documents can be obtain. However, the large amount of books or digital information and resources available has made document retrieval in our libraries difficult. Hence the aim of this project is to classify books into genres. With an ever increasing book available it has been difficult and cumbersome to manually categorize books according to their types. Several research has been done on classification of books, however it is important to extend existing by adopting another approach to classification of books into genres using the title from books. The system was developed with the Recurrent Neural Network (RNN) using Long Short Term Memory (LSTMs) which allow it to take care of long term dependency of a sentence by giving the freedom in choosing and making the model learn what to memorize, what to leave and what to add to it.
Journal of advances in information technology, 2024
Digital books and internet retailers are growing in popularity daily. Different individuals prefer various genres of literature. Categorizing genres facilitates the discovery of books that match a reader's tastes. The assortment is the process of categorizing or genre-classifying a book. In this paper, we categorize books by genre using a variety of traditional machine learning and deep learning models based on book titles and snippets. Such work exists for books in other languages but has not yet been completed for Bengali novels. We have developed two types of datasets as a result of data collection for this research. One dataset includes the titles of Bengali novels across nine genres, while the other includes book snippets from three genres. For classification, we have employed logistic regression, Support Vector Machines (SVM), random forest classifiers, decision trees, Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN), and Bidirectional Encoder Representations from Transformers (BERT). Among all the models, BERT has the highest performance for both datasets, with 90% accuracy for the book excerpt dataset and 77% accuracy for the book Title dataset. With the exception of BERT, traditional machine learning models performed better in the Snippets dataset, whereas deep learning models performed better in the Titles dataset. Due to the quantity and the number of words present in the dataset, the performance varied.
2007
Genres are textual categories that organise and structure communication. By definition, each genre brings with it a set of conventions that can be conceptualised as expectations regarding a textual instance of a specific genre. For example, the conventions of the blog genre-a genre that is only instantiated on the World Wide Web-comprise a sequence of more or less daily postings that contain narratives, and opinions of the respective blogger, an individual who wants to participate in a discussion on a certain subject. Blog postings are publicly accessible, so that other people can comment on them. These conventions are different from those underlying the editorial genre: a single author presents an argumentative statement of views that are considered to be representative of a newspaper as a whole. In brief, genres convey a large amount of communicative context. This context is essential for determining the relevance of a specific text in and for a given situation. Genres have great potential for Information Retrieval (IR) applications, such as the integration of knowledge about genres into search-engines, enabling the user, for example, to augment keywordbased search with a specific set of genres the documents to be returned by the engine should belong to. This application treats genres as a filter that can be employed to narrow down the document result set. The distinction between topical and non-topical textual dimensions is crucial when it comes to the features on which genre identification algorithms operate. Topics rely on features based on content words, while genre classes appear to be more easily identified using grammatical features. As Natural Language Processing (NLP) provides methods to retrieve grammatical features, the investigation of the influence of NLP on genre identification is of primary importance. Among the topics addressed by the papers compiled in this volume are: experiments in genre identification and classification, genres and traditional IR architectures, the composition of feature sets, machine learning approaches, descriptive genre analyses and distinctive genre features as well as text and hypertext structures. Despite the promising results reported in the contributions, it becomes evident that genres-especially web genres-are notoriously difficult to identify using fully automatic methods. The IR subfield Automatic Genre Identification (AGI) is still in its infancy. We hope that the present collection helps in establishing this subfield as well as its community further and that it whets the appetite of researchers who share our view that AGI is an interesting and promising extension of traditional IR. After the colloquium "Towards a Reference Corpus of Web Genres" (organised by Marina Santini and Serge Sharoff), held in conjunction with Corpus Linguistics 2007 in Birmingham, UK, it is the aim of this workshop to act as a kind of follow-up meeting where researchers working on genre identification have a platform for presentations, as a forum for publishing and promoting research results and establishing research networks and explore possibilities for cooperation. We would like to thank the organisers of the Sixth International Conference on Recent Advances in Natural Language Processing for agreeing to host this workshop. Furthermore, we would like to thank the authors and the members of the Programme Committee for their help with the reviewing process.
2010
Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms for classifying text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using artificial intelligence technique that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of naïve Bayes classifier is then used on derived features and finally only a single concept of genetic algorithm has been added for final classification. A system based on the proposed algorithm has been implemented and tested. The experimental results show that the proposed system works as a successful text classifier.
2006
In this paper we investigate possibility of using phrases of flexible length in genre classification of textual documents as an extension to classic bag of words document representation where documents are represented using single words as features. The investigation is conducted on collection of articles from document database collected from three different sources representing different genres: newspaper reports, abstracts of scientific articles and legal documents. The investigation includes comparison between classification results obtained by using classic bag of words representation and results obtained by using bag of words extended by flexible length phrases.
Proc. of the JADT, 2006
In this paper, two experiments in automatic genre classification of web pages are presented. These two experiments are designed to highlight three important issues related to genre classification : corpus composition and genre palettes, feature representativeness, and exportability of classification models. Results show the influence of corpus composition and genre palette on classification rates. They also show how well and to what extent feature sets represent genres in a palette, and give an idea of the limitations of the classification models when exported and used for predictive tasks.
Automatic categorization of texts into genres, rather than subject categories, is typically quite difficult. We have run a series of experiments on an annotated Swedish text corpus to determine whether the use of linguistic metadata (in this case, parts-of-speech) can be used to improve the performance of such categorizers. Compared to the traditional approach of using word frequencies, we consistently achieved better results and reduced the error rate by 8.6%.
Proceedings of the 25th annual international …, 2002
IJRASET, 2021
The present work aims to classify the genre of the books automatically using the Python programming language. A genre is a subset of art, literature, or music that has a distinct form, substance, and style. In many instances, a book can be classified as belonging to more than one genre. It's difficult to categorize a book or piece of literature as belonging to one genre over another. Many novels end up badly categorized or pushed under the super-genre umbrella of fiction since there is no clear criterion to determine how much of a book belongs to a given genre. Therefore, it's critical to develop a system for categorizing books and determining their relevance to a particular genre. Therefore, the current study tries to solve this challenge by combining various text categorization approaches and models to come up with the best solution. I.
Genre Classification on Text-Internal Features: a Corpus Study, 2019
This paper is part of a greater study examining the features of the genre extracted from the text directly and suitable both for classification tasks and for adapting models of automatic morphological and syntactic tagging on data from various genres. The purpose of this work is to identify and describe the significant features of the genre, received from big data exclusively from the texts themselves (without the use of metadata, information about the author, date, literary style and method, etc.). Based on the selected features, it is explored how texts can be delimited according to one of the mostly used philological classifications with these features and the hypothesis whether with reliance solely on text-internal features texts can be grouped by genre quite successfully or not is being checked.
Proceedings of the 37th Annual Hawaii International …, 2004
Computational Linguistics, 2000
The two main factors that characterize a text are its content and its style, and both can be used as a means of categorization. In this paper we present an approach to text categorization in terms of genre and author for Modern Greek. In contrast to previous stylometric approaches, we attempt to take full advantage of existing natural language processing (NLP) tools. To this end, we propose a set of style markers including analysis-level measures that represent the way in which the input text has been analyzed and capture useful stylistic information without additional cost. We present a set of small-scale but reasonable experiments in text genre detection, author identi cation, and author veri cation tasks and show that the proposed method performs better than the most popular distributional lexical measures, i.e., functions of vocabulary richness and frequencies of occurrence of the most frequent words. All the presented experiments are based on unrestricted text downloaded from the World Wide Web without any manual text preprocessing or text sampling. Various performance issues regarding the training set size and the signi cance of the proposed style markers are discussed. Our system can be used in any application that requires fast and easily adaptable text categorization in terms of stylistically homogeneous categories. Moreover, the procedure of de ning analysis-level markers can be followed in order to extract useful stylistic information using existing text processing tools.
To aid literary fictions enthusiasts, the research aims to classify Filipino short stories according to their genre using K-Means algorithm and Artificial Neural Networks (ANN). The study limits the input under at least one specified genre among fantasy, horror, and romance. Two (2) sets of data are used for testing the system. The study concluded that K-Means algorithm provides a better accuracy for classifying Filipino fictions according to their genre if two outputs are used, with respect to both genre produced from source and from the professional. Otherwise, ANN classifies the story more accurately.
IEEE Access, 2021
The analysis of discourse and the study of what characterizes it in terms of communicative objectives is essential to most tasks of Natural Language Processing. Consequently, research on textual genres as expressions of such objectives presents an opportunity to enhance both automatic techniques and resources. To conduct an investigation of this kind, it is necessary to have a good understanding of what defines and distinguishes each textual genre. This research presents a data-driven approach to discover and analyze patterns in several textual genres with the aim of identifying and quantifying the differences between them, considering how language is employed and meaning expressed in each particular case. To identify and analyze patterns within genres, a set of linguistic features is first defined, extracted and computed by using several Natural Language Processing tools. Specifically, the analysis is performed over a corpora of documents—containing news, tales and reviews—gathered...
Texto Livre: Linguagem e Tecnologia, 2020
Classifying literary genres has always been methodologically confined to philological methods and what is commonly known as Vector Space Clustering (VSC). The problem has been exasperated with the widening gap between computational theory and traditional analysis of literary texts. Towards finding a solution to this problem, the current study utilizes a synergetic approach that brings together two established methods. First, a computational model of genre classification is drawn upon for identifying concept-based, rather than word-bound, topics, where the representation of texts is secured via the ‘bag of concepts’ (BOC) model as well as the sense-restricted knowledge and meaningful links holding between and among concepts; relatedly, the two model strands of explicit semantic analysis (ESA) and ConceptNet have enacted text classification. Second, a contextual lexical semantic approach (CRUSE, 1986, 2000) is employed so that the contextual variability of word meanings and concepts c...
1999
Abstract The central questions are: How useful is information about part-of-speech frequency for text categorisation? Is it feasible to limit word features to content words for text classifications? This is examined for 5 domain and 4 genre classification tasks using LIMAS, the German equivalent of the Brown corpus. Because LIMAS is too heterogeneous, neither question can be answered reliably for any of the tasks.
1997
As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.