Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2014
Abstract—The web is a huge repository of information and there is a need for categorizing web documents to facilitate the search and retrieval of documents. Web document classification plays an important role in information organization and retrieval.This paper presents a fuzzy set based approach for automatically classifying web documents into one of the classes represented by a set of training documents belonging to a number of classes. Using same word to represent more than one meaning and many words representing one meaning lead to ambiguity especially in web environment where numbers of users are very large. This problem is tackled using fuzzy association wherein each pair of words has a value associated with it. This helps in distinguishing it with other such pairs of words and thus helps in tackling ambiguities. The approach present in this paper does not require any parameter to be given by the user and hence is independent of any bias that may occur due to user input. It re...
International Computer Software and Applications Conference, 2002
In this paper, a method of automatically classifying Web documents into a set of categories using the fuzzy association concept is proposed. Using the same word or vocabulary to describe different entities creates ambiguity, especially in the Web environment where the user population is large. To solve this problem, fuzzy association is used to capture the relationships among different index
2013
The web is a huge repository of information and there is a need for categorizing web documents to facilitate the search and retrieval of documents. Web document classification plays an important role in information organization and retrieval.This paper presents a fuzzy set based approach for automatically classifying web documents into one of the classes represented by a set of training documents belonging to a number of classes. Using same word to represent more than one meaning and many words representing one meaning lead to ambiguity especially in web environment where numbers of users are very large. This problem is tackled using fuzzy association wherein each pair of words has a value associated with it. This helps in distinguishing it with other such pairs of words and thus helps in tackling ambiguities. The approach present in this paper does not require any parameter to be given by the user and hence is independent of any bias that may occur due to user input. It requires a ...
Computational Science and Its …, 2004
In present, Information retrieval systems which are simply expressed with combination between keywords and phrase search according to the direct keyword matching method to get the information which users need. But Web documents retrieval systems serve too many documents because of term ambiguity. Also it often happens that words with several meanings occur in a document, but in a rather different context from that expected by the querying person. So the user should need extra time and effort to get more close documents. To overcome these problems, in this paper we propose an information retrieval system based on the content, which connects documents according to the degree of semantic link which it express fuzzy value by fuzzy function. Also we propose an algorithm which it produce the hierarchical structure using the degree of concepts and contents among documents. As result, we are able to select and to provide user-interested documents.
Artificial Intelligence and …, 2007
We propose a categorical data fuzzy clustering algorithm to classify web documents. We extract a number of words for each thematic area (category) and then, we treat each word as a multidimensional categorical data vector. For each category, we use the algorithm to partition the available words into a number of clusters, where the center of each cluster corresponds to a word. To calculate the dissimilarity measure between two words we use the Hamming distance. Then, the classification of a new document is accomplished in two steps. Firstly, we estimate the minimum distance between this document and all the cluster centers of each category. Secondly, we select the smallest of the above minimum distance and we classify the document in the category that corresponds to this distance.
2014
Exponential growth of the web increased the importance of web document classification and data mining. To get the exact information, in the form of knowing what classes a web document belongs to, is expensive. Automatic classification of web document is of great use to search engines which provides this information at a low cost. In this paper, we propose an approach for classifying the web document using the frequent item word sets generated by the Frequent Pattern (FP) Growth which is an association analysis technique of data mining. These set of associated words act as feature set. The final classification obtained after Naive Bayes classifier used on the feature set. For the experimental work, we use Gensim package, as it is simple and robust. Results show that our approach can be effectively classifying the web document. Web document classification is the process of classifying documents into predefined categories based on their content. The classifiers used for this purpose sh...
mirlabs.org
In this work, we present a method to generate, from text documents, fuzzy rules used to classify documents and to improve the information retrieval. With this method, we face the issue of dimensionality in text documents for information retrieval. We also present a comparison analysis among the method that we proposed and well-known machine learning methods for classification. The aim of our work is to develop a mechanism to reduce the high dimensionality of the attribute-value matrix obtained from the documents and, consequently, scale up the proposed classifier. Some experiments have been run using different domains in order to validate the proposed approach and compare the results with the ones obtained with the OneR, K-Nearest Neighbor classifier, C4.5, Multi-variable Naive Bayes, and SVM methods. The experiments and the obtained results showed that this is a promising approach to deal with the dimensionality problem of document for information retrieval.
Information Technology, e-Business and Applications, 2004
Applying text classification techniques on web documents imposes many potential problems due to the huge volume and unstructured nature of these documents. In this research, Association Rules Classifier (ARC) is proposed as a novel classification framework that captures different hypertext features sources namely text, anchor, title and metadata and uses them to build a comprehensive knowledge base composed of association rules expressing the features dependencies. The ARC performance is compared with three other well-known full text classifiers: Bernoulli Bayes, Multinomial Bayes and KNN. The ARC has shown an accuracy improvement reaching 65% for large vocabulary size datasets. For small vocabulary size datasets, the ARC performance was similar to the best classifier among the three. When compared to other web classifiers that exploit anchor, title and metadata, the ARC enhanced the accuracy by about 22%.
1998
In this paper, we present a system that extracts and generalizes terms from Internet documents to represent classification knowledge of a given class hierarchy. We propose a measurement to evaluate the importance of a term with respect to a class in the class hierarchy, and denote it as support. With a given threshold, terms with high supports are sifted as keywords of a class, and terms with low supports are filtered out. To further enhance the recall of this approach, Mining Association Rules technique is applied to mine the association between terms. An inference model is composed of these association relations and the previously computed supports of the terms in the class. To increase the recall rate of the keyword selection process. we then present a polynomialtime inference algorithm to promote a term, strongly associated to a known keyword, to a keyword. According to our experiment results on the collected Internet documents from Yam search engine, we show that the proposed methods In the paper contribute to refine the classification knowledge and increase the recall of keyword selection.
International Journal of Data Mining & Knowledge Management Process, 2012
In this new and current era of technology, advancements and techniques, efficient and effective text document classification is becoming a challenging and highly required area to capably categorize text documents into mutually exclusive categories. Fuzzy similarity provides a way to find the similarity of features among various documents. In this paper, a technical review on various fuzzy similarity based models is given. These models are discussed and compared to frame out their use and necessity. A tour of different methodologies is provided which is based upon fuzzy similarity related concerns. It shows that how text and web documents are categorized efficiently into different categories. Various experimental results of these models are also discussed. The technical comparisons among each model's parameters are shown in the form of a 3-D chart. Such study and technical review provide a strong base of research work done on fuzzy similarity based text document categorization.
Le World Wide Web fournit une grande quantité d'information dans plusieurs secteurs thématiques. En raison de sa nature dynamique, l'information que WWW nous offre s'augmente rapidement dans une base quotidienne. Par conséquence, une catégorisation du contenu de Web est très importante. La question de classification de pages Web est bien connue dans la communauté de la recherche d'information et d'apprentissage automatique, et soulève des questions importantes. Ainsi que les études dans le domaine de la classification matière-basée sont fortement développées, nous notons la manque de l'utilisation de l'information sémantique comme critère de catégorisation. Dans notre étude, nous essayons une catégorisation des pages Web dans des catégories thématiques prédéfinies, basées sur le traitement des textes d'ancrage des pages Web. Á ce but, nous exploitons l'information linguistique que les réseaux sémantiques nous fournissent pour l'anglais et la langue grecque. Nous notons également l'importance des traits de catégorisation évalués par des méthodes de désambiguisation sémantique.
2003
The Internet makes it possible to share and manipulate a vast quantity of information efficiently and effectively, but the rapid and chaotic growth experienced by the Net has generated a poorly organized environment that hinders the sharing and mining of useful data. The need for meaningful web-page classification techniques is therefore becoming an urgent issue. This paper describes a novel approach to web-page classification based on a fuzzy representation of web pages. A doublet representation that associates a weight with each of the most representative words of the web document so as to characterize its relevance in the document. This weight is derived by taking advantage of the characteristics of HTML language. Then a fuzzy-rule-based classifier is generated from a supervised learning process that uses a genetic algorithm to search for the minimum fuzzy-rule set that best covers the training examples. The proposed system has been demonstrated with two significantly different classes of web pages.
Fuzzy Sets and Systems, 2004
In this paper, we present an application of association rules to query reÿnement. Starting from an initial set of documents retrieved from the web, text transactions are constructed and association rules are extracted. A fuzzy extension of text transactions and association rules is employed, where the presence of the terms (items) in the documents (transactions) is determined with a value between 0 and 1. The obtained rules o er the user additional terms to be added to the query with the purpose of guiding the search and improving the retrieval.
Journal of Computer Science, 2016
The ever-increasing amount of information on the Web is organized in structured, semi-structured and unstructured data. Text classification systems, capable of handling such different structures, may facilitate the work of important tasks such as indexation and information retrieval in search engines. The objective of this research is to develop a method for the classification of documents into multiple categories with fuzzy logic. This method was built from a process of pattern recognition and, also, two variables called similarity and accuracy were used. The proposed fuzzy classification method uses variables that express the ability to analyze the similarity and accuracy of a document through a database of terms. The database of terms is generated by a collection of pre-classified documents in categories of interest. The documents processed according to the similarity and accuracy in the database of terms composes a training set also called knowledge base. From this database, it is possible to identify a pattern that specifies a set of rules through a knowledge discovery process. This process involves the data mining of the knowledge base. Thus, it was possible to define a general model that is used in the creation of rules and membership functions of the fuzzy model for the classification of documents into multiple categories. The general model of the rules identified in the data mining process and implemented in fuzzy model considers the most significant variables and also contributes to the specification of the membership functions, such as the definition of linguistic terms of fuzzy sets. Thus, it was possible to implement a more deterministic approach regarding the input, membership functions and inference rules of the fuzzy model. The results of the proposed method for classification of documents are relevant because they have a satisfactory accuracy rate.
Studies in Fuzziness and Soft Computing, 2008
In this paper, we present an application of association rules to query refinement. Starting from an initial set of documents retrieved from the web, text transactions are constructed and association rules are extracted. A fuzzy extension of text transactions and association rules is employed, where the presence of the terms (items) in the documents (transactions) is determined with a value between 0 and 1. The obtained rules offer the user additional terms to be added to the query with the purpose of guiding the search and improving the retrieval.
International Journal of …, 2009
In this study, a fuzzy similarity approach for Arabic web pages classification is presented. The approach uses a fuzzy term-category relation by manipulating membership degree for the training data and the degree value for a test web page. Six measures are used and compared in this study. These measures include: Einstein, Algebraic, Hamacher, MinMax, Special case fuzzy and Bounded Difference approaches. These measures are applied and compared using 50 different Arabic web pages. Einstein measure was gave best performance among the other measures. An analysis of these measures and concluding remarks are drawn in this study.
IEEE Transactions on Fuzzy Systems, 2015
Web documents are heterogeneous and complex. 4 There exists complicated associations within one web document 5 and linking to the others. The high interactions between terms in 6 documents demonstrate vague and ambiguous meanings. Efficient 7 and effective clustering methods to discover latent and coherent 8 meanings in context are necessary. This paper presents a fuzzy lin-9 guistic topological space along with a fuzzy clustering algorithm 10 to discover the contextual meaning in the web documents. The 11 proposed algorithm extracts features from the web documents us-12 ing conditional random field methods and builds a fuzzy linguistic 13 topological space based on the associations of features. The associa-14 tions of cooccurring features organize a hierarchy of connected se-15 mantic complexes called "CONCEPTS," wherein a fuzzy linguistic 16 measure is applied on each complex to evaluate 1) the relevance of 17 a document belonging to a topic, and 2) the difference between the 18 other topics. Web contents are able to be clustered into topics in the 19 hierarchy depending on their fuzzy linguistic measures; web users 20 can further explore the CONCEPTS of web contents accordingly. 21 Besides the algorithm applicability in web text domains, it can be 22 extended to other applications, such as data mining, bioinformat-23 ics, content-based, or collaborative information filtering, etc.
2010 10th International Conference on Hybrid Intelligent Systems, 2010
This work presents the integration of a fuzzy method and text mining to obtain an approach that enables the text documents classification to be closer to the user needs. The aim of this work is to develop a mechanism to reduce the high dimensionality of the attribute-value matrix obtained from the documents and, with this, to manage the imprecision and uncertainty using fuzzy rules to classify text documents. Some experiments have been run using different domains in order to validate the proposed approach and to compare the results with the ones obtained with the Ibk, J48, Naive Bayes and OneR classification methods. The advantages of the method, the experiments and the results obtained are discussed.
Lecture Notes in Computer Science, 2006
In this paper we develop the general framework for text representation based on fuzzy set theory.
V International Enformatika Conference(IEC …, 2005
Abstract In this study, a fuzzy similarity approach for Arabic web pages classification is presented. The approach uses a fuzzy term-category relation by manipulating membership degree for the training data and the degree value for a test web page. Six measures are used and ...
This paper addresses the issue of an adequate representation of a web page, to perform further on classification and data mining. The approach focuses the textual part of web pages, which are represented by a two-dimension vector. The vector components are sorted by the relevance of each word in the text. Two approaches, analytical and fuzzy, that take advantage of characteristics of the HTML language are presented to compute the word relevance. Both models are contrasted in learning and classification tasks, to evaluate the suitability of each approach. The experiments show an obvious improvement of fuzzy method versus analytical one. The analytical and fuzzy approaches here presented are general, in the sense that every characteristic of the web pages could be easily integrated without additional cost.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.