Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2004, Proceedings of the 7th International Workshop on the Web and Databases colocated with ACM SIGMOD/PODS 2004 - WebDB '04
The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index. We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam. This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.
2004
The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.
Proceedings of the 8th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing, 2012
With over 2.5 hours a day spent browsing websites online [1] and with over a billion pages [2], identifying and detecting web spam is an important problem. Although large corpora of legitimate web pages are available to researchers, the same cannot be said about web spam or spam web pages. We introduce the Webb Spam Corpus 2011-a corpus of approximately 330, 000 spam web pages-which we make available to researchers in the fight against spam. By having a standard corpus available, researchers can collaborate better on developing and reporting results of spam filtering techniques. The corpus contains web pages crawled from links found in over 6.3 million spam emails. We analyze multiple aspects of this corpus including redirection, HTTP headers and web page content. We also provide insights into changes in web spam since the last Webb Spam Corpus was released in 2006. These insights include: 1) spammers manipulate social media in spreading spam; 2) HTTP headers also change over time (e.g. hosting IP addresses of web spam appear in more IP ranges); 3) Web spam content has evolved but the majority of content is still scam.
International Journal of Cooperative Information Systems, 2014
Identifying and detecting web spam is an ongoing battle between spam-researchers and spammers which has been going on since search engines allowed searching of web pages to the modern sharing of web links via social networks. A common challenge faced by spam-researchers is the fact that new techniques depend on requiring a corpus of legitimate and spam web pages. Although large corpora of legitimate web pages are available to researchers, the same cannot be said about web spam or spam web pages. In this paper, we introduce the Webb Spam Corpus 2011 — a corpus of approximately 330,000 spam web pages — which we make available to researchers in the fight against spam. By having a standard corpus available, researchers can collaborate better on developing and reporting results of spam filtering techniques. The corpus contains web pages crawled from links found in over 6.3 million spam emails. We analyze multiple aspects of this corpus including redirection, HTTP headers, web page conten...
Sigir Forum, 2006
We describe the WEBSPAM-UK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by a large and diverse set of judges.
Web spam is a big challenge for quality of search engine results. It is very important for search engines to detect web spam accurately. In this paper we present 32 low cost quality factors to classify spam and ham pages on real time basis. These features can be divided in to three categories: (i) URL features, (ii) Content features, and (iii) Link features. We developed a classifier using Resilient Back-propagation learning algorithm of neural network and obtained good accuracy. This classifier can be applied to search engine results on real time because calculation of these features require very little CPU resources.
Proceedings of the 15th international conference on World Wide Web - WWW '06, 2006
In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
Adversarial Information Retrieval on the Web
as well. In general, the page with higher rank always has more accessed chance, since people usually only read a few top ranked pages returned by search engines while searching on line. Driven by commercial motivations, some websites or pages owners attempt to deceive search engines for ranking their sites or pages higher than they deserve . This is called web spamming .
ACM Transactions on The Web, 2008
We propose link-based techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The issue of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible.
Proceedings of the First …, 2005
Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists or other means of human intervention. We assume that spammed pages have a biased distribution of pages that contribute to the undeserved high PageRank value. We define SpamRank by penalizing pages that originate a suspicious PageRank share and personalizing PageRank on the penalties. Our method is tested on a 31 M page crawl of the .de domain with a manually classified 1000-page stratified random sample with bias towards large PageRank values. * Support from NKFP-2/0017/2002 project Data Riddle and various ETIK, OTKA and AKP grants 1 Note that their notion of an authority plays similar role as a page with high PageRank value. In this sense the existence of a hyperlink should affect ranking only if it expresses human annotation. As examples for different uses of hyperlinks we refer to the article of Davison [13] that shows how to determine intra-site links that may serve both navigational and certain spam purposes. In more examples are given, among others spamming guest books with links or mirroring with the sole purpose of the additional links to spam targets.
2006
We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Using this approach we are able to detect 80.4% of the Web spam in our sample, with only 1.1% of false positives.
International Journal of Computer Applications Technology and Research, 2014
Internet is a global information system. Most of the users use search engines due to high volume of information in virtual world in order to access to required information. They often observe the results of first pages in search engines. If they cannot obtain desired result, then they exchange query statement. Search engines try to place the best results in the first links of results on the basis of user's query. Web spam is an illegal and unethical method to increase the rank of internet pages by deceiving the algorithms of search engines. It involves commercial, political and economic applications. In this paper, we firstly present some definitions in terms of web spam. Then we explain different kinds of web spam, and we describe some method, used to combat with this difficulty.
Proceedings of the 4th …, 2008
2011
In this paper, we investigate variations of Spam Mass for filtering web spam. Firstly, we propose two strategies for designing new variations of the Spam Mass algorithm. Then, we perform experiments among different versions of Spam Mass using WEBSPAM-UK2006 data set. Finally, we show improvement through proposed strategy by up to 1.33 times in recall and 1.02 times in precision over the original version of Spam Mass.
The European Integrated Project Dynamically Evolving, Large Scale Information Systems (DELIS): proceedings of the final workshop, 2009
Abstract. The Web is both an excellent medium for sharing information as well as an attractive platform for delivering products and services. This platform is, to some extent, mediated by search engines in order to meet the needs of users seeking information. Search engines are the “dragons” that keep a valuable treasure: information [13]. Given the vast amount of information available on the Web, it is customary to answer queries with only a small set of results (typically 10 or 20 pages at most). Search engines must then rank Web ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.