Skip to main content

Ricardo Baeza-Yates

Followers

0

Following

0

Public Views

Universidad Central de Venezuela

María Ruiz Ortiz

Universidad Internacional de Valencia

Universidade de Évora

SFS Qatar (Georgetown University)

Mikhail (Mykhailo) Minakov

Woodrow Wilson International Center for Scholars

Eduardo Pellejero

Universidade Federal do Rio Grande do Norte

Adolfo Vasquez Rocca

Universidad Complutense de Madrid

National Scientific and Technical Research Council

The Hebrew University of Jerusalem

Lorena Córdoba

National Scientific and Technical Research Council

Interests

Uploads

Papers by Ricardo Baeza-Yates

On measuring the lexical quality of the web

Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality, 2012

In this paper we propose a measure for estimating the lexical quality of the Web, that is, the re... more In this paper we propose a measure for estimating the lexical quality of the Web, that is, the representational aspect of the textual web content. Our lexical quality measure is based in a small corpus of spelling errors and we apply it to English and Spanish. We first compute the correlation of our measure with web popularity measures to show that gives independent information and then we apply it to different web segments, including social media. Our results shed a light on the lexical quality of the Web and show that authoritative websites have several orders of magnitude less misspellings than the overall Web. We also present an analysis of the geographical distribution of lexical quality throughout English and Spanish speaking countries as well as how this measure changes in about one year.

Proceedings of the 14th international ACM SIGACCESS conference on Computers and accessibility, 2012

We present an ebook reader for Android which displays ebooks in a more accessible manner for user... more We present an ebook reader for Android which displays ebooks in a more accessible manner for users with dyslexia. The ebook reader combines features that other related tools already have, such as text-to-speech technology, and new features, such as displaying the text with an adapted text layout based on the results of a user study with partici pants with dyslexia. Since there is no universal profile of a user with dyslexia, the layout settings are customizable and users can override the special layout setting according to their reading preferences.

Generalizing PageRank

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006

This paper introduces a family of link-based ranking algorithms that propagate page importance th... more This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ranking function of this family. The main objective of this paper is to determine whether this family of ranking techniques has some interest per se, and how different choices for the damping function impact on rank quality and on convergence speed. Even though our results suggest that Page-Rank can be approximated with other simpler forms of rankings that may be computed more efficiently, our focus is of more speculative nature, in that it aims at separating the kernel of PageRank, that is, link-based importance propagation, from the way propagation decays over paths. We focus on three damping functions, having linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. Our presentation includes algorithms, analysis, comparisons and experiments that study their behavior under different parameters in real Web graph data. Among other results, we show how to calculate a linear approximation that induces a page ordering that is almost identical to Page-Rank's using a fixed small number of iterations; comparisons were performed using Kendall's τ on large domain datasets.

Web Retrieval and Mining

Management, Types, and Standards, 2011

The advent of the Web in the mid-1990s followed by its fast adoption in a relatively short time, ... more The advent of the Web in the mid-1990s followed by its fast adoption in a relatively short time, posed significant challenges to classical information retrieval methods developed in the 1970s and the 1980s. The major challenges include that the Web is massive, dynamic, and distributed. The two main types of tasks that are carried on the Web are searching and mining. Searching is locating information given an information need, and mining is extracting information and/or knowledge from a corpus. The metrics for success when carrying these tasks on the Web include precision, recall (completeness), freshness, and efficiency

The impact of lexical simplification by verbal paraphrases for people with and without dyslexia

Text simplification is the process of transforming a text into an equivalent which is easier to r... more Text simplification is the process of transforming a text into an equivalent which is easier to read and to understand, preserving its meaning for a target population. One such population who could benefit from text simplification are people with dyslexia. One of the alternatives for text simplification is the use of verbal paraphrases. One of the more common verbal paraphrase pairs are the one composed by a lexical verb (to hug) and by a support verb plus a noun collocation (to give a hug). This paper explores how Spanish verbal paraphrases impact the readability and the comprehension of people with and without dyslexia dyslexia. For the selection of pairs of verbal paraphrases we have used the Badele.3000 database, a linguistic resource composed of more than 3,600 verbal paraphrases. To measure the impact in reading performance and understandability, we performed an eye-tracking study including comprehension questionnaires. The study is based on a group of 46 participants, 23 with confirmed dyslexia and 23 control group. We did not find significant effects, thus tools that can perform this kind of paraphrases automatically might not have a large effect on people with dyslexia. Therefore, other kinds of text simplification might be needed to benefit readability and understandability of people with dyslexia.

Generic Damping Functions for Propagating Importance in Link-Based Ranking

Internet Mathematics, 2006

This paper introduces a family of link-based ranking algorithms that propagate page importance th... more This paper introduces a family of link-based ranking algorithms that propagate page importance through links. The algorithms include a damping function which decreases with distance, thus a direct link implies greater endorsement that a link via a longer path. PageRank is the most widely known ranking function of this family. The main objective of this paper is to determine whether this family of ranking techniques is of some interest per se, and how different choices for the damping function affect rank quality and convergence speed. Even though our results suggest that PageRank can be approximated with other more simple forms of rankings that may be computed more efficiently, our focus is more speculative in nature, given that it aims at separating the kernel of PageRank, that is, link-based importance propagation, from the way propagation decays over paths. We focus on three damping functions that have linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. The work we carry includes algorithms, analysis, comparisons and experiments that study their behavior under different parameters in real Web graph data. Amongst other results, we show how to calculate a linear approximation that induces a page ordering that is almost identical to PageRank's using a fixed number of iterations. Comparisons were made using Kendall's τ on large domain datasets.

The presence of English and Spanish dyslexia in the Web

New Review of Hypermedia and Multimedia, 2012

This article may be used for research, teaching, and private study purposes. Any substantial or s... more This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

XQL and proximal nodes

Journal of the American Society for Information Science and Technology, 2002

Despite the fact that several models to structure text documents and to query on this structure h... more Despite the fact that several models to structure text documents and to query on this structure have been proposed in the past, a standard has emerged only relatively recently with the introduction of XML and its proposed query language XQL, on which we focus in this article. Although there exist some implementations of XQL, efficiency of the query engine is still a problem. We show in this article that an already existing model, Proximal Nodes, which was defined with the goal of efficiency in mind, can be used as an efficient query engine behind an XQL front‐end.

Block addressing indices for approximate text retrieval

Journal of the American Society for Information Science, 2000

The issue of reducing the space overhead when indexing large text databases is becoming more and ... more The issue of reducing the space overhead when indexing large text databases is becoming more and more important, as the text collections grow in size. Another subject, which is gaining importance as text databases grow and get more heterogeneous and error prone, is that of exible string matching. One of the best tools to make the search more exible is to allow a limited number of di erences between the words found and those sought. This is called \approximate text searching", which is becoming more and more popular. In recent years some indexing schemes with very low space overhead have appeared, some of them dealing with approximate searching. These low overhead indices (whose most notorious exponent is Glimpse) are modi ed inverted les, where space is saved by making the lists of occurrences point to text blocks instead of exact word positions. Despite their existence, little is known about the expected behavior of these \block addressing" indices, and even less is known when it comes to cope with approximate search. Our main contribution is an analytical study of the space-time trade-o s for indexed text searching. We study the space overhead and retrieval times as functions of the block size. We nd that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. This surprising analytical conclusion is validated with extensive experiments, obtaining typical performance gures. These results are valid for classical exact queries as well as for approximate searching. We apply our analysis to the Web, using recent statistics on the distribution of the document sizes. We show that pointing to documents instead of to xed size blocks reduces space requirements but increases search times.

Very fast and simple approximate string matching

Information Processing Letters, 1999

We improve the fastest known algorithm for approximate string matching. This algorithm can only b... more We improve the fastest known algorithm for approximate string matching. This algorithm can only be used for low error levels. By using a new algorithm to verify potential matches and a new optimization technique for biased texts (such as English), the algorithm becomes the fastest one for medium error levels too. This includes most of the interesting cases in this area.

Compression: a key for next-generation text retrieval systems

Computer, 2000

Improved bounds for the expected behaviour of AVL trees

BIT, 1992

In this paper we improve previous bounds on expected measures of AVL trees by usingfringe analysi... more In this paper we improve previous bounds on expected measures of AVL trees by usingfringe analysis. A new way of handling larger tree collections that are not closed is presented.An inherent difficulty posed by the transformations necessary to keep the AVL tree balancedmakes its analysis difficult when using fringe analysis methods. We derive a technique to copewith this difficulty obtaining

Visualizing the European Trade Graph

Univerita di Roma, Rome, 2003

In this paper we consider a graph representing commercial trade. In the trade graph, each node is... more In this paper we consider a graph representing commercial trade. In the trade graph, each node is a country and each (weighted) arc represents the amount of commercial trade between two countries. We consider an undirected version, in which we sum both exports and imports. The data we analyze was obtained from the United Nations Statistics Division (Commodity Trade Database COMTRADE, http://unstats.un.org/unsd/comtrade/). We use a subset of the global trade graph which corresponds to the trade between European countries. In Table 1 we show ...

Comparing the characteristics of the korean and the chilean web

Korea-Chile IT Cooperation Center ITCC, Technical report, Dec 1, 2004

Executive Summary This report summarizes the results of a comparison between the characteristics ... more Executive Summary This report summarizes the results of a comparison between the characteristics of two public Web spaces: the pages under the. KR (South Korea) domain, and the pages under the. CL (Chile) domain. We show several similarities that contribute to validate more general models for the characteristics of the Web, specially in terms of link structure.

Web structure, age and page quality

2nd International Workshop on Web Dynamics (WebDyn 2002), 2002

This paper is aimed at the study of quantitative measures of the relation between Web structure, ... more This paper is aimed at the study of quantitative measures of the relation between Web structure, age, and quality of Web pages. Quality is studied from di erent link-based metrics and their relationship with the structure of the Web and the last modification time of a page. We show that, as expected, Pagerank is biased against new pages. As a subproduct we propose a Pagerank variant that includes age into account and we obtain information on how the rate of change is related with Web structure.

Caracterizando la Web chilena

Encuentro chileno deficiencias de la computación, 2000

ABSTRACT This article presents a characterization of the web space from Chile in 2007. The charac... more ABSTRACT This article presents a characterization of the web space from Chile in 2007. The characterization shows distributions of sites and domains, analysis of document content and server configuration. In addition, the network structure of the chilean Web is analyzed, determining components based on hyperlink structure at the document and site levels. Original Abstract: En este art\&#39;iculo se muestra una caracterizaci\&#39;on del espacio web de Chile para el a\~no 2007. Se muestran distribuciones de sitios y dominios, caracterizaci\&#39;on del contenido en base a tipos de documento, asi como configuraci\&#39;on de los servidores. Se estudia la estructura de la red creada mediante hiperv\&#39;inculos en los documentos y c\&#39;omo las diferentes componentes de esta estructura var\&#39;ian cuando los hiperv\&#39;inculos son agregados a nivel de sitios.

Information Retrieval and the Web

In this paper we briefly explore the challenges to expand,information,retrieval (IR) on the Web, ... more In this paper we briefly explore the challenges to expand,information,retrieval (IR) on the Web, in particular other types of data, Web mining and issues related to crawling. We also mention,the main relations of IR and soft computing,and how,these techniques address these challenges. � 2003Elsevier Inc. All rights reserved.

On measuring the lexical quality of the web

Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality, 2012

In this paper we propose a measure for estimating the lexical quality of the Web, that is, the re... more In this paper we propose a measure for estimating the lexical quality of the Web, that is, the representational aspect of the textual web content. Our lexical quality measure is based in a small corpus of spelling errors and we apply it to English and Spanish. We first compute the correlation of our measure with web popularity measures to show that gives independent information and then we apply it to different web segments, including social media. Our results shed a light on the lexical quality of the Web and show that authoritative websites have several orders of magnitude less misspellings than the overall Web. We also present an analysis of the geographical distribution of lexical quality throughout English and Spanish speaking countries as well as how this measure changes in about one year.

Proceedings of the 14th international ACM SIGACCESS conference on Computers and accessibility, 2012

We present an ebook reader for Android which displays ebooks in a more accessible manner for user... more We present an ebook reader for Android which displays ebooks in a more accessible manner for users with dyslexia. The ebook reader combines features that other related tools already have, such as text-to-speech technology, and new features, such as displaying the text with an adapted text layout based on the results of a user study with partici pants with dyslexia. Since there is no universal profile of a user with dyslexia, the layout settings are customizable and users can override the special layout setting according to their reading preferences.

Generalizing PageRank

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006

This paper introduces a family of link-based ranking algorithms that propagate page importance th... more This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ranking function of this family. The main objective of this paper is to determine whether this family of ranking techniques has some interest per se, and how different choices for the damping function impact on rank quality and on convergence speed. Even though our results suggest that Page-Rank can be approximated with other simpler forms of rankings that may be computed more efficiently, our focus is of more speculative nature, in that it aims at separating the kernel of PageRank, that is, link-based importance propagation, from the way propagation decays over paths. We focus on three damping functions, having linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. Our presentation includes algorithms, analysis, comparisons and experiments that study their behavior under different parameters in real Web graph data. Among other results, we show how to calculate a linear approximation that induces a page ordering that is almost identical to Page-Rank's using a fixed small number of iterations; comparisons were performed using Kendall's τ on large domain datasets.

Web Retrieval and Mining

Management, Types, and Standards, 2011

The advent of the Web in the mid-1990s followed by its fast adoption in a relatively short time, ... more The advent of the Web in the mid-1990s followed by its fast adoption in a relatively short time, posed significant challenges to classical information retrieval methods developed in the 1970s and the 1980s. The major challenges include that the Web is massive, dynamic, and distributed. The two main types of tasks that are carried on the Web are searching and mining. Searching is locating information given an information need, and mining is extracting information and/or knowledge from a corpus. The metrics for success when carrying these tasks on the Web include precision, recall (completeness), freshness, and efficiency

The impact of lexical simplification by verbal paraphrases for people with and without dyslexia

Text simplification is the process of transforming a text into an equivalent which is easier to r... more Text simplification is the process of transforming a text into an equivalent which is easier to read and to understand, preserving its meaning for a target population. One such population who could benefit from text simplification are people with dyslexia. One of the alternatives for text simplification is the use of verbal paraphrases. One of the more common verbal paraphrase pairs are the one composed by a lexical verb (to hug) and by a support verb plus a noun collocation (to give a hug). This paper explores how Spanish verbal paraphrases impact the readability and the comprehension of people with and without dyslexia dyslexia. For the selection of pairs of verbal paraphrases we have used the Badele.3000 database, a linguistic resource composed of more than 3,600 verbal paraphrases. To measure the impact in reading performance and understandability, we performed an eye-tracking study including comprehension questionnaires. The study is based on a group of 46 participants, 23 with confirmed dyslexia and 23 control group. We did not find significant effects, thus tools that can perform this kind of paraphrases automatically might not have a large effect on people with dyslexia. Therefore, other kinds of text simplification might be needed to benefit readability and understandability of people with dyslexia.

Generic Damping Functions for Propagating Importance in Link-Based Ranking

Internet Mathematics, 2006

This paper introduces a family of link-based ranking algorithms that propagate page importance th... more This paper introduces a family of link-based ranking algorithms that propagate page importance through links. The algorithms include a damping function which decreases with distance, thus a direct link implies greater endorsement that a link via a longer path. PageRank is the most widely known ranking function of this family. The main objective of this paper is to determine whether this family of ranking techniques is of some interest per se, and how different choices for the damping function affect rank quality and convergence speed. Even though our results suggest that PageRank can be approximated with other more simple forms of rankings that may be computed more efficiently, our focus is more speculative in nature, given that it aims at separating the kernel of PageRank, that is, link-based importance propagation, from the way propagation decays over paths. We focus on three damping functions that have linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. The work we carry includes algorithms, analysis, comparisons and experiments that study their behavior under different parameters in real Web graph data. Amongst other results, we show how to calculate a linear approximation that induces a page ordering that is almost identical to PageRank's using a fixed number of iterations. Comparisons were made using Kendall's τ on large domain datasets.

The presence of English and Spanish dyslexia in the Web

New Review of Hypermedia and Multimedia, 2012

This article may be used for research, teaching, and private study purposes. Any substantial or s... more This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

XQL and proximal nodes

Journal of the American Society for Information Science and Technology, 2002

Despite the fact that several models to structure text documents and to query on this structure h... more Despite the fact that several models to structure text documents and to query on this structure have been proposed in the past, a standard has emerged only relatively recently with the introduction of XML and its proposed query language XQL, on which we focus in this article. Although there exist some implementations of XQL, efficiency of the query engine is still a problem. We show in this article that an already existing model, Proximal Nodes, which was defined with the goal of efficiency in mind, can be used as an efficient query engine behind an XQL front‐end.

Block addressing indices for approximate text retrieval

Journal of the American Society for Information Science, 2000

The issue of reducing the space overhead when indexing large text databases is becoming more and ... more The issue of reducing the space overhead when indexing large text databases is becoming more and more important, as the text collections grow in size. Another subject, which is gaining importance as text databases grow and get more heterogeneous and error prone, is that of exible string matching. One of the best tools to make the search more exible is to allow a limited number of di erences between the words found and those sought. This is called \approximate text searching", which is becoming more and more popular. In recent years some indexing schemes with very low space overhead have appeared, some of them dealing with approximate searching. These low overhead indices (whose most notorious exponent is Glimpse) are modi ed inverted les, where space is saved by making the lists of occurrences point to text blocks instead of exact word positions. Despite their existence, little is known about the expected behavior of these \block addressing" indices, and even less is known when it comes to cope with approximate search. Our main contribution is an analytical study of the space-time trade-o s for indexed text searching. We study the space overhead and retrieval times as functions of the block size. We nd that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. This surprising analytical conclusion is validated with extensive experiments, obtaining typical performance gures. These results are valid for classical exact queries as well as for approximate searching. We apply our analysis to the Web, using recent statistics on the distribution of the document sizes. We show that pointing to documents instead of to xed size blocks reduces space requirements but increases search times.

Very fast and simple approximate string matching

Information Processing Letters, 1999

We improve the fastest known algorithm for approximate string matching. This algorithm can only b... more We improve the fastest known algorithm for approximate string matching. This algorithm can only be used for low error levels. By using a new algorithm to verify potential matches and a new optimization technique for biased texts (such as English), the algorithm becomes the fastest one for medium error levels too. This includes most of the interesting cases in this area.

Compression: a key for next-generation text retrieval systems

Computer, 2000

Improved bounds for the expected behaviour of AVL trees

BIT, 1992

In this paper we improve previous bounds on expected measures of AVL trees by usingfringe analysi... more In this paper we improve previous bounds on expected measures of AVL trees by usingfringe analysis. A new way of handling larger tree collections that are not closed is presented.An inherent difficulty posed by the transformations necessary to keep the AVL tree balancedmakes its analysis difficult when using fringe analysis methods. We derive a technique to copewith this difficulty obtaining

Visualizing the European Trade Graph

Univerita di Roma, Rome, 2003

In this paper we consider a graph representing commercial trade. In the trade graph, each node is... more In this paper we consider a graph representing commercial trade. In the trade graph, each node is a country and each (weighted) arc represents the amount of commercial trade between two countries. We consider an undirected version, in which we sum both exports and imports. The data we analyze was obtained from the United Nations Statistics Division (Commodity Trade Database COMTRADE, http://unstats.un.org/unsd/comtrade/). We use a subset of the global trade graph which corresponds to the trade between European countries. In Table 1 we show ...

Comparing the characteristics of the korean and the chilean web

Korea-Chile IT Cooperation Center ITCC, Technical report, Dec 1, 2004

Executive Summary This report summarizes the results of a comparison between the characteristics ... more Executive Summary This report summarizes the results of a comparison between the characteristics of two public Web spaces: the pages under the. KR (South Korea) domain, and the pages under the. CL (Chile) domain. We show several similarities that contribute to validate more general models for the characteristics of the Web, specially in terms of link structure.

Web structure, age and page quality

2nd International Workshop on Web Dynamics (WebDyn 2002), 2002

This paper is aimed at the study of quantitative measures of the relation between Web structure, ... more This paper is aimed at the study of quantitative measures of the relation between Web structure, age, and quality of Web pages. Quality is studied from di erent link-based metrics and their relationship with the structure of the Web and the last modification time of a page. We show that, as expected, Pagerank is biased against new pages. As a subproduct we propose a Pagerank variant that includes age into account and we obtain information on how the rate of change is related with Web structure.

Caracterizando la Web chilena

Encuentro chileno deficiencias de la computación, 2000

ABSTRACT This article presents a characterization of the web space from Chile in 2007. The charac... more ABSTRACT This article presents a characterization of the web space from Chile in 2007. The characterization shows distributions of sites and domains, analysis of document content and server configuration. In addition, the network structure of the chilean Web is analyzed, determining components based on hyperlink structure at the document and site levels. Original Abstract: En este art\&#39;iculo se muestra una caracterizaci\&#39;on del espacio web de Chile para el a\~no 2007. Se muestran distribuciones de sitios y dominios, caracterizaci\&#39;on del contenido en base a tipos de documento, asi como configuraci\&#39;on de los servidores. Se estudia la estructura de la red creada mediante hiperv\&#39;inculos en los documentos y c\&#39;omo las diferentes componentes de esta estructura var\&#39;ian cuando los hiperv\&#39;inculos son agregados a nivel de sitios.

Information Retrieval and the Web

In this paper we briefly explore the challenges to expand,information,retrieval (IR) on the Web, ... more In this paper we briefly explore the challenges to expand,information,retrieval (IR) on the Web, in particular other types of data, Web mining and issues related to crawling. We also mention,the main relations of IR and soft computing,and how,these techniques address these challenges. � 2003Elsevier Inc. All rights reserved.