Sorting
Sorting
8
If you keep proving stuff that others have done, getting confidence, increasing the
complexities of your solutions—for the fun of it—then one day you’ll turn
around and discover that nobody actually did that one! And that’s the way to
become a computer scientist.
Richard Feynman, Lectures on Computation
This chapter presents two extended examples that use the programming tech-
niques from Chapters 2–5 and analysis ideas from Chapters 6–7 to solve some
interesting and important problems. First, we consider the problem of arrang-
ing a list in order. Next, we consider the problem of finding an item that satisfies
some property. These examples involve some quite challenging problems and
incorporate many of the ideas we have seen up to this point in the book. Readers
who can understand them well are well on their way to thinking like computer
scientists!
8.1 Sorting
The sorting problem takes two inputs: a list of elements and a comparison pro-
cedure. It outputs a list containing same elements as the input list ordered ac-
cording to the comparison procedure. For example, if we sort a list of numbers
using < as the comparison procedure, the output is the list of numbers sorted
in order from least to greatest.
Sorting is one of the most widely studied problems in computing, and many dif-
ferent sorting algorithms have been proposed. Curious readers should attempt
to develop their own sorting procedures before continuing further. It may be il-
luminating to try sorting some items by hand an think carefully about how you
do it and how much work it is. For example, take a shuffled deck of cards and
arrange them in sorted order by ranks. Or, try arranging all the students in your
class in order by birthday. Next, we present and analyze three different sorting
procedures.
transitive makes sense if the comparison function is transitive. This means it has the prop-
erty that for any inputs a, b, and c, if (cf a b) and (cf b c) are both true, the result
of (cf a c) must be true. The < function is transitive: a < b and b < c implies
a < c for all numbers a, b, and c. If the comparison function does not have this
property, there may be no way to arrange the elements in a single sorted list. All
of our sorting procedures require that the procedure passed as the comparison
function is transitive.
Once we can find the best element in a given list, we can sort the whole list by
repeatedly finding the best element of the remaining elements until no more
elements remain. To define our best-first sorting procedure, we first define a
procedure for finding the best element in the list, and then define a procedure
for removing an element from a list.
Finding the Best. The best element in the list is either the first element, or the
best element from the rest of the list. Hence, we define list-find-best recursively.
An empty list has no best element, so the base case is for a list that has one
element. When the input list has only one element, that element is the best
element. If the list has more than one element, the best element is the better of
the first element in the list and the best element of the rest of the list.
To pick the better element from two elements, we define the pick-better proce-
dure that takes three inputs: a comparison function and two values.
(define (pick-better cf p1 p2) (if (cf p1 p2) p1 p2))
Assuming the procedure passed as cf has constant running time, the running
time of pick-better is constant. For most of our examples, we use the < proce-
dure as the comparison function. For arbitrary inputs, the running time of < is
not constant since in the worst case performing the comparison requires exam-
ining every digit in the input numbers. But, if the maximum value of a number
in the input list is limited, then we can consider < a constant time procedure
since all of the inputs passed to it are below some fixed size.
We use pick-better to define list-find-best:
(define (list-find-best cf p)
(if (null? (cdr p)) (car p)
(pick-better cf (car p) (list-find-best cf (cdr p)))))
We use n to represent the number of elements in the input list p. An applica-
tion of list-find-best involves n − 1 recursive applications since each one passes
in (cdr p) as the new p operand and the base case stops when the list has one
element left. The running time for each application (excluding the recursive
application) is constant since it involves only applications of the constant time
procedures null?, cdr, and pick-better. So, the total running time for list-find-
best is in Θ(n); it scales linearly with the length of the input list.
Deleting an Element. To implement best first sorting, we need to produce a
list that contains all the elements of the original list except for the best element,
which will be placed at the front of the output list. We define a procedure, list-
delete, that takes as inputs a List and a Value, and produces a List that contains
all the elements of the input list in the original order except for the first element
that is equal to the input value.
Chapter 8. Sorting and Searching 155
(define (list-sort-best-first-let cf p)
(if (null? p) null
(let ((best (list-find-best cf p)))
(cons best (list-sort-best-first-let cf (list-delete p best))))))
This runs faster than list-sort-best-first since it avoids the duplicate evaluations,
but the asymptotic asymptotic running time is still in Θ(n2 ): there are n recur-
sive applications of list-sort-best-first-let and each application involves linear
time applications of list-find-best and list-delete. Using the let expression im-
proves the actual running time by avoiding the duplicate work, but does not im-
pact the asymptotic growth rate since the duplicate work is hidden in the con-
stant factor.
Exercise 8.1. What is the best case input for list-sort-best-first? What is its
asymptotic running time on the best case input?
Exercise 8.2. Use the time special form (Section 7.1) to experimentally mea-
sure the evaluation times for the list-sort-best-first-let procedure. Do the results
match the expected running times based on the Θ(n2 ) asymptotic running time?
You may find it helpful to define a procedure that constructs a list containing
n random elements. To generate the random elements use the built-in proce-
dure random that takes one number as input and evaluates to a pseudorandom
number between 0 and one less than the value of the input number. Be careful
in your time measurements that you do not include the time required to gener-
ate the input list.
Exercise 8.3. Define the list-find-best procedure using the list-accumulate pro-
cedure from Section 5.4.2 and evaluate its asymptotic running time.
Exercise 8.4. [?] Define and analyze a list-sort-worst-last procedure that sorts
by finding the worst element first and putting it at the end of the list.
Exercise 8.5. We analyzed the worst case running time of list-sort-insert above.
Analyze the best case running time. Your analysis should identify the inputs for
which list-sort-insert runs fastest, and describe the asymptotic running time for
the best case input.
The problem with our insertion sort is that it divides the work unevenly into
inserting one element and sorting the rest of the list. This is a very unequal
division. Any sorting procedure that works by considering one element at a time
and putting it in the sorted position as is done by list-sort-find-best and list-
sort-insert has a running time in Ω(n2 ). We cannot do better than this with this
strategy since there are n elements, and the time required to figure out where
each element goes is in Ω(n).
To do better, we need to either reduce the number of recursive applications
needed (this would mean each recursive call results in more than one element
being sorted), or reduce the time required for each application. The approach
we take is to use each recursive application to divide the list into two approx-
imately equal-sized parts, but to do the division in such a way that the results
of sorting the two parts can be combined directly to form the result. We par-
tition the elements in the list so that all elements in the first part are less than
(according to the comparison function) all elements in the second part.
Our first attempt is to modify insert-one to partition the list into two parts. This
approach does not produce a better-than-quadratic time sorting procedure be-
cause of the inefficiency of accessing list elements; however, it leads to insights
for producing a quicker sorting procedure.
First, we define a list-extract procedure that takes as inputs a list and two num-
bers indicating the start and end positions, and outputs a list containing the
elements of the input list between the start and end positions:
(define (list-extract p start end)
(if (= start 0)
(if (= end 0) null
(cons (car p) (list-extract (cdr p) start (− end 1))))
(list-extract (cdr p) (− start 1) (− end 1))))
The running time of the list-extract procedure is in Θ(n) where n is the number
of elements in the input list. The worst case input is when the value of end is the
length of the input list, which means there will be n recursive applications, each
involving a constant amount of work.
We use list-extract to define procedures for obtaining first and second halves of
a list (when the list has an odd number of elements, we put the middle element
in the second half of the list):
(define (list-first-half p)
(list-extract p 0 (floor (/ (list-length p) 2))))
(define (list-second-half p)
(list-extract p (floor (/ (list-length p) 2)) (list-length p)))
The list-first-half and list-second-half procedures use list-extract so their run-
ning times are linear in the number of elements in the input list.
The list-insert-one-split procedure inserts an element in sorted order by first
splitting the list in halves and then recursively inserting the new element in the
appropriate half of the list:
160 8.1. Sorting
half of the list. In either case, the length of the list is approximately n2 . The run-
ning time of list-append is in Θ(m) where m is the length of the first input list.
So, the time required for each list-insert-one-split application is in Θ(n) where
n is the length of the input list to list-insert-one-split.
The lengths of the input lists to list-insert-one-split in the recursive calls are ap-
proximately n2 , n4 , n8 , . . ., 1, since the length of the list halves with each call. The
summation has log2 n terms, and the sum of the list is n, so the average length
input is logn n . Hence, the total running time for the list-append applications in
2
each application of list-insert-one-split is in Θ(log2 n × logn n ) = Θ(n).
2
A List is either (1) null or (2) a Pair whose second cell is a List.
A Tree is either (1) null or (2) a triple while first and third parts are both
Trees.
Symbolically:
Tree ::⇒ null
Tree ::⇒ (make-tree Tree Element Tree)
The make-tree procedure can be defined using cons to package the three inputs
into a tree:
(define (make-tree left element right)
(cons element (cons left right)))
We define selector procedures for extracting the parts of a non-null tree:
(define (tree-element tree) (car tree))
(define (tree-left tree) (car (cdr tree)))
(define (tree-right tree) (cdr (cdr tree)))
The tree-left and tree-right procedures are constant time procedures that evalu-
ate to the left or right subtrees respectively of a tree.
In a sorted tree, the elements are maintained in a sorted structure. All elements
in the left subtree of a tree are less than (according to the comparison function)
the value of the root element of the tree; all elements in the right subtree of a tree
are greater than or equal to the value of the root element of the tree (the result
of comparing them with the root element is false). For example, here is a sorted
binary tree containing 6 elements using < as the comparison function:
5 12
1 6 17
The top node has element value 7, and its left subtree is a tree containing the
tree elements whose values are less than 7. The null subtrees are not shown. For
example, the left subtree of the element whose value is 12 is null. Although there
are six elements in the tree, we can reach any element from the top by following
at most two branches. By contrast, with a list of six elements, we need five cdr
operations to reach the last element.
depth The depth of a tree is the largest number of steps needed to reach any node in
Chapter 8. Sorting and Searching 163
the tree starting from the root. The example tree has depth 2, since we can reach
every node starting from the root of the tree in two or fewer steps. A tree of depth
d can contain up to 2d+1 − 1 elements. One way to see this is from this recursive
definition for the maximum number of nodes in a tree:
1 : d=0
TreeNodes (d) =
TreeNodes (d − 1) + 2 × TreeLeaves (d − 1) : d > 0
A tree of depth zero has one node. Increasing the depth of a tree by one means
we can add two nodes for each leaf node in the tree, so the total number of nodes
in the new tree is the sum of the number of nodes in the original tree and twice
the number of leaves in the original tree. The maximum number of leaves in a
tree of depth d is 2d since each level doubles the number of leaves. Hence, the
second equation simplifies to
TreeNodes (d − 1) + 2 × 2d−1 = TreeNodes (d − 1) + 2d .
The value of TreeNodes (d − 1) is 2d−1 + 2d−2 + . . . + 1 = 2d − 1. Adding 2d and
2d − 1 gives 2d+1 − 1 as the maximum number of nodes in a tree of depth d.
Hence, a well-balanced tree containing n nodes has depth approximately log2 n.
A tree is well-balanced if the left and right subtrees of all nodes in the contain well-balanced
nearly the same number of elements.
Procedures that are analogous to the list-first-half , list-second-half , and list-
append procedures that had linear running times for the standard list represen-
tation can all be implemented with constant running times for the tree repre-
sentation. For example, tree-left is analogous to list-first-half and make-tree is
analogous to list-append.
The tree-insert-one procedure inserts an element in a sorted binary tree:
(define (tree-insert-one cf el tree)
(if (null? tree) (make-tree null el null)
(if (cf el (tree-element tree))
(make-tree (tree-insert-one cf el (tree-left tree))
(tree-element tree)
(tree-right tree))
(make-tree (tree-left tree)
(tree-element tree)
(tree-insert-one cf el (tree-right tree))))))
When the input tree is null, the new element is the top element of a new tree
whose left and right subtrees are null. Otherwise, the procedure compares the
element to insert with the element at the top node of the tree. If the comparison
evaluates to true, the new element belongs in the left subtree. The result is a tree
where the left tree is the result of inserting this element in the old left subtree,
and the element and right subtree are the same as they were in the original tree.
For the alternate case, the element is inserted in the right subtree, and the left
subtree is unchanged.
In addition to the recursive call, tree-insert-one only applies constant time pro-
cedures. If the tree is well-balanced, each recursive application halves the size
of the input tree so there are approximately log2 n recursive calls. Hence, the
running time to insert an element in a well-balanced tree using tree-insert-one
is in Θ(log n).
164 8.1. Sorting
ments in its input list (which is the result of the list-to-sorted tree application, a
list containing n elements where n is the number of elements in p).
Only the fastest-growing term contributes to the total asymptotic running time,
so the expected total running time for an application of list-sort-tree-insert to a
list containing n elements is in Θ(n log n). This is substantially better than the
previous sorting algorithms which had running times in Θ(n2 ) since logarithms
grow far slower than their input. For example, if n is one million, n2 is over 50,000
times bigger than n log2 n; if n is one billion, n2 is over 33 million times bigger
than n log2 n since log2 1000000000 is just under 30.
There is no general sorting procedure that has expected running time better
than Θ(n log n), so there is no algorithm that is asymptotically faster than list-
sort-tree (in fact, it can be proven that no asymptotically faster sorting procedure
exists). There are, however, sorting procedures that may have advantages such
as how they use memory which may provide better absolute performance in
some situations.
Unbalanced Trees. Our analysis assumes the left and right halves of the tree
passed to tree-insert-one having approximately the same number of elements.
If the input list is in random order, this assumption is likely to be valid: each
element we insert is equally likely to go into the left or right half, so the halves
contain approximately the same number of elements all the way down the tree.
But, if the input list is not in random order this may not be the case.
For example, suppose the input list is already in sorted order. Then, each ele-
ment that is inserted will be the rightmost node in the tree when it is inserted.
For the previous example, this produces the unbalanced tree shown in Figure 8.1.
This tree contains the same six elements as the earlier example, but because it is
not well-balanced the number of branches that must be traversed to reach the
deepest element is 5 instead of 2. Similarly, if the input list is in reverse sorted
order, we will have an unbalanced tree where only the left branches are used.
In these pathological situations, the tree effectively becomes a list. The num-
ber of recursive applications of tree-insert-one needed to insert a new element
will not be in Θ(log n), but rather will be in Θ(n). Hence, the worst case run-
ning time for list-sort-tree-insert is in Θ(n2 ) since the worst case time for tree-
insert-one is in Θ(n) and there are Θ(n) applications of tree-insert-one. The list-
sort-tree-insert procedure has expected running time in Θ(n log n) for randomly
distributed inputs, but has worst case running time in Θ(n2 ).
Exercise 8.8. [?] Define a procedure binary-tree-depth that takes as input a bi-
nary tree and outputs the depth of the tree. The running time of your procedure
should not grow faster than linearly with the number of nodes in the tree.
12
17
Figure 8.1. Unbalanced trees.
The depth of the output tree should be no higher than log2 n + 1 where n is the
number of elements in the input tree.
8.2 Searching
In a broad sense, nearly all problems can be thought of as search problems. If
we can define the space of possible solutions, we can search that space to find
a correct solution. For example, to solve the pegboard puzzle (Exploration 5.2)
we enumerate all possible sequences of moves and search that space to find a
winning sequence. For most interesting problems, however, the search space is
far too large to search through all possible solutions.
This section explores a few specific types of search problems. First, we consider
the simple problem of finding an element in a list that satisfies some property.
Then, we consider searching for an item in sorted data. Finally, we consider
the more specific problem of efficiently searching for documents (such as web
pages) that contain some target word.
168 8.2. Searching
Assuming the matching function has constant running time, the worst case run-
ning time of list-search is linear in the size of the input list. The worst case is
when there is no satisfying element in the list. If the input list has length n, there
are n recursive calls to list-search, each of which involves only constant time
procedures.
Without imposing more structure on the input and comparison function, there
is no more efficient search procedure. In the worst case, we always need to test
every element in the input list before concluding that there is no element that
satisfies the matching function.
1 Ifthe input list contains false as an element, we do not know when the list-search result is false
if it means the element is not in the list or the element whose value is false satisfies the property. An
alternative would be to produce an error if no satisfying element is found, but this is more awkward
when list-search is used by other procedures.
Chapter 8. Sorting and Searching 169
Strings. We use the built-in String datatype to represent documents and target
words. A String is similar to a List, but specialized for representing sequences
of characters. A convenient way to make a String it to just use double quotes
around a sequence of characters. For example, "abcd" evaluates to a String con-
taining four characters.
The String datatype provides procedures for matching, ordering, and converting
between Strings and Lists of characters:
string=?: String × String → Boolean
Outputs true if the input Strings have exactly the same sequence of charac-
ters, otherwise false.
string<?: String × String → Boolean
Outputs true if the first input String is lexicographically before the second
input String, otherwise false.
string->list: String → List
Outputs a List containing the characters in the input String.
list->string : List → String
Outputs a String containing the characters in the input List.
One advantage of using Strings instead of Lists of characters is the built-in pro-
cedures for comparing Strings; we could write similar procedures for Lists of
characters, but lexicographic ordering is somewhat tricky to get right, so it is
better to use the built-in procedures.
Building the index. The entries in the index are Pairs of a word represented
as a string, and a list of locations where that word appears. Each location is a
Pair consisting of a document identifier (for web documents, this is the Uniform
Resource Locator (URL) that is the address of the web page represented as a
string) and a Number identifying the position within the document where the
word appears (we label positions as the number of characters in the document
before this location).
To build the index, we split each document into words and record the position
of each word in the document. The first step is to define a procedure that takes
as input a string representing an entire document, and produces a list of (word
. position) pairs containing one element for each word in the document. We
define a word as a sequence of alphabetic characters; non-alphabetic characters
including spaces, numbers, and punctuation marks separate words and are not
included in the index.
The text-to-word-positions procedure takes a string as input and outputs a list
of word-position pairs corresponding to each word in the input. The inner pro-
cedure, text-to-word-positions-iter, takes three inputs: a list of the characters in
the document, a list of the characters in the current word, and a number repre-
senting the position in the string where the current word starts; it outputs the list
of (word . position) pairs. The value passed in as w can be null, meaning there
is no current word. Otherwise, it is a list of the characters in the current word.
A word starts when the first alphabetic character is found, and continues until
either the first non-alphabetic character or the end of the document. We use the
built-in char-downcase procedure to convert all letters to their lowercase form,
so KING, King, and king all correspond to the same word.
Chapter 8. Sorting and Searching 171
(define (text-to-word-positions s)
(define (text-to-word-positions-iter p w pos)
(if (null? p)
(if (null? w) null (list (cons (list->string w) pos)))
(if (not (char-alphabetic? (car p))) ; finished word
(if (null? w) ; no current word
(text-to-word-positions-iter (cdr p) null (+ pos 1))
(cons (cons (list->string w) pos)
(text-to-word-positions-iter (cdr p) null
(+ pos (list-length w) 1))))
(text-to-word-positions-iter (cdr p)
(list-append w (list (char-downcase (car p))))
pos))))
(text-to-word-positions-iter (string->list s) null 0))
The next step is to build an index from the list of word-position pairs. To enable
fast searching, we store the index in a binary tree sorted by the target word. The
insert-into-index procedure takes as input an index and a word-position pair
and outputs an index consisting of the input index with the input word-position
pair added.
The index is represented as a sorted binary tree where each element is a pair of a
word and a list of the positions where that word appears. Each word should ap-
pear in the tree only once, so if the word-position pair to be added corresponds
to a word that is already in the index, the position is added to the corresponding
list of positions. Otherwise, a new entry is added to the index for the word with
a list of positions containing the position as its only element.
(define (insert-into-index index wp)
(if (null? index)
(make-tree null (cons (car wp) (list (cdr wp))) null)
(if (string=? (car wp) (car (tree-element index)))
(make-tree (tree-left index)
(cons (car (tree-element index))
(list-append (cdr (tree-element index))
(list (cdr wp))))
(tree-right index))
(if (string<? (car wp) (car (tree-element index)))
(make-tree (insert-into-index (tree-left index) wp)
(tree-element index)
(tree-right index))
(make-tree (tree-left index)
(tree-element index)
(insert-into-index (tree-right index) wp))))))
To insert all the (word . position) pairs in a list into the index, we use insert-into-
index to add each pair, passing the resulting index into the next recursive call:
(define (insert-all-wps index wps)
(if (null? wps) index
(insert-all-wps (insert-into-index index (car wps)) (cdr wps))))
To add all the words in a document to the index we use text-to-word-positions
to obtain the list of word-position pairs. Since we want to include the document
172 8.2. Searching
identity in the positions, we use list-map to add the url (a string that identifies
the document location) to the position of each word. Then, we use insert-all-
wps to add all the word-position pairs in this document to the index. The index-
document procedure takes a document identifier and its text as a string, and
produces an index of all words in the document.
(define (index-document url text)
(insert-all-wps
null
(list-map (lambda (wp) (cons (car wp) (cons url (cdr wp))))
(text-to-word-positions text))))
We leave analyzing the running time of index-document as an exercise. The im-
portant point, though, is that it only has to be done once for a given set of doc-
uments. Once the index is built, we can use it to answer any number of search
queries without needing to reconstruct the index.
Merging indexes. Our goal is to produce an index for a set of documents, not
just a single document. So, we need a way to take two indexes produced by
index-document and combine them into a single index. We use this repeat-
edly to create an index of any number of documents. To merge two indexes,
we combine their word occurrences. If a word occurs in both documents, the
word should appear in the merged index with a position list that includes all the
positions in both indexes. If the word occurs in only one of the documents, that
word and its position list should be included in the merged index.
(define (merge-indexes d1 d2)
(define (merge-elements p1 p2)
(if (null? p1) p2
(if (null? p2) p1
(if (string=? (car (car p1)) (car (car p2)))
(cons (cons (car (car p1))
(list-append (cdr (car p1)) (cdr (car p2))))
(merge-elements (cdr p1) (cdr p2)))
(if (string<? (car (car p1)) (car (car p2)))
(cons (car p1) (merge-elements (cdr p1) p2))
(cons (car p2) (merge-elements p1 (cdr p2))))))))
(list-to-sorted-tree
(lambda (e1 e2) (string<? (car e1) (car e2)))
(merge-elements (tree-extract-elements d1)
(tree-extract-elements d2)))))))
To merge the indexes, we first use tree-extract-elements to convert the tree rep-
resentations to lists. The inner merge-elements procedure takes the two lists of
word-position pairs and outputs a single list.
Since the lists are sorted by the target word, we can perform the merge effi-
ciently. If the first words in both lists are the same, we produce a word-position
pair that appends the position lists for the two entries. If they are different, we
use string<? to determine which of the words belongs first, and include that el-
ement in the merged list. This way, the two lists are kept synchronized, so there
is no need to search the lists to see if the same word appears in both lists.
Obtaining documents. To build a useful index for searching, we need some
Chapter 8. Sorting and Searching 173
The read-all-chars procedure takes a port as its input, and produces a list con-
taining all the characters in the document associated with the port:
(define (read-all-chars port)
(let ((c (read-char port)))
(if (eof-object? c) null
(cons c (read-all-chars port)))))
Using these procedures, we define web-get, a procedure that takes as input a
string that represents the URL of some web page, and outputs a string repre-
senting the contents of that page.
(define (web-get url)
(list->string (read-all-chars (get-pure-port (string->url url)))))
2 We use Name to represent sequences of characters in the domain and path names, although the
actual rules for valid names for each of these are different.
174 8.2. Searching
To make it easy to build an index of a set of web pages, we define the index-
pages procedure that takes as input a list of web pages and outputs an index of
the words in those pages. It recurses through the list of pages, indexing each
document, and merging that index with the result of indexing the rest of the
pages in the list.
(define (index-pages p)
(if (null? p) null
(merge-indexes (index-document (car p) (web-get (car p)))
(index-pages (cdr p)))))
We can use this to create an index of any set of web pages. For example, here
we use Jeremy Hylton’s collection of the complete works of William Shakespeare
([Link] to define shakespeare-index as an index of the words
used in all of Shakespeare’s plays.
(define shakespeare-index
(index-pages
(list-map
(lambda (play)
(string-append "[Link] play "/[Link]"))
;; List of plays following the site’s naming conventions.
(list "allswell" "asyoulikeit" "comedy errors" "cymbeline" "lll"
"measure" "merry wives" "merchant" "midsummer" "much ado"
"pericles" "taming shrew" "tempest" "troilus cressida" "twelfth night"
"two gentlemen" "winters tale" "1henryiv" "2henryiv" "henryv"
"1henryvi" "2henryvi" "3henryvi" "henryviii" "john" "richardii"
"richardiii" "cleopatra" "coriolanus" "hamlet" "julius caesar" "lear"
"macbeth" "othello" "romeo juliet" "timon" "titus"))))
Building the index takes about two and a half hours on my laptop. It contains
22949 distinct words and over 1.6 million word occurrences. Much of the time
spent building the index is in constructing new lists and trees for every change,
which can be avoided by using the mutable data types we cover in the next chap-
ter. The key idea, though, is that the index only needs to be built once. Once the
documents have been indexed, we can use the index to quickly perform any
search.
Searching. Using an index, searching for pages that use a given word is easy
and efficient. Since the index is a sorted binary tree, we use binary-tree-search
to search for a word in the index:
(define (search-in-index index word)
(binary-tree-search
(lambda (el) (string=? word (car el))) ; first element of (word . position)
(lambda (el) (string<? word (car el)))
index))
As analyzed in the previous section, the expected running time of binary-tree-
search is in Θ(log n) where n is the number of nodes in the input tree.3 The
body of search-in-index applies binary-tree-search to the index. The number of
nodes in the index is the number of distinct words in the indexed documents.
3 Because of the way merge-indexes is defined, we do not actually get this expected running time.
So, the running time of search-in-index scales logarithmically with the number
of distinct words in the indexed documents. Note that the number and size of
the documents does not matter! This is why a search engine such as Google
can respond to a query quickly even though its index contains many billions of
documents.
One issue we should be concerned with is the running time of the procedures
passed into binary-tree-search. Our analysis of binary-tree-search assumes the
equality and comparison functions are constant time procedures. Here, the pro-
cedures as string=? and string<?, which both have worst case running times
that are linear in the length of the input string. As used here, one of the inputs
is the target word. So, the amount of work for each binary-tree-search recursive
call is in Θ(w) where w is the length of word. Thus, the overall running time of
search-in-index is in Θ(w log d) where w is the length of word and d is the num-
ber of words in the index. If we assume all words are of some maximum length,
though, the w term disappears as a constant factor (that is, we are assuming
w < C for some constant C. Thus, the overall running time is in Θ(log d).
Here are some examples:
Exercise 8.14. Define a procedure for finding the longest word in a document.
Analyze the running time of your procedure.
Exercise 8.15. Produce a list of the words in Shakespeare’s plays sorted by their
length.
Exercise 8.16. [?] Analyze the running time required to build the index.
a. Analyze the running time of the text-to-word-positions procedure. Use n to
represent the number of characters in the input string, and w to represent
the number of distinct words. Be careful to clearly state all assumptions on
which your analysis relies.
b. Analyze the running time of the insert-into-index procedure.
c. Analyze the running time of the index-document procedure.
d. Analyze the running time of the merge-indexes procedure.
e. Analyze the overall running time of the index-pages procedure. Your result
should describe how the running time is impacted by the number of docu-
ments to index, the size of each document, and the number of distinct words.
Exercise 8.17. [?] The search-in-index procedure does not actually have the ex-
pected running time in Θ(log w) (where w is the number of distinct words in
the index) for the Shakespeare index because of the way it is built using merge-
indexes. The problem has to do with the running time of the binary tree on
pathological inputs. Explain why the input to list-to-sorted-tree in the merge-
indexes procedure leads to a binary tree where the running time for searching is
in Θ(w). Modify the merge-indexes definition to avoid this problem and ensure
that searches on the resulting index run in Θ(log w).
In addition to fast indexed search, web search engines have to solve two prob-
lems: (1) find documents to index, and (2) identify the most important docu-
ments that contain a particular search term.
For our Shakespeare index example, we manually found a list of interesting doc-
uments to index. This approach does not scale well to indexing the World Wide
Web where there are trillions of documents and new ones are created all the
time. For this, we need a web crawler. web crawler
A web crawler finds documents on the web. Typical web crawlers start with a set
of seed URLs, and then find more documents to index by following the links on
those pages. This proceeds recursively: the links on each newly discovered page
are added to the set of URLs for the crawler to index. To develop a web crawler,
we need a way to extract the links on a given web page, and to manage the set of
pages to index. The rank assigned
to a document is
a. [?] Define a procedure extract-links that takes as input a string represent- calculated from the
ranks of documents
ing the text of a web page and outputs a list of all the pages linked to from citing it. In
this page. Linked pages can be found by searching for anchor tags on the addition, the rank
web page. An anchor tag has the form:4 <a href=target>. (The text-to-word- of a document is
positions procedure may be a helpful starting point for defining extract-links.) calculated from a
constant
b. Define a procedure crawl-page that takes as input an index and a string rep- representing the
probability that a
resenting a URL. As output, it produces a pair consisting of an index (that is
browser through the
the result of adding all words from the page at the input URL to the input database will
index) and a list of URLs representing all pages linked to by the crawled page. randomly jump to
the document. The
c. [??] Define a procedure crawl-web that takes as input a list of seed URLs and method is
a Number indicating the maximum depth of the crawl. It should output an particularly useful
index of all the words on the web pages at the locations given by the seed in enhancing the
performance of
URLs and any page that can be reached from these seed URLs by following search engine
no more than the maximum depth number of links. results for
hypermedia
databases, such as
For a web search engine to be useful, we don’t want to just get a list of all the the world wide web,
pages that contain some target word, we want the list to be sorted according to whose documents
which of those pages are most likely to be interesting. Selecting the best pages have a large
for a given query is a challenging and important problem, and the ability to do variation in quality.
United States Patent
this well is one of the main things that distinguishes web search engines. Many #6,285,999, September
factors are used to rank pages including an analysis of the text on the page itself, 2001. (Inventor:
Lawrence Page,
whether the target word is part of a title, and how recently the page was updated. Assignee: Stanford
University)
The best ways of ranking pages also consider the pages that link to the ranked
page. If many pages link to a given page, it is more likely that the given page is
useful. This property can also be defined recursively: a page is highly ranked if
there are many highly-ranked pages that link to it.
The ranking system used by Google is based on this formula:
R(v)
R(u) = ∑ L(v)
v∈ Lu
4 Not all links match this structure exactly, so this may miss some of the links on a page.
178 8.3. Summary
where Lu is the set of web pages that contain links to the target page u and L(v)
is the number of links on the page v (thus, the value of a link from a page con-
taining many links is less than the value of a link from a page containing only a
few links). The value R(u) gives a measure of the ranking of the page identified
by u, where higher values indicate more valuable pages.
The problem with this formula is that is is circular: there is no base case, and no
way to order the web pages to compute the correct rank of each page in turn,
since the rank of each page depends on the rank of the pages that link to it.
relaxation One way to approximate equations like this one is to use relaxation. Relaxation
obtains an approximate solution to some systems of equations by repeatedly
evaluating the equations. To estimate the page ranks for a set of web pages, we
initially assume every page has rank 1 and evaluate R(u) for all the pages (using
the value of 1 as the rank for every other page). Then, re-evaluate the R(u) values
using the resulting ranks. A relaxation keeps repeating until the values change
by less than some threshold amount, but there is no guarantee how quickly this
will happen. For the page ranking evaluation, it may be enough to decide on
some fixed number of iterations.
d. [??] Define a procedure, web-link-graph, that takes as input a set of URLs
and produces as output a graph of the linking structure of those documents.
The linking structure can be represented as a List where each element of the
List is a pair of a URL and a list of all the URLs that include a link to that URL.
The extract-links procedure from the previous exploration will be useful for
determining the link targets of a given URL.
e. [?] Define a procedure that takes as input the output of web-link-graph and
outputs a preliminary ranking of each page that measures the number of
other pages that link to that page.
f. [??] Refine your page ranking procedure to weight links from highly-ranked
pages more heavily in a page’s rank by using a algorithm.
g. [? ? ??] Come up with a cool name, set up your search engine as a web ser-
vice, and attract more than 0.001% of all web searches to your site.
8.3 Summary
The focus of Part II has been on predicting properties of procedures, in particu-
lar how their running time scales with the size of their input. This involved many
encounters with the three powerful ideas introduced in Section 1.4: recursive
definitions, universality, and abstraction. The simple Turing Machine model is
a useful abstraction for modeling nearly all conceivable computing machines,
and the few simple operations it defines are enough for a universal computer.
Actual machines use the digital abstraction to allow a continuous range of volt-
ages to represent just two values. The asymptotic operators used to describe
running times are also a kind of abstraction—they allow us to represent the set
of infinitely many different functions with a compact notation.
In Part III, we will see many more recursive definitions, and extend the notion of
recursive definitions to the language interpreter itself. We change the language
evaluation rules themselves, and see how different evaluation rules enable dif-
ferent ways of expressing computations.