100% found this document useful (1 vote)
4K views376 pages

Fundamental Data Structures

Articles Introduction 1 Abstract data type 1 Data structure 9 Analysis of algorithms 11 Amortized analysis 17 Accounting method 18 Potential method 20 Sequences 22 Array data type 22 Array data structure 26 Dynamic array 32 Linked list 35 Doubly linked list 51 Stack (abstract data type) 55 Queue (abstract data type) 84 Double-ended queue 86 Circular buffer 89 Dictionaries 99 Associative array 99 Association list 102 Hash table 103 Linear probing 116 Quadratic probing 117 Double hashing 121 Cuckoo hashing 123 Hopscotch hashing 127 Hash function 129 Perfect hash function 137 Universal hashing 138 K-independent hashing 143 Tabulation hashing 145 Cryptographic hash function 147 Sets 155 Set (abstract data type) 155 Bit array 160 Bloom filter 165 MinHash 176 Disjoint-set data structure 179 Partition refinement 183 Priority queues 185 Priority queue 185 Heap (data structure) 190 Binary heap 193 d-ary heap 199 Binomial heap 201 Fibonacci heap 207 Pairing heap 212 Double-ended priority queue 215 Soft heap 220 Successors and neighbors 222 Binary search algorithm 222 Binary search tree 230 Random binary tree 240 Tree rotation 243 Self-balancing binary search tree 246 Treap 248 AVL tree 251 Red–black tree 255 Scapegoat tree 270 Splay tree 274 Tango tree 289 Skip list 311 B-tree 317 B+ tree 328 Integer and string searching 333 Trie 333 Radix tree 340 Directed acyclic word graph 345 Suffix tree 347 Suffix array 352 Van Emde Boas tree 356 Fusion tree 360 References Article Sources and Contributors 364 Image Sources, Licenses and Contributors 369 Article Licenses License 372

Uploaded by

josiang__
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
4K views376 pages

Fundamental Data Structures

Articles Introduction 1 Abstract data type 1 Data structure 9 Analysis of algorithms 11 Amortized analysis 17 Accounting method 18 Potential method 20 Sequences 22 Array data type 22 Array data structure 26 Dynamic array 32 Linked list 35 Doubly linked list 51 Stack (abstract data type) 55 Queue (abstract data type) 84 Double-ended queue 86 Circular buffer 89 Dictionaries 99 Associative array 99 Association list 102 Hash table 103 Linear probing 116 Quadratic probing 117 Double hashing 121 Cuckoo hashing 123 Hopscotch hashing 127 Hash function 129 Perfect hash function 137 Universal hashing 138 K-independent hashing 143 Tabulation hashing 145 Cryptographic hash function 147 Sets 155 Set (abstract data type) 155 Bit array 160 Bloom filter 165 MinHash 176 Disjoint-set data structure 179 Partition refinement 183 Priority queues 185 Priority queue 185 Heap (data structure) 190 Binary heap 193 d-ary heap 199 Binomial heap 201 Fibonacci heap 207 Pairing heap 212 Double-ended priority queue 215 Soft heap 220 Successors and neighbors 222 Binary search algorithm 222 Binary search tree 230 Random binary tree 240 Tree rotation 243 Self-balancing binary search tree 246 Treap 248 AVL tree 251 Red–black tree 255 Scapegoat tree 270 Splay tree 274 Tango tree 289 Skip list 311 B-tree 317 B+ tree 328 Integer and string searching 333 Trie 333 Radix tree 340 Directed acyclic word graph 345 Suffix tree 347 Suffix array 352 Van Emde Boas tree 356 Fusion tree 360 References Article Sources and Contributors 364 Image Sources, Licenses and Contributors 369 Article Licenses License 372

Uploaded by

josiang__
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 376

Fundamental Data Structures

PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Wed, 29 Aug 2012 18:40:03 UTC

Contents
Articles
Introduction
Abstract data type Data structure Analysis of algorithms Amortized analysis Accounting method Potential method 1 1 9 11 17 18 20 22 22 26 32 35 51 55 84 86 89 99 99 102 103 116 117 121 123 127 129 137 138 143 145

Sequences
Array data type Array data structure Dynamic array Linked list Doubly linked list Stack (abstract data type) Queue (abstract data type) Double-ended queue Circular buffer

Dictionaries
Associative array Association list Hash table Linear probing Quadratic probing Double hashing Cuckoo hashing Hopscotch hashing Hash function Perfect hash function Universal hashing K-independent hashing Tabulation hashing

Cryptographic hash function

147 155

Sets
Set (abstract data type) Bit array Bloom filter MinHash Disjoint-set data structure Partition refinement

155 160 165 176 179 183 185 185 190 193 199 201 207 212 215 220 222 222 230 240 243 246 248 251 255 270 274 289 311 317 328 333 333

Priority queues
Priority queue Heap (data structure) Binary heap d-ary heap Binomial heap Fibonacci heap Pairing heap Double-ended priority queue Soft heap

Successors and neighbors


Binary search algorithm Binary search tree Random binary tree Tree rotation Self-balancing binary search tree Treap AVL tree Redblack tree Scapegoat tree Splay tree Tango tree Skip list B-tree B+ tree

Integer and string searching


Trie

Radix tree Directed acyclic word graph Suffix tree Suffix array Van Emde Boas tree Fusion tree

340 345 347 352 356 360

References
Article Sources and Contributors Image Sources, Licenses and Contributors 364 369

Article Licenses
License 372

Introduction
Abstract data type
In computer science, an abstract data type (ADT) is a mathematical model for a certain class of data structures that have similar behavior; or for certain data types of one or more programming languages that have similar semantics. An abstract data type is defined indirectly, only by the operations that may be performed on it and by mathematical constraints on the effects (and possibly cost) of those operations.[1] For example, an abstract stack data structure could be defined by three operations: push, that inserts some data item onto the structure, pop, that extracts an item from it (with the constraint that each pop always returns the most recently pushed item that has not been popped yet), and peek, that allows data on top of the structure to be examined without removal. When analyzing the efficiency of algorithms that use stacks, one may also specify that all operations take the same time no matter how many items have been pushed into the stack, and that the stack uses a constant amount of storage for each element. Abstract data types are purely theoretical entities, used (among other things) to simplify the description of abstract algorithms, to classify and evaluate data structures, and to formally describe the type systems of programming languages. However, an ADT may be implemented by specific data types or data structures, in many ways and in many programming languages; or described in a formal specification language. ADTs are often implemented as modules: the module's interface declares procedures that correspond to the ADT operations, sometimes with comments that describe the constraints. This information hiding strategy allows the implementation of the module to be changed without disturbing the client programs. The term abstract data type can also be regarded as a generalised approach of a number of algebraic structures, such as lattices, groups, and rings.[2] This can be treated as part of subject area of artificial intelligence. The notion of abstract data types is related to the concept of data abstraction, important in object-oriented programming and design by contract methodologies for software development .

Defining an abstract data type (ADT)


An abstract data type is defined as a mathematical model of the data objects that make up a data type as well as the functions that operate on these objects. There are no standard conventions for defining them. A broad division may be drawn between "imperative" and "functional" definition styles.

Imperative abstract data type definitions


In the "imperative" view, which is closer to the philosophy of imperative programming languages, an abstract data structure is conceived as an entity that is mutable meaning that it may be in different states at different times. Some operations may change the state of the ADT; therefore, the order in which operations are evaluated is important, and the same operation on the same entities may have different effects if executed at different times just like the instructions of a computer, or the commands and procedures of an imperative language. To underscore this view, it is customary to say that the operations are executed or applied, rather than evaluated. The imperative style is often used when describing abstract algorithms. This is described by Donald E. Knuth and can be referenced from here The Art of Computer Programming.

Abstract data type Abstract variable Imperative ADT definitions often depend on the concept of an abstract variable, which may be regarded as the simplest non-trivial ADT. An abstract variable V is a mutable entity that admits two operations: store(V,x) where x is a value of unspecified nature; and fetch(V), that yields a value; with the constraint that fetch(V) always returns the value x used in the most recent store(V,x) operation on the same variable V. As in so many programming languages, the operation store(V,x) is often written V x (or some similar notation), and fetch(V) is implied whenever a variable V is used in a context where a value is required. Thus, for example, V V + 1 is commonly understood to be a shorthand for store(V,fetch(V) + 1). In this definition, it is implicitly assumed that storing a value into a variable U has no effect on the state of a distinct variable V. To make this assumption explicit, one could add the constraint that if U and V are distinct variables, the sequence { store(U,x); store(V,y) } is equivalent to { store(V,y); store(U,x) }. More generally, ADT definitions often assume that any operation that changes the state of one ADT instance has no effect on the state of any other instance (including other instances of the same ADT) unless the ADT axioms imply that the two instances are connected (aliased) in that sense. For example, when extending the definition of abstract variable to include abstract records, the operation that selects a field from a record variable R must yield a variable V that is aliased to that part of R. The definition of an abstract variable V may also restrict the stored values x to members of a specific set X, called the range or type of V. As in programming languages, such restrictions may simplify the description and analysis of algorithms, and improve their readability. Note that this definition does not imply anything about the result of evaluating fetch(V) when V is un-initialized, that is, before performing any store operation on V. An algorithm that does so is usually considered invalid, because its effect is not defined. (However, there are some important algorithms whose efficiency strongly depends on the assumption that such a fetch is legal, and returns some arbitrary value in the variable's range.) Instance creation Some algorithms need to create new instances of some ADT (such as new variables, or new stacks). To describe such algorithms, one usually includes in the ADT definition a create() operation that yields an instance of the ADT, usually with axioms equivalent to the result of create() is distinct from any instance S in use by the algorithm. This axiom may be strengthened to exclude also partial aliasing with other instances. On the other hand, this axiom still allows implementations of create() to yield a previously created instance that has become inaccessible to the program.

Abstract data type Preconditions, postconditions, and invariants In imperative-style definitions, the axioms are often expressed by preconditions, that specify when an operation may be executed; postconditions, that relate the states of the ADT before and after the execution of each operation; and invariants, that specify properties of the ADT that are not changed by the operations. Example: abstract stack (imperative) As another example, an imperative definition of an abstract stack could specify that the state of a stack S can be modified only by the operations push(S,x), where x is some value of unspecified nature; and pop(S), that yields a value as a result; with the constraint that For any value x and any abstract variable V, the sequence of operations { push(S,x); V pop(S) } is equivalent to { V x }; Since the assignment { V x }, by definition, cannot change the state of S, this condition implies that { V pop(S) } restores S to the state it had before the { push(S,x) }. From this condition and from the properties of abstract variables, it follows, for example, that the sequence { push(S,x); push(S,y); U pop(S); push(S,z); V pop(S); W pop(S); } where x,y, and z are any values, and U, V, W are pairwise distinct variables, is equivalent to { U y; V z; W x } Here it is implicitly assumed that operations on a stack instance do not modify the state of any other ADT instance, including other stacks; that is, For any values x,y, and any distinct stacks S and T, the sequence { push(S,x); push(T,y) } is equivalent to { push(T,y); push(S,x) }. A stack ADT definition usually includes also a Boolean-valued function empty(S) and a create() operation that returns a stack instance, with axioms equivalent to create() S for any stack S (a newly created stack is distinct from all previous stacks) empty(create()) (a newly created stack is empty) not empty(push(S,x)) (pushing something into a stack makes it non-empty) Single-instance style Sometimes an ADT is defined as if only one instance of it existed during the execution of the algorithm, and all operations were applied to that instance, which is not explicitly notated. For example, the abstract stack above could have been defined with operations push(x) and pop(), that operate on "the" only existing stack. ADT definitions in this style can be easily rewritten to admit multiple coexisting instances of the ADT, by adding an explicit instance parameter (like S in the previous example) to every operation that uses or modifies the implicit instance. On the other hand, some ADTs cannot be meaningfully defined without assuming multiple instances. This is the case when a single operation takes two distinct instances of the ADT as parameters. For an example, consider augmenting the definition of the stack ADT with an operation compare(S,T) that checks whether the stacks S and T contain the same items in the same order.

Abstract data type

Functional ADT definitions


Another way to define an ADT, closer to the spirit of functional programming, is to consider each state of the structure as a separate entity. In this view, any operation that modifies the ADT is modeled as a mathematical function that takes the old state as an argument, and returns the new state as part of the result. Unlike the "imperative" operations, these functions have no side effects. Therefore, the order in which they are evaluated is immaterial, and the same operation applied to the same arguments (including the same input states) will always return the same results (and output states). In the functional view, in particular, there is no way (or need) to define an "abstract variable" with the semantics of imperative variables (namely, with fetch and store operations). Instead of storing values into variables, one passes them as arguments to functions. Example: abstract stack (functional) For example, a complete functional-style definition of a stack ADT could use the three operations: push: takes a stack state and an arbitrary value, returns a stack state; top: takes a stack state, returns a value; pop: takes a stack state, returns a stack state; with the following axioms: top(push(s,x)) = x (pushing an item onto a stack leaves it at the top) pop(push(s,x)) = s (pop undoes the effect of push) In a functional-style definition there is no need for a create operation. Indeed, there is no notion of "stack instance". The stack states can be thought of as being potential states of a single stack structure, and two stack states that contain the same values in the same order are considered to be identical states. This view actually mirrors the behavior of some concrete implementations, such as linked lists with hash cons. Instead of create(), a functional definition of a stack ADT may assume the existence of a special stack state, the empty stack, designated by a special symbol like or "()"; or define a bottom() operation that takes no arguments and returns this special stack state. Note that the axioms imply that push(,x) In a functional definition of a stack one does not need an empty predicate: instead, one can test whether a stack is empty by testing whether it is equal to . Note that these axioms do not define the effect of top(s) or pop(s), unless s is a stack state returned by a push. Since push leaves the stack non-empty, those two operations are undefined (hence invalid) when s = . On the other hand, the axioms (and the lack of side effects) imply that push(s,x) = push(t,y) if and only if x = y and s = t. As in some other branches of mathematics, it is customary to assume also that the stack states are only those whose existence can be proved from the axioms in a finite number of steps. In the stack ADT example above, this rule means that every stack is a finite sequence of values, that becomes the empty stack () after a finite number of pops. By themselves, the axioms above do not exclude the existence of infinite stacks (that can be poped forever, each time yielding a different state) or circular stacks (that return to the same state after a finite number of pops). In particular, they do not exclude states s such that pop(s) = s or push(s,x) = s for some x. However, since one cannot obtain such stack states with the given operations, they are assumed "not to exist".

Abstract data type

Advantages of abstract data typing


Encapsulation Abstraction provides a promise that any implementation of the ADT has certain properties and abilities; knowing these is all that is required to make use of an ADT object. The user does not need any technical knowledge of how the implementation works to use the ADT. In this way, the implementation may be complex but will be encapsulated in a simple interface when it is actually used. Localization of change Code that uses an ADT object will not need to be edited if the implementation of the ADT is changed. Since any changes to the implementation must still comply with the interface, and since code using an ADT may only refer to properties and abilities specified in the interface, changes may be made to the implementation without requiring any changes in code where the ADT is used. Flexibility Different implementations of an ADT, having all the same properties and abilities, are equivalent and may be used somewhat interchangeably in code that uses the ADT. This gives a great deal of flexibility when using ADT objects in different situations. For example, different implementations of an ADT may be more efficient in different situations; it is possible to use each in the situation where they are preferable, thus increasing overall efficiency.

Typical operations
Some operations that are often specified for ADTs (possibly under other names) are compare(s,t), that tests whether two structures are equivalent in some sense; hash(s), that computes some standard hash function from the instance's state; print(s) or show(s), that produces a human-readable representation of the structure's state. In imperative-style ADT definitions, one often finds also create(), that yields a new instance of the ADT; initialize(s), that prepares a newly created instance s for further operations, or resets it to some "initial state"; copy(s,t), that puts instance s in a state equivalent to that of t; clone(t), that performs s new(), copy(s,t), and returns s; free(s) or destroy(s), that reclaims the memory and other resources used by s; The free operation is not normally relevant or meaningful, since ADTs are theoretical entities that do not "use memory". However, it may be necessary when one needs to analyze the storage used by an algorithm that uses the ADT. In that case one needs additional axioms that specify how much memory each ADT instance uses, as a function of its state, and how much of it is returned to the pool by free.

Examples
Some common ADTs, which have proved useful in a great variety of applications, are Container Deque List Map Multimap

Multiset Priority queue Queue

Abstract data type Set Stack String Tree

Each of these ADTs may be defined in many ways and variants, not necessarily equivalent. For example, a stack ADT may or may not have a count operation that tells how many items have been pushed and not yet popped. This choice makes a difference not only for its clients but also for the implementation.

Implementation
Implementing an ADT means providing one procedure or function for each abstract operation. The ADT instances are represented by some concrete data structure that is manipulated by those procedures, according to the ADT's specifications. Usually there are many ways to implement the same ADT, using several different concrete data structures. Thus, for example, an abstract stack can be implemented by a linked list or by an array. An ADT implementation is often packaged as one or more modules, whose interface contains only the signature (number and types of the parameters and results) of the operations. The implementation of the module namely, the bodies of the procedures and the concrete data structure used can then be hidden from most clients of the module. This makes it possible to change the implementation without affecting the clients. When implementing an ADT, each instance (in imperative-style definitions) or each state (in functional-style definitions) is usually represented by a handle of some sort.[3] Modern object-oriented languages, such as C++ and Java, support a form of abstract data types. When a class is used as a type, it is an abstract type that refers to a hidden representation. In this model an ADT is typically implemented as a class, and each instance of the ADT is an object of that class. The module's interface typically declares the constructors as ordinary procedures, and most of the other ADT operations as methods of that class. However, such an approach does not easily encapsulate multiple representational variants found in an ADT. It also can undermine the extensibility of object-oriented programs. In a pure object-oriented program that uses interfaces as types, types refer to behaviors not representations.

Example: implementation of the stack ADT


As an example, here is an implementation of the stack ADT above in the C programming language. Imperative-style interface An imperative-style interface might be: typedef struct stack_Rep stack_Rep; representation (an opaque record). */ typedef stack_Rep *stack_T; instance (an opaque pointer). */ typedef void *stack_Item; stored in stack (arbitrary address). */ stack_T stack_create(void); instance, initially empty. */ void stack_push(stack_T s, stack_Item e); the stack. */ stack_Item stack_pop(stack_T s); the stack and return it . */ /* Type: instance /* Type: handle to a stack /* Type: value that can be

/* Create new stack /* Add an item at the top of /* Remove the top item from

Abstract data type int stack_empty(stack_T ts); empty. */ This implementation could be used in the following manner: #include <stack.h> stack_T t = stack_create(); int foo = 17; t = stack_push(t, &foo); stack. */ void *e = stack_pop(t); the stack. */ if (stack_empty(t)) { } /* Include the stack interface. */ /* Create a stack instance. */ /* An arbitrary datum. */ /* Push the address of 'foo' onto the /* Check whether stack is

/* Get the top item and delete it from /* Do something if stack is empty. */

This interface can be implemented in many ways. The implementation may be arbitrarily inefficient, since the formal definition of the ADT, above, does not specify how much space the stack may use, nor how long each operation should take. It also does not specify whether the stack state t continues to exist after a call s pop(t). In practice the formal definition should specify that the space is proportional to the number of items pushed and not yet popped; and that every one of the operations above must finish in a constant amount of time, independently of that number. To comply with these additional specifications, the implementation could use a linked list, or an array (with dynamic resizing) together with two integers (an item count and the array size) Functional-style interface Functional-style ADT definitions are more appropriate for functional programming languages, and vice-versa. However, one can provide a functional style interface even in an imperative language like C. For example: typedef struct stack_Rep stack_Rep; representation (an opaque record). */ typedef stack_Rep *stack_T; state (an opaque pointer). */ typedef void *stack_Item; address). */ /* Type: stack state /* Type: handle to a stack /* Type: item (arbitrary

stack_T stack_empty(void); /* Returns the empty stack state. */ stack_T stack_push(stack_T s, stack_Item x); /* Adds x at the top of s, returns the resulting state. */ stack_Item stack_top(stack_T s); /* Returns the item currently at the top of s. */ stack_T stack_pop(stack_T s); /* Remove the top item from s, returns the resulting state. */ The main problem is that C lacks garbage collection, and this makes this style of programming impractical; moreover, memory allocation routines in C are slower than allocation in a typical garbage collector, thus the performance impact of so many allocations is even greater.

Abstract data type

ADT libraries
Many modern programming languages, such as C++ and Java, come with standard libraries that implement several common ADTs, such as those listed above.

Built-in abstract data types


The specification of some programming languages is intentionally vague about the representation of certain built-in data types, defining only the operations that can be done on them. Therefore, those types can be viewed as "built-in ADTs". Examples are the arrays in many scripting languages, such as Awk, Lua, and Perl, which can be regarded as an implementation of the Map or Table ADT.

References
[1] Barbara Liskov, Programming with Abstract Data Types, in Proceedings of the ACM SIGPLAN Symposium on Very High Level Languages, pp. 50--59, 1974, Santa Monica, California [2] Rudolf Lidl (2004). Abstract Algebra. Springer. ISBN81-8128-149-7., Chapter 7,section 40. [3] Robert Sedgewick (1998). Algorithms in C. Addison/Wesley. ISBN0-201-31452-5., definition 4.4.

Further
Mitchell, John C.; Plotkin, Gordon (July 1988). "Abstract Types Have Existential Type" (http://theory.stanford. edu/~jcm/papers/mitch-plotkin-88.pdf). ACM Transactions on Programming Languages and Systems 10 (3).

External links
Abstract data type (http://www.nist.gov/dads/HTML/abstractDataType.html) in NIST Dictionary of Algorithms and Data Structures

Data structure

Data structure
In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.[1][2] Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. For example, B-trees are particularly well-suited for implementation of databases, while compiler implementations usually use hash tables to look up identifiers. Data structures provide a means to manage huge amounts of data efficiently, a hash table such as large databases and internet indexing services. Usually, efficient data structures are a key to designing efficient algorithms. Some formal design methods and programming languages emphasize data structures, rather than algorithms, as the key organizing factor in software design.

Overview
An array data structure stores a number of elements of the same type in a specific order. They are accessed using an integer to specify which element is required (although the elements may be of almost any type). Arrays may be fixed-length or expandable. Record (also called tuple or struct) Records are among the simplest data structures. A record is a value that contains other values, typically in fixed number and sequence and typically indexed by names. The elements of records are usually called fields or members. A hash or dictionary or map is a more flexible variation on a record, in which name-value pairs can be added and deleted freely. Union. A union type definition will specify which of a number of permitted primitive types may be stored in its instances, e.g. "float or long integer". Contrast with a record, which could be defined to contain a float and an integer; whereas, in a union, there is only one value at a time. A tagged union (also called a variant, variant record, discriminated union, or disjoint union) contains an additional field indicating its current type, for enhanced type safety. A set is an abstract data structure that can store specific values, without any particular order, and no repeated values. Values themselves are not retrieved from sets, rather one tests a value for membership to obtain a boolean "in" or "not in". An object contains a number of data fields, like a record, and also a number of program code fragments for accessing or modifying them. Data structures not containing code, like those above, are called plain old data structure. Many others are possible, but they tend to be further variations and compounds of the above.

Data structure

10

Basic principles
Data structures are generally based on the ability of a computer to fetch and store data at any place in its memory, specified by an addressa bit string that can be itself stored in memory and manipulated by the program. Thus the record and array data structures are based on computing the addresses of data items with arithmetic operations; while the linked data structures are based on storing addresses of data items within the structure itself. Many data structures use both principles, sometimes combined in non-trivial ways (as in XOR linking) The implementation of a data structure usually requires writing a set of procedures that create and manipulate instances of that structure. The efficiency of a data structure cannot be analyzed separately from those operations. This observation motivates the theoretical concept of an abstract data type, a data structure that is defined indirectly by the operations that may be performed on it, and the mathematical properties of those operations (including their space and time cost).

Language support
Most assembly languages and some low-level languages, such as BCPL(Basic Combined Programming Language), lack support for data structures. Many high-level programming languages, and some higher-level assembly languages, such as MASM, on the other hand, have special syntax or other built-in support for certain data structures, such as vectors (one-dimensional arrays) in the C language or multi-dimensional arrays in Pascal. Most programming languages feature some sorts of library mechanism that allows data structure implementations to be reused by different programs. Modern languages usually come with standard libraries that implement the most common data structures. Examples are the C++ Standard Template Library, the Java Collections Framework, and Microsoft's .NET Framework. Modern languages also generally support modular programming, the separation between the interface of a library module and its implementation. Some provide opaque data types that allow clients to hide implementation details. Object-oriented programming languages, such as C++, Java and .NET Framework may use classes for this purpose. Many known data structures have concurrent versions that allow multiple computing threads to access the data structure simultaneously.

References
[1] Paul E. Black (ed.), entry for data structure in Dictionary of Algorithms and Data Structures. U.S. National Institute of Standards and Technology. 15 December 2004. Online version (http:/ / www. itl. nist. gov/ div897/ sqg/ dads/ HTML/ datastructur. html) Accessed May 21, 2009. [2] Entry data structure in the Encyclopdia Britannica (2009) Online entry (http:/ / www. britannica. com/ EBchecked/ topic/ 152190/ data-structure) accessed on May 21, 2009.

Further reading
Peter Brass, Advanced Data Structures, Cambridge University Press, 2008. Donald Knuth, The Art of Computer Programming, vol. 1. Addison-Wesley, 3rd edition, 1997. Dinesh Mehta and Sartaj Sahni Handbook of Data Structures and Applications, Chapman and Hall/CRC Press, 2007. Niklaus Wirth, Algorithms and Data Structures, Prentice Hall, 1985.

Data structure

11

External links
UC Berkeley video course on data structures (http://academicearth.org/courses/data-structures) Descriptions (http://nist.gov/dads/) from the Dictionary of Algorithms and Data Structures CSE.unr.edu (http://www.cse.unr.edu/~bebis/CS308/) Data structures course with animations (http://www.cs.auckland.ac.nz/software/AlgAnim/ds_ToC.html) Data structure tutorials with animations (http://courses.cs.vt.edu/~csonline/DataStructures/Lessons/index. html) An Examination of Data Structures from .NET perspective (http://msdn.microsoft.com/en-us/library/ aa289148(VS.71).aspx) Schaffer, C. Data Structures and Algorithm Analysis (http://people.cs.vt.edu/~shaffer/Book/C++ 3e20110915.pdf)

Analysis of algorithms
In computer science, the analysis of algorithms is the determination of the number of resources (such as time and storage) necessary to execute them. Most algorithms are designed to work with inputs of arbitrary length. Usually the efficiency or running time of an algorithm is stated as a function relating the input length to the number of steps (time complexity) or storage locations (space complexity). Algorithm analysis is an important part of a broader computational complexity theory, which provides theoretical estimates for the resources needed by any algorithm which solves a given computational problem. These estimates provide an insight into reasonable directions of search for efficient algorithms. In theoretical analysis of algorithms it is common to estimate their complexity in the asymptotic sense, i.e., to estimate the complexity function for arbitrarily large input. Big O notation, omega notation and theta notation are used to this end. For instance, binary search is said to run in a number of steps proportional to the logarithm of the length of the list being searched, or in O(log(n)), colloquially "in logarithmic time". Usually asymptotic estimates are used because different implementations of the same algorithm may differ in efficiency. However the efficiencies of any two "reasonable" implementations of a given algorithm are related by a constant multiplicative factor called a hidden constant. Exact (not asymptotic) measures of efficiency can sometimes be computed but they usually require certain assumptions concerning the particular implementation of the algorithm, called model of computation. A model of computation may be defined in terms of an abstract computer, e.g., Turing machine, and/or by postulating that certain operations are executed in unit time. For example, if the sorted list to which we apply binary search has n elements, and we can guarantee that each lookup of an element in the list can be done in unit time, then at most log2 n + 1 time units are needed to return an answer.

Cost models
Time efficiency estimates depend on what we define to be a step. For the analysis to correspond usefully to the actual execution time, the time required to perform a step must be guaranteed to be bounded above by a constant. One must be careful here; for instance, some analyses count an addition of two numbers as one step. This assumption may not be warranted in certain contexts. For example, if the numbers involved in a computation may be arbitrarily large, the time required by a single addition can no longer be assumed to be constant. Two cost models are generally used:[1][2][3][4][5] the uniform cost model, also called uniform-cost measurement (and similar variations), assigns a constant cost to every machine operation, regardless of the size of the numbers involved

Analysis of algorithms the logarithmic cost model, also called logarithmic-cost measurement (and variations thereof), assigns a cost to every machine operation proportional to the number of bits involved The latter is more cumbersome to use, so it's only employed when necessary, for example in the analysis of arbitrary-precision arithmetic algorithms, like those used in cryptography. A key point which is often overlooked is that published lower bounds for problems are often given for a model of computation that is more restricted than the set of operations that you could use in practice and therefore there are algorithms that are faster than what would naively be thought possible.[6]

12

Run-time analysis
Run-time analysis is a theoretical classification that estimates and anticipates the increase in running time (or run-time) of an algorithm as its input size (usually denoted as n) increases. Run-time efficiency is a topic of great interest in computer science: A program can take seconds, hours or even years to finish executing, depending on which algorithm it implements (see also performance analysis, which is the analysis of an algorithm's run-time in practice).

Shortcomings of empirical metrics


Since algorithms are platform-independent (i.e. a given algorithm can be implemented in an arbitrary programming language on an arbitrary computer running an arbitrary operating system), there are significant drawbacks to using an empirical approach to gauge the comparative performance of a given set of algorithms. Take as an example a program that looks up a specific entry in a sorted list of size n. Suppose this program were implemented on Computer A, a state-of-the-art machine, using a linear search algorithm, and on Computer B, a much slower machine, using a binary search algorithm. Benchmark testing on the two computers running their respective programs might look something like the following:
n (list size) Computer A run-time (in nanoseconds) 7 32 125 500 Computer B run-time (in nanoseconds) 100,000 150,000 200,000 250,000

15 65 250 1,000

Based on these metrics, it would be easy to jump to the conclusion that Computer A is running an algorithm that is far superior in efficiency to that of Computer B. However, if the size of the input-list is increased to a sufficient number, that conclusion is dramatically demonstrated to be in error:

Analysis of algorithms

13

n (list size)

Computer A run-time (in nanoseconds) 7 32 125 500 ... 500,000 2,000,000 8,000,000 ...

Computer B run-time (in nanoseconds) 100,000 150,000 200,000 250,000 ... 500,000 550,000 600,000 ... 1,375,000 ns, or 1.375 milliseconds

15 65 250 1,000 ... 1,000,000 4,000,000 16,000,000 ...

63,072 1012 31,536 1012 ns, or 1 year

Computer A, running the linear search program, exhibits a linear growth rate. The program's run-time is directly proportional to its input size. Doubling the input size doubles the run time, quadrupling the input size quadruples the run-time, and so forth. On the other hand, Computer B, running the binary search program, exhibits a logarithmic growth rate. Doubling the input size only increases the run time by a constant amount (in this example, 25,000 ns). Even though Computer A is ostensibly a faster machine, Computer B will inevitably surpass Computer A in run-time because it's running an algorithm with a much slower growth rate.

Orders of growth
Informally, an algorithm can be said to exhibit a growth rate on the order of a mathematical function if beyond a certain input size n, the function f(n) times a positive constant provides an upper bound or limit for the run-time of that algorithm. In other words, for a given input size n greater than some n0 and a constant c, the running time of that algorithm will never be larger than c f(n). This concept is frequently expressed using Big O notation. For example, since the run-time of insertion sort grows quadratically as its input size increases, insertion sort can be said to be of order O(n). Big O notation is a convenient way to express the worst-case scenario for a given algorithm, although it can also be used to express the average-case for example, the worst-case scenario for quicksort is O(n), but the average-case run-time is O(n log n).[7]

Empirical orders of growth


Assuming the execution time follows power rule, k na, the coefficient a can be found [8] by taking empirical measurements of run time at some problem-size points , and calculating so that . If the order of growth indeed follows the power rule, the empirical value of a will stay constant at different ranges, and if not, it will change - but still could serve for comparison of any two given algorithms as to their empirical local orders of growth behaviour. Applied to the above table:

Analysis of algorithms

14

n (list size)

Computer A run-time (in nanoseconds) 7 32 125 500 ... 500,000 2,000,000

Local order of growth (n^_)

Computer B run-time (in nanoseconds) 100,000

Local order of growth (n^_)

15 65 250 1,000 ... 1,000,000 4,000,000

1.04 1.01 1.00 ...

150,000 200,000 250,000 ... 500,000

0.28 0.21 0.16 ...

1.00 1.00 ...

550,000 600,000 ...

0.07 0.06 ...

16,000,000 8,000,000 ... ...

It is clearly seen that the first algorithm exhibits a linear order of growth indeed following the power rule. The empirical values for the second one are diminishing rapidly, suggesting it follows another rule of growth and in any case has much lower local orders of growth (and improving further still), empirically, than the first one.

Evaluating run-time complexity


The run-time complexity for the worst-case scenario of a given algorithm can sometimes be evaluated by examining the structure of the algorithm and making some simplifying assumptions. Consider the following pseudocode: 1 2 3 4 5 6 7 get a positive integer from input if n > 10 print "This might take a while..." for i = 1 to n for j = 1 to i print i * j print "Done!"

A given computer will take a discrete amount of time to execute each of the instructions involved with carrying out this algorithm. The specific amount of time to carry out a given instruction will vary depending on which instruction is being executed and which computer is executing it, but on a conventional computer, this amount will be deterministic.[9] Say that the actions carried out in step 1 are considered to consume time T1, step 2 uses time T2, and so forth. In the algorithm above, steps 1, 2 and 7 will only be run once. For a worst-case evaluation, it should be assumed that step 3 will be run as well. Thus the total amount of time to run steps 1-3 and step 7 is:

The loops in steps 4, 5 and 6 are trickier to evaluate. The outer loop test in step 4 will execute ( n + 1 ) times (note that an extra step is required to terminate the for loop, hence n + 1 and not n executions), which will consume T4( n + 1 ) time. The inner loop, on the other hand, is governed by the value of i, which iterates from 1 to n. On the first pass through the outer loop, j iterates from 1 to 1: The inner loop makes one pass, so running the inner loop body (step 6) consumes T6 time, and the inner loop test (step 5) consumes 2T5 time. During the next pass through the outer loop, j iterates from 1 to 2: the inner loop makes two passes, so running the inner loop body (step 6) consumes 2T6 time, and the inner loop test (step 5) consumes 3T5 time. Altogether, the total time required to run the inner loop body can be expressed as an arithmetic progression:

Analysis of algorithms

15

which can be factored[10] as

The total time required to run the inner loop test can be evaluated similarly:

which can be factored as

Therefore the total running time for this algorithm is:

which reduces to

As a rule-of-thumb, one can assume that the highest-order term in any given function dominates its rate of growth and thus defines its run-time order. In this example, n is the highest-order term, so one can conclude that f(n) = O(n). Formally this can be proven as follows: Prove that

(for n 0) Let k be a constant greater than or equal to [T1..T7] (for n 1) Therefore for A more elegant approach to analyzing this algorithm would be to declare that [T1..T7] are all equal to one unit of time greater than or equal to [T1..T7]. This would mean that the algorithm's running time breaks down as follows:[11] (for n 1)

Growth rate analysis of other resources


The methodology of run-time analysis can also be utilized for predicting other growth rates, such as consumption of memory space. As an example, consider the following pseudocode which manages and reallocates memory usage by a program based on the size of a file which that program manages: while (file still open) let n = size of file for every 100,000 kilobytes of increase in file size double the amount of memory reserved

Analysis of algorithms In this instance, as the file size n increases, memory will be consumed at an exponential growth rate, which is order O(2n).[12]

16

Relevance
Algorithm analysis is important in practice because the accidental or unintentional use of an inefficient algorithm can significantly impact system performance. In time-sensitive applications, an algorithm taking too long to run can render its results outdated or useless. An inefficient algorithm can also end up requiring an uneconomical amount of computing power or storage in order to run, again rendering it practically useless.

Notes
[1] Alfred V. Aho; John E. Hopcroft; Jeffrey D. Ullman (1974). The design and analysis of computer algorithms. Addison-Wesley Pub. Co.., section 1.3 [2] Juraj Hromkovi (2004). Theoretical computer science: introduction to Automata, computability, complexity, algorithmics, randomization, communication, and cryptography (http:/ / books. google. com/ books?id=KpNet-n262QC& pg=PA177). Springer. pp.177178. ISBN978-3-540-14015-3. . [3] Giorgio Ausiello (1999). Complexity and approximation: combinatorial optimization problems and their approximability properties (http:/ / books. google. com/ books?id=Yxxw90d9AuMC& pg=PA3). Springer. pp.38. ISBN978-3-540-65431-5. . [4] Wegener, Ingo (2005), Complexity theory: exploring the limits of efficient algorithms (http:/ / books. google. com/ books?id=u7DZSDSUYlQC& pg=PA20), Berlin, New York: Springer-Verlag, p.20, ISBN978-3-540-21045-0, [5] Robert Endre Tarjan (1983). Data structures and network algorithms (http:/ / books. google. com/ books?id=JiC7mIqg-X4C& pg=PA3). SIAM. pp.37. ISBN978-0-89871-187-5. . [6] Examples of the price of abstraction? (http:/ / cstheory. stackexchange. com/ questions/ 608/ examples-of-the-price-of-abstraction), cstheory.stackexchange.com [7] The term lg is often used as shorthand for log2 [8] How To Avoid O-Abuse and Bribes (http:/ / rjlipton. wordpress. com/ 2009/ 07/ 24/ how-to-avoid-o-abuse-and-bribes/ ), at the blog "Gdels Lost Letter and P=NP" by R. J. Lipton, professor of Computer Science at Georgia Tech, recounting idea by Robert Sedgewick [9] However, this is not the case with a quantum computer [10] It can be proven by induction that [11] This approach, unlike the above approach, neglects the constant time consumed by the loop tests which terminate their respective loops, but it is trivial to prove that such omission does not affect the final result [12] Note that this is an extremely rapid and most likely unmanageable growth rate for consumption of memory resources

References
Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. & Stein, Clifford (2001). Introduction to Algorithms. Chapter 1: Foundations (Second ed.). Cambridge, MA: MIT Press and McGraw-Hill. pp.3122. ISBN0-262-03293-7. Sedgewick, Robert (1998). Algorithms in C, Parts 1-4: Fundamentals, Data Structures, Sorting, Searching (3rd ed.). Reading, MA: Addison-Wesley Professional. ISBN978-0-201-31452-6. Knuth, Donald. The Art of Computer Programming. Addison-Wesley. Greene, Daniel A.; Knuth, Donald E. (1982). Mathematics for the Analysis of Algorithms (Second ed.). Birkhuser. ISBN3-7463-3102-X . Goldreich, Oded (2010). Computational Complexity: A Conceptual Perspective. Cambridge University Press. ISBN978-0-521-88473-0.

Amortized analysis

17

Amortized analysis
In computer science, amortized analysis is a method of analyzing algorithms that considers the entire sequence of operations of the program. It allows for the establishment of a worst-case bound for the performance of an algorithm irrespective of the inputs by looking at all of the operations. At the heart of the method is the idea that while certain operations may be extremely costly in resources, they cannot occur at a high-enough frequency to weigh down the entire program because the number of less costly operations will far outnumber the costly ones in the long run, "paying back" the program over a number of iterations.[1] It is particularly useful because it guarantees worst-case performance rather than making assumptions about the state of the program.

History
Amortized analysis initially emerged from a method called aggregate analysis, which is now subsumed by amortized analysis. However, the technique was first formally introduced by Robert Tarjan in his paper Amortized Computational Complexity, which addressed the need for a more useful form of analysis than the common probabilistic methods used. Amortization was initially used for very specific types of algorithms, particularly those involving binary trees and union operations. However, it is now ubiquitous and comes into play when analyzing many other algorithms as well.[1]

Method
The method requires knowledge of which series of operations are possible. This is most commonly the case with data structures, which have state that persists between operations. The basic idea is that a worst case operation can alter the state in such a way that the worst case cannot occur again for a long time, thus "amortizing" its cost. There are generally three methods for performing amortized analysis: the aggregate method, the accounting method, and the potential method. All of these give the same answers, and their usage difference is primarily circumstantial and due to individual preference.[2] Aggregate analysis determines the upper bound T(n) on the total cost of a sequence of n operations, then calculates the average cost to be T(n) / n.[2] The accounting method determines the individual cost of each operation, combining its immediate execution time and its influence on the running time of future operations. Usually, many short-running operations accumulate a "debt" of unfavorable state in small increments, while rare long-running operations decrease it drastically.[2] The potential method is like the accounting method, but overcharges operations early to compensate for undercharges later.[2]

Common use
In common usage, an "amortized algorithm" is one that an amortized analysis has shown to perform well. Online algorithms commonly use amortized analysis.

References
Allan Borodin and Ran El-Yaniv (1998). Online Computation and Competitive Analysis [3]. Cambridge University Press. pp.20,141.
[1] RebeccaFiebrink (2007), AmortizedAnalysisExplained (http:/ / www. cs. princeton. edu/ ~fiebrink/ 423/ AmortizedAnalysisExplained_Fiebrink. pdf), , retrieved 2011-05-03 [2] Vijaya Ramachandran (2006), CS357 Lecture 16: Amortized Analysis (http:/ / www. cs. utexas. edu/ ~vlr/ s06. 357/ notes/ lec16. pdf), , retrieved 2011-05-03 [3] http:/ / www. cs. technion. ac. il/ ~rani/ book. html

Accounting method

18

Accounting method
In the field of analysis of algorithms in computer science, the accounting method is a method of amortized analysis based on accounting. The accounting method often gives a more intuitive account of the amortized cost of an operation than either aggregate analysis or the potential method. Note, however, that this does not guarantee such analysis will be immediately obvious; often, choosing the correct parameters for the accounting method requires as much knowledge of the problem and the complexity bounds one is attempting to prove as the other two methods. The accounting method is most naturally suited for proving an O(1) bound on time. The method as explained here is for proving such a bound.

The method
Preliminarily, we choose a set of elementary operations which will be used in the algorithm, and arbitrarily set their cost to 1. The fact that the costs of these operations may in reality differ presents no difficulty in principle. What is important, is that each elementary operation has a constant cost. Each aggregate operation is assigned a "payment". The payment is intended to cover the cost of elementary operations needed to complete this particular operation, with some of the payment left over, placed in a pool to be used later. The difficulty with problems that require amortized analysis is that, in general, some of the operations will require greater than constant cost. This means that no constant payment will be enough to cover the worst case cost of an operation, in and of itself. With proper selection of payment, however, this is no longer a difficulty; the expensive operations will only occur when there is sufficient payment in the pool to cover their costs.

Examples
A few examples will help to illustrate the use of the accounting method.

Table expansion
It is often necessary to create a table before it is known how much space is needed. One possible strategy is to double the size of the table when it is full. Here we will use the accounting method to show that the amortized cost of an insertion operation in such a table is O(1). Before looking at the procedure in detail, we need some definitions. Let T be a table, E an element to insert, num(T) the number of elements in T, and size(T) the allocated size of T. We assume the existence of operations create_table(n), which creates an empty table of size n, for now assumed to be free, and elementary_insert(T,E), which inserts element E into a table T that already has space allocated, with a cost of 1. The following pseudocode illustrates the table insertion procedure: function table_insert(T,E) if num(T) = size(T) U := create_table(2 size(T)) for each F in T elementary_insert(U,F) T := U elementary_insert(T,E) Without amortized analysis, the best bound we can show for n insert operations is O(n2) this is due to the loop at line 4 that performs num(T) elementary insertions.

Accounting method For analysis using the accounting method, we assign a payment of 3 to each table insertion. Although the reason for this is not clear now, it will become clear during the course of the analysis. Assume that initially the table is empty with size(T) = m. The first m insertions therefore do not require reallocation and only have cost 1 (for the elementary insert). Therefore, when num(T) = m, the pool has (3 - 1)m = 2m. Inserting element m + 1 requires reallocation of the table. Creating the new table on line 3 is free (for now). The loop on line 4 requires m elementary insertions, for a cost of m. Including the insertion on the last line, the total cost for this operation is m + 1. After this operation, the pool therefore has 2m + 3 - (m + 1) = m + 2. Next, we add another m - 1 elements to the table. At this point the pool has m + 2 + 2(m - 1) = 3m. Inserting an additional element (that is, element 2m + 1) can be seen to have cost 2m + 1 and a payment of 3. After this operation, the pool has 3m + 3 - (2m + 1) = m + 2. Note that this is the same amount as after inserting element m + 1. In fact, we can show that this will be the case for any number of reallocations. It can now be made clear why the payment for an insertion is 3. 1 goes to inserting the element the first time it is added to the table, 1 goes to moving it the next time the table is expanded, and 1 goes to moving one of the elements that was already in the table the next time the table is expanded. We initially assumed that creating a table was free. In reality, creating a table of size n may be as expensive as O(n). Let us say that the cost of creating a table of size n is n. Does this new cost present a difficulty? Not really; it turns out we use the same method to show the amortized O(1) bounds. All we have to do is change the payment. When a new table is created, there is an old table with m entries. The new table will be of size 2m. As long as the entries currently in the table have added enough to the pool to pay for creating the new table, we will be all right. We cannot expect the first We must then rely on the last entries to help pay for the new table. Those entries already paid for the current table. entries to pay the cost . This means we must add to the payment

19

for each entry, for a total payment of 3 + 4 = 7.

References
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 17.2: The accounting method, pp.410412.

Potential method

20

Potential method
In computational complexity theory, the potential method is a method used to analyze the amortized time and space complexity of a data structure, a measure of its performance over sequences of operations that smooths out the cost of infrequent but expensive operations.[1][2]

Definition of amortized time


In the potential method, a function is chosen that maps states of the data structure to non-negative numbers. If S is a state of the data structure, (S) may be thought of intuitively as an amount of potential energy stored in that state;[1][2] alternatively, (S) may be thought of as representing the amount of disorder in state S or its distance from an ideal state. The potential value prior to the operation of initializing a data structure is defined to be zero. Let o be any individual operation within a sequence of operations on some data structure, with Sbefore denoting the state of the data structure prior to operation o and Safter denoting its state after operation o has completed. Then, once has been chosen, the amortized time for operation o is defined to be

where C is a non-negative constant of proportionality (in units of time) that must remain fixed throughout the analysis. That is, the amortized time is defined to be the actual time taken by the operation plus C times the difference in potential caused by the operation.[1][2]

Relation between amortized and actual time


Despite its artificial appearance, the total amortized time of a sequence of operations provides a valid upper bound on the actual time for the same sequence of operations. That is, for any sequence of operations , the total amortized time detail, is always at least as large as the total actual time . In more

where the sequence of potential function values forms a telescoping series in which all terms other than the initial and final potential function values cancel in pairs, and where the final inequality arises from the assumptions that and . Therefore, amortized time can be used to provide accurate predictions about the actual time of sequences of operations, even though the amortized time for an individual operation may vary widely from its actual time.

Amortized analysis of worst-case inputs


Typically, amortized analysis is used in combination with a worst case assumption about the input sequence. With this assumption, if X is a type of operation that may be performed by the data structure, and n is an integer defining the size of the given data structure (for instance, the number of items that it contains), then the amortized time for operations of type X is defined to be the maximum, among all possible sequences of operations on data structures of size n and all operations oi of type X within the sequence, of the amortized time for operation oi. With this definition, the time to perform a sequence of operations may be estimated by multiplying the amortized time for each type of operation in the sequence by the number of operations of that type.

Potential method

21

Example
A dynamic array is a data structure for maintaining an array of items, allowing both random access to positions within the array and the ability to increase the array size by one. It is available in Java as the "ArrayList" type and in Python as the "list" type. A dynamic array may be implemented by a data structure consisting of an array A of items, of some length N, together with a number nN representing the positions within the array that have been used so far. With this structure, random accesses to the dynamic array may be implemented by accessing the same cell of the internal array A, and when n<N an operation that increases the dynamic array size may be implemented simply by incrementingn. However, when n=N, it is necessary to resize A, and a common strategy for doing so is to double its size, replacing A by a new array of length2n.[3] This structure may be analyzed using a potential function =2nN. Since the resizing strategy always causes A to be at least half-full, this potential function is always non-negative, as desired. When an increase-size operation does not lead to a resize operation, increases by 2, a constant. Therefore, the constant actual time of the operation and the constant increase in potential combine to give a constant amortized time for an operation of this type. However, when an increase-size operation causes a resize, the potential value of n prior to the resize decreases to zero after the resize. Allocating a new internal array A and copying all of the values from the old internal array to the new one takes O(n) actual time, but (with an appropriate choice of the constant of proportionality C) this is entirely cancelled by the decrease of n in the potential function, leaving again a constant total amortized time for the operation. The other operations of the data structure (reading and writing array cells without changing the array size) do not cause the potential function to change and have the same constant amortized time as their actual time.[2] Therefore, with this choice of resizing strategy and potential function, the potential method shows that all dynamic array operations take constant amortized time. Combining this with the inequality relating amortized time and actual time over sequences of operations, this shows that any sequence of n dynamic array operations takes O(n) actual time in the worst case, despite the fact that some of the individual operations may themselves take a linear amount of time.[2]

Applications
The potential function method is commonly used to analyze Fibonacci heaps, a form of priority queue in which removing an item takes logarithmic amortized time, and all other operations take constant amortized time.[4] It may also be used to analyze splay trees, a self-adjusting form of binary search tree with logarithmic amortized time per operation.[5]

References
[1] Goodrich, Michael T.; Tamassia, Roberto (2002), "1.5.1 Amortization Techniques", Algorithm Design: Foundations, Analysis and Internet Examples, Wiley, pp.3638. [2] Cormen, Thomas H.; Leiserson, Charles E., Rivest, Ronald L., Stein, Clifford (2001) [1990]. "17.3 The potential method". Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp.412416. ISBN0-262-03293-7. [3] Goodrich and Tamassia, 1.5.2 Analyzing an Extendable Array Implementation, pp. 139141; Cormen et al., 17.4 Dynamic tables, pp. 416424. [4] Cormen et al., Chapter 20, "Fibonacci Heaps", pp. 476497. [5] Goodrich and Tamassia, Section 3.4, "Splay Trees", pp. 185194.

22

Sequences
Array data type
In computer science, an array type is a data type that is meant to describe a collection of elements (values or variables), each selected by one or more indices (identifying keys) that can be computed at run time by the program. Such a collection is usually called an array variable, array value, or simply array.[1] By analogy with the mathematical concepts of vector and matrix, array types with one and two indices are often called vector type and matrix type, respectively. Language support for array types may include certain built-in array data types, some syntactic constructions (array type constructors) that the programmer may use to define such types and declare array variables, and special notation for indexing array elements.[1] For example, in the Pascal programming language, the declaration type MyTable = array [1..4,1..2] of integer, defines a new array data type called MyTable. The declaration var A: MyTable then defines a variable A of that type, which is an aggregate of eight elements, each being an integer variable identified by two indices. In the Pascal program, those elements are denoted A[1,1], A[1,2], A[2,1], A[4,2].[2] Special array types are often defined by the language's standard libraries. Array types are distinguished from record types mainly because they allow the element indices to be computed at run time, as in the Pascal assignment A[I,J] := A[N-I,2*J]. Among other things, this feature allows a single iterative statement to process arbitrarily many elements of an array variable. In more theoretical contexts, especially in type theory and in the description of abstract algorithms, the terms "array" and "array type" sometimes refer to an abstract data type (ADT) also called abstract array or may refer to an associative array, a mathematical model with the basic operations and behavior of a typical array type in most languages basically, a collection of elements that are selected by indices computed at run-time. Depending on the language, array types may overlap (or be identified with) other data types that describe aggregates of values, such as lists and strings. Array types are often implemented by array data structures, but sometimes by other means, such as hash tables, linked lists, or search trees.

History
Assembly languages and low-level languages like BCPL[3] generally have no syntactic support for arrays. Because of the importance of array structures for efficient computation, the earliest high-level programming languages, including FORTRAN (1957), COBOL (1960), and Algol 60 (1960), provided support for multi-dimensional arrays.

Abstract arrays
An array data structure can be mathematically modeled as an abstract data structure (an abstract array) with two operations get(A, I): the data stored in the element of the array A whose indices are the integer tuple I. set(A,I,V): the array that results by setting the value of that element to V. These operations are required to satisfy the axioms[4] get(set(A,I, V), I)=V get(set(A,I, V), J)=get(A, J) if IJ

Array data type for any array state A, any value V, and any tuples I, J for which the operations are defined. The first axiom means that each element behaves like a variable. The second axiom means that elements with distinct indices behave as disjoint variables, so that storing a value in one element does not affect the value of any other element. These axioms do not place any constraints on the set of valid index tuples I, therefore this abstract model can be used for triangular matrices and other oddly-shaped arrays.

23

Implementations
In order to effectively implement variables of such types as array structures (with indexing done by pointer arithmetic), many languages restrict the indices to integer data types (or other types that can be interpreted as integers, such as bytes and enumerated types), and require that all elements have the same data type and storage size. Most of those languages also restrict each index to a finite interval of integers, that remains fixed throughout the lifetime of the array variable. In some compiled languages, in fact, the index ranges may have to be known at compile time. On the other hand, some programming languages provide more liberal array types, that allow indexing by arbitrary values, such as floating-point numbers, strings, objects, references, etc.. Such index values cannot be restricted to an interval, much less a fixed interval. So, these languages usually allow arbitrary new elements to be created at any time. This choice precludes the implementation of array types as array data structures. That is, those languages use array-like syntax to implement a more general associative array semantics, and must therefore be implemented by a hash table or some other search data structure.

Language support
Multi-dimensional arrays
The number of indices needed to specify an element is called the dimension, dimensionality, or rank of the array type. (This nomenclature conflicts with the concept of dimension in linear algebra,[5] where it is the number of elements. Thus, an array of numbers with 5 rows and 4 columns, hence 20 elements, is said to have dimension 2 in computing contexts, but represents a matrix with dimension 4-by-5 or 20 in mathematics. Also, the computer science meaning of "rank" is similar to its meaning in tensor algebra but not to the linear algebra concept of rank of a matrix.) Many languages support only one-dimensional arrays. In those languages, a multi-dimensional array is typically represented by an Iliffe vector, a one-dimensional array of references to arrays of one dimension less. A two-dimensional array, in particular, would be implemented as a vector of pointers to its rows. Thus an element in row i and column j of an array A would be accessed by double indexing (A[i][j] in typical notation). This way of emulating multi-dimensional arrays allows the creation of ragged or jagged arrays, where each row may have a different size or, in general, where the valid range of each index depends on the values of all preceding indices. This representation for multi-dimensional arrays is quite prevalent in C and C++ software. However, C and C++ will use a linear indexing formula for multi-dimensional arrays that are declared as such, e.g. by int A[10][20] or int A[m][n], instead of the traditional int **A.[6]:p.81

Array data type

24

Indexing notation
Most programming languages that support arrays support the store and select operations, and have special syntax for indexing. Early languages used parentheses, e.g. A(i,j), as in FORTRAN; others choose square brackets, e.g. A[i,j] or A[i][j], as in Algol 60 and Pascal.

Index types
Array data types are most often implemented as array structures: with the indices restricted to integer (or totally ordered) values, index ranges fixed at array creation time, and multilinear element addressing. This was the case in most "third generation" languages, and is still the case of most systems programming languages such as Ada, C, and C++. In some languages, however, array data types have the semantics of associative arrays, with indices of arbitrary type and dynamic element creation. This is the case in some scripting languages such as Awk and Lua, and of some array types provided by standard C++ libraries.

Bounds checking
Some languages (like Pascal and Modula) perform bounds checking on every access, raising an exception or aborting the program when any index is out of its valid range. Compilers may allow these checks to be turned off to trade safety for speed. Other languages (like FORTRAN and C) trust the programmer and perform no checks. Good compilers may also analyze the program to determine the range of possible values that the index may have, and this analysis may lead to bounds-checking elimination.

Index origin
Some languages, such as C, provide only zero-based array types, for which the minimum valid value for any index is 0. This choice is convenient for array implementation and address computations. With a language such as C, a pointer to the interior of any array can be defined that will symbolically act as a pseudo-array that accommodates negative indices. This works only because C does not check an index against bounds when used. Other languages provide only one-based array types, where each index starts at 1; this is the traditional convention in mathematics for matrices and mathematical sequences. A few languages, such as Pascal, support n-based array types, whose minimum legal indices are chosen by the programmer. The relative merits of each choice have been the subject of heated debate. Neither zero-based nor one-based indexing has any natural advantage in avoiding off-by-one or fencepost errors. See comparison of programming languages (array) for the base indices used by various languages. The 0-based/1-based debate is not limited to just programming languages. For example, the elevator button for the ground-floor of a building is labeled "0" in France and many other countries, but "1" in the USA.

Highest index
The relation between numbers appearing in an array declaration and the index of that array's last element also varies by language. In many languages (such as C), one should specify the number of elements contained in the array; whereas in others (such as Pascal and Visual Basic .NET) one should specify the numeric value of the index of the last element. Needless to say, this distinction is immaterial in languages where the indices start at 1.

Array algebra
Some programming languages (including APL, Matlab, and newer versions of Fortran) directly support array programming, where operations and functions defined for certain data types are implicitly extended to arrays of elements of those types. Thus one can write A+B to add corresponding elements of two arrays A and B. The multiplication operation may be merely distributed over corresponding elements of the operands (APL) or may be

Array data type interpreted as the matrix product of linear algebra (Matlab).

25

String types and arrays


Many languages provide a built-in string data type, with specialized notation ("string literals") to build values of that type. In some languages (such as C), a string is just an array of characters, or is handled in much the same way. Other languages, like Pascal, may provide vastly different operations for strings and arrays.

Array index range queries


Some programming languages provide operations that return the size (number of elements) of a vector, or, more generally, range of each index of an array. In C and C++ arrays do not support the size function, so programmers often have to declare separate variable to hold the size, and pass it to procedures as a separate parameter. Elements of a newly created array may have undefined values (as in C), or may be defined to have a specific "default" value such as 0 or a null pointer (as in Java). In C++ a std::vector object supports the store, select, and append operations with the performance characteristics discussed above. Vectors can be queried for their size and can be resized. Slower operations like inserting an element in the middle are also supported.

Slicing
An array slicing operation takes a subset of the elements of an array-typed entity (value or variable) and then assembles them as another array-typed entity, possibly with other indices. If array types are implemented as array structures, many useful slicing operations (such as selecting a sub-array, swapping indices, or reversing the direction of the indices) can be performed very efficiently by manipulating the dope vector of the structure. The possible slicings depend on the implementation details: for example, FORTRAN allows slicing off one column of a matrix variable, but not a row, and treat it as a vector; whereas C allow slicing off a row from a matrix, but not a column. On the other hand, other slicing operations are possible when array types are implemented in other ways.

Resizing
Some languages allow dynamic arrays (also called resizable, growable, or extensible): array variables whose index ranges may be expanded at any time after creation, without changing the values of its current elements. For one-dimensional arrays, this facility may be provided as an operation "append(A,x)" that increases the size of the array A by one and then sets the value of the last element to x. Other array types (such as Pascal strings) provide a concatenation operator, which can be used together with slicing to achieve that effect and more. In some languages, assigning a value to an element of an array automatically extends the array, if necessary, to include that element. In other array types, a slice can be replaced by an array of different size" with subsequent elements being renumbered accordingly as in Python's list assignment "A[5:5] = [10,20,30]", that inserts three new elements (10,20, and 30) before element "A[5]". Resizable arrays are conceptually similar to lists, and the two concepts are synonymous in some languages. An extensible array can be implemented as a fixed-size array, with a counter that records how many elements are actually in use. The append operation merely increments the counter; until the whole array is used, when the append operation may be defined to fail. This is an implementation of a dynamic array with a fixed capacity, as in the string type of Pascal. Alternatively, the append operation may re-allocate the underlying array with a larger size, and copy the old elements to the new area.

Array data type

26

References
[1] Robert W. Sebesta (2001) Concepts of Programming Languages. Addison-Wesley. 4th edition (1998), 5th edition (2001), ISBN 0-201-38596-1 ISBN13: 9780201385960 [2] K. Jensen and Niklaus Wirth, PASCAL User Manual and Report. Springer. Paperback edition (2007) 184 pages, ISBN 3-540-06950-X ISBN 978-3540069508 [3] John Mitchell, Concepts of Programming Languages. Cambridge University Press. [4] Lukham, Suzuki (1979), "Verification of array, record, and pointer operations in Pascal". ACM Transactions on Programming Languages and Systems 1(2), 226244. [5] see the definition of a matrix [6] Brian W. Kernighan and Dennis M. Ritchie (1988), The C programming Language. Prentice-Hall, 205 pages.

External links
NIST's Dictionary of Algorithms and Data Structures: Array (http://www.nist.gov/dads/HTML/array.html)

Array data structure


In computer science, an array data structure or simply an array is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key. An array is stored so that the position of each element can be computed from its index tuple by a mathematical formula.[1][2][3] For example, an array of 10 integer variables, with indices 0 through 9, may be stored as 10 words at memory addresses 2000, 2004, 2008, 2036, so that the element with index i has the address 2000 + 4 i.[4] Arrays are analogous to the mathematical concepts of vectors, matrices, and tensors. Indeed, arrays with one or two indices are often called vectors or matrices, respectively. Arrays are often used to implement tables, especially lookup tables; the word table is sometimes used as a synonym of array. Arrays are among the oldest and most important data structures, and are used by almost every program. They are also used to implement many other data structures, such as lists and strings. They effectively exploit the addressing logic of computers. In most modern computers and many external storage devices, the memory is a one-dimensional array of words, whose indices are their addresses. Processors, especially vector processors, are often optimized for array operations. Arrays are useful mostly because the element indices can be computed at run time. Among other things, this feature allows a single iterative statement to process arbitrarily many elements of an array. For that reason, the elements of an array data structure are required to have the same size and should use the same data representation. The set of valid index tuples and the addresses of the elements (and hence the element addressing formula) are usually,[3][5] but not always,[2] fixed while the array is in use. The term array is often used to mean array data type, a kind of data type provided by most high-level programming languages that consists of a collection of values or variables that can be selected by one or more indices computed at run-time. Array types are often implemented by array structures; however, in some languages they may be implemented by hash tables, linked lists, search trees, or other data structures. The term is also used, especially in the description of algorithms, to mean associative array or "abstract array", a theoretical computer science model (an abstract data type or ADT) intended to capture the essential properties of arrays.

Array data structure

27

History
The first digital computers used machine-language programming to set up and access array structures for data tables, vector and matrix computations, and for many other purposes. Von Neumann wrote the first array-sorting program (merge sort) in 1945, during the building of the first stored-program computer.[6]p.159 Array indexing was originally done by self-modifying code, and later using index registers and indirect addressing. Some mainframes designed in the 1960s, such as the Burroughs B5000 and its successors, had special instructions for array indexing that included index-bounds checking.. Assembly languages generally have no special support for arrays, other than what the machine itself provides. The earliest high-level programming languages, including FORTRAN (1957), COBOL (1960), and ALGOL 60 (1960), had support for multi-dimensional arrays, and so has C (1972). In C++ (1983), class templates exist for multi-dimensional arrays whose dimension is fixed at runtime[3][5] as well as for runtime-flexible arrays.[2]

Applications
Arrays are used to implement mathematical vectors and matrices, as well as other kinds of rectangular tables. Many databases, small and large, consist of (or include) one-dimensional arrays whose elements are records. Arrays are used to implement other data structures, such as heaps, hash tables, deques, queues, stacks, strings, and VLists. One or more large arrays are sometimes used to emulate in-program dynamic memory allocation, particularly memory pool allocation. Historically, this has sometimes been the only way to allocate "dynamic memory" portably. Arrays can be used to determine partial or complete control flow in programs, as a compact alternative to (otherwise repetitive), multiple IF statements. They are known in this context as control tables and are used in conjunction with a purpose built interpreter whose control flow is altered according to values contained in the array. The array may contain subroutine pointers (or relative subroutine numbers that can be acted upon by SWITCH statements) that direct the path of the execution.

Array element identifier and addressing formulas


When data objects are stored in an array, individual objects are selected by an index that is usually a non-negative scalar integer. Indices are also called subscripts. An index maps the array value to a stored object. There are three ways in which the elements of an array can be indexed: 0 (zero-based indexing): The first element of the array is indexed by subscript of 0.[7] 1 (one-based indexing): The first element of the array is indexed by subscript of 1.[8] n (n-based indexing): The base index of an array can be freely chosen. Usually programming languages allowing n-based indexing also allow negative index values and other scalar data types like enumerations, or characters may be used as an array index. Arrays can have multiple dimensions, thus it is not uncommon to access an array using multiple indices. For example a two dimensional array A with three rows and four columns might provide access to the element at the 2nd row and 4th column by the expression: A[1, 3] (in a row major language) and A[3, 1] (in a column major language) in the case of a zero-based indexing system. Thus two indices are used for a two dimensional array, three for a three dimensional array, and n for an n dimensional array. The number of indices needed to specify an element is called the dimension, dimensionality, or rank of the array. In standard arrays, each index is restricted to a certain range of consecutive integers (or consecutive values of some enumerated type), and the address of an element is computed by a "linear" formula on the indices.

Array data structure

28

One-dimensional arrays
A one-dimensional array (or single dimension array) is a type of linear array. Accessing its elements involves a single subscript which can either represent a row or column index. As an example consider the C declaration int anArrayName[10]; Syntax : datatype anArrayname[sizeofArray]; In the given example the array can contain 10 elements of any value available to the int type. In C, the array element indices are 0-9 inclusive in this case. For example, the expressions anArrayName[0], and anArrayName[9] are the first and last elements respectively. For a vector with linear addressing, the element with index i is located at the address B + c i, where B is a fixed base address and c a fixed constant, sometimes called the address increment or stride. If the valid element indices begin at 0, the constant B is simply the address of the first element of the array. For this reason, the C programming language specifies that array indices always begin at 0; and many programmers will call that element "zeroth" rather than "first". However, one can choose the index of the first element by an appropriate choice of the base address B. For example, if the array has five elements, indexed 1 through 5, and the base address B is replaced by B 30c, then the indices of those same elements will be 31 to 35. If the numbering does not start at 0, the constant B may not be the address of any element.

Multidimensional arrays
For a two-dimensional array, the element with indices i,j would have address B + c i + d j, where the coefficients c and d are the row and column address increments, respectively. More generally, in a k-dimensional array, the address of an element with indices i1, i2, , ik is B + c1 i1 + c2 i2 + + ck ik. For example: int a[3][2]; This means that array a has 3 rows and 2 columns, and the array is of integer type. Here we can store 6 elements they are stored linearly but starting from first row linear then continuing with second row. The above array will be stored as a11, a12, a13, a21, a22, a23. This formula requires only k multiplications and k1 additions, for any array that can fit in memory. Moreover, if any coefficient is a fixed power of 2, the multiplication can be replaced by bit shifting. The coefficients ck must be chosen so that every valid index tuple maps to the address of a distinct element. If the minimum legal value for every index is 0, then B is the address of the element whose indices are all zero. As in the one-dimensional case, the element indices may be changed by changing the base address B. Thus, if a two-dimensional array has rows and columns indexed from 1 to 10 and 1 to 20, respectively, then replacing B by B + c1 - 3 c1 will cause them to be renumbered from 0 through 9 and 4 through 23, respectively. Taking advantage of this feature, some languages (like FORTRAN 77) specify that array indices begin at 1, as in mathematical tradition; while other languages (like Fortran 90, Pascal and Algol) let the user choose the minimum value for each index.

Dope vectors
The addressing formula is completely defined by the dimension d, the base address B, and the increments c1, c2, , ck. It is often useful to pack these parameters into a record called the array's descriptor or stride vector or dope vector.[2][3] The size of each element, and the minimum and maximum values allowed for each index may also be included in the dope vector. The dope vector is a complete handle for the array, and is a convenient way to pass arrays as arguments to procedures. Many useful array slicing operations (such as selecting a sub-array, swapping indices, or reversing the direction of the indices) can be performed very efficiently by manipulating the dope

Array data structure vector.[2]

29

Compact layouts
Often the coefficients are chosen so that the elements occupy a contiguous area of memory. However, that is not necessary. Even if arrays are always created with contiguous elements, some array slicing operations may create non-contiguous sub-arrays from them. There are two systematic compact layouts for a two-dimensional array. For example, consider the matrix

In the row-major order layout (adopted by C for statically declared arrays), the elements in each row are stored in consecutive positions and all of the elements of a row have a lower address than any of the elements of a consecutive row:
1 2 3 4 5 6 7 8 9

In column-major order (traditionally used by Fortran), the elements in each column are consecutive in memory and all of the elements of a columns have a lower address than any of the elements of a consecutive column:
1 4 7 2 5 8 3 6 9

For arrays with three or more indices, "row major order" puts in consecutive positions any two elements whose index tuples differ only by one in the last index. "Column major order" is analogous with respect to the first index. In systems which use processor cache or virtual memory, scanning an array is much faster if successive elements are stored in consecutive positions in memory, rather than sparsely scattered. Many algorithms that use multidimensional arrays will scan them in a predictable order. A programmer (or a sophisticated compiler) may use this information to choose between row- or column-major layout for each array. For example, when computing the product AB of two matrices, it would be best to have A stored in row-major order, and B in column-major order.

Array resizing
Static arrays have a size that is fixed when they are created and consequently do not allow elements to be inserted or removed. However, by allocating a new array and copying the contents of the old array to it, it is possible to effectively implement a dynamic version of an array; see dynamic array. If this operation is done infrequently, insertions at the end of the array require only amortized constant time. Some array data structures do not reallocate storage, but do store a count of the number of elements of the array in use, called the count or size. This effectively makes the array a dynamic array with a fixed maximum size or capacity; Pascal strings are examples of this.

Array data structure

30

Non-linear formulas
More complicated (non-linear) formulas are occasionally used. For a compact two-dimensional triangular array, for instance, the addressing formula is a polynomial of degree 2.

Efficiency
Both store and select take (deterministic worst case) constant time. Arrays take linear (O(n)) space in the number of elements n that they hold. In an array with element size k and on a machine with a cache line size of B bytes, iterating through an array of n elements requires the minimum of ceiling(nk/B) cache misses, because its elements occupy contiguous memory locations. This is roughly a factor of B/k better than the number of cache misses needed to access n elements at random memory locations. As a consequence, sequential iteration over an array is noticeably faster in practice than iteration over many other data structures, a property called locality of reference (this does not mean however, that using a perfect hash or trivial hash within the same (local) array, will not be even faster - and achievable in constant time). Libraries provide low-level optimized facilities for copying ranges of memory (such as memcpy) which can be used to move contiguous blocks of array elements significantly faster than can be achieved through individual element access. The speedup of such optimized routines varies by array element size, architecture, and implementation. Memory-wise, arrays are compact data structures with no per-element overhead. There may be a per-array overhead, e.g. to store index bounds, but this is language-dependent. It can also happen that elements stored in an array require less memory than the same elements stored in individual variables, because several array elements can be stored in a single word; such arrays are often called packed arrays. An extreme (but commonly used) case is the bit array, where every bit represents a single element. A single octet can thus hold up to 256 different combinations of up to 8 different conditions, in the most compact form. Array accesses with statically predictable access patterns are a major source of data parallelism.

Efficiency comparison with other data structures


Linked list Array Dynamic Balanced array tree (1) (n) (1) amortized (n) N/A 0 [12] (n) (n) (log n) (log n) Random access list (log n) (1)

Indexing Insert/delete at beginning Insert/delete at end Insert/delete in middle

(n) (1) (1) search time + [9][10][11] (1) (n)

(1) N/A N/A

(log n) (log n) updating (log n) (log n) updating

Wasted space (average)

(n)

Growable arrays are similar to arrays but add the ability to insert and delete elements; adding and deleting at the end is particularly efficient. However, they reserve linear ((n)) additional storage, whereas arrays do not reserve additional storage. Associative arrays provide a mechanism for array-like functionality without huge storage overheads when the index values are sparse. For example, an array that contains values only at indexes 1 and 2 billion may benefit from using such a structure. Specialized associative arrays with integer keys include Patricia tries, Judy arrays, and van Emde Boas trees. Balanced trees require O(log n) time for indexed access, but also permit inserting or deleting elements in O(log n) time,[13] whereas growable arrays require linear ((n)) time to insert or delete elements at an arbitrary position.

Array data structure Linked lists allow constant time removal and insertion in the middle but take linear time for indexed access. Their memory use is typically worse than arrays, but is still linear. An Iliffe vector is an alternative to a multidimensional array structure. It uses a one-dimensional array of references to arrays of one dimension less. For two dimensions, in particular, this alternative structure would be a vector of pointers to vectors, one for each row. Thus an element in row i and column j of an array A would be accessed by double indexing (A[i][j] in typical notation). This alternative structure allows ragged or jagged arrays, where each row may have a different size or, in general, where the valid range of each index depends on the values of all preceding indices. It also saves one multiplication (by the column address increment) replacing it by a bit shift (to index the vector of row pointers) and one extra memory access (fetching the row address), which may be worthwhile in some architectures.

31

Meaning of dimension
The dimension of an array is the number of indices needed to select an element. Thus, if the array is seen as a function on a set of possible index combinations, it is the dimension of the space of which its domain is a discrete subset. Thus a one-dimensional array is a list of data, a two-dimensional array a rectangle of data, a three-dimensional array a block of data, etc. This should not be confused with the dimension of the set of all matrices with a given domain, that is, the number of elements in the array. For example, an array with 5 rows and 4 columns is two-dimensional, but such matrices form a 20-dimensional space. Similarly, a three-dimensional vector can be represented by a one-dimensional array of size three.

References
[1] Black, Paul E. (13 November 2008). "array" (http:/ / www. nist. gov/ dads/ HTML/ array. html). Dictionary of Algorithms and Data Structures. National Institute of Standards and Technology. . Retrieved 2010-08-22. [2] Bjoern Andres; Ullrich Koethe; Thorben Kroeger; Hamprecht (2010). "Runtime-Flexible Multi-dimensional Arrays and Views for C++98 and C++0x". arXiv:1008.2909[cs.DS]. [3] Garcia, Ronald; Lumsdaine, Andrew (2005). "MultiArray: a C++ library for generic programming with arrays". Software: Practice and Experience 35 (2): 159188. doi:10.1002/spe.630. ISSN0038-0644. [4] David R. Richardson (2002), The Book on Data Structures. iUniverse, 112 pages. ISBN 0-595-24039-9, ISBN 978-0-595-24039-5. [5] T. Veldhuizen. Arrays in Blitz++. In Proc. of the 2nd Int. Conf. on Scientific Computing in Object-Oriented Parallel Environments (ISCOPE), LNCS 1505, pages 223-220. Springer, 1998. [6] Donald Knuth, The Art of Computer Programming, vol. 3. Addison-Wesley [7] "Array Code Examples - PHP Array Functions - PHP code" (http:/ / www. configure-all. com/ arrays. php). http:/ / www. configure-all. com/ : Computer Programming Web programming Tips. . Retrieved 2011-04-08. "In most computer languages array index (counting) starts from 0, not from 1. Index of the first element of the array is 0, index of the second element of the array is 1, and so on. In array of names below you can see indexes and values." [8] "Chapter 6 - Arrays, Types, and Constants" (http:/ / www. modula2. org/ tutor/ chapter6. php). Modula-2 Tutorial. http:/ / www. modula2. org/ tutor/ index. php. . Retrieved 2011-04-08. "The names of the twelve variables are given by Automobiles[1], Automobiles[2], ... Automobiles[12]. The variable name is "Automobiles" and the array subscripts are the numbers 1 through 12. [i.e. in Modula-2, the index starts by one!]" [9] Gerald Kruse. CS 240 Lecture Notes (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ syllabus. htm): Linked Lists Plus: Complexity Trade-offs (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ linkedlist2. htm). Juniata College. Spring 2008. [10] Day 1 Keynote - Bjarne Stroustrup: C++11 Style (http:/ / channel9. msdn. com/ Events/ GoingNative/ GoingNative-2012/ Keynote-Bjarne-Stroustrup-Cpp11-Style) at GoingNative 2012 on channel9.msdn.com from minute 45 or foil 44 [11] Number crunching: Why you should never, ever, EVER use linked-list in your code again (http:/ / kjellkod. wordpress. com/ 2012/ 02/ 25/ why-you-should-never-ever-ever-use-linked-list-in-your-code-again/ ) at kjellkod.wordpress.com [12] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; Munro, JI; Demaine, ED (Technical Report CS-99-09), Resizable Arrays in Optimal Time and Space (http:/ / www. cs. uwaterloo. ca/ research/ tr/ 1999/ 09/ CS-99-09. pdf), Department of Computer Science, University of Waterloo,

Array data structure


[13] Counted B-Tree (http:/ / www. chiark. greenend. org. uk/ ~sgtatham/ algorithms/ cbtree. html)

32

Dynamic array
In computer science, a dynamic array, growable array, resizable array, dynamic table, mutable array, or array list is a random access, variable-size list data structure that allows elements to be added or removed. It is supplied with standard libraries in many modern mainstream programming languages. A dynamic array is not the same thing as a dynamically allocated array, which is a fixed-size array whose size is fixed when the array is allocated, although a dynamic array may use such a fixed-size array as a back end.[1]

Bounded-size dynamic arrays and capacity


The simplest dynamic array is constructed by allocating a fixed-size array and then dividing it into two parts: the first stores the elements of the dynamic array and the second is reserved, or unused. We can then add or remove elements at the end of the dynamic array in constant time by using the reserved space, until this space is completely consumed. The number of elements used by the dynamic array contents is its logical size or size, while the size of the underlying array is called the dynamic array's capacity, which is the maximum possible size without relocating data.
Several values are inserted at the end of a dynamic array using geometric expansion. Grey cells indicate space reserved for expansion. Most insertions are fast (constant time), while some are slow due to the need for reallocation ((n) time, labelled with turtles). The logical size and capacity of the final array are shown.

In applications where the logical size is bounded, the fixed-size data structure suffices. This may be short-sighted, when problems with the array filling up turn up later. It is best to put resize code into any array, to respond to new conditions. Then choosing initial capacity is optimization, not getting the program to run. Resizing the underlying array is an expensive task, typically involving copying the entire contents of the array.

Geometric expansion and amortized cost


To avoid incurring the cost of resizing many times, dynamic arrays resize by a large amount, such as doubling in size, and use the reserved space for future expansion. The operation of adding an element to the end might work as follows: function insertEnd(dynarray a, element e) if (a.size = a.capacity) // resize a to twice its current capacity: a.capacity a.capacity * 2 // (copy the contents to the new memory location here) a[a.size] e a.size a.size + 1 As n elements are inserted, the capacities form a geometric progression. Expanding the array by any constant proportion ensures that inserting n elements takes O(n) time overall, meaning that each insertion takes amortized constant time. The value of this proportion a leads to a time-space tradeoff: the average time per insertion operation

Dynamic array is about a/(a1), while the number of wasted cells is bounded above by (a1)n. The choice of a depends on the library or application: some textbooks use a=2,[2][3] but Java's ArrayList implementation uses a=3/2[1] and the C implementation of Python's list data structure uses a=9/8.[4] Many dynamic arrays also deallocate some of the underlying storage if its size drops below a certain threshold, such as 30% of the capacity. This threshold must be strictly smaller than 1/a in order to support mixed sequences of insertions and removals with amortized constant cost. Dynamic arrays are a common example when teaching amortized analysis.[2][3]

33

Performance
Linked list Array Dynamic Balanced array tree (1) (n) (1) amortized (n) N/A (log n) (log n) Random access list (log n) (1)

Indexing Insert/delete at beginning Insert/delete at end Insert/delete in middle

(n) (1) (1) search time + [5][6][7] (1) (n)

(1) N/A N/A

(log n) (log n) updating (log n) (log n) updating

Wasted space (average)

(n)

[8]

(n)

(n)

The dynamic array has performance similar to an array, with the addition of new operations to add and remove elements from the end: Getting or setting the value at a particular index (constant time) Iterating over the elements in order (linear time, good cache performance) Inserting or deleting an element in the middle of the array (linear time) Inserting or deleting an element at the end of the array (constant amortized time)

Dynamic arrays benefit from many of the advantages of arrays, including good locality of reference and data cache utilization, compactness (low memory use), and random access. They usually have only a small fixed additional overhead for storing information about the size and capacity. This makes dynamic arrays an attractive tool for building cache-friendly data structures. Compared to linked lists, dynamic arrays have faster indexing (constant time versus linear time) and typically faster iteration due to improved locality of reference; however, dynamic arrays require linear time to insert or delete at an arbitrary location, since all following elements must be moved, while linked lists can do this in constant time. This disadvantage is mitigated by the gap buffer and tiered vector variants discussed under Variants below. Also, in a highly fragmented memory region, it may be expensive or impossible to find contiguous space for a large dynamic array, whereas linked lists do not require the whole data structure to be stored contiguously. A balanced tree can store a list while providing all operations of both dynamic arrays and linked lists reasonably efficiently, but both insertion at the end and iteration over the list are slower than for a dynamic array, in theory and in practice, due to non-contiguous storage and tree traversal/manipulation overhead.

Dynamic array

34

Variants
Gap buffers are similar to dynamic arrays but allow efficient insertion and deletion operations clustered near the same arbitrary location. Some deque implementations use array deques, which allow amortized constant time insertion/removal at both ends, instead of just one end. Goodrich[9] presented a dynamic array algorithm called Tiered Vectors that provided O(n1/2) performance for order preserving insertions or deletions from the middle of the array. Hashed Array Tree (HAT) is a dynamic array algorithm published by Sitarski in 1996.[10] Hashed Array Tree wastes order n1/2 amount of storage space, where n is the number of elements in the array. The algorithm has O(1) amortized performance when appending a series of objects to the end of a Hashed Array Tree. In a 1999 paper,[8] Brodnik et al. describe a tiered dynamic array data structure, which wastes only n1/2 space for n elements at any point in time, and they prove a lower bound showing that any dynamic array must waste this much space if the operations are to remain amortized constant time. Additionally, they present a variant where growing and shrinking the buffer has not only amortized but worst-case constant time. Bagwell (2002)[11] presented the VList algorithm, which can be adapted to implement a dynamic array.

Language support
C++'s std::vector is an implementation of dynamic arrays, as are the ArrayList[12] classes supplied with the Java API and the .NET Framework. The generic List<> class supplied with version 2.0 of the .NET Framework is also implemented with dynamic arrays. Smalltalk's OrderedCollection is a dynamic array with dynamic start and end-index, making the removal of the first element also O(1). Python's list datatype implementation is a dynamic array. Delphi and D implement dynamic arrays at the language's core. Ada's Ada.Containers.Vectors generic package provides dynamic array implementation for a given subtype. Many scripting languages such as Perl and Ruby offer dynamic arrays as a built-in primitive data type. Several cross-platform frameworks provide dynamic array implementations for C: CFArray and CFMutableArray in Core Foundation; GArray and GPtrArray in GLib.

References
[1] See, for example, the source code of java.util.ArrayList class from OpenJDK 6 (http:/ / hg. openjdk. java. net/ jdk6/ jdk6/ jdk/ file/ e0e25ac28560/ src/ share/ classes/ java/ util/ ArrayList. java). [2] Goodrich, Michael T.; Tamassia, Roberto (2002), "1.5.2 Analyzing an Extendable Array Implementation", Algorithm Design: Foundations, Analysis and Internet Examples, Wiley, pp.3941. [3] Cormen, Thomas H.; Leiserson, Charles E., Rivest, Ronald L., Stein, Clifford (2001) [1990]. "17.4 Dynamic tables". Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. pp.416424. ISBN0-262-03293-7. [4] List object implementation (http:/ / svn. python. org/ projects/ python/ trunk/ Objects/ listobject. c) from python.org, retrieved 2011-09-27. [5] Gerald Kruse. CS 240 Lecture Notes (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ syllabus. htm): Linked Lists Plus: Complexity Trade-offs (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ linkedlist2. htm). Juniata College. Spring 2008. [6] Day 1 Keynote - Bjarne Stroustrup: C++11 Style (http:/ / channel9. msdn. com/ Events/ GoingNative/ GoingNative-2012/ Keynote-Bjarne-Stroustrup-Cpp11-Style) at GoingNative 2012 on channel9.msdn.com from minute 45 or foil 44 [7] Number crunching: Why you should never, ever, EVER use linked-list in your code again (http:/ / kjellkod. wordpress. com/ 2012/ 02/ 25/ why-you-should-never-ever-ever-use-linked-list-in-your-code-again/ ) at kjellkod.wordpress.com [8] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; Munro, JI; Demaine, ED (Technical Report CS-99-09), Resizable Arrays in Optimal Time and Space (http:/ / www. cs. uwaterloo. ca/ research/ tr/ 1999/ 09/ CS-99-09. pdf), Department of Computer Science, University of Waterloo, [9] Goodrich, Michael T.; Kloss II, John G. (1999), "Tiered Vectors: Efficient Dynamic Arrays for Rank-Based Sequences" (http:/ / citeseer. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 17. 7503), Workshop on Algorithms and Data Structures 1663: 205216, doi:10.1007/3-540-48447-7_21, [10] Sitarski, Edward (September 1996), Algorithm Alley (http:/ / www. ddj. com/ architect/ 184409965?pgno=5), "HATs: Hashed array trees", Dr. Dobb's Journal 21 (11), [11] Bagwell, Phil (2002), Fast Functional Lists, Hash-Lists, Deques and Variable Length Arrays (http:/ / citeseer. ist. psu. edu/ bagwell02fast. html), EPFL,

Dynamic array
[12] Javadoc on ArrayList

35

External links
NIST Dictionary of Algorithms and Data Structures: Dynamic array (http://www.nist.gov/dads/HTML/ dynamicarray.html) VPOOL (http://www.bsdua.org/libbsdua.html#vpool) - C language implementation of dynamic array. CollectionSpy (http://www.collectionspy.com) A Java profiler with explicit support for debugging ArrayList- and Vector-related issues. Open Data Structures - Chapter 2 - Array-Based Lists (http://opendatastructures.org/versions/edition-0.1e/ ods-java/2_Array_Based_Lists.html)

Linked list
In computer science, a linked list is a data structure consisting of a group of nodes which together represent a sequence. Under the simplest form, each node is composed of a datum and a reference (in other words, a link) to the next node in the sequence; more complex variants add additional links. This structure allows for efficient insertion or removal of elements from any position in the sequence.

A linked list whose nodes contain two fields: an integer value and a link to the next node. The last node is linked to a terminator used to signify the end of the list.

Linked lists are among the simplest and most common data structures. They can be used to implement several other common abstract data types, including stacks, queues, associative arrays, and symbolic expressions, though it is not uncommon to implement the other data structures directly without using a list as the basis of implementation. The principal benefit of a linked list over a conventional array is that the list elements can easily be inserted or removed without reallocation or reorganization of the entire structure because the data items need not be stored contiguously in memory or on disk. Linked lists allow insertion and removal of nodes at any point in the list, and can do so with a constant number of operations if the link previous to the link being added or removed is maintained during list traversal. On the other hand, simple linked lists by themselves do not allow random access to the data, or any form of efficient indexing. Thus, many basic operations such as obtaining the last node of the list (assuming that the last node is not maintained as separate node reference in the list structure), or finding a node that contains a given datum, or locating the place where a new node should be inserted may require scanning most or all of the list elements.

History
Linked lists were developed in 1955-56 by Allen Newell, Cliff Shaw and Herbert A. Simon at RAND Corporation as the primary data structure for their Information Processing Language. IPL was used by the authors to develop several early artificial intelligence programs, including the Logic Theory Machine, the General Problem Solver, and a computer chess program. Reports on their work appeared in IRE Transactions on Information Theory in 1956, and several conference proceedings from 1957 to 1959, including Proceedings of the Western Joint Computer Conference in 1957 and 1958, and Information Processing (Proceedings of the first UNESCO International Conference on Information Processing) in 1959. The now-classic diagram consisting of blocks representing list nodes with arrows pointing to successive list nodes appears in "Programming the Logic Theory Machine" by Newell and Shaw in Proc. WJCC, February 1957. Newell and Simon were recognized with the ACM Turing Award in 1975

Linked list for having "made basic contributions to artificial intelligence, the psychology of human cognition, and list processing". The problem of machine translation for natural language processing led Victor Yngve at Massachusetts Institute of Technology (MIT) to use linked lists as data structures in his COMIT programming language for computer research in the field of linguistics. A report on this language entitled "A programming language for mechanical translation" appeared in Mechanical Translation in 1958. LISP, standing for list processor, was created by John McCarthy in 1958 while he was at MIT and in 1960 he published its design in a paper in the Communications of the ACM, entitled "Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I". One of LISP's major data structures is the linked list. By the early 1960s, the utility of both linked lists and languages which use these structures as their primary data representation was well established. Bert Green of the MIT Lincoln Laboratory published a review article entitled "Computer languages for symbol manipulation" in IRE Transactions on Human Factors in Electronics in March 1961 which summarized the advantages of the linked list approach. A later review article, "A Comparison of list-processing computer languages" by Bobrow and Raphael, appeared in Communications of the ACM in April 1964. Several operating systems developed by Technical Systems Consultants (originally of West Lafayette Indiana, and later of Chapel Hill, North Carolina) used singly linked lists as file structures. A directory entry pointed to the first sector of a file, and succeeding portions of the file were located by traversing pointers. Systems using this technique included Flex (for the Motorola 6800 CPU), mini-Flex (same CPU), and Flex9 (for the Motorola 6809 CPU). A variant developed by TSC for and marketed by Smoke Signal Broadcasting in California, used doubly linked lists in the same manner. The TSS/360 operating system, developed by IBM for the System 360/370 machines, used a double linked list for their file system catalog. The directory structure was similar to Unix, where a directory could contain files and/or other directories and extend to any depth. A utility flea was created to fix file system problems after a crash, since modified portions of the file catalog were sometimes in memory when a crash occurred. Problems were detected by comparing the forward and backward links for consistency. If a forward link was corrupt, then if a backward link to the infected node was found, the forward link was set to the node with the backward link. A humorous comment in the source code where this utility was invoked stated "Everyone knows a flea collar gets rid of bugs in cats".

36

Basic concepts and nomenclature


Each record of a linked list is often called an element or node. The field of each node that contains the address of the next node is usually called the next link or next pointer. The remaining fields are known as the data, information, value, cargo, or payload fields. The head of a list is its first node. The tail of a list may refer either to the rest of the list after the head, or to the last node in the list. In Lisp and some derived languages, the next node may be called the cdr (pronounced could-er) of the list, while the payload of the head node may be called the car.

Linked list

37

Post office box analogy


The concept of a linked list can be explained by a simple analogy to real-world post office boxes. Suppose Alice is a spy who wishes to give a codebook to Bob by putting it in a post office box and then giving him the key. However, the book is too thick to fit in a single post office box, so instead she divides the book into two halves and purchases two post office boxes. In the first box, she puts the first half of the book and a key to the second box, and in the second box she puts the second half of the book. She then gives Bob a key to the first box. No matter how large the book is, this scheme can be extended to any number of boxes by always putting the key to the next box in the previous box. In this analogy, the boxes correspond to elements or nodes, the keys correspond to pointers, and the book itself is the data. The key given to Bob is the head pointer, while those stored in the boxes are next pointers. The scheme as described above is a singly linked list (see below).

Linear and circular lists

In the last node of a list, the link field often contains a null reference, a special value used to indicate the lack of further nodes. A less common convention is to make it point to the first node of the list; in that case the list is said to be circular or circularly linked; otherwise it is said to be open or linear.

Bob (bottom) has the key to box 201, which contains the first half of the book and a key to box 102, which contains the rest of the book.

A circular linked list

Singly, doubly, and multiply linked lists


Singly linked lists contain nodes which have a data field as well as a next field, which points to the next node in the linked list.

A singly linked list whose nodes contain two fields: an integer value and a link to the next node

In a doubly linked list, each node contains, besides the next-node link, a second link field pointing to the previous node in the sequence. The two links may be called forward(s) and backwards, or next and prev(ious).

A doubly linked list whose nodes contain three fields: an integer value, the link forward to the next node, and the link backward to the previous node

A technique known as XOR-linking allows a doubly linked list to be implemented using a single link field in each node. However, this technique requires the ability to do bit operations on addresses, and therefore may not be available in some high-level languages.

Linked list In a multiply linked list, each node contains two or more link fields, each field being used to connect the same set of data records in a different order (e.g., by name, by department, by date of birth, etc.). (While doubly linked lists can be seen as special cases of multiply linked list, the fact that the two orders are opposite to each other leads to simpler and more efficient algorithms, so they are usually treated as a separate case.) In the case of a circular doubly linked list, the only change that occurs is that end, or "tail", of the said list is linked back to the front, or "head", of the list and vice versa.

38

Sentinel nodes
In some implementations, an extra sentinel or dummy node may be added before the first data record and/or after the last one. This convention simplifies and accelerates some list-handling algorithms, by ensuring that all links can be safely dereferenced and that every list (even one that contains no data elements) always has a "first" and "last" node.

Empty lists
An empty list is a list that contains no data records. This is usually the same as saying that it has zero nodes. If sentinel nodes are being used, the list is usually said to be empty when it has only sentinel nodes.

Hash linking
The link fields need not be physically part of the nodes. If the data records are stored in an array and referenced by their indices, the link field may be stored in a separate array with the same indices as the data records.

List handles
Since a reference to the first node gives access to the whole list, that reference is often called the address, pointer, or handle of the list. Algorithms that manipulate linked lists usually get such handles to the input lists and return the handles to the resulting lists. In fact, in the context of such algorithms, the word "list" often means "list handle". In some situations, however, it may be convenient to refer to a list by a handle that consists of two links, pointing to its first and last nodes.

Combining alternatives
The alternatives listed above may be arbitrarily combined in almost every way, so one may have circular doubly linked lists without sentinels, circular singly linked lists with sentinels, etc.

Tradeoffs
As with most choices in computer programming and design, no method is well suited to all circumstances. A linked list data structure might work well in one case, but cause problems in another. This is a list of some of the common tradeoffs involving linked list structures.

Linked lists vs. dynamic arrays

Linked list

39

Linked list Array

Dynamic Balanced array tree (1) (n) (1) amortized (n) (log n) (log n)

Random access list (log n) (1)

Indexing Insert/delete at beginning Insert/delete at end Insert/delete in middle

(n) (1) (1) search time + [1][2][3] (1) (n)

(1) N/A N/A

(log n) (log n) updating (log n) (log n) updating

N/A

Wasted space (average)

(n)

[4]

(n)

(n)

A dynamic array is a data structure that allocates all elements contiguously in memory, and keeps a count of the current number of elements. If the space reserved for the dynamic array is exceeded, it is reallocated and (possibly) copied, an expensive operation. Linked lists have several advantages over dynamic arrays. Insertion or deletion of an element at a specific point of a list, assuming that we have a pointer to the node (before the one to be removed, or before the insertion point) already, is a constant-time operation, whereas insertion in a dynamic array at random locations will require moving half of the elements on average, and all the elements in the worst case. While one can "delete" an element from an array in constant time by somehow marking its slot as "vacant", this causes fragmentation that impedes the performance of iteration. Moreover, arbitrarily many elements may be inserted into a linked list, limited only by the total memory available; while a dynamic array will eventually fill up its underlying array data structure and will have to reallocate an expensive operation, one that may not even be possible if memory is fragmented, although the cost of reallocation can be averaged over insertions, and the cost of an insertion due to reallocation would still be amortized O(1). This helps with appending elements at the array's end, but inserting into (or removing from) middle positions still carries prohibitive costs due to data moving to maintain contiguity. An array from which many elements are removed may also have to be resized in order to avoid wasting too much space. On the other hand, dynamic arrays (as well as fixed-size array data structures) allow constant-time random access, while linked lists allow only sequential access to elements. Singly linked lists, in fact, can only be traversed in one direction. This makes linked lists unsuitable for applications where it's useful to look up an element by its index quickly, such as heapsort. Sequential access on arrays and dynamic arrays is also faster than on linked lists on many machines, because they have optimal locality of reference and thus make good use of data caching. Another disadvantage of linked lists is the extra storage needed for references, which often makes them impractical for lists of small data items such as characters or boolean values, because the storage overhead for the links may exceed by a factor of two or more the size of the data. In contrast, a dynamic array requires only the space for the data itself (and a very small amount of control data).[5] It can also be slow, and with a nave allocator, wasteful, to allocate memory separately for each new element, a problem generally solved using memory pools. Some hybrid solutions try to combine the advantages of the two representations. Unrolled linked lists store several elements in each list node, increasing cache performance while decreasing memory overhead for references. CDR coding does both these as well, by replacing references with the actual data referenced, which extends off the end of the referencing record. A good example that highlights the pros and cons of using dynamic arrays vs. linked lists is by implementing a program that resolves the Josephus problem. The Josephus problem is an election method that works by having a group of people stand in a circle. Starting at a predetermined person, you count around the circle n times. Once you reach the nth person, take them out of the circle and have the members close the circle. Then count around the circle the same n times and repeat the process, until only one person is left. That person wins the election. This shows the

Linked list strengths and weaknesses of a linked list vs. a dynamic array, because if you view the people as connected nodes in a circular linked list then it shows how easily the linked list is able to delete nodes (as it only has to rearrange the links to the different nodes). However, the linked list will be poor at finding the next person to remove and will need to search through the list until it finds that person. A dynamic array, on the other hand, will be poor at deleting nodes (or elements) as it cannot remove one node without individually shifting all the elements up the list by one. However, it is exceptionally easy to find the nth person in the circle by directly referencing them by their position in the array. The list ranking problem concerns the efficient conversion of a linked list representation into an array. Although trivial for a conventional computer, solving this problem by a parallel algorithm is complicated and has been the subject of much research. A balanced tree has similar memory access patterns and space overhead to a linked list while permitting much more efficient indexing, taking O(log n) time instead of O(n) for a random access. However, insertion and deletion operations are more expensive due to the overhead of tree manipulations to maintain balance. Efficient schemes exist for trees to automatically maintain themselves in almost-balanced state, like AVL trees or red-black trees.

40

Singly linked linear lists vs. other lists


While doubly linked and/or circular lists have advantages over singly linked linear lists, linear lists offer some advantages that make them preferable in some situations. For one thing, a singly linked linear list is a recursive data structure, because it contains a pointer to a smaller object of the same type. For that reason, many operations on singly linked linear lists (such as merging two lists, or enumerating the elements in reverse order) often have very simple recursive algorithms, much simpler than any solution using iterative commands. While one can adapt those recursive solutions for doubly linked and circularly linked lists, the procedures generally need extra arguments and more complicated base cases. Linear singly linked lists also allow tail-sharing, the use of a common final portion of sub-list as the terminal portion of two different lists. In particular, if a new node is added at the beginning of a list, the former list remains available as the tail of the new one a simple example of a persistent data structure. Again, this is not true with the other variants: a node may never belong to two different circular or doubly linked lists. In particular, end-sentinel nodes can be shared among singly linked non-circular lists. One may even use the same end-sentinel node for every such list. In Lisp, for example, every proper list ends with a link to a special node, denoted by nil or (), whose CAR and CDR links point to itself. Thus a Lisp procedure can safely take the CAR or CDR of any list. Indeed, the advantages of the fancy variants are often limited to the complexity of the algorithms, not in their efficiency. A circular list, in particular, can usually be emulated by a linear list together with two variables that point to the first and last nodes, at no extra cost.

Doubly linked vs. singly linked


Double-linked lists require more space per node (unless one uses XOR-linking), and their elementary operations are more expensive; but they are often easier to manipulate because they allow sequential access to the list in both directions. In a doubly linked list, one can insert or delete a node in a constant number of operations given only that node's address. To do the same in a singly linked list, one must have the address of the pointer to that node, which is either the handle for the whole list (in case of the first node) or the link field in the previous node. Some algorithms require access in both directions. On the other hand, doubly linked lists do not allow tail-sharing and cannot be used as persistent data structures.

Linked list

41

Circularly linked vs. linearly linked


A circularly linked list may be a natural option to represent arrays that are naturally circular, e.g. the corners of a polygon, a pool of buffers that are used and released in FIFO order, or a set of processes that should be time-shared in round-robin order. In these applications, a pointer to any node serves as a handle to the whole list. With a circular list, a pointer to the last node gives easy access also to the first node, by following one link. Thus, in applications that require access to both ends of the list (e.g., in the implementation of a queue), a circular structure allows one to handle the structure by a single pointer, instead of two. A circular list can be split into two circular lists, in constant time, by giving the addresses of the last node of each piece. The operation consists in swapping the contents of the link fields of those two nodes. Applying the same operation to any two nodes in two distinct lists joins the two list into one. This property greatly simplifies some algorithms and data structures, such as the quad-edge and face-edge. The simplest representation for an empty circular list (when such a thing makes sense) is a null pointer, indicating that the list has no nodes. Without this choice, many algorithms have to test for this special case, and handle it separately. By contrast, the use of null to denote an empty linear list is more natural and often creates fewer special cases.

Using sentinel nodes


Sentinel node may simplify certain list operations, by ensuring that the next and/or previous nodes exist for every element, and that even empty lists have at least one node. One may also use a sentinel node at the end of the list, with an appropriate data field, to eliminate some end-of-list tests. For example, when scanning the list looking for a node with a given value x, setting the sentinel's data field to x makes it unnecessary to test for end-of-list inside the loop. Another example is the merging two sorted lists: if their sentinels have data fields set to +, the choice of the next output node does not need special handling for empty lists. However, sentinel nodes use up extra space (especially in applications that use many short lists), and they may complicate other operations (such as the creation of a new empty list). However, if the circular list is used merely to simulate a linear list, one may avoid some of this complexity by adding a single sentinel node to every list, between the last and the first data nodes. With this convention, an empty list consists of the sentinel node alone, pointing to itself via the next-node link. The list handle should then be a pointer to the last data node, before the sentinel, if the list is not empty; or to the sentinel itself, if the list is empty. The same trick can be used to simplify the handling of a doubly linked linear list, by turning it into a circular doubly linked list with a single sentinel node. However, in this case, the handle should be a single pointer to the dummy node itself.[6]

Linked list operations


When manipulating linked lists in-place, care must be taken to not use values that you have invalidated in previous assignments. This makes algorithms for inserting or deleting linked list nodes somewhat subtle. This section gives pseudocode for adding or removing nodes from singly, doubly, and circularly linked lists in-place. Throughout we will use null to refer to an end-of-list marker or sentinel, which may be implemented in a number of ways.

Linearly linked lists


Singly linked lists Our node data structure will have two fields. We also keep a variable firstNode which always points to the first node in the list, or is null for an empty list.

Linked list record Node { data; // The data being stored in the node Node next // A reference to the next node, null for last node } record List { Node firstNode // points to first node of list; null for empty list } Traversal of a singly linked list is simple, beginning at the first node and following each next link until we come to the end: node := list.firstNode while node not null (do something with node.data) node := node.next The following code inserts a node after an existing node in a singly linked list. The diagram shows how it works. Inserting a node before an existing one cannot be done directly; instead, one must keep track of the previous node and insert a node after it.

42

function insertAfter(Node node, Node newNode) // insert newNode after node newNode.next := node.next node.next := newNode Inserting at the beginning of the list requires a separate function. This requires updating firstNode.
function insertBeginning(List list, Node newNode) // insert node before current first node newNode.next := list.firstNode

list.firstNode := newNode

Similarly, we have functions for removing the node after a given node, and for removing a node from the beginning of the list. The diagram demonstrates the former. To find and remove a particular node, one must again keep track of the previous element.

Linked list

43

function removeAfter(node node) // remove node past this one obsoleteNode := node.next node.next := node.next.next destroy obsoleteNode function removeBeginning(List list) // remove first node obsoleteNode := list.firstNode list.firstNode := list.firstNode.next // point past deleted node destroy obsoleteNode Notice that removeBeginning() sets list.firstNode to null when removing the last node in the list. Since we can't iterate backwards, efficient insertBefore or removeBefore operations are not possible. Appending one linked list to another can be inefficient unless a reference to the tail is kept as part of the List structure, because we must traverse the entire first list in order to find the tail, and then append the second list to this. Thus, if two linearly linked lists are each of length , list appending has asymptotic time complexity of . In the Lisp family of languages, list appending is provided by the append procedure. Many of the special cases of linked list operations can be eliminated by including a dummy element at the front of the list. This ensures that there are no special cases for the beginning of the list and renders both insertBeginning() and removeBeginning() unnecessary. In this case, the first useful data in the list will be found at list.firstNode.next.

Linked list

44

Circularly linked list


In a circularly linked list, all nodes are linked in a continuous circle, without using null. For lists with a front and a back (such as a queue), one stores a reference to the last node in the list. The next node after the last node is the first node. Elements can be added to the back of the list and removed from the front in constant time. Circularly linked lists can be either singly or doubly linked. Both types of circularly linked lists benefit from the ability to traverse the full list beginning at any given node. This often allows us to avoid storing firstNode and lastNode, although if the list may be empty we need a special representation for the empty list, such as a lastNode variable which points to some node in the list or is null if it's empty; we use such a lastNode here. This representation significantly simplifies adding and removing nodes with a non-empty list, but empty lists are then a special case. Algorithms Assuming that someNode is some node in a non-empty circular singly linked list, this code iterates through that list starting with someNode: function iterate(someNode) if someNode null node := someNode do do something with node.value node := node.next while node someNode Notice that the test "while node someNode" must be at the end of the loop. If the test was moved to the beginning of the loop, the procedure would fail whenever the list had only one node. This function inserts a node "newNode" into a circular linked list after a given node "node". If "node" is null, it assumes that the list is empty. function insertAfter(Node node, Node newNode) if node = null newNode.next := newNode else newNode.next := node.next node.next := newNode Suppose that "L" is a variable pointing to the last node of a circular linked list (or null if the list is empty). To append "newNode" to the end of the list, one may do insertAfter(L, newNode) L := newNode To insert "newNode" at the beginning of the list, one may do insertAfter(L, newNode) if L = null L := newNode

Linked list

45

Linked lists using arrays of nodes


Languages that do not support any type of reference can still create links by replacing pointers with array indices. The approach is to keep an array of records, where each record has integer fields indicating the index of the next (and possibly previous) node in the array. Not all nodes in the array need be used. If records are also not supported, parallel arrays can often be used instead. As an example, consider the following linked list record that uses arrays instead of pointers: record Entry { integer next; // index of next entry in array integer prev; // previous entry (if double-linked) string name; real balance; } By creating an array of these structures, and an integer variable to store the index of the first element, a linked list can be built: integer listHead Entry Records[1000] Links between elements are formed by placing the array index of the next (or previous) cell into the Next or Prev field within a given element. For example:
Index 0 1 Next Prev 1 -1 4 0 -1 Name Jones, John Smith, Joseph Adams, Adam Balance 123.45 234.56 0.00

2 (listHead) 4 3 4 5 6 7 0

Ignore, Ignatius 999.99 2 Another, Anita 876.54

In the above example, ListHead would be set to 2, the location of the first entry in the list. Notice that entry 3 and 5 through 7 are not part of the list. These cells are available for any additions to the list. By creating a ListFree integer variable, a free list could be created to keep track of what cells are available. If all entries are in use, the size of the array would have to be increased or some elements would have to be deleted before new entries could be stored in the list. The following code would traverse the list and display names and account balance: i := listHead while i 0 // loop through the list print i, Records[i].name, Records[i].balance // print entry i := Records[i].next When faced with a choice, the advantages of this approach include: The linked list is relocatable, meaning it can be moved about in memory at will, and it can also be quickly and directly serialized for storage on disk or transfer over a network.

Linked list Especially for a small list, array indexes can occupy significantly less space than a full pointer on many architectures. Locality of reference can be improved by keeping the nodes together in memory and by periodically rearranging them, although this can also be done in a general store. Nave dynamic memory allocators can produce an excessive amount of overhead storage for each node allocated; almost no allocation overhead is incurred per node in this approach. Seizing an entry from a pre-allocated array is faster than using dynamic memory allocation for each node, since dynamic memory allocation typically requires a search for a free memory block of the desired size. This approach has one main disadvantage, however: it creates and manages a private memory space for its nodes. This leads to the following issues: It increase complexity of the implementation. Growing a large array when it is full may be difficult or impossible, whereas finding space for a new linked list node in a large, general memory pool may be easier. Adding elements to a dynamic array will occasionally (when it is full) unexpectedly take linear (O(n)) instead of constant time (although it's still an amortized constant). Using a general memory pool leaves more memory for other data if the list is smaller than expected or if many nodes are freed. For these reasons, this approach is mainly used for languages that do not support dynamic memory allocation. These disadvantages are also mitigated if the maximum size of the list is known at the time the array is created.

46

Language support
Many programming languages such as Lisp and Scheme have singly linked lists built in. In many functional languages, these lists are constructed from nodes, each called a cons or cons cell. The cons has two fields: the car, a reference to the data for that node, and the cdr, a reference to the next node. Although cons cells can be used to build other data structures, this is their primary purpose. In languages that support abstract data types or templates, linked list ADTs or templates are available for building linked lists. In other languages, linked lists are typically built using references together with records.

Internal and external storage


When constructing a linked list, one is faced with the choice of whether to store the data of the list directly in the linked list nodes, called internal storage, or merely to store a reference to the data, called external storage. Internal storage has the advantage of making access to the data more efficient, requiring less storage overall, having better locality of reference, and simplifying memory management for the list (its data is allocated and deallocated at the same time as the list nodes). External storage, on the other hand, has the advantage of being more generic, in that the same data structure and machine code can be used for a linked list no matter what the size of the data is. It also makes it easy to place the same data in multiple linked lists. Although with internal storage the same data can be placed in multiple lists by including multiple next references in the node data structure, it would then be necessary to create separate routines to add or delete cells based on each field. It is possible to create additional linked lists of elements that use internal storage by using external storage, and having the cells of the additional linked lists store references to the nodes of the linked list containing the data. In general, if a set of data structures needs to be included in multiple linked lists, external storage is the best approach. If a set of data structures need to be included in only one linked list, then internal storage is slightly better, unless a generic linked list package using external storage is available. Likewise, if different sets of data that can be stored in the same data structure are to be included in a single linked list, then internal storage would be fine.

Linked list Another approach that can be used with some languages involves having different data structures, but all have the initial fields, including the next (and prev if double linked list) references in the same location. After defining separate structures for each type of data, a generic structure can be defined that contains the minimum amount of data shared by all the other structures and contained at the top (beginning) of the structures. Then generic routines can be created that use the minimal structure to perform linked list type operations, but separate routines can then handle the specific data. This approach is often used in message parsing routines, where several types of messages are received, but all start with the same set of fields, usually including a field for message type. The generic routines are used to add new messages to a queue when they are received, and remove them from the queue in order to process the message. The message type field is then used to call the correct routine to process the specific type of message.

47

Example of internal and external storage


Suppose you wanted to create a linked list of families and their members. Using internal storage, the structure might look like the following: record member { // member of a family member next; string firstName; integer age; } record family { // the family itself family next; string lastName; string address; member members // head of list of members of this family } To print a complete list of families and their members using internal storage, we could write: aFamily := Families // start at head of families list while aFamily null // loop through list of families print information about family aMember := aFamily.members // get head of list of this family's members while aMember null // loop through list of members print information about member aMember := aMember.next aFamily := aFamily.next Using external storage, we would create the following structures: record node { // generic link structure node next; pointer data // generic pointer for data at node } record member { // structure for family member string firstName; integer age } record family { // structure for family

Linked list string lastName; string address; node members // head of list of members of this family } To print a complete list of families and their members using external storage, we could write: famNode := Families // start at head of families list while famNode null // loop through list of families aFamily := (family) famNode.data // extract family from node print information about family memNode := aFamily.members // get list of family members while memNode null // loop through list of members aMember := (member)memNode.data // extract member from node print information about member memNode := memNode.next famNode := famNode.next Notice that when using external storage, an extra step is needed to extract the record from the node and cast it into the proper data type. This is because both the list of families and the list of members within the family are stored in two linked lists using the same data structure (node), and this language does not have parametric types. As long as the number of families that a member can belong to is known at compile time, internal storage works fine. If, however, a member needed to be included in an arbitrary number of families, with the specific number known only at run time, external storage would be necessary.

48

Speeding up search
Finding a specific element in a linked list, even if it is sorted, normally requires O(n) time (linear search). This is one of the primary disadvantages of linked lists over other data structures. In addition to the variants discussed above, below are two simple ways to improve search time. In an unordered list, one simple heuristic for decreasing average search time is the move-to-front heuristic, which simply moves an element to the beginning of the list once it is found. This scheme, handy for creating simple caches, ensures that the most recently used items are also the quickest to find again. Another common approach is to "index" a linked list using a more efficient external data structure. For example, one can build a red-black tree or hash table whose elements are references to the linked list nodes. Multiple such indexes can be built on a single list. The disadvantage is that these indexes may need to be updated each time a node is added or removed (or at least, before that index is used again).

Random access lists


A random access list is a list with support for fast random access to read or modify any element in the list.[7] One possible implementation is a skew binary random access list using the skew binary number system, which involves a list of trees with special properties; this allows worst-case constant time head/cons operations, and worst-case logarithmic time random access to an element by index).[7] Random access lists can be implemented as persistent data structures.[7] Random access lists can be viewed as immutable linked lists in that they likewise support the same O(1) head and tail operations.[7]

Linked list A simple extension to random access lists is the min-list, which provides an additional operation that yields the minimum element in the entire list in constant time (without mutation complexities).[7]

49

Related data structures


Both stacks and queues are often implemented using linked lists, and simply restrict the type of operations which are supported. The skip list is a linked list augmented with layers of pointers for quickly jumping over large numbers of elements, and then descending to the next layer. This process continues down to the bottom layer, which is the actual list. A binary tree can be seen as a type of linked list where the elements are themselves linked lists of the same nature. The result is that each node may include a reference to the first node of one or two other linked lists, which, together with their contents, form the subtrees below that node. An unrolled linked list is a linked list in which each node contains an array of data values. This leads to improved cache performance, since more list elements are contiguous in memory, and reduced memory overhead, because less metadata needs to be stored for each element of the list. A hash table may use linked lists to store the chains of items that hash to the same position in the hash table. A heap shares some of the ordering properties of a linked list, but is almost always implemented using an array. Instead of references from node to node, the next and previous data indexes are calculated using the current data's index. A self-organizing list rearranges its nodes based on some heuristic which reduces search times for data retrieval by keeping commonly accessed nodes at the head of the list.

Notes
[1] Gerald Kruse. CS 240 Lecture Notes (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ syllabus. htm): Linked Lists Plus: Complexity Trade-offs (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ linkedlist2. htm). Juniata College. Spring 2008. [2] Day 1 Keynote - Bjarne Stroustrup: C++11 Style (http:/ / channel9. msdn. com/ Events/ GoingNative/ GoingNative-2012/ Keynote-Bjarne-Stroustrup-Cpp11-Style) at GoingNative 2012 on channel9.msdn.com from minute 45 or foil 44 [3] Number crunching: Why you should never, ever, EVER use linked-list in your code again (http:/ / kjellkod. wordpress. com/ 2012/ 02/ 25/ why-you-should-never-ever-ever-use-linked-list-in-your-code-again/ ) at kjellkod.wordpress.com [4] Brodnik, Andrej; Carlsson, Svante; Sedgewick, Robert; Munro, JI; Demaine, ED (Technical Report CS-99-09), Resizable Arrays in Optimal Time and Space (http:/ / www. cs. uwaterloo. ca/ research/ tr/ 1999/ 09/ CS-99-09. pdf), Department of Computer Science, University of Waterloo, [5] The amount of control data required for a dynamic array is usually of the form , where is a per-array constant, is a per-dimension constant, and is the number of dimensions. and are typically on the order of 10 bytes. [6] Ford, William and Topp, William Data Structures with C++ using STL Second Edition (2002). Prentice-Hall. ISBN 0-13-085850-1, pp. 466-467 [7] C Okasaki, " Purely Functional Random-Access Lists (http:/ / cs. oberlin. edu/ ~jwalker/ refs/ fpca95. ps)"

Linked list

50

Footnotes References
Juan, Angel (2006). "Ch20 Data Structures; ID06 - PROGRAMMING with JAVA (slide part of the book "Big Java", by CayS. Horstmann)" (http://www.uoc.edu/in3/emath/docs/java/ch20.pdf) (PDF). p.3 "Definition of a linked list" (http://nist.gov/dads/HTML/linkedList.html). National Institute of Standards and Technology. 2004-08-16. Retrieved 2004-12-14. Antonakos, James L.; Mansfield, Kenneth C., Jr. (1999). Practical Data Structures Using C/C++. Prentice-Hall. pp.165190. ISBN0-13-280843-9. Collins, William J. (2005) [2002]. Data Structures and the Java Collections Framework. New York: McGraw Hill. pp.239303. ISBN0-07-282379-8. Cormen, Thomas H.; Charles E. Leiserson; Ronald L. Rivest; Clifford Stein (2003). Introduction to Algorithms. MIT Press. pp.205213 & 501505. ISBN0-262-03293-7. Cormen, Thomas H.; Charles E. Leiserson; Ronald L. Rivest; Clifford Stein (2001). "10.2: Linked lists". Introduction to Algorithms (2md ed.). MIT Press. pp.204209. ISBN0-262-03293-7. Green, Bert F. Jr. (1961). "Computer Languages for Symbol Manipulation". IRE Transactions on Human Factors in Electronics (2): 38. McCarthy, John (1960). "Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I" (http://www-formal.stanford.edu/jmc/recursive.html). Communications of the ACM. Knuth, Donald (1997). "2.2.3-2.2.5". Fundamental Algorithms (3rd ed.). Addison-Wesley. pp.254298. ISBN0-201-89683-4. Newell, Allen; Shaw, F. C. (1957). "Programming the Logic Theory Machine". Proceedings of the Western Joint Computer Conference: 230240. Parlante, Nick (2001). "Linked list basics" (http://cslibrary.stanford.edu/103/LinkedListBasics.pdf). Stanford University. Retrieved 2009-09-21. Sedgewick, Robert (1998). Algorithms in C. Addison Wesley. pp.90109. ISBN0-201-31452-5. Shaffer, Clifford A. (1998). A Practical Introduction to Data Structures and Algorithm Analysis. New Jersey: Prentice Hall. pp.77102. ISBN0-13-660911-2. Wilkes, Maurice Vincent (1964). "An Experiment with a Self-compiling Compiler for a Simple List-Processing Language". Annual Review in Automatic Programming (Pergamon Press) 4 (1). Wilkes, Maurice Vincent (1964). "Lists and Why They are Useful". Proceeds of the ACM National Conference, Philadelphia 1964 (ACM) (P64): F11. Shanmugasundaram, Kulesh (2005-04-04). "Linux Kernel Linked List Explained" (http://isis.poly.edu/kulesh/ stuff/src/klist/). Retrieved 2009-09-21.

External links
Description (http://nist.gov/dads/HTML/linkedList.html) from the Dictionary of Algorithms and Data Structures Introduction to Linked Lists (http://cslibrary.stanford.edu/103/), Stanford University Computer Science Library Linked List Problems (http://cslibrary.stanford.edu/105/), Stanford University Computer Science Library Open Data Structures - Chapter 3 - Linked Lists (http://opendatastructures.org/versions/edition-0.1e/ods-java/ 3_Linked_Lists.html) Patent for the idea of having nodes which are in several linked lists simultaneously (http://www.google.com/ patents?vid=USPAT7028023) (note that this technique was widely used for many decades before the patent was granted)

Doubly linked list

51

Doubly linked list


In computer science, a doubly linked list is a linked data structure that consists of a set of sequentially linked records called nodes. Each node contains two fields, called links, that are references to the previous and to the next node in the sequence of nodes. The beginning and ending nodes' previous and next links, respectively, point to some kind of terminator, typically a sentinel node or null, to facilitate traversal of the list. If there is only one sentinel node, then the list is circularly linked via the sentinel node. It can be conceptualized as two singly linked lists formed from the same data items, but in opposite sequential orders.

A doubly linked list whose nodes contain three fields: an integer value, the link to the next node, and the link to the previous node.

The two node links allow traversal of the list in either direction. While adding or removing a node in a doubly linked list requires changing more links than the same operations on a singly linked list, the operations are simpler and potentially more efficient (for nodes other than first nodes) because there is no need to keep track of the previous node during traversal or no need to traverse the list to find the previous node, so that its link can be modified.

Nomenclature and implementation


The first and last nodes of a doubly linked list are immediately accessible (i.e., accessible without traversal, and usually called head and tail) and therefore allow traversal of the list from the beginning or end of the list, respectively: e.g., traversing the list from beginning to end, or from end to beginning, in a search of the list for a node with specific data value. Any node of a doubly linked list, once gotten, can be used to begin a new traversal of the list, in either direction (towards beginning or end), from the given node. The link fields of a doubly linked list node are often called next and previous or forward and backward. The references stored in the link fields are usually implemented as pointers, but (as in any linked data structure) they may also be address offsets or indices into an array where the nodes live.

Basic algorithms
Open doubly linked lists
Data type declarations record DoublyLinkedNode { prev // A reference to the previous node next // A reference to the next node data // Data or a reference to data } record DoublyLinkedList { DoublyLinkedNode firstNode DoublyLinkedNode lastNode }

// points to first node of list // points to last node of list

Doubly linked list Traversing the list Traversal of a doubly linked list can be in either direction. In fact, the direction of traversal can change many times, if desired. Traversal is often called iteration, but that choice of terminology is unfortunate, for iteration has well-defined semantics (e.g., in mathematics) which are not analogous to traversal. Forwards node := list.firstNode while node null <do something with node.data> node := node.next Backwards node := list.lastNode while node null <do something with node.data> node := node.prev Inserting a node These symmetric functions insert a node either after or before a given node, with the diagram demonstrating after:

52

function insertAfter(List list, Node node, Node newNode) newNode.prev := node newNode.next := node.next if node.next == null list.lastNode := newNode else node.next.prev := newNode node.next := newNode function insertBefore(List list, Node node, Node newNode) newNode.prev := node.prev newNode.next := node if node.prev == null list.firstNode := newNode else node.prev.next := newNode node.prev := newNode We also need a function to insert a node at the beginning of a possibly empty list:

Doubly linked list function insertBeginning(List list, Node newNode) if list.firstNode == null list.firstNode := newNode list.lastNode := newNode newNode.prev := null newNode.next := null else insertBefore(list, list.firstNode, newNode) A symmetric function inserts at the end: function insertEnd(List list, Node newNode) if list.lastNode == null insertBeginning(list, newNode) else insertAfter(list, list.lastNode, newNode) Removing a node Removal of a node is easier than insertion, but requires special handling if the node to be removed is the firstNode or lastNode: function remove(List list, Node node) if node.prev == null list.firstNode := node.next else node.prev.next := node.next if node.next == null list.lastNode := node.prev else node.next.prev := node.prev destroy node One subtle consequence of the above procedure is that deleting the last node of a list sets both firstNode and lastNode to null, and so it handles removing the last node from a one-element list correctly. Notice that we also don't need separate "removeBefore" or "removeAfter" methods, because in a doubly linked list we can just use "remove(node.prev)" or "remove(node.next)" where these are valid. This also assumes that the node being removed is guaranteed to exist. If the node does not exist in this list, then some error handling would be required.

53

Circular doubly linked lists


Traversing the list Assuming that someNode is some node in a non-empty list, this code traverses through that list starting with someNode (any node will do): Forwards node := someNode do do something with node.value node := node.next while node someNode

Doubly linked list Backwards node := someNode do do something with node.value node := node.prev while node someNode Notice the postponing of the test to the end of the loop. This is important for the case where the list contains only the single node someNode. Inserting a node This simple function inserts a node into a doubly linked circularly linked list after a given element: function insertAfter(Node node, Node newNode) newNode.next := node.next newNode.prev := node node.next.prev := newNode node.next := newNode To do an "insertBefore", we can simply "insertAfter(node.prev, newNode)". Inserting an element in a possibly empty list requires a special function: function insertEnd(List list, Node node) if list.lastNode == null node.prev := node node.next := node else insertAfter(list.lastNode, node) list.lastNode := node To insert at the beginning we simply "insertAfter(list.lastNode, node)". Finally, removing a node must deal with the case where the list empties: function remove(List list, Node node) if node.next == node list.lastNode := null else node.next.prev := node.prev node.prev.next := node.next if node == list.lastNode list.lastNode := node.prev; destroy node

54

References

Stack (abstract data type)

55

Stack (abstract data type)


In computer science, a stack is a last in, first out (LIFO) abstract data type and linear data structure. A stack can have any abstract data type as an element, but is characterized by two fundamental operations, called push and pop (or pull). The push operation adds a new item to the top of the stack, or initializes the stack if it is empty. If the stack is full and does not contain enough space to accept the given item, the stack is then considered to be in an overflow state. The pop operation removes an item from the top of the stack. A pop either reveals previously concealed items, or results in an empty stack, but if the stack is empty then it goes into Simple representation of a stack underflow state (It means no items are present in stack to be removed). A stack pointer is the register which holds the value of the stack. The stack pointer always points to the top value of the stack. A stack is a restricted data structure, because only a small number of operations are performed on it. The nature of the pop and push operations also means that stack elements have a natural order. Elements are removed from the stack in the reverse order to the order of their addition: therefore, the lower elements are those that have been on the stack the longest.[1]

History
The stack was first proposed in 1946, in the computer design of Alan M. Turing (who used the terms "bury" and "unbury") as a means of calling and returning from subroutines. In 1957, the Germans Klaus Samelson and Friedrich L. Bauer patented the idea.[2] The same concept was developed, independently, by the Australian Charles Leonard Hamblin in the first half of 1957.[3]

Abstract definition
A stack is a basic computer science data structure and can be defined in an abstract, implementation-free manner, or it can be generally defined as a linear list of items in which all additions and deletion are restricted to one end that is Top. This is a VDM (Vienna Development Method) description of a stack:[4] Function signatures: init: -> Stack push: N x Stack -> Stack top: Stack -> (N U ERROR) remove: Stack -> Stack isempty: Stack -> Boolean (where N indicates an element (natural numbers in this case), and U indicates set union) Semantics: top(init()) = ERROR top(push(i,s)) = i remove(init()) = init() remove(push(i, s)) = s

Stack (abstract data type) isempty(init()) = true isempty(push(i, s)) = false

56

Inessential operations
In many implementations, a stack has more operations than "push" and "pop". An example is "top of stack", or "peek", which observes the top-most element without removing it from the stack.[5] Since this can be done with a "pop" and a "push" with the same data, it is not essential. An underflow condition can occur in the "stack top" operation if the stack is empty, the same as "pop". Often implementations have a function which just returns whether the stack is empty.

Software stacks
Implementation
In most high level languages, a stack can be easily implemented either through an array or a linked list. What identifies the data structure as a stack in either case is not the implementation but the interface: the user is only allowed to pop or push items onto the array or linked list, with few other helper operations. The following will demonstrate both implementations, using C. Array The array implementation aims to create an array where the first element (usually at the zero-offset) is the bottom. That is, array[0] is the first element pushed onto the stack and the last element popped off. The program must keep track of the size, or the length of the stack. The stack itself can therefore be effectively implemented as a two-element structure in C: typedef struct { size_t size; int items[STACKSIZE]; } STACK; The push() operation is used both to initialize the stack, and to store values to it. It is responsible for inserting (copying) the value into the ps->items[] array and for incrementing the element counter (ps->size). In a responsible C implementation, it is also necessary to check whether the array is already full to prevent an overrun. void push(STACK *ps, int x) { if (ps->size == STACKSIZE) { fputs("Error: stack overflow\n", stderr); abort(); } else ps->items[ps->size++] = x; } The pop() operation is responsible for removing a value from the stack, and decrementing the value of ps->size. A responsible C implementation will also need to check that the array is not already empty. int pop(STACK *ps) { if (ps->size == 0){ fputs("Error: stack underflow\n", stderr);

Stack (abstract data type) abort(); } else return ps->items[--ps->size]; } If we use a dynamic array, then we can implement a stack that can grow or shrink as much as needed. The size of the stack is simply the size of the dynamic array. A dynamic array is a very efficient implementation of a stack, since adding items to or removing items from the end of a dynamic array is amortized O(1) time. Linked list The linked-list implementation is equally simple and straightforward. In fact, a simple singly linked list is sufficient to implement a stackit only requires that the head node or element can be removed, or popped, and a node can only be inserted by becoming the new head node. Unlike the array implementation, our structure typedef corresponds not to the entire stack structure, but to a single node: typedef struct stack { int data; struct stack *next; } STACK; Such a node is identical to a typical singly linked list node, at least to those that are implemented in C. The push() operation both initializes an empty stack, and adds a new node to a non-empty one. It works by receiving a data value to push onto the stack, along with a target stack, creating a new node by allocating memory for it, and then inserting it into a linked list as the new head: void push(STACK **head, int value) { STACK *node = malloc(sizeof(STACK));

57

/* create a new node */

if (node == NULL){ fputs("Error: no space available for node\n", stderr); abort(); } else { /* initialize node */ node->data = value; node->next = empty(*head) ? NULL : *head; /* insert new head if any */ *head = node; } } A pop() operation removes the head from the linked list, and assigns the pointer to the head to the previous second node. It checks whether the list is empty before popping from it: int pop(STACK **head) { if (empty(*head)) { /* stack is empty */ fputs("Error: stack underflow\n", stderr); abort(); } else { //pop a node

Stack (abstract data type) STACK *top = *head; int value = top->data; *head = top->next; free(top); return value; } }

58

Stacks and programming languages


Some languages, like LISP and Python, do not call for stack implementations, since push and pop functions are available for any list. All Forth-like languages (such as Adobe PostScript) are also designed around language-defined stacks that are directly visible to and manipulated by the programmer. Examples from Common Lisp: (setf list (list 'a 'b 'c)) ;; (A B C) (pop list) ;; A list ;; (B C) (push 'new list) ;; (NEW B C) C++'s Standard Template Library provides a "stack" templated class which is restricted to only push/pop operations. Java's library contains a Stack class that is a specialization of Vector---this could be considered a design flaw, since the inherited get() method from Vector ignores the LIFO constraint of the Stack. PHP has an SplStack [6] class.

Stack (abstract data type)

59

Hardware stacks
A common use of stacks at the architecture level is as a means of allocating and accessing memory.

Basic architecture of a stack


A typical stack is an area of computer memory with a fixed origin and a variable size. Initially the size of the stack is zero. A stack pointer, usually in the form of a hardware register, points to the most recently referenced location on the stack; when the stack has a size of zero, the stack pointer points to the origin of the stack. The two operations applicable to all stacks are: a push operation, in which a data item is placed at the location pointed to by the stack pointer, and the address in the stack pointer is adjusted by the size of the data item; a pop or pull operation: a data item at the current location pointed to by the stack pointer is removed, and the stack pointer is adjusted by the size of the data item. There are many variations on the basic principle of stack operations. Every stack has a fixed location in memory at which it begins. As data items are added to the stack, the stack pointer is displaced to indicate the current extent of the stack, which expands away from the origin.

Stack pointers may point to the origin of a stack or to a limited range of addresses either above or below the origin (depending on the direction in which the stack grows); however, the stack pointer cannot cross the origin of the stack. In other words, if the origin of the stack is at address 1000 and the stack grows downwards (towards addresses 999, 998, and so on), the stack pointer must never be incremented beyond 1000 (to 1001, 1002, etc.). If a pop operation on the stack causes the stack pointer to move past the origin of the stack, a stack underflow occurs. If a push operation causes the stack pointer to increment or decrement beyond the maximum extent of the stack, a stack overflow occurs. Some environments that rely heavily on stacks may provide additional operations, for example: Duplicate: the top item is popped, and then pushed again (twice), so that an additional copy of the former top item is now on top, with the original below it. Peek: the topmost item is inspected (or returned), but the stack pointer is not changed, and the stack size does not change (meaning that the item remains on the stack). This is also called top operation in many articles.

A typical stack, storing local data and call information for nested procedure calls (not necessarily nested procedures!). This stack grows downward from its origin. The stack pointer points to the current topmost datum on the stack. A push operation decrements the pointer and copies the data to the stack; a pop operation copies data from the stack and then increments the pointer. Each procedure called in the program stores procedure return information (in yellow) and local data (in other colors) by pushing them onto the stack. This type of stack implementation is extremely common, but it is vulnerable to buffer overflow attacks (see the text).

Stack (abstract data type) Swap or exchange: the two topmost items on the stack exchange places. Rotate (or Roll): the n topmost items are moved on the stack in a rotating fashion. For example, if n=3, items 1, 2, and 3 on the stack are moved to positions 2, 3, and 1 on the stack, respectively. Many variants of this operation are possible, with the most common being called left rotate and right rotate. Stacks are either visualized growing from the bottom up (like real-world stacks), or, with the top of the stack in a fixed position (see image [note in the image, the top (28) is the stack 'bottom', since the stack 'top' is where items are pushed or popped from]), a coin holder, a Pez dispenser, or growing from left to right, so that "topmost" becomes "rightmost". This visualization may be independent of the actual structure of the stack in memory. This means that a right rotate will move the first element to the third position, the second to the first and the third to the second. Here are two equivalent visualizations of this process: apple banana cucumber cucumber banana apple banana cucumber apple apple cucumber banana

60

===right rotate==>

===left rotate==>

A stack is usually represented in computers by a block of memory cells, with the "bottom" at a fixed location, and the stack pointer holding the address of the current "top" cell in the stack. The top and bottom terminology are used irrespective of whether the stack actually grows towards lower memory addresses or towards higher memory addresses. Pushing an item on to the stack adjusts the stack pointer by the size of the item (either decrementing or incrementing, depending on the direction in which the stack grows in memory), pointing it to the next cell, and copies the new top item to the stack area. Depending again on the exact implementation, at the end of a push operation, the stack pointer may point to the next unused location in the stack, or it may point to the topmost item in the stack. If the stack points to the current topmost item, the stack pointer will be updated before a new item is pushed onto the stack; if it points to the next available location in the stack, it will be updated after the new item is pushed onto the stack. Popping the stack is simply the inverse of pushing. The topmost item in the stack is removed and the stack pointer is updated, in the opposite order of that used in the push operation.

Hardware support
Stack in main memory Most CPUs have registers that can be used as stack pointers. Processor families like the x86, Z80, 6502, and many others have special instructions that implicitly use a dedicated (hardware) stack pointer to conserve opcode space. Some processors, like the PDP-11 and the 68000, also have special addressing modes for implementation of stacks, typically with a semi-dedicated stack pointer as well (such as A7 in the 68000). However, in most processors, several different registers may be used as additional stack pointers as needed (whether updated via addressing modes or via add/sub instructions). Stack in registers or dedicated memory The x87 floating point architecture is an example of a set of registers organised as a stack where direct access to individual registers (relative the current top) is also possible. As with stack-based machines in general, having the top-of-stack as an implicit argument allows for a small machine code footprint with a good usage of bus bandwidth and code caches, but it also prevents some types of optimizations possible on processors permitting random access to the register file for all (two or three) operands. A stack structure also makes superscalar implementations with

Stack (abstract data type) register renaming (for speculative execution) somewhat more complex to implement, although it is still feasible, as exemplified by modern x87 implementations. Sun SPARC, AMD Am29000, and Intel i960 are all examples of architectures using register windows within a register-stack as another strategy to avoid the use of slow main memory for function arguments and return values. There are also a number of small microprocessors that implements a stack directly in hardware and some microcontrollers have a fixed-depth stack that is not directly accessible. Examples are the PIC microcontrollers, the Computer Cowboys MuP21, the Harris RTX line, and the Novix NC4016. Many stack-based microprocessors were used to implement the programming language Forth at the microcode level. Stacks were also used as a basis of a number of mainframes and mini computers. Such machines were called stack machines, the most famous being the Burroughs B5000.

61

Applications
Stacks have numerous applications. We see stacks in everyday life, from the books in our library, to the sheaf of papers that we keep in our printer tray. All of them follow the Last In First Out (LIFO) logic, that is when we add a book to a pile of books, we add it to the top of the pile, whereas when we remove a book from the pile, we generally remove it from the top of the pile. Given below are a few applications of stacks in the world of computers:

Converting a decimal number into a binary number


The logic for transforming a decimal number into a binary number is as follows: 1. Read a number 2. Iteration (while number is greater than zero) 1. Find out the remainder after dividing the number by 2 2. Print the remainder 3. Divide the number by 2 3. End the iteration

Decimal to binary conversion of 23

However, there is a problem with this logic. Suppose the number, whose binary form we want to find is 23. Using this logic, we get the result as 11101, instead of getting 10111. To solve this problem, we use a stack.[7] We make use of the LIFO property of the stack. Initially we push the binary digit formed into the stack, instead of printing it directly. After the entire number has been converted into the binary form, we pop one digit at a time from the stack and print it. Therefore we get the decimal number converted into its proper binary form. Algorithm: function outputInBinary(Integer n) Stack s = new Stack while n > 0 do Integer bit = n modulo 2 s.push(bit) if s is full then

Stack (abstract data type) return error end if n = floor(n / 2) end while while s is not empty do output(s.pop()) end while end function

62

Towers of Hanoi
One of the most interesting applications of stacks can be found in solving a puzzle called Tower of Hanoi. According to an old Brahmin story, the existence of the universe is calculated in terms of the time taken by a number of monks, who are working all the time, to move 64 disks from one pole to another. But there are some rules about how this should be done, which are:

Towers of Hanoi

1. move only one disk at a time. 2. for temporary storage, a third pole may be used. 3. a disk of larger diameter may not be placed on a disk of smaller diameter.[8] For algorithm of this puzzle see Tower of Hanoi. Assume that A is first tower, B is second tower & C is third tower.

Stack (abstract data type)

63

Towers of Hanoi step 1

Towers of Hanoi step 2

Towers of Hanoi step 3

Stack (abstract data type)

64

Towers of Hanoi step 4

Output: (when there are 3 disks) Let 1 be the smallest disk, 2 be the disk of medium size and 3 be the largest disk.

Tower of Hanoi

Stack (abstract data type)

65

Move disk From peg To peg 1 2 1 3 1 2 1 A A C A B B A C B B C A C C

The C++ code for this solution can be implemented in two ways: First implementation (using stacks implicitly by recursion) void TowersofHanoi(int n, int a, int b, int c) { if(n > 0) { TowersofHanoi(n-1, a, c, b); //recursion cout << " Move top disk from tower " << a << " to tower " << b << endl ; TowersofHanoi(n-1, c, b, a); //recursion } }
[9]

Second implementation (using stacks explicitly) // Global variable , tower [1:3] are three towers arrayStack<int> tower[4]; void TowerofHanoi(int n) { // Preprocessor for moveAndShow. for (int d = n; d > 0; d--) tower[1].push(d); moveAndShow(n, 1, 2, 3); tower 3 using

//initialize //add disk d to tower 1 /*move n disks from tower 1 to tower 2 as intermediate tower*/

} void moveAndShow(int n, int a, int b, int c) { // Move the top n disks from tower a to tower b showing states. // Use tower c for intermediate storage. if(n > 0) {

Stack (abstract data type) moveAndShow(n-1, a, c, b); int d = tower[x].top(); x to top of tower[x].pop(); tower[y].push(d); showState(); moveAndShow(n-1, c, b, a); } } //recursion //move a disc from top of tower //tower y //show state of 3 towers //recursion

66

However complexity for above written implementations is O( for small values of n (generally n <= 30).

). So it's obvious that problem can only be solved

In case of the monks, the number of turns taken to transfer 64 disks, by following the above rules, will be 18,446,744,073,709,551,615; which will surely take a lot of time![8][9]

Expression evaluation and syntax parsing


Calculators employing reverse Polish notation use a stack structure to hold values. Expressions can be represented in prefix, postfix or infix notations and conversion from one form to another may be accomplished using a stack. Many compilers use a stack for parsing the syntax of expressions, program blocks etc. before translating into low level code. Most programming languages are context-free languages, allowing them to be parsed with stack based machines. Evaluation of an infix expression that is fully parenthesized Input: (((2 * 5) - (1 * 2)) / (11 - 9)) Output: 4 Analysis: Five types of input characters 1. 2. 3. 4. 5. Opening bracket Numbers Operators Closing bracket New line character

Data structure requirement: A character stack Algorithm


1. Read one input character 2. Actions at end of each input Opening brackets Number Operator Closing brackets (2.1) (2.2) (2.3) (2.4) Push into stack and then Go to step (1) Push into stack and then Go to step (1) Push into stack and then Go to step (1) Pop from character stack

(2.4.1) if it is opening bracket, then discard it, Go to step (1) (2.4.2) Pop is used four times The first popped element is assigned to op2 The second popped element is assigned to op The third popped element is assigned to op1 The fourth popped element is the remaining opening bracket, which can be discarded Evaluate op1 op op2

Stack (abstract data type)


Convert the result into character and push into the stack Go to step (2.4) New line character (2.5) Pop from stack and print the answer STOP

67

Result: The evaluation of the fully parenthesized infix expression is printed as follows: Input String: (((2 * 5) - (1 * 2)) / (11 - 9))
Input Symbol Stack (from bottom to top) ( ( ( 2 * 5 ) ( 1 * 2 ) ) / ( 11 9 ) ) New line [10] ( ( 10 ( ( 10 ( ( 10 - ( ( ( 10 - ( 1 ( ( 10 - ( 1 * ( ( 10 - ( 1 * 2 ( ( 10 - 2 (8 (8/ (8/( ( 8 / ( 11 ( 8 / ( 11 ( 8 / ( 11 - 9 (8/2 4 Empty 11 - 9 = 2 & Push 8 / 2 = 4 & Push Pop & Print 1 * 2 = 2 & Push 10 - 2 = 8 & Push ( (( ((( (((2 (((2* (((2*5 2 * 5 = 10 and push Operation

Stack (abstract data type) Evaluation of infix expression which is not fully parenthesized Input: (2 * 5 - 1 * 2) / (11 - 9) Output: 4 Analysis: There are five types of input characters which are: 1. 2. 3. 4. 5. Opening brackets Numbers Operators Closing brackets New line character (\n)

68

We do not know what to do if an operator is read as an input character. By implementing the priority rule for operators, we have a solution to this problem. The Priority rule: we should perform a comparative priority check if an operator is read, and then push it. If the stack top contains an operator of priority higher than or equal to the priority of the input operator, then we pop it and print it. We keep on performing the priority check until the top of stack either contains an operator of lower priority or if it does not contain an operator. Data Structure Requirement for this problem: a character stack and an integer stack Algorithm:
1. Read an input character 2. Actions that will be performed at the end of each input Opening brackets Number Operator (2.1) (2.2) (2.3) Push it into character stack and then Go to step (1) Push into integer stack, Go to step (1) Do the comparative priority check

(2.3.1) if the character stack's top contains an operator with equal or higher priority, then pop it into op Pop a number from integer stack into op2 Pop another number from integer stack into op1 Calculate op1 op op2 and push the result into the integer stack Closing brackets (2.4) Pop from the character stack

(2.4.1) if it is an opening bracket, then discard it and Go to step (1) (2.4.2) To op, assign the popped element Pop a number from integer stack and assign it op2 Pop another number from integer stack and assign it to op1 Calculate op1 op op2 and push the result into the integer stack Convert into character and push into stack Go to the step (2.4) New line character (2.5) Print the result after popping from the stack STOP

Result: The evaluation of an infix expression that is not fully parenthesized is printed as follows: Input String: (2 * 5 - 1 * 2) / (11 - 9)

Stack (abstract data type)

69

Input Symbol Character Stack (from bottom to top) Integer Stack (from bottom to top) ( 2 * 5 ( ( (* (* (* (1 * 2 ) ((-* (-* (( / ( 11 9 ) New line / /( /( /(/(/ 10 10 1 10 1 10 1 2 10 2 8 8 8 8 11 8 11 8 11 9 82 4 4 [10] 25 2

Operation performed

Push as * has higher priority

Since '-' has less priority, we do 2 * 5 = 10 We push 10 and then push '-'

Push * as it has higher priority

Perform 1 * 2 = 2 and push it Pop - and 10 - 2 = 8 and push, Pop (

Perform 11 - 9 = 2 and push it Perform 8 / 2 = 4 and push it Print the output, which is 4

Evaluation of prefix expression Input: / - * 2 5 * 1 2 - 11 9 Output: 4 Analysis: there are three types of input characters 1. Numbers 2. Operators 3. New line character (\n) Data structure requirement: a character stack and an integer stack Algorithm:
1. Read one character input at a time and keep pushing it into the character stack until the new line character is reached 2. Perform pop from the character stack. If the stack is empty, go to step (3) Number Operator (2.1) Push in to the integer stack and then go to step (1) (2.2) Assign the operator to op Pop a number from integer stack and assign it to op1

Pop another number from integer stack and assign it to op2 Calculate op1 op op2 and push the output into the integer stack. Go to step (2)

Stack (abstract data type)


3. Pop the result from the integer stack and display the result

70

Result: the evaluation of prefix expression is printed as follows: Input String: / - * 2 5 * 1 2 - 11 9


Input Symbol Character Stack (from bottom to top) Integer Stack (from bottom to top) Operation performed / * 2 5 * 1 2 11 9 \n / / /-* /-*2 /-*25 /-*25* /-*25*1 /-*25*12 /-*25*12/ - * 2 5 * 1 2 - 11 / - * 2 5 * 1 2 - 11 9 / - * 2 5 * 1 2 - 11 /-*25*12/-*25*12 /-*25*1 /-*25* /-*25 /-*2 /-* // Stack is empty 9 9 11 2 22 221 22 225 2252 2 2 10 28 4 Stack is empty [10] 5 * 2 = 10 10 - 2 = 8 8/2=4 Print 4 1*2=2 11 - 9 = 2

Evaluation of postfix expression The calculation: 1 + 2 * 4 + 3 can be written down like this in postfix notation with the advantage of no precedence rules and parentheses needed: 1 2 4 * + 3 + The expression is evaluated from the left to right using a stack: 1. when encountering an operand: push it 2. when encountering an operator: pop two operands, evaluate the result and push it. Like the following way (the Stack is displayed after Operation has taken place):

Stack (abstract data type)

71

Input 1 2 4 * + 3 +

Operation

Stack (after op)

Push operand 1 Push operand 2, 1 Push operand 4, 2, 1 Multiply Add 8, 1 9

Push operand 3, 9 Add 12

The final result, 12, lies on the top of the stack at the end of the calculation. Example in C #include<stdio.h> int main() { int a[100], i; printf("To pop enter -1\n"); for(i = 0;;) { printf("Push "); scanf("%d", &a[i]); if(a[i] == -1) { if(i == 0) { printf("Underflow\n"); } else { printf("pop = %d\n", a[--i]); } } else { i++; } } }

Stack (abstract data type) Evaluation of postfix expression (Pascal) This is an implementation in Pascal, using marked sequential file as data archives. { programmer : clx321 file : stack.pas unit : Pstack.tpu } program TestStack; {this program uses ADT of Stack, I will assume that the unit of ADT of Stack has already existed} uses PStack;

72

{ADT of STACK}

{dictionary} const mark = '.'; var data : stack; f : text; cc : char; ccInt, cc1, cc2 : integer; {functions} IsOperand (cc : char) : boolean; {JUST Prototype} {return TRUE if cc is operand} ChrToInt (cc : char) : integer; {JUST Prototype} {change char to integer} Operator (cc1, cc2 : integer) : integer; {JUST Prototype} {operate two operands} {algorithms} begin assign (f, cc); reset (f); read (f, cc); {first if (cc = mark) then begin writeln ('empty end else begin repeat if (IsOperand begin ccInt :=

elmt}

archives !');

(cc)) then ChrToInt (cc);

Stack (abstract data type) push (ccInt, data); end else begin pop (cc1, data); pop (cc2, data); push (data, Operator (cc2, cc1)); end; read (f, cc); {next elmt} until (cc = mark); end; close (f); end }

73

Conversion of an Infix expression that is fully parenthesized into a Postfix expression


Input: (((8 + 1) - (7 - 4)) / (11 - 9)) Output: 8 1 + 7 4 - - 11 9 - / Analysis: There are five types of input characters which are: * * * * * Opening brackets Numbers Operators Closing brackets New line character (\n)

Requirement: A character stack Algorithm:


1. Read an character input 2. Actions to be performed at end of each input Opening brackets Number Operator Closing brackets (2.1) (2.2) (2.3) (2.4) Push into stack and then Go to step (1) Print and then Go to step (1) Push into stack and then Go to step (1) Pop it from the stack

(2.4.1) If it is an operator, print it, Go to step (2.4) (2.4.2) If the popped element is an opening bracket, discard it and go to step (1) New line character (2.5) STOP

Therefore, the final output after conversion of an infix expression to a postfix expression is as follows:

Stack (abstract data type)

74

Input

Operation

Stack (after op) ( (( ((( 8 (((+ 8 81 ((( ((

Output on monitor

( ( ( 8 + 1 )

(2.1) Push operand into stack (2.1) Push operand into stack (2.1) Push operand into stack (2.2) Print it (2.3) Push operator into stack (2.2) Print it (2.4) Pop from the stack: Since popped element is '+' print it (2.4) Pop from the stack: Since popped element is '(' we ignore it and read next character

81+ 81+

( 7 4 )

(2.3) Push operator into stack (2.1) Push operand into stack (2.2) Print it (2.3) Push the operator in the stack (2.2) Print it (2.4) Pop from the stack: Since popped element is '-' print it (2.4) Pop from the stack: Since popped element is '(' we ignore it and read next character

((((-( 81+7 ((-(81+74 ((-( ((81+74-

(2.4) Pop from the stack: Since popped element is '-' print it (2.4) Pop from the stack: Since popped element is '(' we ignore it and read next character

(( (

81+74--

/ ( 11 9 )

(2.3) Push the operand into the stack (2.1) Push into the stack (2.2) Print it (2.3) Push the operand into the stack (2.2) Print it (2.4) Pop from the stack: Since popped element is '-' print it (2.4) Pop from the stack: Since popped element is '(' we ignore it and read next character

(/ (/( 8 1 + 7 4 - - 11 (/(8 1 + 7 4 - - 11 9 (/( (/ 8 1 + 7 4 - - 11 9 -

(2.4) Pop from the stack: Since popped element is '/' print it (2.4) Pop from the stack: Since popped element is '(' we ignore it and read next character

( Stack is empty

8 1 + 7 4 - - 11 9 - /

New line character [10]

(2.5) STOP

Stack (abstract data type)

75

Rearranging railroad cars


Problem Description This is one useful application of stacks. Consider that a freight train has n railroad cars, each to be left at different station. They're numbered 1 through n and freight train visits these stations in order n through 1. Obviously, the railroad cars are labeled by their destination. To facilitate removal of the cars from the train, we must rearrange them in ascending order of their number (i.e. 1 through n). When cars are in this order, they can be detached at each station. We rearrange cars at a shunting yard that has input track, output track and k holding tracks between input & output tracks (i.e. holding track). Solution Strategy To rearrange cars, we examine the cars on the input from front to back. If the car being examined is next one in the output arrangement, we move it directly to output track. If not, we move it to the holding track & leave it there until it's time to place it to the output track. The holding tracks operate in a LIFO manner as the cars enter & leave these tracks from top. When rearranging cars only following moves are permitted: A car may be moved from front (i.e. right end) of the input track to the top of one of the holding tracks or to the left end of the output track. A car may be moved from the top of holding track to left end of the output track. The figure shows a shunting yard with k = 3, holding tracks H1, H2 & H3, also n = 9. The n cars of freight train begin in the input track & are to end up in the output track in order 1 through n from right to left. The cars initially are in the order 5,8,1,7,4,2,9,6,3 from back to front. Later cars are rearranged in desired order.

Stack (abstract data type) A Three Track Example Consider the input arrangement from figure, here we note that the car 3 is at the front, so it can't be output yet, as it to be preceded by cars 1 & 2. So car 3 is detached & moved to holding track H1. The next car 6 can't be output & it is moved to holding track H2. Because we have to output car 3 before car 6 & this will not possible if we move car 6 to holding track H1. Now it's obvious that we move car 9 to H3. The requirement of rearrangement of cars on any holding track is that the cars should be preferred to arrange in ascending order from top to bottom. So car 2 is now moved to holding track H1 so that it satisfies the previous statement. If we move car 2 to H2 or H3, then we've no place to move cars 4,5,7,8.The least restrictions on future car placement arise when the new car is moved to the holding track that has a car at its top with smallest label such that < . We may call it an assignment rule to decide whether a particular car belongs to a specific holding track. When car 4 is considered, there are three places to move the car H1,H2,H3. The top of these tracks are 2,6,9.So using above mentioned Assignment rule, we move car 4 to H2. The car 7 is moved to H3. The next car 1 has the least label, so it's moved to output track. Now it's time for car 2 & 3 to output which are from H1(in short all the cars from H1 are appended to car 1 on output track). The car 4 is moved to output track. No other cars can be moved to output track at this time. The next car 8 is moved to holding track H1. Car 5 is output from input track. Car 6 is moved to output track from H2, so is the 7 from H3,8 from H1 & 9 from H3.
[9]

76

Railroad cars example

Backtracking
Another important application of stacks is Backtracing. Consider a simple example of finding the correct path in a maze. There are a series of points, from the starting point to the destination. We start from one point. To reach the final destination, there are several paths. Suppose we choose a random path. After following a certain path, we realise that the path we have chosen is wrong. So we need to find a way by which we can return back to the beginning of that path. This can be done with the use of stacks. With the help of stacks, we remember the point where we have reached. This is done by pushing that point into the stack. In case we end up on the wrong path, we can pop the last point from the stack and thus return back to the last point and continue our quest to find the right path. This is called backtracking.

Stack (abstract data type)

77

Quicksort
Sorting means arranging the list of elements in a particular order. In case of numbers, it could be in ascending order, or in the case of letters, alphabetic order. Quicksort is an algorithm of the divide and conquer type. In this method, to sort a set of numbers, we reduce it to two smaller sets, and then sort these smaller sets. This can be explained with the help of the following example: Suppose A is a list of the following numbers:

In the reduction step, we find the final position of one of the numbers. In this case, let us assume that we have to find the final position of 48, which is the first number in the list. To accomplish this, we adopt the following method. Begin with the last number, and move from right to left. Compare each number with 48. If the number is smaller than 48, we stop at that number and swap it with 48. In our case, the number is 24. Hence, we swap 24 and 48.

The numbers 96 and 72 to the right of 48, are greater than 48. Now beginning with 24, scan the numbers in the opposite direction, that is from left to right. Compare every number with 48 until you find a number that is greater than 48. In this case, it is 60. Therefore we swap 48 and 60.

Note that the numbers 12, 24 and 36 to the left of 48 are all smaller than 48. Now, start scanning numbers from 60, in the right to left direction. As soon as you find lesser number, swap it with 48. In this case, it is 44. Swap it with 48. The final result is:

Stack (abstract data type)

78

Now, beginning with 44, scan the list from left to right, until you find a number greater than 48. Such a number is 84. Swap it with 48. The final result is:

Now, beginning with 84, traverse the list from right to left, until you reach a number lesser than 48. We do not find such a number before reaching 48. This means that all the numbers in the list have been scanned and compared with 48. Also, we notice that all numbers less than 48 are to the left of it, and all numbers greater than 48, are to its right. The final partitions look as follows:

Therefore, 48 has been placed in its proper position and now our task is reduced to sorting the two partitions. This above step of creating partitions can be repeated with every partition containing 2 or more elements. As we can process only a single partition at a time, we should be able to keep track of the other partitions, for future processing. This is done by using two stacks called LOWERBOUND and UPPERBOUND, to temporarily store these partitions. The addresses of the first and last elements of the partitions are pushed into the LOWERBOUND and UPPERBOUND stacks respectively. Now, the above reduction step is applied to the partitions only after its boundary values are popped from the stack. We can understand this from the following example: Take the above list A with 12 elements. The algorithm starts by pushing the boundary values of A, that is 1 and 12 into the LOWERBOUND and UPPERBOUND stacks respectively. Therefore the stacks look as follows: LOWERBOUND: 1 UPPERBOUND: 12

To perform the reduction step, the values of the stack top are popped from the stack. Therefore, both the stacks become empty. LOWERBOUND: {empty} UPPERBOUND: {empty}

Now, the reduction step causes 48 to be fixed to the 5th position and creates two partitions, one from position 1 to 4 and the other from position 6 to 12. Hence, the values 1 and 6 are pushed into the LOWERBOUND stack and 4 and

Stack (abstract data type) 12 are pushed into the UPPERBOUND stack. LOWERBOUND: 1, 6 UPPERBOUND: 4, 12

79

For applying the reduction step again, the values at the stack top are popped. Therefore, the values 6 and 12 are popped. Therefore the stacks look like: LOWERBOUND: 1 UPPERBOUND: 4

The reduction step is now applied to the second partition, that is from the 6th to 12th element.

After the reduction step, 98 is fixed in the 11th position. So, the second partition has only one element. Therefore, we push the upper and lower boundary values of the first partition onto the stack. So, the stacks are as follows: LOWERBOUND: 1, 6 UPPERBOUND: 4, 10

The processing proceeds in the following way and ends when the stacks do not contain any upper and lower bounds of the partition to be processed, and the list gets sorted.
[11]

The Stock Span Problem


In the stock span problem, we will solve a financial problem with the help of stacks. Suppose, for a stock, we have a series of n daily price quotes, the span of the stock's price on a particular day is defined as the maximum number of consecutive days for which the price of the stock on the current day is less than or equal to its price on that day.
The Stockspan Problem

Stack (abstract data type) An algorithm which has Quadratic Time Complexity Input: An array P with n elements Output: An array S of n elements such that S[i] is the largest integer k such that k <= i + 1 and P[y] <= P[i] for j = i - k + 1,.....,i Algorithm: 1. Initialize an array P which contains the daily prices of the stocks 2. Initialize an array S which will store the span of the stock 3. for i = 0 to i = n - 1 3.1 Initialize k to zero 3.2 Done with a false condition 3.3 repeat 3.3.1 if ( P[i - k] <= P[i] ) then Increment k by 1 3.3.2 else Done with true condition 3.4 Till (k > i) or done with processing Assign value of k to S[i] to get the span of the stock 4. Return array S Now, analyzing this algorithm for running time, we observe: We have initialized the array S at the beginning and returned it at the end. This is a constant time operation, hence takes O(n) time The repeat loop is nested within the for loop. The for loop, whose counter is i is executed n times. The statements which are not in the repeat loop, but in the for loop are executed n times. Therefore these statements and the incrementing and condition testing of i take O(n) time. In repetition of i for the outer for loop, the body of the inner repeat loop is executed maximum i + 1 times. In the worst case, element S[i] is greater than all the previous elements. So, testing for the if condition, the statement after that, as well as testing the until condition, will be performed i + 1 times during iteration i for the outer for loop. Hence, the total time taken by the inner loop is O(n(n + 1)/2), which is O( ) The running time of all these steps is calculated by adding the time taken by all these three steps. The first two terms are O( ) while the last term is O( ). Therefore the total running time of the algorithm is O( ). An algorithm which has Linear Time Complexity In order to calculate the span more efficiently, we see that the span on a particular day can be easily calculated if we know the closest day before i, such that the price of the stocks on that day was higher than the price of the stocks on the present day. If there exists such a day, we can represent it by h(i) and initialize h(i) to be -1. Therefore the span of a particular day is given by the formula, s = i - h(i). To implement this logic, we use a stack as an abstract data type to store the days i, h(i), h(h(i)) and so on. When we go from day i-1 to i, we pop the days when the price of the stock was less than or equal to p(i) and then push the value of day i back into the stack. Here, we assume that the stack is implemented by operations that take O(1) that is constant time. The algorithm is as follows: Input: An array P with n elements and an empty stack N Output: An array S of n elements such that P[i] is the largest integer k such that k <= i + 1 and P[y] <= P[i] for j = i - k + 1,.....,i

80

Stack (abstract data type) Algorithm: 1. Initialize an array P which contains the daily prices of the stocks 2. Initialize an array S which will store the span of the stock 3. for i = 0 to i = n - 1 3.1 Initialize k to zero 3.2 Done with a false condition 3.3 while not (Stack N is empty or done with processing) 3.3.1 if ( P[i] >= P[N.top())] then Pop a value from stack N 3.3.2 else Done with true condition 3.4 if Stack N is empty 3.4.1 Initialize h to -1 3.5 else 3.5.1 Initialize stack top to h 3.6 Put the value of h - i in S[i] 3.7 Push the value of i in N 4. Return array S Now, analyzing this algorithm for running time, we observe: We have initialized the array S at the beginning and returned it at the end. This is a constant time operation, hence takes O(n) time The while loop is nested within the for loop. The for loop, whose counter is i is executed n times. The statements which are not in the repeat loop, but in the for loop are executed n times. Therefore these statements and the incrementing and condition testing of i take O(n) time. Now, observe the inner while loop during i repetitions of the for loop. The statement done with a true condition is done at most once, since it causes an exit from the loop. Let us say that t(i) is the number of times statement Pop a value from stack N is executed. So it becomes clear that while not (Stack N is empty or done with processing) is tested maximum t(i) + 1 times. Adding the running time of all the operations in the while loop, we get:

81

An element once popped from the stack N is never pushed back into it. Therefore,

So, the running time of all the statements in the while loop is O(

The running time of all the steps in the algorithm is calculated by adding the time taken by all these steps. The run time of each step is O( ). Hence the running time complexity of this algorithm is O( ).
[12]

Stack (abstract data type)

82

Runtime memory management


A number of programming languages are stack-oriented, meaning they define most basic operations (adding two numbers, printing a character) as taking their arguments from the stack, and placing any return values back on the stack. For example, PostScript has a return stack and an operand stack, and also has a graphics state stack and a dictionary stack. Forth uses two stacks, one for argument passing and one for subroutine return addresses. The use of a return stack is extremely commonplace, but the somewhat unusual use of an argument stack for a human-readable programming language is the reason Forth is referred to as a stack-based language. Many virtual machines are also stack-oriented, including the p-code machine and the Java Virtual Machine. Almost all calling conventions computer runtime memory environments use a special stack (the "call stack") to hold information about procedure/function calling and nesting in order to switch to the context of the called function and restore to the caller function when the calling finishes. The functions follow a runtime protocol between caller and callee to save arguments and return value on the stack. Stacks are an important way of supporting nested or recursive function calls. This type of stack is used implicitly by the compiler to support CALL and RETURN statements (or their equivalents) and is not manipulated directly by the programmer. Some programming languages use the stack to store data that is local to a procedure. Space for local data items is allocated from the stack when the procedure is entered, and is deallocated when the procedure exits. The C programming language is typically implemented in this way. Using the same stack for both data and procedure calls has important security implications (see below) of which a programmer must be aware in order to avoid introducing serious security bugs into a program.

Security
Some computing environments use stacks in ways that may make them vulnerable to security breaches and attacks. Programmers working in such environments must take special care to avoid the pitfalls of these implementations. For example, some programming languages use a common stack to store both data local to a called procedure and the linking information that allows the procedure to return to its caller. This means that the program moves data into and out of the same stack that contains critical return addresses for the procedure calls. If data is moved to the wrong location on the stack, or an oversized data item is moved to a stack location that is not large enough to contain it, return information for procedure calls may be corrupted, causing the program to fail. Malicious parties may attempt a stack smashing attack that takes advantage of this type of implementation by providing oversized data input to a program that does not check the length of input. Such a program may copy the data in its entirety to a location on the stack, and in so doing it may change the return addresses for procedures that have called it. An attacker can experiment to find a specific type of data that can be provided to such a program such that the return address of the current procedure is reset to point to an area within the stack itself (and within the data provided by the attacker), which in turn contains instructions that carry out unauthorized operations. This type of attack is a variation on the buffer overflow attack and is an extremely frequent source of security breaches in software, mainly because some of the most popular programming languages (such as C) use a shared stack for both data and procedure calls, and do not verify the length of data items. Frequently programmers do not write code to verify the size of data items, either, and when an oversized or undersized data item is copied to the stack, a security breach may occur.

Stack (abstract data type)

83

References
[1] http:/ / www. cprogramming. com/ tutorial/ computersciencetheory/ stack. html cprogramming.com [2] Dr. Friedrich Ludwig Bauer and Dr. Klaus Samelson (30. Mrz 1957) (in german). Verfahren zur automatischen Verarbeitung von kodierten Daten und Rechenmaschine zur Ausbung des Verfahrens. (http:/ / v3. espacenet. com/ origdoc?DB=EPODOC& IDX=DE1094019& F=0& QPN=DE1094019). Deutsches Patentamt. . Retrieved 2010-10-01. [3] C. L. Hamblin, "An Addressless Coding Scheme based on Mathematical Notation", N.S.W University of Technology, May 1957 (typescript) [4] Jones: "Systematic Software Development Using VDM" [5] Horowitz, Ellis: "Fundamentals of Data Structures in Pascal", page 67. Computer Science Press, 1984 [6] http:/ / www. php. net/ manual/ en/ class. splstack. php [7] Richard F. Gilberg; Behrouz A. Forouzan. Data Structures-A Pseudocode Approach with C++. Thomson Brooks/Cole. [8] Dromey, R.G. How to Solve it by Computer. Prentice Hall of India. [9] Data structures, Algorithms and Applications in C++ by Sartaj Sahni [10] Gopal, Arpita. Magnifying Data Structures. PHI. [11] Lipschutz, Seymour. Theory and Problems of Data Structures. Tata McGraw Hill. [12] Goodrich, Tamassia, Mount, Michael, Roberto, David. Data Structures and Algorithms in C++. Wiley-India.

Stack implementation on goodsoft.org.ua (http://goodsoft.org.ua/en/data_struct/stack.html)

Further reading
Donald Knuth. The Art of Computer Programming, Volume 1: Fundamental Algorithms, Third Edition.Addison-Wesley, 1997. ISBN 0-201-89683-4. Section 2.2.1: Stacks, Queues, and Deques, pp.238243. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 10.1: Stacks and queues, pp.200204.

External links
Stack Machines - the new wave (http://www.ece.cmu.edu/~koopman/stack_computers/index.html) Bounding stack depth (http://www.cs.utah.edu/~regehr/stacktool) Libsafe - Protecting Critical Elements of Stacks (http://research.avayalabs.com/project/libsafe/) VBScript implementation of stack, queue, deque, and Red-Black Tree (http://www.ludvikjerabek.com/ downloads.html) Stack Size Analysis for Interrupt-driven Programs (http://www.cs.ucla.edu/~palsberg/paper/sas03.pdf) (322 KB) Paul E. Black, Bounded stack (http://www.nist.gov/dads/HTML/boundedstack.html) at the NIST Dictionary of Algorithms and Data Structures.

Queue (abstract data type)

84

Queue (abstract data type)


In computer science, a queue ( /kju/ KEW) is a particular kind of abstract data type or collection in which the entities in the collection are kept in order and the principal (or only) operations on the collection are the addition of entities to the rear terminal position and removal of entities from the front terminal position. This makes the queue a First-In-First-Out (FIFO) data structure. In a FIFO data structure, the first element added to the queue will be the first one to be removed. This is equivalent to the requirement that once an element is added, Representation of a Queue with FIFO (First In First Out) property all elements that were added before have to be removed before the new element can be invoked. A queue is an example of a linear data structure. Queues provide services in computer science, transport, and operations research where various entities such as data, objects, persons, or events are stored and held to be processed later. In these contexts, the queue performs the function of a buffer. Queues are common in computer programs, where they are implemented as data structures coupled with access routines, as an abstract data structure or in object-oriented languages as classes. Common implementations are circular buffers and linked lists.

Representing a queue
In each of the cases, the customer or object at the front of the line was the first one to enter, while at the end of the line is the last to have entered. Every time a customer finishes paying for their items (or a person steps off the escalator, or the machine part is removed from the assembly line, etc.) that object leaves the queue from the front. This represents the queue dequeue function. Every time another object or customer enters the line to wait, they join the end of the line and represent the enqueue function. The queue size function would return the length of the line, and the empty function would return true only if there was nothing in the line.

Queue implementation
Theoretically, one characteristic of a queue is that it does not have a specific capacity. Regardless of how many elements are already contained, a new element can always be added. It can also be empty, at which point removing an element will be impossible until a new element has been added again. Fixed length arrays are limited in capacity, and inefficient because items need to be copied towards the head of the queue. However conceptually they are simple and work with early languages such as FORTRAN and BASIC which did not have pointers or objects. Most modern languages with objects or pointers can implement or come with libraries for dynamic lists. Such data structures may have not specified fixed capacity limit besides memory constraints. Queue overflow results from trying to add an element onto a full queue and queue underflow happens when trying to remove an element from an empty queue. A bounded queue is a queue limited to a fixed number of items.

Queue (abstract data type) There are several efficient implementations of FIFO queues. An efficient implementation is one that can perform the operationsenqueuing and dequeuingin O(1) time. Linked list A doubly linked list has O(1) insertion and deletion at both ends, so is a natural choice for queues. A regular singly linked list only has efficient insertion and deletion at one end. However, a small modificationkeeping a pointer to the last node in addition to the first onewill enable it to implement an efficient queue. A deque implemented using a modified dynamic array

85

Queues and programming languages


Some languages, like Perl and Ruby, already have operations for pushing and popping an array from both ends, so one can use push and shift functions to enqueue and dequeue a list (or, in reverse, one can use unshift and pop), although in some cases these operations are not efficient. C++'s Standard Template Library provides a "queue" templated class which is restricted to only push/pop operations. Since J2SE5.0, Java's library contains a Queue interface that specifies queue operations; implementing classes include LinkedList and (since J2SE 1.6) ArrayDeque. PHP has an SplQueue [1] class and third party libraries like beanstalk'd and Gearman.

References
General Donald Knuth. The Art of Computer Programming, Volume 1: Fundamental Algorithms, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89683-4. Section 2.2.1: Stacks, Queues, and Deques, pp.238243. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 10.1: Stacks and queues, pp.200204. William Ford, William Topp. Data Structures with C++ and STL, Second Edition. Prentice Hall, 2002. ISBN 0-13-085850-1. Chapter 8: Queues and Priority Queues, pp.386390. Adam Drozdek. Data Structures and Algorithms in C++, Third Edition. Thomson Course Technology, 2005. ISBN 0-534-49182-0. Chapter 4: Stacks and Queues, pp.137169. Citations
[1] http:/ / www. php. net/ manual/ en/ class. splqueue. php

External links
STL Quick Reference (http://www.halpernwightsoftware.com/stdlib-scratch/quickref.html#containers14) VBScript implementation of stack, queue, deque, and Red-Black Tree (http://www.ludvikjerabek.com/ downloads.html) Paul E. Black, Bounded queue (http:/ / www. nist. gov/ dads/ HTML/ boundedqueue. html) at the NIST Dictionary of Algorithms and Data Structures.

Double-ended queue

86

Double-ended queue
In computer science, a double-ended queue (dequeue, often abbreviated to deque, pronounced deck) is an abstract data type that implements a queue for which elements can only be added to or removed from the front (head) or back (tail).[1] It is also often called a head-tail linked list.

Naming conventions
Deque is sometimes written dequeue, but this use is generally deprecated in technical literature or technical writing because dequeue is also a verb meaning "to remove from a queue". Nevertheless, several libraries and some writers, such as Aho, Hopcroft, and Ullman in their textbook Data Structures and Algorithms, spell it dequeue. John Mitchell, author of Concepts in Programming Languages, also uses this terminology.

Distinctions and sub-types


This differs from the queue abstract data type or First-In-First-Out List (FIFO), where elements can only be added to one end and removed from the other. This general data class has some possible sub-types: An input-restricted deque is one where deletion can be made from both ends, but insertion can only be made at one end. An output-restricted deque is one where insertion can be made at both ends, but deletion can be made from one end only. Both the basic and most common list types in computing, queues and stacks can be considered specializations of deques, and can be implemented using deques.

Operations
The following operations are possible on a deque:
operation insert element at back insert element at front remove last element remove first element examine last element examine first element Ada Append C++ push_back Java offerLast Perl push PHP array_push Python append Ruby push push JavaScript

Prepend

push_front offerFirst unshift

array_unshift appendleft unshift unshift

Delete_Last

pop_back

pollLast

pop

array_pop

pop

pop

pop

Delete_First

pop_front

pollFirst

shift

array_shift

popleft

shift

shift

Last_Element

back

peekLast

$array[-1] end

<obj>[-1]

last

<obj>[<obj>.length - 1]

First_Element front

peekFirst

$array[0]

reset

<obj>[0]

first

<obj>[0]

Double-ended queue

87

Implementations
There are at least two common ways to efficiently implement a deque: with a modified dynamic array or with a doubly linked list. The dynamic array approach uses a variant of a dynamic array that can grow from both ends, sometimes called array deques. These array deques have all the properties of a dynamic array, such as constant time random access, good locality of reference, and inefficient insertion/removal in the middle, with the addition of amortized constant time insertion/removal at both ends, instead of just one end. Three common implementations include: Storing deque contents in a circular buffer, and only resizing when the buffer becomes full. This decreases the frequency of resizings. Allocating deque contents from the center of the underlying array, and resizing the underlying array when either end is reached. This approach may require more frequent resizings and waste more space, particularly when elements are only inserted at one end. Storing contents in multiple smaller arrays, allocating additional arrays at the beginning or end as needed. Indexing is implemented by keeping a dynamic array containing pointers to each of the smaller arrays.

Language support
Ada's containers provides the generic packages Ada.Containers.Vectors and Ada.Containers.Doubly_Linked_Lists, for the dynamic array and linked list implementations, respectively. C++'s Standard Template Library provides the class templates std::deque and std::list, for the multiple array and linked list implementations, respectively. As of Java 6, Java's Collections Framework provides a new Deque interface that provides the functionality of insertion and removal at both ends. It is implemented by classes such as ArrayDeque (also new in Java 6) and LinkedList, providing the dynamic array and linked list implementations, respectively. However, the ArrayDeque, contrary to its name, does not support random access. Python 2.4 introduced the collections module with support for deque objects. As of PHP 5.3, PHP's SPL extension contains the 'SplDoublyLinkedList' class that can be used to implement Deque datastructures. Previously to make a Deque structure the array functions array_shift/unshift/pop/push had to be used instead. GHC's Data.Sequence [2] module implements an efficient, functional deque structure in Haskell. The implementation uses 2-3 finger trees annotated with sizes. There are other (fast) possibilities to implement purely functional (thus also persistent) double queues (most using heavily lazy evaluation), see references,[3][4],.[5]

Complexity
In a doubly linked list implementation and assuming no allocation/deallocation overhead, the time complexity of all deque operations is O(1). Additionally, the time complexity of insertion or deletion in the middle, given an iterator, is O(1); however, the time complexity of random access by index is O(n). In a growing array, the amortized time complexity of all deque operations is O(1). Additionally, the time complexity of random access by index is O(1); but the time complexity of insertion or deletion in the middle is O(n).

Double-ended queue

88

Applications
One example where a deque can be used is the A-Steal job scheduling algorithm.[6] This algorithm implements task scheduling for several processors. A separate deque with threads to be executed is maintained for each processor. To execute the next thread, the processor gets the first element from the deque (using the "remove first element" deque operation). If the current thread forks, it is put back to the front of the deque ("insert element at front") and a new thread is executed. When one of the processors finishes execution of its own threads (i.e. its deque is empty), it can "steal" a thread from another processor: it gets the last element from the deque of another processor ("remove last element") and executes it.

References
[1] Donald Knuth. The Art of Computer Programming, Volume 1: Fundamental Algorithms, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89683-4. Section 2.2.1: Stacks, Queues, and Deques, pp. 238243. [2] http:/ / www. haskell. org/ ghc/ docs/ latest/ html/ libraries/ containers/ Data-Sequence. html [3] www.cs.cmu.edu/~rwh/theses/okasaki.pdf C. Okasaki, "Purely Functional Data Structures", September 1996 [4] Adam L. Buchsbaum and Robert E. Tarjan. Confluently persistent deques via data structural bootstrapping. Journal of Algorithms, 18(3):513547, May 1995. (pp. 58, 101, 125) [5] Haim Kaplan and Robert E. Tarjan. Purely functional representations of catenable sorted lists. In ACM Symposium on Theory of Computing, pages 202211, May 1996. (pp. 4, 82, 84, 124) [6] Eitan Frachtenberg, Uwe Schwiegelshohn (2007). Job Scheduling Strategies for Parallel Processing: 12th International Workshop, JSSPP 2006. Springer. ISBN3-540-71034-5. See p.22.

External links
SGI STL Documentation: deque<T, Alloc> (http://www.sgi.com/tech/stl/Deque.html) Code Project: An In-Depth Study of the STL Deque Container (http://www.codeproject.com/KB/stl/ vector_vs_deque.aspx) Diagram of a typical STL deque implementation (http://pages.cpsc.ucalgary.ca/~kremer/STL/1024x768/ deque.html) Deque implementation in C (http://www.martinbroadhurst.com/articles/deque.html) VBScript implementation of stack, queue, deque, and Red-Black Tree (http://www.ludvikjerabek.com/ downloads.html)

Circular buffer

89

Circular buffer
A circular buffer, cyclic buffer or ring buffer is a data structure that uses a single, fixed-size buffer as if it were connected end-to-end. This structure lends itself easily to buffering data streams.

Uses
An example that could possibly use an overwriting circular buffer is with multimedia. If the buffer is used as the bounded buffer in the producer-consumer problem then it is probably desired for the producer (e.g., an audio generator) to overwrite old data if the consumer (e.g., the sound card) is unable to momentarily keep up. Another example is the digital waveguide synthesis method which uses circular buffers to efficiently simulate the sound of vibrating strings or wind instruments.

The "prized" attribute of a circular buffer is that it does not need to have its elements shuffled around when one is consumed. (If a non-circular buffer were used then it would be necessary to shift all elements when one is consumed.) In other words, the circular buffer is well suited as a FIFO buffer while a standard, non-circular buffer is well suited as a LIFO buffer. Circular buffering makes a good implementation strategy for a queue that has fixed maximum size. Should a maximum size be adopted for a queue, then a circular buffer is a completely ideal implementation; all queue operations are constant time. However, expanding a circular buffer requires shifting memory, which is comparatively costly. For arbitrarily expanding queues, a Linked list approach may be preferred instead.

A ring showing, conceptually, a circular buffer. This visually shows that the buffer has no real end and it can loop around the buffer. However, since memory is never physically created as a ring, a linear representation is generally used as is done below.

How it works
A circular buffer first starts empty and of some predefined length. For example, this is a 7-element buffer:

Assume that a 1 is written into the middle of the buffer (exact starting location does not matter in a circular buffer):

Then assume that two more elements are added 2 & 3 which get appended after the 1:

If two elements are then removed from the buffer, the oldest values inside the buffer are removed. The two elements removed, in this case, are 1 & 2 leaving the buffer with just a 3:

If the buffer has 7 elements then it is completely full:

Circular buffer

90

A consequence of the circular buffer is that when it is full and a subsequent write is performed, then it starts overwriting the oldest data. In this case, two more elements A & B are added and they overwrite the 3 & 4:

Alternatively, the routines that manage the buffer could prevent overwriting the data and return an error or raise an exception. Whether or not data is overwritten is up to the semantics of the buffer routines or the application using the circular buffer. Finally, if two elements are now removed then what would be returned is not 3 & 4 but 5 & 6 because A & B overwrote the 3 & the 4 yielding the buffer with:

Circular buffer mechanics


What is not shown in the example above is the mechanics of how the circular buffer is managed.

Start / End Pointers


Generally, a circular buffer requires four pointers: one to the actual buffer in memory one to the buffer end in memory (or alternately: the size of the buffer) one to point to the start of valid data (or alternately: amount of data written to the buffer) one to point to the end of valid data (or alternately: amount of data read from the buffer)

Alternatively, a fixed-length buffer with two integers to keep track of indices can be used in languages that do not have pointers. Taking a couple of examples from above. (While there are numerous ways to label the pointers and exact semantics can vary, this is one way to do it.) This image shows a partially full buffer:

This image shows a full buffer with two elements having been overwritten:

What to note about the second one is that after each element is overwritten then the start pointer is incremented as well.

Circular buffer

91

Difficulties
Full / Empty Buffer Distinction
A small disadvantage of relying on pointers or relative indices of the start and end of data is, that in the case the buffer is entirely full, both pointers point to the same element:

This is exactly the same situation as when the buffer is empty:

To solve this confusion there are a number of solutions: Always keep one slot open. Use a fill count to distinguish the two cases. Use read and write counts to get the fill count from. Use absolute indices.

Always Keep One Slot Open This design always keeps one slot unallocated. A full buffer has at most slots. If both pointers refer to

the same slot, the buffer is empty. If the end (write) pointer refers to the slot preceding the one referred to by the start (read) pointer, the buffer is full. This is a simple, robust, approach that only requires two pointers, at the expense of one buffer slot. Example implementation, 'C' language /* Circular buffer example, keeps one slot open */ #include <stdio.h> #include <malloc.h> /* Opaque buffer element type. This would be defined by the application. */ typedef struct { int value; } ElemType; /* Circular buffer object */ typedef struct { int size; /* maximum number of elements int start; /* index of oldest element int end; /* index at which to write new element ElemType *elems; /* vector of elements } CircularBuffer; void cbInit(CircularBuffer *cb, int size) { cb->size = size + 1; /* include empty elem */ cb->start = 0; cb->end = 0; cb->elems = (ElemType *)calloc(cb->size, sizeof(ElemType)); }

*/ */ */ */

Circular buffer

92

void cbFree(CircularBuffer *cb) { free(cb->elems); /* OK if null */ } int cbIsFull(CircularBuffer *cb) { return (cb->end + 1) % cb->size == cb->start; } int cbIsEmpty(CircularBuffer *cb) { return cb->end == cb->start; } /* Write an element, overwriting oldest element if buffer is full. App can choose to avoid the overwrite by checking cbIsFull(). */ void cbWrite(CircularBuffer *cb, ElemType *elem) { cb->elems[cb->end] = *elem; cb->end = (cb->end + 1) % cb->size; if (cb->end == cb->start) cb->start = (cb->start + 1) % cb->size; /* full, overwrite */ } /* Read oldest element. App must ensure !cbIsEmpty() first. */ void cbRead(CircularBuffer *cb, ElemType *elem) { *elem = cb->elems[cb->start]; cb->start = (cb->start + 1) % cb->size; } int main(int argc, char **argv) { CircularBuffer cb; ElemType elem = {0}; int testBufferSize = 10; /* arbitrary size */ cbInit(&cb, testBufferSize); /* Fill buffer with test elements 3 times */ for (elem.value = 0; elem.value < 3 * testBufferSize; ++ elem.value) cbWrite(&cb, &elem); /* Remove and print all elements */ while (!cbIsEmpty(&cb)) { cbRead(&cb, &elem); printf("%d\n", elem.value); } cbFree(&cb); return 0; }

Circular buffer Use a Fill Count This approach replaces the end pointer with a counter that tracks the number of readable items in the buffer. This unambiguously indicates when the buffer is empty or full and allows use of all buffer slots. The performance impact should be negligible, since this approach adds the costs of maintaining the counter and computing the tail slot on writes but eliminates the need to maintain the end pointer and simplifies the fullness test. Note: When using semaphores in a Producer-consumer model, the semaphores act as a fill count. Differences from previous example /* This approach replaces the CircularBuffer 'end' field with the 'count' field and changes these functions: */ void cbInit(CircularBuffer *cb, int size) { cb->size = size; cb->start = 0; cb->count = 0; cb->elems = (ElemType *)calloc(cb->size, sizeof(ElemType)); } int cbIsFull(CircularBuffer *cb) { return cb->count == cb->size; } int cbIsEmpty(CircularBuffer *cb) { return cb->count == 0; } void cbWrite(CircularBuffer *cb, ElemType *elem) { int end = (cb->start + cb->count) % cb->size; cb->elems[end] = *elem; if (cb->count == cb->size) cb->start = (cb->start + 1) % cb->size; /* full, overwrite */ else ++ cb->count; } void cbRead(CircularBuffer *cb, ElemType *elem) { *elem = cb->elems[cb->start]; cb->start = (cb->start + 1) % cb->size; -- cb->count; }

93

Circular buffer Read / Write Counts Another solution is to keep counts of the number of items written to and read from the circular buffer. Both counts are stored in signed integer variables with numerical limits larger than the number of items that can be stored and are allowed to wrap freely. The unsigned difference (write_count - read_count) always yields the number of items placed in the buffer and not yet retrieved. This can indicate that the buffer is empty, partially full, completely full (without waste of a storage location) or in a state of overrun. The advantage is: The source and sink of data can implement independent policies for dealing with a full buffer and overrun while adhering to the rule that only the source of data modifies the write count and only the sink of data modifies the read count. This can result in elegant and robust circular buffer implementations even in multi-threaded environments. The disadvantage is: You need two additional variables. Record last operation Another solution is to keep a flag indicating whether the most recent operation was a read or a write. If the two pointers are equal, then the flag will show whether the buffer is full or empty: if the most recent operation was a write, the buffer must be full, and conversely if it was a read, it must be empty. The advantages are: Only a single bit needs to be stored (which may be particularly useful if the algorithm is implemented in hardware) The test for full/empty is simple The disadvantage is: You need an extra variable Absolute indices If indices are used instead of pointers, indices can store read/write counts instead of the offset from start of the buffer. This is similar to the above solution, except that there are no separate variables, and relative indices are obtained on the fly by division modulo the buffer's length. The advantage is: No extra variables are needed. The disadvantages are: Every access needs an additional modulo operation. If counter wrap is possible, complex logic can be needed if the buffer's length is not a divisor of the counter's capacity. On binary computers, both of these disadvantages disappear if the buffer's length is a power of twoat the cost of a constraint on possible buffers lengths.

94

Circular buffer

95

Multiple Read Pointers


A little bit more complex are multiple read pointers on the same circular buffer. This is useful if you have n threads, which are reading from the same buffer, but one thread writing to the buffer.

Chunked Buffer
Much more complex are different chunks of data in the same circular buffer. The writer is not only writing elements to the buffer, it also assigns these elements to chunks . The reader should not only be able to read from the buffer, it should also get informed about the chunk borders. Example: The writer is reading data from small files, writing them into the same circular buffer. The reader is reading the data, but needs to know when and which file is starting at a given position.

Optimization
A circular-buffer implementation may be optimized by mapping the underlying buffer to two contiguous regions of virtual memory. (Naturally, the underlying buffers length must then equal some multiple of the systems page size.) Reading from and writing to the circular buffer may then be carried out with greater efficiency by means of direct memory access; those accesses which fall beyond the end of the first virtual-memory region will automatically wrap around to the beginning of the underlying buffer. When the read offset is advanced into the second virtual-memory region, both offsetsread and writeare decremented by the length of the underlying buffer.

Optimized POSIX Implementation


#include <sys/mman.h> #include <stdlib.h> #include <unistd.h> #define report_exceptional_condition() abort () struct ring_buffer { void *address; unsigned long count_bytes; unsigned long write_offset_bytes; unsigned long read_offset_bytes; }; //Warning order should be at least 12 for Linux void ring_buffer_create (struct ring_buffer *buffer, unsigned long order) { char path[] = "/dev/shm/ring-buffer-XXXXXX"; int file_descriptor; void *address; int status; file_descriptor = mkstemp (path);

Circular buffer if (file_descriptor < 0) report_exceptional_condition (); status = unlink (path); if (status) report_exceptional_condition (); buffer->count_bytes = 1UL << order; buffer->write_offset_bytes = 0; buffer->read_offset_bytes = 0; status = ftruncate (file_descriptor, buffer->count_bytes); if (status) report_exceptional_condition (); buffer->address = mmap (NULL, buffer->count_bytes << 1, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); if (buffer->address == MAP_FAILED) report_exceptional_condition (); address = mmap (buffer->address, buffer->count_bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, file_descriptor, 0); if (address != buffer->address) report_exceptional_condition (); address = mmap (buffer->address + buffer->count_bytes, buffer->count_bytes, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, file_descriptor, 0); if (address != buffer->address + buffer->count_bytes) report_exceptional_condition (); status = close (file_descriptor); if (status) report_exceptional_condition (); } void ring_buffer_free (struct ring_buffer *buffer) { int status; status = munmap (buffer->address, buffer->count_bytes << 1); if (status)

96

Circular buffer report_exceptional_condition (); } void * ring_buffer_write_address (struct ring_buffer *buffer) { /*** void pointer arithmetic is a constraint violation. ***/ return buffer->address + buffer->write_offset_bytes; } void ring_buffer_write_advance (struct ring_buffer *buffer, unsigned long count_bytes) { buffer->write_offset_bytes += count_bytes; } void * ring_buffer_read_address (struct ring_buffer *buffer) { return buffer->address + buffer->read_offset_bytes; } void ring_buffer_read_advance (struct ring_buffer *buffer, unsigned long count_bytes) { buffer->read_offset_bytes += count_bytes; if (buffer->read_offset_bytes >= buffer->count_bytes) { buffer->read_offset_bytes -= buffer->count_bytes; buffer->write_offset_bytes -= buffer->count_bytes; } } unsigned long ring_buffer_count_bytes (struct ring_buffer *buffer) { return buffer->write_offset_bytes - buffer->read_offset_bytes; } unsigned long ring_buffer_count_free_bytes (struct ring_buffer *buffer) { return buffer->count_bytes - ring_buffer_count_bytes (buffer); }

97

Circular buffer

98

void ring_buffer_clear (struct ring_buffer *buffer) { buffer->write_offset_bytes = 0; buffer->read_offset_bytes = 0; } /*Note, that initial anonymous mmap() can be avoided - after initial mmap() for descriptor fd, you can try mmap() with hinted address as (buffer->address + buffer->count_bytes) and if it fails another one with hinted address as (buffer->address buffer->count_bytes). Make sure MAP_FIXED is not used in such case, as under certain situations it could end with segfault. The advantage of such approach is, that it avoids requirement to map twice the amount you need initially (especially useful e.g. if you want to use hugetlbfs and the allowed amount is limited) and in context of gcc/glibc - you can avoid certain feature macros (MAP_ANONYMOUS usually requires one of: _BSD_SOURCE, _SVID_SOURCE or _GNU_SOURCE).*/

External links
CircularBuffer at the Portland Pattern Repository Boost: Templated Circular Buffer Container [1] http://www.dspguide.com/ch28/2.htm

References
[1] http:/ / www. boost. org/ doc/ libs/ 1_39_0/ libs/ circular_buffer/ doc/ circular_buffer. html

99

Dictionaries
Associative array
In computer science, an associative array, map, or dictionary is an abstract data type composed of a collection of (key,value) pairs, such that each possible key appears at most once in the collection. Operations associated with this data type allow:[1][2] the addition of pairs to the collection the removal of pairs from the collection the modification of the values of existing pairs the lookup of the value associated with a particular key

The dictionary problem is the task of designing a data structure that implements an associative array. A standard solution to the dictionary problem is a hash table; in some cases it is also possible to solve the problem using directly addressed arrays, binary search trees, or other more specialized structures.[1][2][3] Many programming languages include associative arrays as primitive data types, and they are available in software libraries for many others. Content-addressable memory is a form of direct hardware-level support for associative arrays. Associative arrays have many applications including such fundamental programming patterns as memoization and the decorator pattern.[4]

Operations
In an associative array, the association between a key and a value is often known as a "binding", and the same word "binding" may also be used to refer to the process of creating a new association. The operations that are usually defined for an associative array are:[1][2] Add or insert: add a new {key, value} pair to the collection, binding the new key to its new value. The arguments to this operation are the key and the value. Reassign: replace the value in one of the (key,value) pairs that are already in the collection, binding an old key to a new value. As with an insertion, the arguments to this operation are the key and the value. Remove or delete: remove a (key,value) pair from the collection, unbinding a given key from its value. The argument to this operation is the key. Lookup: find the value (if any) that is bound to a given key. The argument to this operation is the key, and the value is returned from the operation. If no value is found, some associative array implementations raise an exception. In addition, associative arrays may also include other operations such as determining the number of bindings or constructing an iterator to loop over all the bindings. Usually, for such an operation, the order in which the bindings are returned may be arbitrary. A multimap generalizes an associative array by allowing multiple values to be associated with a single key.[5] A bidirectional map is a related abstract data type in which the bindings operate in both directions: each value must be associated with a unique key, and a second lookup operation takes a value as argument and looks up the key associated with that value.

Associative array

100

Example
Suppose that the set of loans made by a library is to be represented in a data structure. Each book in a library may be checked out only by a single library patron at a time. However, a single patron may be able to check out multiple books. Therefore, the information about which books are checked out to which patrons may be represented by an associative array, in which the books are the keys and the patrons are the values. For instance (using the notation from Python in which a binding is represented by placing a colon between the key and the value), the current checkouts may be represented by an associative array { "Great Expectations": "John", "Pride and Prejudice": "Alice", "Wuthering Heights": "Alice" } A lookup operation with the key "Great Expectations" in this array would return the name of the person who checked out that book, John. If John returns his book, that would cause a deletion operation in the associative array, and if Pat checks out another book, that would cause an insertion operation, leading to a different state: { "Pride and Prejudice": "Alice", "The Brothers Karamazov": "Pat", "Wuthering Heights": "Alice" } In this new state, the same lookup as before, with the key "Great Expectations", would raise an exception, because this key is no longer present in the array.

Implementation
For dictionaries with very small numbers of bindings, it may make sense to implement the dictionary using an association list, a linked list of bindings. With this implementation, the time to perform the basic dictionary operations is linear in the total number of bindings; however, it is easy to implement and the constant factors in its running time are small.[1][6] Another very simple implementation technique, usable when the keys are restricted to a narrow range of integers, is direct addressing into an array: the value for a given key k is stored at the array cell A[k], or if there is no binding for k then the cell stores a special sentinel value that indicates the absence of a binding. As well as being simple, this technique is fast: each dictionary operation takes constant time. However, the space requirement for this structure is the size of the entire keyspace, making it impractical unless the keyspace is small.[3] The most frequently used general purpose implementation of an associative array is with a hash table: an array of bindings, together with a hash function that maps each possible key into an array index. The basic idea of a hash table is that the binding for a given key is stored at the position given by applying the hash function to that key, and that lookup operations are performed by looking at that cell of the array and using the binding found there. However, hash table based dictionaries must be prepared to handle collisions that occur when two keys are mapped by the hash function to the same index, and many different collision resolution strategies have been developed for dealing with this situation, often based either on open addressing (looking at a sequence of hash table indices instead of a single index, until finding either the given key or an empty cell) or on hash chaining (storing a small association list instead of a single binding in each hash table cell).[1][2][3] Dictionaries may also be stored in binary search trees or in data structures specialized to a particular type of keys such as radix trees, tries, Judy arrays, or van Emde Boas trees, but these implementation methods are less efficient than hash tables as well as placing greater restrictions on the types of data that they can handle. The advantages of

Associative array these alternative structures come from their ability to handle operations beyond the basic ones of an associative array, such as finding the binding whose key is the closest to a queried key, when the query is not itself present in the set of bindings.

101

Language support
Associative arrays can be implemented in any programming language as a package and many language systems provide them as part of their standard library. In some languages, they are not only built into the standard system, but have special syntax, often using array-like subscripting. Built-in syntactic support for associative arrays was introduced by SNOBOL4, under the name "table". MUMPS made multi-dimensional associative arrays, optionally persistent, its key data structure. SETL supported them as one possible implementation of sets and maps. Most modern scripting languages, starting with AWK and including Perl, Tcl, JavaScript, Python, Ruby, and Lua, support associative arrays as a primary container type. In many more languages, they are available as library functions without special syntax. In Smalltalk, Objective-C, .NET[7], Python, and REALbasic they are called dictionaries; in Perl and Ruby they are called hashes; in C++, Java, and Go they are called maps (see map (C++), unordered_map (C++), and Map); in Common Lisp and Windows PowerShell, they are called hash tables (since both typically use this implementation). In PHP, all arrays can be associative, except that the keys are limited to integers and strings. In JavaScript (see also JSON), all objects behave as associative arrays. In Lua, they are called tables, and are used as the primitive building block for all data structures. In Visual FoxPro, they are called Collections.

References
[1] Goodrich, Michael T.; Tamassia, Roberto (2006), "9.1 The Map Abstract Data Type", Data Structures & Algorithms in Java (4th ed.), Wiley, pp.368371. [2] Mehlhorn, Kurt; Sanders, Peter (2008), "4 Hash Tables and Associative Arrays", Algorithms and Data Structures: The Basic Toolbox, Springer, pp.8198. [3] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001), "11 Hash Tables", Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill, pp.221252, ISBN0-262-03293-7. [4] Goodrich & Tamassia (2006), pp. 597599. [5] Goodrich & Tamassia (2006), pp. 389397. [6] "When should I use a hash table instead of an association list?" (http:/ / www. faqs. org/ faqs/ lisp-faq/ part2/ section-2. html). lisp-faq/part2. 1996-02-20. . [7] "Dictionary<TKey, TValue> Class" (http:/ / msdn. microsoft. com/ en-us/ library/ xfhwa508. aspx). MSDN. .

External links
NIST's Dictionary of Algorithms and Data Structures: Associative Array (http://www.nist.gov/dads/HTML/ assocarray.html)

Association list

102

Association list
In computer programming and particularly in Lisp, an association list, often referred to as an alist, is a linked list in which each list element (or node) comprises a key and a value. The association list is said to associate the value with the key. In order to find the value associated with a given key, each element of the list is searched in turn, starting at the head, until the key is found. Duplicate keys that appear later in the list are ignored. It is a simple way of implementing an associative array. The disadvantage of association lists is that the time to search is O(n), where n is the length of the list. And unless the list is regularly pruned to remove elements with duplicate keys multiple values associated with the same key will increase the size of the list, and thus the time to search, without providing any compensatory advantage. One advantage is that a new element can be added to the list at its head, which can be done in constant time. For quite small values of n it is more efficient in terms of time and space than more sophisticated strategies such as hash tables and trees. In the early development of Lisp, association lists were used to resolve references to free variables in procedures.[1] Many programming languages, including Lisp, Scheme, OCaml, and Haskell have functions for handling association lists in their standard library.

References
[1] McCarthy, John; Abrahams, Paul W.; Edwards, Daniel J.; Hart, Timothy P.; Levin, Michael I. (1985). LISP 1.5 Programmer's Manual (http:/ / www. softwarepreservation. org/ projects/ LISP/ book/ LISP 1. 5 Programmers Manual. pdf). MIT Press. ISBN0-262-13011-4. .

Hash table

103

Hash table
Hash table
Type Unsorted associative array Invented 1953 Time complexity in big O notation Average Space Search Insert Delete O(n) [1] Worst case O(n) O(n) O(1) O(n)

O(1 + n/k) O(1) O(1 + n/k)

In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number). Thus, a hash table implements an associative array. The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought. Ideally, the hash function should map each possible key to a unique slot index, but this ideal is rarely achievable in A small phone book as a hash table practice (unless the hash keys are fixed; i.e. new entries are never added to the table after it is created). Instead, most hash table designs assume that hash collisionsdifferent keys that map to the same hash valuewill occur and must be accommodated in some way. In a well-dimensioned hash table, the average cost (number of instructions) for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key-value pairs, at constant average (indeed, amortized[2]) cost per operation.[3][4] In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.

Hash table

104

Hash function
At the heart of the hash table algorithm is an array of items; this array is often simply called the hash table. Hash table algorithms calculate an index based on the data item's key and the length of the array. The index is used to find or insert the data into the array. The implementation of this calculation is the hash function, f: index = f(key, arrayLength) The hash function calculates an index into the array from the data key and arrayLength (the size of the array). For assembly language or other low-level programs, a trivial hash function can often create an index with just one or two inline machine instructions.

Choosing a good hash function


A good hash function and implementation algorithm are essential for good hash table performance, but may be difficult to achieve. A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions, and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g. a Pearson's chi-squared test for discrete uniform distributions [5] [6] The distribution needs to be uniform only for table sizes s that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of s, the hash function needs to be uniform only when s is a power of two. On the other hand, some hashing algorithms provide uniform hashes only when s is a prime number.[7] For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash[3] is claimed to have particularly poor clustering behavior.[7] Cryptographic hash functions are believed to provide good hash functions for any table size s, either by modulo reduction or by bit masking. They may also be appropriate, if there is a risk of malicious users trying to sabotage a network service by submitting requests designed to generate a large number of collisions in the server's hash tables. However, the risk of sabotage can also be avoided by cheaper methods (such as applying a secret salt to the data, or using a universal hash function). Some authors claim that good hash functions should have the avalanche effect; that is, a single-bit change in the input key should affect, on average, half the bits in the output. Some popular hash functions do not have this property.

Perfect hash function


If all keys are known ahead of time, a perfect hash function can be used to create a perfect hash table that has no collisions. If minimal perfect hashing is used, every location in the hash table can be used as well. Perfect hashing allows for constant-time lookups in the worst case. This is in contrast to most chaining and open addressing methods, where the time for lookup is low on average, but may be very large (proportional to the number of entries) for some sets of keys.

Hash table

105

Collision resolution
Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. For example, if 2,500 keys are hashed into a million buckets, even with a perfectly uniform random distribution, according to the birthday problem there is a 95% chance of at least two of the keys being hashed to the same slot. Therefore, most hash table implementations have some collision resolution strategy to handle such events. Some common strategies are described below. All these methods require that the keys (or pointers to them) be stored in the table, together with the associated values.

Load factor
The performance of most collision resolution methods does not depend directly on the number n of stored entries. Instead, performance depends strongly on the table's load factor. Load factor is equal to n/s, the ratio of the number of stored entries n and the size s of the table's array of buckets. Sometimes this is referred to as the fill factor, as it represents the portion of the s buckets in the structure that are filled with one of the n stored entries. With a good hash function, the average lookup cost is nearly constant as the load factor increases from 0 to 0.7 (about 2/3 full) or so. Beyond that point, the probability of collisions and the cost of handling them increases. A low load factor is not especially beneficial. As load factor approaches 0, the proportion of unused areas in the hash table increases, but there is not necessarily any reduction in search cost. This results in wasted memory.

Separate chaining
In the strategy known as separate chaining, direct chaining, or simply chaining, each slot of the bucket array is a pointer to a linked list that contains the key-value pairs that hashed to the same location. Lookup requires scanning the list for an entry with the given key. Insertion requires adding a new entry record to either end of the list belonging to the hashed slot. Deletion requires searching the list and removing the element. (The technique is also called open hashing or closed addressing.)
Hash collision resolved by separate chaining.

Chained hash tables with linked lists are popular because they require only basic data structures with simple algorithms, and can use simple hash functions that are unsuitable for other methods. The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution of keys is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucketthat is, on the load factor. Chained hash tables remain effective even when the number of table entries n is much higher than the number of slots. Their performance degrades more gracefully (linearly) with the load factor. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10) is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than a plain sequential list, and possibly even faster than a balanced search tree. For separate-chaining, the worst-case scenario is when all entries are inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the

Hash table lookup procedure may have to scan all its entries; so the worst-case cost is proportional to the number n of entries in the table. The bucket chains are often implemented as ordered lists, sorted by the key field; this choice approximately halves the average cost of unsuccessful lookups, compared to an unordered list. However, if some keys are much more likely to come up than others, an unordered list with move-to-front heuristic may be more effective. More sophisticated data structures, such as balanced search trees, are worth considering only if the load factor is large (about 10 or more), or if the hash distribution is likely to be very non-uniform, or if one must guarantee good performance even in a worst-case scenario. However, using a larger table and/or a better hash function may be even more effective in those cases. Chained hash tables also inherit the disadvantages of linked lists. When storing small keys and values, the space overhead of the next pointer in each entry record can be significant. An additional disadvantage is that traversing a linked list has poor cache performance, making the processor cache ineffective. Separate chaining with list heads Some chaining implementations store the first record of each chain in the slot array itself.[4] The purpose is to increase cache efficiency of hash table access. To save memory space, such hash tables often have about as many slots as stored entries, meaning that many slots have two or more entries. Separate chaining with other structures

106

Hash collision by separate chaining with head records in the bucket array.

Instead of a list, one can use any other data structure that supports the required operations. For example, by using a self-balancing tree, the theoretical worst-case time of common hash table operations (insertion, deletion, lookup) can be brought down to O(log n) rather than O(n). However, this approach is only worth the trouble and extra memory cost if long delays must be avoided at all costs (e.g. in a real-time application), or if one must guard against many entries hashed to the same slot (e.g. if one expects extremely non-uniform distributions, or in the case of web sites or other publicly accessible services, which are vulnerable to malicious key distributions in requests). The variant called array hash table uses a dynamic array to store all the entries that hash to the same slot.[8][9][10] Each newly inserted entry gets appended to the end of the dynamic array that is assigned to the slot. The dynamic array is resized in an exact-fit manner, meaning it is grown only by as many bytes as needed. Alternative techniques such as growing the array by block sizes or pages were found to improve insertion performance, but at a cost in space. This variation makes more efficient use of CPU caching and the translation lookaside buffer (TLB), because slot entries are stored in sequential memory positions. It also dispenses with the next pointers that are required by linked lists, which saves space. Despite frequent array resizing, space overheads incurred by operating system such as memory fragmentation, were found to be small. An elaboration on this approach is the so-called dynamic perfect hashing,[11] where a bucket that contains k entries is organized as a perfect hash table with k2 slots. While it uses more memory (n2 slots for n entries, in the worst case and n*k slots in the average case), this variant has guaranteed constant worst-case lookup time, and low amortized time for insertion.

Hash table

107

Open addressing
In another strategy, called open addressing, all entry records are stored in the bucket array itself. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.[12] The name "open addressing" refers to the fact that the location ("address") of the item is not determined by its hash value. (This method is also called closed hashing; it should not be confused with "open hashing" or "closed addressing" that usually mean separate chaining.) Well-known probe sequences include: Linear probing, in which the interval between probes is fixed (usually 1) Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the starting value given by the original hash computation Double hashing, in which the interval between probes is computed by another hash function A drawback of all these open addressing schemes is that the number of stored entries cannot exceed the number of slots in the bucket array. In fact, even with good hash functions, their performance dramatically degrades when the load factor grows beyond 0.7 or so. Thus a more aggressive resize scheme is needed. Separate linking works correctly with any load factor, although performance is likely to be reasonable if it is kept below 2 or so. For many applications, these restrictions mandate the use of dynamic resizing, with its attendant costs. Open addressing schemes also put more stringent requirements on the hash function: besides distributing the keys more uniformly over the buckets, the function must also minimize the clustering of hash values that are consecutive in the probe order. Using separate chaining, the only concern is that too many objects map to the same hash value; whether they are adjacent or nearby is completely irrelevant. Open addressing only saves memory if the entries are small (less than four times the size of a pointer) and the load factor is not too small. If the load factor is close to zero (that is, there are far more buckets than stored entries), open addressing is wasteful even if each entry is just two words.

Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had previously collided with "John Smith".

Hash table

108

Open addressing avoids the time overhead of allocating each new entry record, and can be implemented even in the absence of a memory allocator. It also avoids the extra indirection required to access the first entry of each bucket (that is, usually the only one). It also has better locality of reference, particularly with linear probing. With small record sizes, these factors can yield better performance than chaining, particularly for lookups. Hash tables with open addressing are also easier to serialize, because they do not use pointers.
This graph compares the average number of cache misses required to look up elements in tables with chaining and linear probing. As the table passes the 80%-full mark, linear probing's performance drastically degrades.

On the other hand, normal open addressing is a poor choice for large elements, because these elements fill entire CPU cache lines (negating the cache advantage), and a large amount of space is wasted on large empty table slots. If the open addressing table only stores references to elements (external storage), it uses space comparable to chaining even for large records but loses its speed advantage. Generally speaking, open addressing is better used for hash tables with small records that can be stored within the table (internal storage) and fit in a cache line. They are particularly suitable for elements of one word or less. If the table is expected to have a high load factor, the records are large, or the data is variable-sized, chained hash tables often perform as well or better. Ultimately, used sensibly, any kind of hash table algorithm is usually fast enough; and the percentage of a calculation spent in hash table code is low. Memory usage is rarely considered excessive. Therefore, in most cases the differences between these algorithms are marginal, and other considerations typically come into play.

Coalesced hashing
A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes within the table itself.[12] Like open addressing, it achieves space usage and (somewhat diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects; in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more elements than table slots.

Robin Hood hashing


One interesting variation on double-hashing collision resolution is Robin Hood hashing.[13] The idea is that a new key may displace a key already inserted, if its probe count is larger than that of the key at the current position. The net effect of this is that it reduces worst case search times in the table. This is similar to Knuth's ordered hash tables except that the criterion for bumping a key does not depend on a direct relationship between the keys. Since both the worst case and the variation in the number of probes is reduced dramatically, an interesting variation is to probe the table starting at the expected successful probe value and then expand from that position in both directions.[14] External Robin Hashing is an extension of this algorithm where the table is stored in an external file and each table position corresponds to a fixed-sized page or bucket with B records.[15]

Hash table

109

Cuckoo hashing
Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup time in the worst case, and constant amortized time for insertions and deletions. It uses two or more hash functions, which means any key/value pair could be in two or more locations. For lookup, the first hash function is used; if the key/value is not found, then the second hash function is used, and so on. If a collision happens during insertion, then the key is re-hashed with the second hash function to map it to another bucket. If all hash functions are used and there is still a collision, then the key it collided with is removed to make space for the new key, and the old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that location also results in a collision, then the process repeats until there is no collision or the process traverses all the buckets, at which point the table is resized. By combining multiple hash functions with multiple cells per bucket, very high space utilisation can be achieved.

Hopscotch hashing
Another alternative open-addressing solution is hopscotch hashing,[16] which combines the approaches of cuckoo hashing and linear probing, yet seems in general to avoid their limitations. In particular it works well even when the load factor grows beyond 0.9. The algorithm is well suited for implementing a resizable concurrent hash table. The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original hashed bucket, where a given entry is always found. Thus, search is limited to the number of entries in this neighborhood, which is logarithmic in the worst case, constant on average, and with proper alignment of the neighborhood typically requires one cache miss. When inserting an entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is outside the neighborhood, items are repeatedly displaced in a sequence of hops. (This is similar to cuckoo hashing, but with the difference that in this case the empty slot is being moved into the neighborhood, instead of items being moved out with the hope of eventually finding an empty slot.) Each hop brings the open slot closer to the original neighborhood, without invalidating the neighborhood property of any of the buckets along the way. In the end, the open slot has been moved into the neighborhood, and the entry being inserted can be added to it.

Dynamic resizing
To keep the load factor under a certain limit, e.g. under 3/4, many table implementations expand the table when items are inserted. For example, in Java's HashMap class the default load factor threshold for table expansion is 0.75. Since buckets are usually implemented on top of a dynamic array and any constant proportion for resizing greater than 1 will keep the load factor under the desired limit, the exact choice of the constant is determined by the same space-time tradeoff as for dynamic arrays. Resizing is accompanied by a full or incremental table rehash whereby existing items are mapped to new bucket locations. To limit the proportion of memory wasted due to empty buckets, some implementations also shrink the size of the tablefollowed by a rehashwhen items are deleted. From the point of space-time tradeoffs, this operation is similar to the deallocation in dynamic arrays.

Hash table

110

Resizing by copying all entries


A common approach is to automatically trigger a complete resizing when the load factor exceeds some threshold rmax. Then a new larger table is allocated, all the entries of the old table are removed and inserted into this new table, and the old table is returned to the free storage pool. Symmetrically, when the load factor falls below a second threshold rmin, all entries are moved to a new smaller table. If the table size increases or decreases by a fixed percentage at each expansion, the total cost of these resizings, amortized over all insert and delete operations, is still a constant, independent of the number of entries n and of the number m of operations performed. For example, consider a table that was created with the minimum possible size and is doubled each time the load ratio exceeds some threshold. If m elements are inserted into that table, the total number of extra re-insertions that occur in all dynamic resizings of the table is at most m1. In other words, dynamic resizing roughly doubles the cost of each insert or delete operation.

Incremental resizing
Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually: During the resize, allocate the new hash table, but keep the old table unchanged. In each lookup or delete operation, check both tables. Perform insertion operations only in the new table. At each insertion also move r elements from the old table to the new table. When all elements are removed from the old table, deallocate it.

To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it is necessary to increase the size of the table by a factor of at least (r + 1)/r during resizing.

Monotonic keys
If it is known that key values will always increase (or decrease) monotonically, then a variation of consistent hashing can be achieved by keeping a list of the single most recent key value at each hash table resize operation. Upon lookup, keys that fall in the ranges defined by these list entries are directed to the appropriate hash functionand indeed hash tableboth of which can be different for each range. Since it is common to grow the overall number of entries by doubling, there will only be O(lg(N)) ranges to check, and binary search time for the redirection would be O(lg(lg(N))). As with consistent hashing, this approach guarantees that any key's hash, once issued, will never change, even when the hash table is later grown.

Other solutions
Linear hashing[17] is a hash table algorithm that permits incremental hash table expansion. It is implemented using a single hash table, but with two possible look-up functions. Another way to decrease the cost of table resizing is to choose a hash function in such a way that the hashes of most values do not change when the table is resized. This approach, called consistent hashing, is prevalent in disk-based and distributed hashes, where rehashing is prohibitively costly.

Hash table

111

Performance analysis
In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with (k) amortized comparisons per insertion and up to k comparisons for a successful lookup. Adding rehashing to this model is straightforward. As in a dynamic array, geometric resizing by a factor of b implies that only k/bi keys are inserted i or more times, so that the total number of insertions is bounded above by bk/(b-1), which is O(k). By using rehashing to maintain k < n, tables using both chaining and open addressing can have unlimited elements and perform successful lookup in a single comparison for the best choice of hash function. In more realistic models, the hash function is a random variable over a probability distribution of hash functions, and performance is computed on average over the choice of hash function. When this distribution is uniform, the assumption is called "simple uniform hashing" and it can be shown that hashing with chaining requires (1 + k/n) comparisons on average for an unsuccessful lookup, and hashing with open addressing requires (1/(1 - k/n)).[18] Both these bounds are constant, if we maintain k/n < c using table resizing, where c is a fixed constant less than 1.

Features
Advantages
The main advantage of hash tables over other table data structures is speed. This advantage is more apparent when the number of entries is large. Hash tables are particularly efficient when the maximum number of entries can be predicted in advance, so that the bucket array can be allocated once with the optimum size and never resized. If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may reduce the average lookup cost by a careful choice of the hash function, bucket table size, and internal data structures. In particular, one may be able to devise a hash function that is collision-free, or even perfect (see below). In this case the keys need not be stored in the table.

Drawbacks
Although operations on a hash table take constant time on average, the cost of a good hash function can be significantly higher than the inner loop of the lookup algorithm for a sequential list or search tree. Thus hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.) For certain string processing applications, such as spell-checking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if each key is represented by a small enough number of bits, then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case. The entries stored in a hash table can be enumerated efficiently (at constant cost per entry), but only in some pseudo-random order. Therefore, there is no efficient way to locate an entry whose key is nearest to a given key. Listing all n entries in some specific order generally requires a separate sorting step, whose cost is proportional to log(n) per entry. In comparison, ordered search trees have lookup and insertion cost proportional to log(n), but allow finding the nearest key at about the same cost, and ordered enumeration of all entries at constant cost per entry. If the keys are not stored (because the hash function is collision-free), there may be no easy way to enumerate the keys that are present in the table at any given moment. Although the average cost per operation is constant and fairly small, the cost of a single operation may be quite high. In particular, if the hash table uses dynamic resizing, an insertion or deletion operation may occasionally take time

Hash table proportional to the number of entries. This may be a serious drawback in real-time or interactive applications. Hash tables in general exhibit poor locality of referencethat is, the data to be accessed is distributed seemingly at random in memory. Because hash tables cause access patterns that jump around, this can trigger microprocessor cache misses that cause long delays. Compact data structures such as arrays searched with linear search may be faster, if the table is relatively small and keys are integers or other short strings. According to Moore's Law, cache sizes are growing exponentially and so what is considered "small" may be increasing. The optimal performance point varies from system to system. Hash tables become quite inefficient when there are many collisions. While extremely uneven hash distributions are extremely unlikely to arise by chance, a malicious adversary with knowledge of the hash function may be able to supply information to a hash that creates worst-case behavior by causing excessive collisions, resulting in very poor performance, e.g. a denial of service attack [19]. In critical applications, universal hashing can be used; a data structure with better worst-case guarantees may be preferable.[20]

112

Uses
Associative arrays
Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted programming languages like AWK, Perl, and PHP. When storing a new item into a multimap and a hash collision occurs, the multimap unconditionally stores both items. When storing a new item into a typical associative array and a hash collision occurs, but the actual keys themselves are different, the associative array likewise stores both items. However, if the key of the new item exactly matches the key of an old item, the associative array typically erases the old item and overwrites it with the new item, so every item in the table has a unique key.

Database indexing
Hash tables may also be used as disk-based data structures and database indices (such as in dbm) although B-trees are more popular in these applications.

Caches
Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entriesusually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.

Sets
Besides recovering the entry that has a given key, many hash table implementations can also tell whether such an entry exists or not. Those structures can therefore be used to implement a set data structure, which merely records whether a given key belongs to a specified set of keys. In this case, the structure can be simplified by eliminating all parts that have to do with the entry values. Hashing can be used to implement both static and dynamic sets.

Hash table

113

Object representation
Several dynamic languages, such as Perl, Python, JavaScript, and Ruby, use hash tables to implement objects. In this representation, the keys are the names of the members and methods of the object, and the values are pointers to the corresponding member or method.

Unique data representation


Hash tables can be used by some programs to avoid creating multiple character strings with the same contents. For that purpose, all strings in use by the program are stored in a single hash table, which is checked whenever a new string has to be created. This technique was introduced in Lisp interpreters under the name hash consing, and can be used with many other kinds of data (expression trees in a symbolic algebra system, records in a database, files in a file system, binary decision diagrams, etc.)

Implementations
In programming languages
Many programming languages provide hash table functionality, either as built-in associative arrays or as standard library modules. In C++11, for example, the unordered_map class provides hash tables for keys and values of arbitrary type. In PHP 5, the Zend 2 engine uses one of the hash functions from Daniel J. Bernstein to generate the hash values used in managing the mappings of data pointers stored in a hash table. In the PHP source code, it is labelled as DJBX33A (Daniel J. Bernstein, Times 33 with Addition). Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are highly optimized as they are used internally to implement namespaces. In the .NET Framework, support for hash tables is provided via the non-generic Hashtable and generic Dictionary classes, which store key-value pairs, and the generic HashSet class, which stores only values.

Independent packages
SparseHash [21] (formerly Google SparseHash) An extremely memory-efficient hash_map implementation, with only 2 bits/entry of overhead. The SparseHash library has several C++ hash map implementations with different performance characteristics, including one that optimizes for memory use and another that optimizes for speed. SunriseDD [22] An open source C library for hash table storage of arbitrary data objects with lock-free lookups, built-in reference counting and guaranteed order iteration. The library can participate in external reference counting systems or use its own built-in reference counting. It comes with a variety of hash functions and allows the use of runtime supplied hash functions via callback mechanism. Source code is well documented. uthash [23] This is an easy-to-use hash table for C structures.

Hash table

114

History
The idea of hashing arose independently in different places. In January 1953, H. P. Luhn wrote an internal IBM memorandum that used hashing with chaining.[24] G. N. Amdahl, E. M. Boehme, N. Rochester, and Arthur Samuel implemented a program using hashing at about the same time. Open addressing with linear probing (relatively prime stepping) is credited to Amdahl, but Ershov (in Russia) had the same idea.[24]

References
[1] Thomas H. Cormen [et al.] (2009). Introduction to Algorithms (3rd ed.). Massachusetts Institute of Technology. pp.253280. ISBN978-0-262-03384-8. [2] Charles E. Leiserson, Amortized Algorithms, Table Doubling, Potential Method (http:/ / videolectures. net/ mit6046jf05_leiserson_lec13/ ) Lecture 13, course MIT 6.046J/18.410J Introduction to AlgorithmsFall 2005 [3] Donald Knuth (1998). 'The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp.513558. ISBN0-201-89685-0. [4] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). Introduction to Algorithms (2nd ed.). MIT Press and McGraw-Hill. 221252. ISBN978-0-262-53196-2. [5] Karl Pearson (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling". Philosophical Magazine, Series 5 50 (302): pp.157175. [6] Robin Plackett (1983). "Karl Pearson and the Chi-Squared Test". International Statistical Review (International Statistical Institute (ISI)) 51 (1): pp.5972. [7] Thomas Wang (1997), Prime Double Hash Table (http:/ / www. concentric. net/ ~Ttwang/ tech/ primehash. htm). Accessed April 27, 2012 [8] Askitis, Nikolas; Zobel, Justin (October 2005). Cache-conscious Collision Resolution in String Hash Tables (http:/ / www. springerlink. com/ content/ b61721172558qt03/ ). 3772/2005. pp.91102. doi:10.1007/11575832_11. ISBN978-3-540-29740-6. . [9] Askitis, Nikolas; Sinha, Ranjan (2010). "Engineering scalable, cache and space efficient tries for strings" (http:/ / www. springerlink. com/ content/ 86574173183j6565/ ). The VLDB Journal 17 (5): 633-660. doi:10.1007/s00778-010-0183-9. ISSN1066-8888. . [10] Askitis, Nikolas (2009). Fast and Compact Hash Tables for Integer Keys (http:/ / crpit. com/ confpapers/ CRPITV91Askitis. pdf). 91. pp.113122. ISBN978-1-920682-72-9. . [11] Erik Demaine, Jeff Lind. 6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory. Spring 2003. http:/ / courses. csail. mit. edu/ 6. 897/ spring03/ scribe_notes/ L2/ lecture2. pdf [12] Tenenbaum, Aaron M.; Langsam, Yedidyah; Augenstein, Moshe J. (1990). Data Structures Using C. Prentice Hall. pp.456461, p. 472. ISBN0-13-199746-7. [13] Celis, Pedro (1986). Robin Hood hashing (Technical report). CS-86-14. [14] Viola, Alfredo (October 2005). "Exact distribution of individual displacements in linear probing hashing". Transactions on Algorithms (TALG) (ACM) 1 (2,): 214242. doi:10.1145/1103963.1103965. [15] Celis, Pedro (March, 1988). External Robin Hood Hashing (Technical report). TR246. [16] Herlihy, Maurice and Shavit, Nir and Tzafrir, Moran (2008). "Hopscotch Hashing". DISC '08: Proceedings of the 22nd international symposium on Distributed Computing. Arcachon, France: Springer-Verlag. pp.350364. [17] Litwin, Witold (1980). "Linear hashing: A new tool for file and table addressing". Proc. 6th Conference on Very Large Databases. pp.212223. [18] Doug Dunham. CS 4521 Lecture Notes (http:/ / www. duluth. umn. edu/ ~ddunham/ cs4521s09/ notes/ ch11. txt). University of Minnesota Duluth. Theorems 11.2, 11.6. Last modified 21 April 2009. [19] Alexander Klink and Julian Wlde's Efficient Denial of Service Attacks on Web Application Platforms (http:/ / events. ccc. de/ congress/ 2011/ Fahrplan/ attachments/ 2007_28C3_Effective_DoS_on_web_application_platforms. pdf), December 28, 2011, 28th Chaos Communication Congress. Berlin, Germany. [20] Crosby and Wallach's Denial of Service via Algorithmic Complexity Attacks (http:/ / www. cs. rice. edu/ ~scrosby/ hash/ CrosbyWallach_UsenixSec2003. pdf). [21] http:/ / code. google. com/ p/ sparsehash/ [22] http:/ / www. sunrisetel. net/ software/ devtools/ sunrise-data-dictionary. shtml [23] http:/ / uthash. sourceforge. net/ [24] Mehta, Dinesh P.; Sahni, Sartaj. Handbook of Datastructures and Applications. pp.915. ISBN1-58488-435-5.

Hash table

115

Further reading
Tamassia, Roberto; Michael T. Goodrich (2006). "Chapter Nine: Maps and Dictionaries" (in English). Data structures and algorithms in Java : [updated for Java 5.0] (4th ed.). Hoboken, N.J.: Wiley. pp.369-418. ISBN0-471-73884-0.

External links
A Hash Function for Hash Table Lookup (http://www.burtleburtle.net/bob/hash/doobs.html) by Bob Jenkins. Hash Tables (http://www.sparknotes.com/cs/searching/hashtables/summary.html) by SparkNotesexplanation using C Hash functions (http://www.azillionmonkeys.com/qed/hash.html) by Paul Hsieh Design of Compact and Efficient Hash Tables for Java (http://blog.griddynamics.com/2011/03/ ultimate-sets-and-maps-for-java-part-i.html) Libhashish (http://libhashish.sourceforge.net/) hash library NIST entry on hash tables (http://www.nist.gov/dads/HTML/hashtab.html) Open addressing hash table removal algorithm from ICI programming language, ici_set_unassign in set.c (http:// ici.cvs.sourceforge.net/ici/ici/set.c?view=markup) (and other occurrences, with permission). A basic explanation of how the hash table works by Reliable Software (http://www.relisoft.com/book/lang/ pointer/8hash.html) Lecture on Hash Tables (http://compgeom.cs.uiuc.edu/~jeffe/teaching/373/notes/06-hashing.pdf) Hash-tables in C (http://task3.cc/308/hash-maps-with-linear-probing-and-separate-chaining/)two simple and clear examples of hash tables implementation in C with linear probing and chaining Open Data Structures - Chapter 5 - Hash Tables (http://opendatastructures.org/versions/edition-0.1e/ods-java/ 5_Hash_Tables.html) MIT's Introduction to Algorithms: Hashing 1 (http://video.google.com/ videoplay?docid=-727485696209877198&q=source:014117792397255896270&hl=en) MIT OCW lecture Video MIT's Introduction to Algorithms: Hashing 2 (http://video.google.com/ videoplay?docid=2307261494964091254&q=source:014117792397255896270&hl=en) MIT OCW lecture Video How to sort a HashMap (Java) and keep the duplicate entries (http://www.lampos.net/sort-hashmap)

Linear probing

116

Linear probing
Linear probing is a scheme in computer programming for resolving hash collisions of values of hash functions by sequentially searching the hash table for a free location.[1] This is accomplished using two values - one as a starting value and one as an interval between successive values in modular arithmetic. The second value, which is the same for all keys and known as the stepsize, is repeatedly added to the starting value until a free space is found, or the entire table is traversed. (In order to traverse the entire table the stepsize should be relatively prime to the arraysize, which is why the array size is often chosen to be a prime number.) newLocation = (startingValue + stepSize) % arraySize This algorithm, which is used in open-addressed hash tables, provides good memory caching (if stepsize is equal to one), through good locality of reference, but also results in clustering, an unfortunately high probability that where there has been one collision there will be more. The performance of linear probing is also more sensitive to input distribution when compared to double hashing, where the stepsize is determined by another hash function applied to the value instead of a fixed stepsize as in linear probing. Given an ordinary hash function H(x), a linear probing function (H(x, i)) would be:

Here H(x) is the starting value, n the size of the hash table, and the stepsize is i in this case.

Dictionary operation in constant time


Using linear probing, dictionary operation can be implemented in constant time. In other words, insert, remove and find operations can be implemented in O(1), as long as the load factor of the hash table is a constant strictly less than one.[2] This analysis makes the (unrealistic) assumption that the hash function is completely random, but can be extended also to 5-independent hash functions.[3] Weaker properties, such as universal hashing, are not strong enough to ensure the constant-time operation of linear probing,[4] but one practical method of hash function generation, tabulation hashing, again leads to a guaranteed constant expected time performance despite not being 5-independent.[5]

References
[1] Dale, Nell (2003). C++ Plus Data Structures. Sudbury, MA: Jones and Bartlett Computer Science. ISBN0-7637-0481-4. [2] Knuth, Donald (1963), Notes on "Open" Addressing (http:/ / algo. inria. fr/ AofA/ Research/ 11-97. html), [3] Pagh, Anna; Pagh, Rasmus; Rui, Milan (2009), "Linear probing with constant independence", SIAM Journal on Computing 39 (3): 11071120, doi:10.1137/070702278, MR2538852 [4] Ptracu, Mihai; Thorup, Mikkel (2010), "On the k-independence required by linear probing and minwise independence" (http:/ / people. csail. mit. edu/ mip/ papers/ kwise-lb/ kwise-lb. pdf), Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings, Part I, Lecture Notes in Computer Science, 6198, Springer, pp.715726, doi:10.1007/978-3-642-14165-2_60, [5] Ptracu, Mihai; Thorup, Mikkel (2011), "The power of simple tabulation hashing", Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11), pp.110, arXiv:1011.5200, doi:10.1145/1993636.1993638

Linear probing

117

External links
How Caching Affects Hashing (http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf) by Gregory L. Heileman and Wenbin Luo 2005. Open Data Structures - Section 5.2 - LinearHashTable: Linear Probing (http://opendatastructures.org/versions/ edition-0.1e/ods-java/5_2_LinearHashTable_Linear_.html)

Quadratic probing
Quadratic probing is an open addressing scheme in computer programming for resolving collisions in hash tables -when an incoming data's hash value indicates it should be stored in an already-occupied slot or bucket. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial until an open slot is found. For a given hash value, the indices generated by linear probing are as follows:

This method results in primary clustering, and as the cluster grows larger, the search for those items hashing within the cluster becomes less efficient. An example sequence using quadratic probing is:

Quadratic probing can be a more efficient algorithm in a closed hash table, since it better avoids the clustering problem that can occur with linear probing, although it is not immune. It also provides good memory caching because it preserves some locality of reference; however, linear probing has greater locality and, thus, better cache performance. Quadratic probing is used in the Berkeley Fast File System to allocate free blocks. The allocation routine chooses a new cylinder-group when the current is nearly full using quadratic probing, because of the speed it shows in finding unused cylinder-groups.

Quadratic Function
Let h(k) be a hash function that maps an element k to an integer in [0,m-1], where m is the size of the table. Let the ith probe position for a value k be given by the function

where c2 0. If c2 = 0, then h(k,i) degrades to a linear probe. For a given hash table, the values of c1 and c2 remain constant. Examples: If , then the probe sequence will be

For m = 2n, a good choice for the constants are c1 = c2 = 1/2, as the values of h(k,i) for i in [0,m-1] are all distinct. This leads to a probe sequence of where the values increase by 1, 2, 3, ... For prime m > 2, most choices of c1 and c2 will make h(k,i) distinct for i in [0, (m-1)/2]. Such choices include c1 = c2 = 1/2, c1 = c2 = 1, and c1 = 0, c2 = 1. Because there are only about m/2 distinct probes for a given element, it is difficult to guarantee that insertions will succeed when the load factor is > 1/2.

Quadratic probing

118

Quadratic Probing Insertion


The problem, here, is to insert a key at an available key space in a given Hash Table using quadratic probing.[1]

Algorithm to Insert key in Hash Table


1. Get the key k 2. Set counter j = 0 3. Compute hash function h[k] = k % SIZE 4. If hashtable[h[k]] is empty (4.1) Insert key k at hashtable[h[k]] (4.2) Stop Else (4.3) The key space at hashtable[h[k]] is occupied, so we need to find the next available key space (4.4) Increment j (4.5) Compute new hash function h[k] = ( k + j * j ) % SIZE (4.6) Repeat Step 4 till j is more than SIZE of hash table 5. The hash table is full 6. Stop

C function for Key Insertion


int quadratic_probing_insert(int *hashtable, int key, int *empty) { /* hashtable[] is an integer hash table; empty[] is another array which indicates whether the key space is occupied; If an empty key space is found, the function returns the index of the bucket where the key is inserted, otherwise it returns (-1) if no empty key space is found */ int j = 0, hk; hk = key % SIZE; while(j < SIZE) { if(empty[hk] == 1) { hashtable[hk] = key; empty[hk] = 0; return (hk); } j++; hk = (key + j * j) % SIZE; } return (-1); }

Quadratic probing

119

Example to Insert key in Hash Table


There are two possible cases to consider: Key space at position h[k] is empty : Insert the key at the position. Key space at position h[k] is occupied: Compute the next hash function h[k]. Consider a hash table initially containing some elements.

Suppose we want to insert a key 10 in the hash table. h[k] = 10 % 8 = 2 Slot 2 being occupied the hash function will search for new available key space. h[k] = ( k + j * j ) % SIZE h[k] = ( 2 + 1 * 1 ) % 8 = 3 Slot 3 is also occupied, so the hash function will search for next available key space. h[k] = ( 2 + 2 * 2 ) % 8 = 6 Slot 6 is empty, so key will be inserted here.

Quadratic Probing Search


Algorithm to Search Element in Hash Table
1. Get the key k to be searched 2. Set counter j = 0 3. Compute hash function h[k] = k % SIZE 4. If the key space at hashtable[h[k]] is occupied (4.1) Compare the element at hashtable[h[k]] with the key k. (4.2) If they are equal (4.2.1) The key is found at the bucket h[k] (4.2.2) Stop Else

Quadratic probing
(4.3) The element might be placed at the next location given by the quadratic function (4.4) Increment j (4.5) Compute new hash function h[k] = ( k + j * j ) % SIZE (4.6) Repeat Step 4 till j is greater than SIZE of hash table 5. The key was not found in the hash table 6. Stop

120

C function for Key Searching


int quadratic_probing_search(int *hashtable, int key, int *empty) { /* If the key is found in the hash table, the function returns the index of the hashtable where the key is inserted, otherwise it returns (-1) if the key is not found */ int j = 0, hk; hk = key % SIZE; while(j < SIZE) { if((empty[hk] == 0) && (hashtable[hk] == key)) return (hk); j++; hk = (key + j * j) % SIZE; } return (-1); }

Limitations
[2]

For linear probing it is a bad idea to let the hash table get nearly full, because performance is degraded as the hash table gets filled. In the case of quadratic probing, the situation is even more drastic. With the exception of the triangular number case for a power-of-two-sized hash table, there is no guarantee of finding an empty cell once the table gets more than half full, or even before the table gets half full if the table size is not prime. This is because at most half of the table can be used as alternative locations to resolve collisions. If the hash table size is b (a prime greater than 3), it can be proven that the first alternative locations including the initial location h(k) are all distinct and unique. Suppose, we assume two of the alternative locations to be given by and , where 0 x, y (b / 2). If these two locations point to the same key space, but x y.

Then the following would have to be true,

As b (table size) is a prime greater than 3, either (x - y) or (x + y) has to be equal to zero. Since x and y are unique, (x - y) cannot be zero. Also, since 0 x, y (b / 2), (x + y) cannot be zero. Thus, by contradiction, it can be said that the first (b / 2) alternative locations after h(k) are unique. So an empty key space can always be found as long as at most (b / 2) locations are filled, i.e., the hash table is not more than half full.

Quadratic probing

121

References
[1] Horowitz, Sahni, Anderson-Freed (2011). Fundamentals of Data Structures in C. University Press. ISBN978-81-7371-605-8. [2] Data Structures and Algorithm Analysis in C++. Pearson Education. 2009. ISBN978-81-317-1474-4.

External links
Tutorial/quadratic probing (http://research.cs.vt.edu/AVresearch/hashing/quadratic.php)

Double hashing
Double hashing is a computer programming technique used in hash tables to resolve hash collisions, cases when two different values to be searched for produce the same hash key. It is a popular collision-resolution technique in open-addressed hash tables. Double hashing is implemented in many popular computer libraries.

Classical applied data structure


Double hashing with open addressing is a classical data structure on a table stored in then 's load factor is . . Let be the number of elements

Double hashing approximates uniform open address hashing. That is, start by randomly, uniformly and independently selecting two universal hash functions and to build a double hashing table . All elements are put in location is computed by: Let have fixed load factor . Bradford and Katehakis [1] showed the expected number of probes , still using these initially chosen hash functions, is regardless of the by double hashing using and . Given a key , determining the -st hash

for an unsuccessful search in distribution of the inputs.

Previous results include: Guibas and Szemerdi . Also, Lueker and Molodowitch and Siegel
[4] [3]

[2]

showed

holds for unsuccessful search for load factors

showed this held assuming ideal randomized functions. Schmidt , and

showed this with more realistic

-wise independent and uniform functions (for

suitable constant ). Like linear probing, it uses one hash value as a starting point and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is decided using a second, independent hash function (hence the name double hashing). Unlike linear probing and quadratic probing, the interval depends on the data, so that even values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering. In other words, given independent hash functions and , the jth location in the bucket sequence for value k in a hash table is:

Double hashing

122

Disadvantages
Linear probing and, to a lesser extent, quadratic probing are able to take advantage of the data cache by accessing locations that are close together. Double hashing has larger intervals and is not able to achieve this advantage. To avoid this situation, store your data with the second key as the row, and your first key as the column. Doing this allows you to iterate on the column, thus preventing cache problems. This also prevents the need to rehash the second key. For instance: pData[hk_2][hk_1] int hv_1 = Hash(v) int hv_2 = Hash2(v) int original_hash = hv_1 while(pData[hv_2][hv_1)){ hv_1 = hv_1 + 1 } Like all other forms of open addressing, double hashing becomes linear as the hash table approaches maximum capacity. The only solution to this is to rehash to a larger size. On top of that, it is possible for the secondary hash function to evaluate to zero. For example, if we choose k=5 with the following function:

The resulting sequence will always remain at the initial hash value. One possible solution is to change the secondary hash function to:

This ensures that the secondary hash function will always be non zero. Essentially, Double Hashing is hashing on an already hashed key.

Notes
[1] P. G. Bradford and M. Katehakis A Probabilistic Study on Combinatorial Expanders and Hashing , SIAM Journal on Computing 2007 (37:1), 83-111. http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 91. 2647 [2] L. Guibas and E. Szemerdi: The Analysis of Double Hashing, Journal of Computer and System Sciences, 1978, 16, 226-274. [3] G. S. Lueker and M. Molodowitch: More Analysis of Double Hashing, Combinatorica, 1993, 13(1), 83-96. [4] J. P. Schmidt and A. Siegel: Double Hashing is Computable and Randomizable with Universal Hash Functions, manuscript.

External links
How Caching Affects Hashing (http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf) by Gregory L. Heileman and Wenbin Luo 2005. Hash Table Animation (http://www.cs.pitt.edu/~kirk/cs1501/animations/Hashing.html)

Cuckoo hashing

123

Cuckoo hashing
Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches.

History
Cuckoo hashing was first described by Rasmus Pagh and Flemming Friche Rodler in 2001.[1]

Theory
The basic idea is to use two hash functions instead of only one. This provides two possible locations in the hash table for each key. In one of the commonly used variants of the algorithm, the hash table is split into two smaller tables of equal size, and each hash function provides an index into one of these two tables. When a new key is inserted, a greedy algorithm is used: The new key is inserted in one of its two possible locations, "kicking out", that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there, until a vacant position is found, or the procedure enters an infinite loop. In the latter case, the hash table is rebuilt in-place using new hash functions: There is no need to allocate new tables for the rehashing: We may simply run through the tables to delete and perform the usual insertion procedure on all keys found not to be at their intended position in the table. Pagh & Rodler,"Cuckoo Hashing"[1] Lookup requires inspection of just two locations in the hash table, which takes constant time in the worst case (see Big O notation). This is in contrast to many other hash table algorithms, which may not have a constant worst-case bound on the time to do a lookup.

Cuckoo hashing example. The arrows show the alternative location of each key. A new item would be inserted in the location of A by moving A to its alternative location, currently occupied by B, and moving B to its alternative location which is currently vacant. Insertion of a new item in the location of H would not succeed: Since H is part of a cycle (together with W), the new item would get kicked out again.

It can also be shown that insertions succeed in expected constant time,[1] even considering the possibility of having to rebuild the table, as long as the number of keys is kept below half of the capacity of the hash table, i.e., the load factor is below 50%. One method of proving this uses the theory of random graphs: one may form an undirected graph called the "Cuckoo Graph" that has a vertex for each hash table location, and an edge for each hashed value, with the endpoints of the edge being the two possible locations of the value. Then, the greedy insertion algorithm for adding a set of values to a cuckoo hash table succeeds if and only if the Cuckoo Graph for this set of values is a pseudoforest, a graph with at most one cycle in each of its connected components. This property is true with high probability for a random graph in which the number of edges is less than half the number of vertices.[2]

Cuckoo hashing

124

Example
The following hashfunctions are given:

k 20 50 53 75

h(k) h'(k) 9 6 9 9 1 4 4 6 9 6 9 0 3 3

100 1 67 1

105 6 3 36 39 3 3 6

1. table for h(k) 20 50 53 75 100 67 105 3 0 1 2 3 4 5 6 7 8 9 10 20 20 20 20 20 20 53 53 53 75 50 50 50 50 50 105 105 105 50 3 3 36 100 67 67 67 67 100 36 39

Cuckoo hashing

125

2. table for h'(k) 20 50 53 75 100 67 0 1 2 3 4 5 6 7 8 9 10 100 100 100 100 105 75 75 75 75 75 75 67 53 53 53 53 50 50 36 50 39 53 20 20 20 105 3 36 39 3 20

Cycle
If you now wish to insert the element 6, then you get into a cycle. In the last row of the table we find the same initial situation as at the beginning again.

considered key table 1

table 2

old value new value old value new value 6 53 67 105 3 39 100 75 50 36 6 50 75 100 6 36 105 67 53 39 3 50 6 53 67 105 3 39 100 75 50 36 6 53 67 105 3 39 100 75 50 36 6 53 50 75 100 6 36 105 67 53 39 3 50

Cuckoo hashing

126

Generalizations and applications


Generalizations of cuckoo hashing that use more than 2 alternative hash functions can be expected to utilize a larger part of the capacity of the hash table efficiently while sacrificing some lookup and insertion speed. Using just three hash functions increases the load to 91%. Another generalization of cuckoo hashing consists in using more than one key per bucket. Using just 2 keys per bucket permits a load factor above 80%. Other algorithms that use multiple hash functions include the Bloom filter. Cuckoo hashing can be used to implement a data structure equivalent to a Bloom filter. A simplified generalization of cuckoo hashing called skewed-associative cache is used in some CPU caches. A study by Zukowski et al.[3] has shown that cuckoo hashing is much faster than chained hashing for small, cache-resident hash tables on modern processors. Kenneth Ross[4] has shown bucketized versions of cuckoo hashing (variants that use buckets that contain more than one key) to be faster than conventional methods also for large hash tables, when space utilization is high. The performance of the bucketized cuckoo hash table was investigated further by Askitis,[5] with its performance compared against alternative hashing schemes. A survey by Mitzenmacher[6] presents open problems related to cuckoo hashing as of 2009.

References
[1] Pagh, Rasmus; Rodler, Flemming Friche (2001). "Cuckoo Hashing" (http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 25. 4189). Algorithms ESA 2001. Lecture Notes in Computer Science. 2161. pp. 121133. doi:10.1007/3-540-44676-1_10. ISBN978-3-540-42493-2. . [2] Kutzelnigg, Reinhard (2006). "Fourth Colloquium on Mathematics and Computer Science" (http:/ / www. dmtcs. org/ dmtcs-ojs/ index. php/ proceedings/ article/ viewFile/ 590/ 1710). Discrete Mathematics and Theoretical Computer Science. pp.403406. [3] Zukowski, Marcin; Heman, Sandor; Boncz, Peter (2006-06) (PDF). Architecture-Conscious Hashing (http:/ / www. cs. cmu. edu/ ~damon2006/ pdf/ zukowski06archconscioushashing. pdf). Proceedings of the International Workshop on Data Management on New Hardware (DaMoN). . Retrieved 2008-10-16. [4] Ross, Kenneth (2006-11-08) (PDF). Efficient Hash Probes on Modern Processors (http:/ / domino. research. ibm. com/ library/ cyberdig. nsf/ papers/ DF54E3545C82E8A585257222006FD9A2/ $File/ rc24100. pdf). IBM Research Report RC24100. RC24100. . Retrieved 2008-10-16. [5] Askitis, Nikolas (2009). Fast and Compact Hash Tables for Integer Keys (http:/ / crpit. com/ confpapers/ CRPITV91Askitis. pdf). 91. pp.113122. ISBN978-1-920682-72-9. . [6] Mitzenmacher, Michael (2009-09-09) (PDF). Some Open Questions Related to Cuckoo Hashing | Proceedings of ESA 2009 (http:/ / www. eecs. harvard. edu/ ~michaelm/ postscripts/ esa2009. pdf). . Retrieved 2010-11-10.

A cool and practical alternative to traditional hash tables (http://www.ru.is/faculty/ulfar/CuckooHash.pdf), U. Erlingsson, M. Manasse, F. Mcsherry, 2006. Cuckoo Hashing for Undergraduates, 2006 (http://www.it-c.dk/people/pagh/papers/cuckoo-undergrad.pdf), R. Pagh, 2006. Cuckoo Hashing, Theory and Practice (http://mybiasedcoin.blogspot.com/2007/06/ cuckoo-hashing-theory-and-practice-part.html) (Part 1, Part 2 (http://mybiasedcoin.blogspot.com/2007/06/ cuckoo-hashing-theory-and-practice-part_15.html) and Part 3 (http://mybiasedcoin.blogspot.com/2007/06/ cuckoo-hashing-theory-and-practice-part_19.html)), Michael Mitzenmacher, 2007. Naor, Moni; Segev, Gil; Wieder, Udi (2008). "History-Independent Cuckoo Hashing" (http://www.wisdom. weizmann.ac.il/~naor/PAPERS/cuckoo_hi_abs.html). International Colloquium on Automata, Languages and Programming (ICALP). Reykjavik, Iceland. Retrieved 2008-07-21.

Cuckoo hashing

127

External links
Cuckoo hash map written in C++ (http://sourceforge.net/projects/cuckoo-cpp/) Static cuckoo hashtable generator for C/C++ (http://www.theiling.de/projects/lookuptable.html) Cuckoo hashtable written in Java (http://lmonson.com/blog/?p=100) Generic Cuckoo hashmap in Java (http://github.com/joacima/Cuckoo-hash-map/blob/master/ CuckooHashMap.java) Cuckoo hash table written in Haskell (http://hackage.haskell.org/packages/archive/hashtables/latest/doc/ html/Data-HashTable-ST-Cuckoo.html)

Hopscotch hashing
Hopscotch hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table using open addressing. It is also well suited for implementing a concurrent hash table. Hopscotch hashing was introduced by Maurice Herlihy, Nir Shavit and Moran Tzafrir in 2008. [1] The name is derived from the sequence of hops that characterize the table's insertion algorithm. The algorithm uses a single array of n buckets. For each bucket, its neighborhood is a small collection of nearby consecutive buckets (i.e. one with close indexes to the original hashed bucket). The desired property of the neighborhood is that the cost of finding an item in the buckets of the neighborhood is close to the cost of finding it in the bucket itself (for example, by having buckets in the neighborhood fall within the same cache line). The size of the neighborhood must be sufficient to accommodate a logarithmic number of items in the worst case (i.e. it must accommodate log(n) items), but only a constant number on average. If some bucket's neighborhood is filled, the table is resized.

In hopscotch hashing, as in cuckoo hashing, and unlike in linear probing, a given item will always be inserted-into and found-in the neighborhood of its hashed bucket. In other words, it will always be found either in its original hashed array entry, or in one of the next H-1 neighboring entries. H could, for example, be 32, the standard machine word size. The neighborhood is thus a "virtual" bucket that has fixed size and overlaps with the next H-1 buckets. To speed the search, each bucket (array entry) includes a "hop-information" word, an H-bit bitmap that indicates which of the next H-1 entries contain items that hashed to the current entry's virtual bucket. In this way, an item can be found quickly by looking at the word to see which entries belong to the bucket, and then scanning through the constant number of entries (most modern

Hopscotch hashing. Here, H is 4. In part (a), the item x is added with a hash value of 6. A linear probe finds that entry 13 is empty. Because 13 is more than 4 entries away from 6, the algorithm looks for an earlier entry to swap with 13. The first place to look in is H-1 = 3 entries before, at entry 10. That entry's hop information bit-map indicates that d, the item at entry 11, can be displaced to 13. After displacing d, Entry 11 is still too far from entry 6, so the algorithm examines entry 8. The hop information bit-map indicates that item c at entry 9 can be moved to entry 11. Finally, a is moved to entry 9. Part (b) shows the table state just after adding x.

Hopscotch hashing processors support special bit manipulation operations that make the lookup in the "hop-information" bitmap very fast). Here is how to add item x which was hashed to bucket i: 1. If the entry i is empty, add x to i and return. 2. Starting at entry i, use a linear probe to find an empty entry at index j. 3. If the empty entry's index j is within H-1 of entry i, place x there and return. Otherwise, entry j is too far from i. To create an empty entry closer to i, find an item y whose hash value lies between i and j, but within H-1 of j. Displacing y to j creates a new empty slot closer to i. Repeat until the empty entry is within H-1 of entry i, place x there and return. If no such item y exists, or if the bucket i already contains H items, resize and rehash the table. The idea is that hopscotch hashing "moves the empty slot towards the desired bucket". This distinguishes it from linear probing which leaves the empty slot where it was found, possibly far away from the original bucket, or from cuckoo hashing that, in order to create a free bucket, moves an item out of one of the desired buckets in the target arrays, and only then tries to find the displaced item a new place. To remove an item from the table, one simply removes it from the table entry. If the neighborhood buckets are cache aligned, then one could apply a reorganization operation in which items are moved into the now vacant location in order to improve alignment. One advantage of hopscotch hashing is that it provides good performance at very high table load factors, even ones exceeding 0.9. Part of this efficiency is due to using a linear probe only to find an empty slot during insertion, not for every lookup as in the original linear probing hash table algorithm. Another advantage is that one can use any hash function, in particular simple ones that are close-to-universal.

128

References
[1] Herlihy, Maurice and Shavit, Nir and Tzafrir, Moran (2008). "Hopscotch Hashing". DISC '08: Proceedings of the 22nd international symposium on Distributed Computing. Arcachon, France: Springer-Verlag. pp.350--364.

Hash function

129

Hash function
A hash function is any algorithm or subroutine that maps large data sets of variable length, called keys, to smaller data sets of a fixed length. For example, a person's name, having a variable length, could be hashed to a single integer. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.

Descriptions
Hash functions are mostly used to accelerate table lookup or data comparison tasks such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.

A hash function that maps names to integers from 0 to 15.. There is a collision between keys "John Smith" and "Sandra Dee".

A hash function should be referentially transparent (stable), i.e., if called twice on input that is "equal" (for example, strings that consist of the same sequence of characters), it should give the same result. This is a contract in many programming languages that allow the user to override equality and hash functions for an object: if two objects are equal, their hash codes must be the same. This is crucial to finding an element in a hash table quickly, because two of the same element would both hash to the same slot. Some hash functions may map two or more keys to the same hash value, causing a collision. Such hash functions try to map the keys to the hash values as evenly as possible because collisions become more frequent as hash tables fill up. Thus, single-digit hash values are frequently restricted to 80% of the size of the table. Depending on the algorithm used, other properties may be required as well, such as double hashing and linear probing. Although the idea was conceived in the 1950s,[1] the design of good hash functions is still a topic of active research. Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomization functions, error correcting codes, and cryptographic hash functions. Although these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized differently. The HashKeeper database maintained by the American National Drug Intelligence Center, for instance, is more aptly described as a catalog of file fingerprints than of hash values.

Hash tables
Hash functions are primarily used in hash tables, to quickly locate a data record (for example, a dictionary definition) given its search key (the headword). Specifically, the hash function is used to map the search key to the hash. The index gives the place where the corresponding record should be stored. Hash tables, in turn, are used to implement associative arrays and dynamic sets. In general, a hashing function may map several different keys to the same index. Therefore, each slot of a hash table is associated with (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a hash table is often called a bucket, and hash values are also called bucket indices. Thus, the hash function only hints at the record's locationit tells where one should start looking for it. Still, in a half-full table, a good hash function will typically narrow the search down to only one or two entries.

Hash function

130

Caches
Hash functions are also used to build caches for large data sets stored in slow media. A cache is generally simpler than a hashed search table, since any collision can be resolved by discarding or writing back the older of the two colliding items. This is also used in file comparison.

Bloom filters
Hash functions are an essential ingredient of the Bloom filter, a compact data structure that provides an enclosing approximation to a set of them.

Finding duplicate records


When storing records in a large unsorted file, one may use a hash function to map each record to an index into a table T, and collect in each bucket T[i] a list of the numbers of all records with the same hash value i. Once the table is complete, any two duplicate records will end up in the same bucket. The duplicates can then be found by scanning every bucket T[i] which contains two or more members, fetching those records, and comparing them. With a table of appropriate size, this method is likely to be much faster than any alternative approach (such as sorting the file and comparing all consecutive pairs).

Finding similar records


Hash functions can also be used to locate table records whose key is similar, but not identical, to a given key; or pairs of records in a large file which have similar keys. For that purpose, one needs a hash function that maps similar keys to hash values that differ by at most m, where m is a small integer (say, 1 or 2). If one builds a table T of all record numbers, using such a hash function, then similar records will end up in the same bucket, or in nearby buckets. Then one need only check the records in each bucket T[i] against those in buckets T[i+k] where k ranges between m andm. This class includes the so-called acoustic fingerprint algorithms, that are used to locate similar-sounding entries in large collection of audio files. For this application, the hash function must be as insensitive as possible to data capture or transmission errors, and to "trivial" changes such as timing and volume changes, compression, etc.[2]

Finding similar substrings


The same techniques can be used to find equal or similar stretches in a large collection of strings, such as a document repository or a genomic database. In this case, the input strings are broken into many small pieces, and a hash function is used to detect potentially equal pieces, as above. The RabinKarp algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is based on the use of hashing to compare strings.

Geometric hashing
This principle is widely used in computer graphics, computational geometry and many other disciplines, to solve many proximity problems in the plane or in three-dimensional space, such as finding closest pairs in a set of points, similar shapes in a list of shapes, similar images in an image database, and so on. In these applications, the set of all inputs is some sort of metric space, and the hashing function can be interpreted as a partition of that space into a grid of cells. The table is often an array with two or more indices (called a grid file, grid index, bucket grid, and similar names), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing or the grid method. Geometric hashing is also used in telecommunications (usually under the name vector quantization) to encode and compress multi-dimensional signals.

Hash function

131

Properties
Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below. Note that different requirements apply to the other related concepts (cryptographic hash functions, checksums, etc.).

Determinism
A hash procedure must be deterministicmeaning that for a given input value it must always generate the same hash value. In other words, it must be a function of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as pseudo-random number generators or the time of day. It also excludes functions that depend on the memory address of the object being hashed, because that address may change during execution (as may happen on systems that use certain methods of garbage collection), although sometimes rehashing of the item is possible.

Uniformity
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisionspairs of inputs that are mapped to the same hash valueincreases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries. Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true. Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries. In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. (In an ideal "perfect hash function", no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if n is much larger than m see the birthday paradox). When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-squared test.

Variable range
In many applications, the range of hash values may be different for each run of the program, or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parametersthe input data z, and the number n of allowed hash values. A common solution is to compute a fixed hash function with a very large range (say, 0 to 2321), divide the result by n, and use the division's remainder. If n is itself a power of 2, this can be done by bit masking and bit shifting. When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution between 0 and n1, for any value of n that may occur in the application. Depending on the function, the remainder may be uniform only for certain values of n, e.g. odd or prime numbers. We can allow the table size n to not be a power of 2 and still not have to perform any remainder or division operation, as these computations are sometimes costly. For example, let n be significantly less than 2b. Consider a pseudo random number generator (PRNG) function P(key) that is uniform on the interval [0, 2b1]. A hash function uniform on the interval [0, n-1] is n P(key)/2b. We can replace the division by a (possibly faster) right bit

Hash function shift: nP(key) >> b.

132

Variable range with minimal movement (dynamic hash function)


When the hash function is used to store values in a hash table that outlives the run of the program, and the hash table needs to be expanded or shrunk, the hash table is referred to as a dynamic hash table. A hash function that will relocate the minimum number of records when the table is resized is desirable. What is needed is a hash function H(z,n) where z is the key being hashed and n is the number of allowed hash values such that H(z,n+1) = H(z,n) with probability close to n/(n+1). Linear hashing and spiral storage are examples of dynamic hash functions that execute in constant time but relax the property of uniformity to achieve the minimal movement property. Extendible hashing uses a dynamic hash function that requires space proportional to n to compute the hash function, and it becomes a function of the previous keys that have been inserted. Several algorithms that preserve the uniformity property but require time proportional to n to compute the value of H(z,n) have been invented.

Data normalization
In some applications, the input data may contain features that are irrelevant for comparison purposes. For example, when looking up a personal name, it may be desirable to ignore the distinction between upper and lower case letters. For such data, one must use a hash function that is compatible with the data equivalence criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value. This can be accomplished by normalizing the input before hashing it, as by upper-casing all letters.

Continuity
A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as possible; two inputs that differ by a little should be mapped to equal or nearly equal hash values. Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash functions, and other related concepts. Continuity is desirable for hash functions only in some applications, such as hash tables that use linear search.

Hash function algorithms


For most types of hashing functions the choice of the function depends strongly on the nature of the input data, and their probability distribution in the intended application.

Trivial hash function


If the datum to be hashed is small enough, one can use the datum itself (reinterpreted as an integer in binary notation) as the hashed value. The cost of computing this "trivial" (identity) hash function is effectively zero. This hash function is perfect, as it maps each input to a distinct hash value. The meaning of "small enough" depends on the size of the type that is used as the hashed value. For example, in Java, the hash code is a 32-bit integer. Thus the 32-bit integer Integer and 32-bit floating-point Float objects can simply use the value directly; whereas the 64-bit integer Long and 64-bit floating-point Double cannot use this method. Other types of data can also use this perfect hashing scheme. For example, when mapping character strings between upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as

Hash function in ASCII or ISO Latin 1), the table has only 28 = 256 entries; in the case of Unicode characters, the table would have 17216 = 1114112 entries. The same technique can be used to map two-letter country codes like "us" or "za" to country names (262=676 table entries), 5-digit zip codes like 13083 to city names (100000 entries), etc. Invalid data values (such as the country code "xx" or the zip code 00000) may be left undefined in the table, or mapped to some appropriate "null" value.

133

Perfect hashing
A hash function that is injectivethat is, maps each valid input to a different hash valueis said to be perfect. With such a function one can directly locate the desired entry in a hash table, without any additional searching.

A perfect hash function for the four names shown

Minimal perfect hashing


A perfect hash function for n keys is said to be minimal if its range consists of n consecutive integers, usually from 0 to n1. Besides providing single-step lookup, a minimal perfect hash function also yields a compact hash table, without any vacant slots. Minimal perfect hash functions are much harder to find than perfect ones with a wider range.

Hashing uniformly distributed data


If the inputs are bounded-length strings (such as A minimal perfect hash function for the four names shown telephone numbers, car license plates, invoice numbers, etc.), and each input may independently occur with uniform probability, then a hash function need only map roughly the same number of inputs to each hash value. For instance, suppose that each input is an integer z in the range 0 to N1, and the output must be an integer h in the range 0 to n1, where N is much larger than n. Then the hash function could be h = z mod n (the remainder of z divided by n), or h = (z n) N (the value z scaled down by n/N and truncated to an integer), or many other formulas. Warning: h = z mod n was used in many of the original random number generators, but was found to have a number of issues. One of which is that as n approaches N, this function becomes less and less uniform.

Hash function

134

Hashing data with other distributions


These simple formulas will not do if the input values are not equally likely, or are not independent. For instance, most patrons of a supermarket will live in the same geographic area, so their telephone numbers are likely to begin with the same 3 to 4 digits. In that case, if n is 10000 or so, the division formula (z n) N, which depends mainly on the leading digits, will generate a lot of collisions; whereas the remainder formula z mod n, which is quite sensitive to the trailing digits, may still yield a fairly even distribution.

Hashing variable-length data


When the data values are long (or variable-length) character stringssuch as personal names, web page addresses, or mail messagestheir distribution is usually very uneven, with complicated dependencies. For example, text in any natural language has highly non-uniform distributions of characters, and character pairs, very characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the stringand depends on each character in a different way. In cryptographic hash functions, a MerkleDamgrd construction is usually used. In general, the scheme for hashing such data is to break the input into a sequence of small units (bits, bytes, words, etc.) and combine all the units b[1], b[2], ..., b[m] sequentially, as follows S S0; for k in 1, 2, ..., m do S F(S, b[k]); // Initialize the state. // Scan the input data units: // Combine data unit k into the state.

return G(S, n) // Extract the hash value from the state. This schema is also used in many text checksum and fingerprint algorithms. The state variable S may be a 32- or 64-bit unsigned integer; in that case, S0 can be 0, and G(S,n) can be just S mod n. The best choice of F is a complex issue and depends on the nature of the data. If the units b[k] are single bits, then F(S,b) could be, for instance if highbit(S) = 0 then return 2 * S + b else return (2 * S + b) ^ P Here highbit(S) denotes the most significant bit of S; the '*' operator denotes unsigned integer multiplication with lost overflow; '^' is the bitwise exclusive or operation applied to words; and P is a suitable fixed word.[3]

Special-purpose hash functions


In many cases, one can design a special-purpose (heuristic) hash function that yields many fewer collisions than a good general-purpose hash function. For example, suppose that the input data are file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc., with mostly sequential numbers. For such data, a function that extracts the numeric part k of the file name and returns k mod n would be nearly optimal. Needless to say, a function that is exceptionally good for a specific kind of data may have dismal performance on data with different distribution.

Rolling hash
In some applications, such as substring search, one must compute a hash function h for every k-character substring of a given n-character string t; where k is a fixed integer, and n is k. The straightforward solution, which is to extract every such substring s of t and compute h(s) separately, requires a number of operations proportional to kn. However, with the proper choice of h, one can use the technique of rolling hash to compute all those hashes with an effort proportional to k+n.

Hash function

135

Universal hashing
A universal hashing scheme is a randomized algorithm that selects a hashing function h among a family of such functions, in such a way that the probability of a collision of any two distinct keys is 1/n, where n is the number of distinct hash values desiredindependently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash function application will behave as well as if it were using a random function, for any distribution of the input data. It will however have more collisions than perfect hashing, and may require more operations than a special-purpose hash function.

Hashing with checksum functions


One can adapt certain checksum or fingerprinting algorithms for use as hash functions. Some of those algorithms will map arbitrary long string data z, with any typical real-world distributionno matter how non-uniform and dependentto a 32-bit or 64-bit string, from which one can extract a hash value in 0 through n1. This method may produce a sufficiently uniform distribution of hash values, as long as the hash range size n is small compared to the range of the checksum or fingerprint function. However, some checksums fare poorly in the avalanche test, which may be a concern in some applications. In particular, the popular CRC32 checksum provides only 16 bits (the higher half of the result) that are usable for hashing. Moreover, each bit of the input has a deterministic effect on each bit of the CRC32, that is one can tell without looking at the rest of the input, which bits of the output will flip if the input bit is flipped; so care must be taken to use all 32 bits when computing the hash from the checksum.[4]

Hashing with cryptographic hash functions


Some cryptographic hash functions, such as SHA-1, have even stronger uniformity guarantees than checksums or fingerprints, and thus can provide very good general-purpose hashing functions. In ordinary applications, this advantage may be too small to offset their much higher cost.[5] However, this method can provide uniformly distributed hashes even when the keys are chosen by a malicious agent. This feature may help protect services against denial of service attacks.

Hashing By Nonlinear Table Lookup


Tables of random numbers (such as 256 random 32 bit integers) can provide high-quality non-linear functions to be used as hash functions or other purposes such as cryptography. The key to be hashed would be split in 8-bit (one byte) parts and each part will be used as an index for the non-linear table. The table values will be added by arithmetic or XOR addition to the hash output value. Because the table is just 1024 bytes in size, it will fit into the cache of modern microprocessors and allow for very fast execution of the hashing algorithm. As the table value is on average much longer than 8 bit, one bit of input will affect nearly all output bits. This is different to multiplicative hash functions where higher-value input bits do not affect lower-value output bits. This algorithm has proven to be very fast and of high quality for hashing purposes (especially hashing of integer number keys).

Efficient Hashing Of Strings


Modern microprocessors will allow for much faster processing, if 8-bit character Strings are not hashed by processing one character at a time, but by interpreting the string as an array of 32 bit or 64 bit integers and hashing/accumulating these "wide word" integer values by means of arithmetic operations (e.g. multiplication by constant and bit-shifting). The remainding characters of the string which are smaller than the word length of the CPU must be handled differently (e.g. being processed one character at a time).

Hash function This approach has proven to speed up hash code generation by a factor of five or more on modern microprocessors of a word size of 64 bit.

136

Origins of the term


The term "hash" comes by way of analogy with its non-technical meaning, to "chop and mix". Indeed, typical hash functions, like the mod operation, "chop" the input domain into many sub-domains that get "mixed" into the output range to improve the uniformity of the key distribution. Donald Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953, and that Robert Morris used the term in a survey paper in CACM which elevated the term from technical jargon to formal terminology.[1]

List of hash functions


Bernstein hash[6] Fowler-Noll-Vo hash function (32, 64, 128, 256, 512, or 1024 bits) Jenkins hash function (32 bits) Pearson hashing (8 bits)

Zobrist hashing

References
[1] Knuth, Donald (1973). The Art of Computer Programming, volume 3, Sorting and Searching. pp.506542. [2] "Robust Audio Hashing for Content Identification by Jaap Haitsma, Ton Kalker and Job Oostveen" (http:/ / citeseer. ist. psu. edu/ rd/ 11787382,504088,1,0. 25,Download/ http:/ / citeseer. ist. psu. edu/ cache/ papers/ cs/ 25861/ http:zSzzSzwww. extra. research. philips. comzSznatlabzSzdownloadzSzaudiofpzSzcbmi01audiohashv1. 0. pdf/ haitsma01robust. pdf) [3] Broder, A. Z. (1993). "Some applications of Rabin's fingerprinting method". Sequences II: Methods in Communications, Security, and Computer Science. Springer-Verlag. pp.143152. [4] Bret Mulvey, Evaluation of CRC32 for Hash Tables (http:/ / home. comcast. net/ ~bretm/ hash/ 8. html), in Hash Functions (http:/ / home. comcast. net/ ~bretm/ hash/ ). Accessed April 10, 2009. [5] Bret Mulvey, Evaluation of SHA-1 for Hash Tables (http:/ / home. comcast. net/ ~bretm/ hash/ 9. html), in Hash Functions (http:/ / home. comcast. net/ ~bretm/ hash/ ). Accessed April 10, 2009. [6] http:/ / www. cse. yorku. ca/ ~oz/ hash. html

External links
General purpose hash function algorithms (C/C++/Pascal/Java/Python/Ruby) (http://www.partow.net/ programming/hashfunctions/index.html) Hash Functions and Block Ciphers by Bob Jenkins (http://burtleburtle.net/bob/hash/index.html) The Goulburn Hashing Function (http://www.webcitation.org/query?url=http://www.geocities.com/ drone115b/Goulburn06.pdf&date=2009-10-25+21:06:51) (PDF) by Mayur Patel MIT's Introduction to Algorithms: Hashing 1 (http://video.google.com/ videoplay?docid=-727485696209877198&q=source:014117792397255896270&hl=en) MIT OCW lecture Video MIT's Introduction to Algorithms: Hashing 2 (http://video.google.com/ videoplay?docid=2307261494964091254&q=source:014117792397255896270&hl=en) MIT OCW lecture Video Hash Fuction Construction for Textual and Geometrical Data Retrieval (http://herakles.zcu.cz/~skala/PUBL/ PUBL_2010/2010_WSEAS-Corfu_Hash-final.pdf) Latest Trends on Computers, Vol.2, pp.483489, CSCC conference, Corfu, 2010

Perfect hash function

137

Perfect hash function


A perfect hash function for a set S is a hash function that maps distinct elements in S to a set of integers, with no collisions. A perfect hash function has many of the same applications as other hash functions, but with the advantage that no collision resolution has to be implemented. In mathematical terms, it is a total injective function.

Properties and uses


A perfect hash function for a specific set S that can be evaluated in constant time, and with values in a small range, can be found by a randomized algorithm in a number of operations that is proportional to the size of S. The minimal size of the description of a perfect hash function depends on the range of its function values: The smaller the range, the more space is required. Any perfect hash functions suitable for use with a hash table require at least a number of bits that is proportional to the size of S. A perfect hash function with values in a limited range can be used for efficient lookup operations, by placing keys from S (or other associated values) in a table indexed by the output of the function. Using a perfect hash function is best in situations where there is a frequently queried large set, S, which is seldom updated. Efficient solutions to performing updates are known as dynamic perfect hashing, but these methods are relatively complicated to implement. A simple alternative to perfect hashing, which also allows dynamic updates, is cuckoo hashing.

Minimal perfect hash function


A minimal perfect hash function is a perfect hash function that maps n keys to n consecutive integersusually [0..n1] or [1..n]. A more formal way of expressing this is: Let j and k be elements of some finite set K. F is a minimal perfect hash function iff F(j) =F(k) implies j=k and there exists an integer a such that the range of F is a..a+|K|1. It has been proved that a general purpose minimal perfect hash scheme requires at least 1.44 bits/key.[1] However the smallest currently use around 2.5 bits/key. A minimal perfect hash function F is order preserving if keys are given in some order a1, a2, ..., and for any keys aj and ak, j<k implies F(aj)<F(ak). Order-preserving minimal perfect hash functions require necessarily (n log n) bits to be represented. A minimal perfect hash function F is monotone if it preserves the lexicographical order of the keys. Monotone minimal perfect hash functions can be represented in very little space.

References
[1] Djamal Belazzougui, Fabiano C. Botelho, Martin Dietzfelbinger (2009) (PDF). Hash, displace, and compress (http:/ / cmph. sourceforge. net/ papers/ esa09. pdf). Springer Berlin / Heidelberg. . Retrieved 2011-08-11.

Further reading
Richard J. Cichelli. Minimal Perfect Hash Functions Made Simple, Communications of the ACM, Vol. 23, Number 1, January 1980. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 11.5: Perfect hashing, pp.245249. Fabiano C. Botelho, Rasmus Pagh and Nivio Ziviani. "Perfect Hashing for Data Management Applications" (http://arxiv.org/pdf/cs/0702159). Fabiano C. Botelho and Nivio Ziviani. "External perfect hashing for very large key sets" (http://homepages.dcc. ufmg.br/~nivio/papers/cikm07.pdf). 16th ACM Conference on Information and Knowledge Management

Perfect hash function (CIKM07), Lisbon, Portugal, November 2007. Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. "Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses" (http://vigna.dsi.unimi.it/ftp/papers/ MonotoneMinimalPerfectHashing.pdf). In Proceedings of the 20th Annual ACM-SIAM Symposium On Discrete Mathematics (SODA), New York, 2009. ACM Press. Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. "Theory and practise of monotone minimal perfect hashing" (http://www.siam.org/proceedings/alenex/2009/alx09_013_belazzouguid.pdf). In Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 2009.

138

External links
Minimal Perfect Hashing (http://burtleburtle.net/bob/hash/perfect.html) by Bob Jenkins gperf (http://www.gnu.org/software/gperf/) is a Free software C and C++ perfect hash generator cmph (http://cmph.sourceforge.net/index.html) is Free Software implementing many perfect hashing methods Sux4J (http://sux4j.dsi.unimi.it/) is Free Software implementing perfect hashing, including monotone minimal perfect hashing in Java MPHSharp (http://www.dupuis.me/node/9) is Free Software implementing many perfect hashing methods in C#

Universal hashing
Using universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

Introduction
Assume we want to map keys from some universe algorithm will have to handle some data set of into bins (labelled ). The keys, which is not known in advance. Usually, the that land in the same bin). A deterministic hash is greater than , since the adversary

goal of hashing is to obtain a low number of collisions (keys from may choose

function cannot offer any guarantee in an adversarial setting if the size of

to be precisely the preimage of a bin. This means that all data keys land in the same bin, making

hashing useless. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function. The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions is called a universal family if, In other words, any two keys of the universe collide with probability at most drawn randomly from when the hash function . is .

. This is exactly the probability of collision we would expect if the hash function assigned
[1]

truly random hash codes to every key. Sometimes, the definition is relaxed to allow collision probability

This concept was introduced by Carter and Wegman in 1977, and has found numerous applications in computer science (see, for example [2]). If we have an upper bound of on the collision probability, we say that we have -almost universality. Many, but not all, universal families have the following stronger uniform difference property:

Universal hashing , concerned with whether when is drawn randomly from the family , the difference

139

is uniformly distributed in stronger. (Similarly, a universal family can be XOR universal if uniformly distributed in where

. Note that the definition of universality is only

, which counts collisions. The uniform difference property is , the value is is a power we have the

is the bitwise exclusive or operation. This is only possible if

of two.) An even stronger condition is pairwise independence: we have this property when probability that will hash to any pair of hash values

is as if they were perfectly random:

. Pairwise independence is sometimes called strong universality. Another property is uniformity. We say that a family is uniform if all hash values are equally likely: for any hash value . Universality does not imply uniformity. However, strong universality does imply uniformity. Given a family with the uniform distance property, one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in to the hash functions. (Similarly, if is a power of two, we can achieve pairwise independence from an XOR universal hash family by doing an exclusive or with a uniformly distributed random constant.) Since a shift by a constant is sometimes irrelevant in applications (e.g. hash tables), a careful distinction between the uniform distance property and pairwise independent is sometimes not made.[3] For some applications (such as hash tables), it is important for the least significant bits of the hash values to be also universal. When a family is strongly universal, this is guaranteed: if is a strongly universal family with , then the family made of the functions identity function be universal. for all is also strongly universal for fails to . Unfortunately, the same is not true of (merely) universal families. For example the family made of the is clearly universal, but the family made of the function

Mathematical guarantees
For any fixed set 1. For any fixed of in keys, using a universal family guarantees the following properties. , the expected number of keys in the bin is . When implementing hash tables

by chaining, this number is proportional to the expected running time of an operation involving the key (for example a query, insertion or deletion). 2. The expected number of pairs of keys in with that collide ( ) is bounded above by , which is of order . When hashing into . When the number of bins, , is , the expected

number of collisions is

bins, there are no collisions at all with probability at least

a half. 3. The expected number of keys in bins with at least keys in them is bounded above by .[4] Thus, if the capacity of each bin is capped to three times the average size ( ), the total number of keys in overflowing bins is at most family whose collision probability is bounded above by , this result is no longer true. As the above guarantees hold for any fixed set
[4]

. This only holds with a hash

. If a weaker definition is used, bounding it by

, they hold if the data set is chosen by an adversary. However, the

adversary has to make this choice before (or independent of) the algorithm's random choice of a hash function. If the adversary can observe the random choice of the algorithm, randomness serves no purpose, and the situation is the same as deterministic hashing.

Universal hashing The second and third guarantee are typically used in conjunction with rehashing. For instance, a randomized algorithm may be prepared to handle some number of collisions. If it observes too many collisions, it chooses another random random variable. from the family and repeats. Universality guarantees that the number of repetitions is a geometric

140

Constructions
Since any computer data can be represented as one or more machine words, one generally needs hash functions for three types of domains: machine words ("integers"); fixed-length vectors of machine words; and variable-length vectors ("strings").

Hashing integers
This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be . The original proposal of Carter and Wegman[1] was to pick a prime and define

where To see that

are randomly chosen integers modulo

with

. Technically, adding only holds when

is not needed for

universality (but it does make the hash function 2-independent). is a universal family, note that

for some integer modulo There are

between ,

and

. If .

, their difference,

is nonzero and has an inverse

. Solving for

possible choices for

(since

is excluded) and, varying

in the allowed range,

possible values for the right hand side. Thus the collision probability is which tends to for large as required. This analysis also shows that does not have to be randomised in

order to have universality. Another way to see is a universal family is via the notion of statistical distance. Write the difference as . Since is nonzero and is uniformly distributed in . The distribution of . , it follows that modulo is

also uniformly distributed in up to a difference in probability of family is

is thus almost uniform,

between the samples. As a result, the statistical distance to a uniform

, which becomes negligible when

Avoiding modular arithmetic The state of the art for hashing integers is the multiply-shift scheme described by Dietzfelbinger et al. in 1997.[5] By avoiding modular arithmetic, this method is much easier to implement and also runs significantly faster in practice (usually by at least a factor of four[6]). The scheme assumes the number of bins is a power of two, . Let be the number of bits in a machine word. Then the hash functions are parametrised over odd positive integers (that fit in a word of bits). To evaluate , multiply by modulo and then keep the high order bits as the hash code. In mathematical notation, this is

Universal hashing and it can be implemented in C-like programming languages by (unsigned) (a*x) >> (w-M) This scheme does not satisfy the uniform difference property and is only . To understand the behavior of the hash function, notice that, if highest-order 'M' bits, then whether position or . Since and have the same appears on , it follows that . On the other hand, has either all 1's or all 0's as its highest order M bits (depending on is larger. Assume that the least significant set bit of is a random odd integer and odd integers have inverses in the ring -almost-universal; for any ,

141

will be uniformly distributed among

-bit integers with the least significant set bit on position

. The probability that these bits are all 0's or all 1's is therefore at most if , then higher-order M bits of . Finally, if then bit of

contain both 0's and 1's, so it is certain that is 1 and if and . 'universal' . To obtain a truly

only if bits is tight, as can be shown with are also 1, which happens with probability This analysis the example and hash function, one can use the multiply-add-shift scheme

where

is a random odd positive integer with . With these choices of .[7] and

and ,

where

is chosen at random from for all

Hashing vectors
This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector of machine words (integers of bits each). If is a universal family with the uniform difference property, the following family dating back to Carter and Wegman[1] also has the uniform difference property (and hence is universal): , where each If is chosen independently at random.

is a power of two, one may replace summation by exclusive or.[8]

In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of.[9] Initialize the hash function with a vector of random odd integers on bits each. Then if the number of bins is for : . It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice.[8] Initialize the hash function with a vector of random odd integers on bits each. The following hash family is universal[10]: . If double-precision operations are not available, one can interpret the input as a vector of half-words ( integers). The algorithm will then use multiplications, where Thus, the algorithm runs at a "rate" of one multiplication per word of input. -bit

was the number of half-words in the vector.

Universal hashing The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes. In this variant, the vector technique is known as tabulation hashing and it provides a practical alternative to multiplication-based universal hashing schemes.[11]

142

Hashing strings
This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above (conceptually padding the vector with zeros up to the upper bound). The space required is the maximal length of the string, but the time to evaluate is just the length of (the zero-padding can be ignored when evaluating the hash function without affecting universality[8]). Now assume we want to hash family proposed by.
[9]

, where a good bound on

is not known a priori. A universal , let

treats the string

as the coefficients of a polynomial modulo a large prime. If

be a prime and define: , where is uniformly random and is chosen

randomly from a universal family mapping integer domain . Consider two strings and let be length of the longer one; for the analysis, the shorter string is conceptually padded with zeros up to length coefficients . A collision before applying roots modulo implies that is a root of the polynomial with . . This polynomial has at most , so the collision probability is at most

The probability of collision through the random prime

brings the total collision probability to

. Thus, if the

is sufficiently large compared to the length of strings hashed, the family is very close to universal (in

statistical distance). To mitigate the computational penalty of modular arithmetic, two tricks are used in practice [8]: 1. One chooses the prime modulo to be close to a power of two, such as a Mersenne prime. This allows arithmetic to be implemented without division (using faster operations like addition and shifts). For instance, on

modern architectures one can work with , while 's are 32-bit values. 2. One can apply vector hashing to blocks. For instance, one applies vector hashing to each 16-word block of the string, and applies string hashing to the results. Since the slower string hashing is applied on a substantially smaller vector, this will essentially be as fast as vector hashing.

References
[1] Carter, Larry; Wegman, Mark N. (1979). "Universal Classes of Hash Functions". Journal of Computer and System Sciences 18 (2): 143154. doi:10.1016/0022-0000(79)90044-8. Conference version in STOC'77. [2] Miltersen, Peter Bro. "Universal Hashing" (http:/ / www. webcitation. org/ 5hmOaVISI) (PDF). Archived from the original (http:/ / www. daimi. au. dk/ ~bromille/ Notes/ un. pdf) on 24th June 2009. . [3] Motwani, Rajeev; Raghavan, Prabhakar (1995). Randomized Algorithms. Cambridge University Press. p.221. ISBN0-521-47465-5. [4] Baran, Ilya; Demaine, Erik D.; Ptracu, Mihai (2008). "Subquadratic Algorithms for 3SUM" (http:/ / people. csail. mit. edu/ mip/ papers/ 3sum/ 3sum. pdf). Algorithmica 50 (4): 584596. doi:10.1007/s00453-007-9036-3. . [5] Dietzfelbinger, Martin; Hagerup, Torben; Katajainen, Jyrki; Penttonen, Martti (1997). "A Reliable Randomized Algorithm for the Closest-Pair Problem" (http:/ / www. diku. dk/ ~jyrki/ Paper/ CP-11. 4. 1997. ps) (Postscript). Journal of Algorithms 25 (1): 1951. doi:10.1006/jagm.1997.0873. . Retrieved 10 February 2011. [6] Thorup, Mikkel. "Text-book algorithms at SODA" (http:/ / mybiasedcoin. blogspot. com/ 2009/ 12/ text-book-algorithms-at-soda-guest-post. html). . [7] Woelfel, Philipp (1999). "Efficient Strongly Universal and Optimally Universal Hashing" (http:/ / www. springerlink. com/ content/ a10p748w7pr48682/ ) (PDF). LNCS. 1672. Mathematical Foundations of Computer Science 1999. pp.262-272. doi:10.1007/3-540-48340-3_24. . Retrieved 17 May 2011. [8] Thorup, Mikkel (2009). "String hashing for linear probing" (http:/ / www. siam. org/ proceedings/ soda/ 2009/ SODA09_072_thorupm. pdf). Proc. 20th ACM-SIAM Symposium on Discrete Algorithms (SODA). pp.655664. ., section 5.3

Universal hashing
[9] Dietzfelbinger, Martin; Gil, Joseph; Matias, Yossi; Pippenger, Nicholas (1992). "Polynomial Hash Functions Are Reliable (Extended Abstract)". Proc. 19th International Colloquium on Automata, Languages and Programming (ICALP). pp.235246. [10] Black, J.; Halevi, S.; Krawczyk, H.; Krovetz, T. (1999). "UMAC: Fast and Secure Message Authentication" (http:/ / www. cs. ucdavis. edu/ ~rogaway/ papers/ umac-full. pdf). Advances in Cryptology (CRYPTO '99). ., Equation 1 [11] Ptracu, Mihai; Thorup, Mikkel (2011). "The power of simple tabulation hashing". Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11). pp.110. arXiv:1011.5200. doi:10.1145/1993636.1993638.

143

Further reading
Knuth, Donald Ervin (1998). [The Art of Computer Programming], Vol. III: Sorting and Searching (2e ed.). Reading, Mass ; London: Addison-Wesley. ISBN0-201-89685-0. knuth.

External links
Open Data Structures - Section 5.1.1 - Multiplicative Hashing (http://opendatastructures.org/versions/ edition-0.1e/ods-java/5_1_ChainedHashTable_Hashin.html#SECTION00811000000000000000)

K-independent hashing
A family of hash functions is said to be -independent or -universal[1] if selecting a hash function at random keys are independent random variables (see from the family guarantees that the hash codes of any designated precise mathematical definitions below). Such families allow good average case performance in randomized algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of independence and the efficiency of evaluating the hash function are well studied, and many -independent families have been proposed.

Introduction
The goal of hashing is usually to map keys from some large domain (universe) bins (labelled into a smaller range, such as ). In the analysis of randomized algorithms and data structures, it is often

desirable for the hash codes of various keys to "behave randomly". For instance, if the hash code of each key were an independent random choice in , the number of keys per bin could be analyzed using the Chernoff bound. A deterministic hash function cannot offer any such guarantee in an adversarial setting, as the adversary may choose the keys to be the precisely the preimage of a bin. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function. The solution to these problems is to pick a function randomly from a large family of hash functions. The randomness in choosing the hash function can be used to guarantee some desired random behavior of the hash codes of any keys of interest. The first definition along these lines was universal hashing, which guarantees a low collision probability for any two designated keys. The concept of -independent hashing, introduced by Wegman and Carter in 1981,[2] strengthens the guarantees of random behavior to families of uniform distribution of hash codes. designated keys, and adds a guarantee on the

K-independent hashing

144

Mathematical Definitions
The strictest definition, introduced by Wegman and Carter[2] under the name "strongly universal hash family", is the following. A family of hash functions is -independent if for any distinct keys and any hash codes (not necessarily distinct) , we have:

This definition is equivalent to the following two conditions: 1. for any fixed , as is drawn randomly from , as , is uniformly distributed in , . are 2. for any fixed, distinct keys is drawn randomly from

independent random variables. Often it is inconvenient to achieve the perfect joint probability of may define a distinct -independent family to satisfy: and

due to rounding issues. Following,[3] one ,

Observe that, even if

is close to 1,

are no longer independent random variables, which is often a problem -independent family, which allows black-box use of

in the analysis of randomized algorithms. Therefore, a more common alternative to dealing with rounding issues is to prove that the hash family is close in statistical distance to a the independence properties.

References
[1] Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009). Introduction to Algorithms (3rd ed.). MIT Press. ISBN0-262-03384-4. [2] Wegman, Mark N.; Carter, J. Lawrence (1981). "New hash functions and their use in authentication and set equality" (http:/ / www. fi. muni. cz/ ~xbouda1/ teaching/ 2009/ IV111/ Wegman_Carter_1981_New_hash_functions. pdf). Journal of Computer and System Sciences 22 (3): 265279. doi:10.1016/0022-0000(81)90033-7. Conference version in FOCS'79. . Retrieved 9 February 2011. [3] Siegel, Alan (2004). "On universal classes of extremely random constant-time hash functions and their time-space tradeoff" (http:/ / www. cs. nyu. edu/ faculty/ siegel/ FASTH. pdf). SIAM Journal on Computing 33 (3): 505543. Conference version in FOCS'89. .

Further reading
Motwani, Rajeev; Raghavan, Prabhakar (1995). Randomized Algorithms. Cambridge University Press. p.221. ISBN0-521-47465-5.

Tabulation hashing

145

Tabulation hashing
In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It is simple and fast enough to be usable in practice, and has theoretical properties that (in contrast to some other universal hashing methods) make it usable with linear probing, cuckoo hashing, and the MinHash technique for estimating the size of set intersections. The first instance of tabulation hashing is Zobrist hashing (1969). It was later rediscovered by Carter & Wegman (1979) and studied in more detail by Ptracu & Thorup (2011).

Method
Let p denote the number of bits in a key to be hashed, and q denote the number of bits desired in an output hash function. Let r be a number smaller than p, and let t be the smallest integer that is at least as large as p/r. For instance, if r=8, then an r-bit number is a byte, and t is the number of bytes per key. The key idea of tabulation hashing is to view a key as a vector of t r-bit numbers, use a lookup table filled with random values to compute a hash value for each of the r-bit numbers representing a given key, and combine these values with the bitwise binary exclusive or operation. The choice of t and r should be made in such a way that this table is not too large; e.g., so that it fits into the computer's cache memory. The initialization phase of the algorithm creates a two-dimensional array T of dimensions 2r by t, and fills the array with random numbers. Once the array T is initialized, it can be used to compute the hash value h(x) of any given key x. To do so, partition x into r-bit values, where x0 consists of the low order r bits of x, x1 consists of the next r bits, etc. (E.g., again, with r=8, xi is just the ith byte of x). Then, use these values as indices into T and combine them with the exclusive or operation: h(x) = T[x0,0] T[x1,1] T[x2,2] ...

Universality
Carter & Wegman (1979) define a randomized scheme for generating hash functions to be universal if, for any two keys, the probability that they collide (that is, they are mapped to the same value as each other) is 1/m, where m is the number of values that the keys can take on. They defined a stronger property in the subsequent paper Wegman & Carter (1981): a randomized scheme for generating hash functions is k-independent if, for every k-tuple of keys, and each possible k-tuple of values, the probability that those keys are mapped to those values is 1/mk. 2-independent hashing schemes are automatically universal, and any universal hashing scheme can be converted into a 2-independent scheme by storing a random number x in the initialization phase of the algorithm and adding x to each hash value, so universality is essentially the same as 2-independence, but k-independence for larger values of k is a stronger property, held by fewer hashing algorithms. As Ptracu & Thorup (2011) observe, tabulation hashing is 3-independent but not 4-independent. For any single key x, T[x0,0] is equally likely to take on any hash value, and the exclusive or of T[x0,0] with the remaining table values does not change this property. For any two keys x and y, x is equally likely to be mapped to any hash value as before, and there is at least one position i where xixi; the table value T[yi,i] is used in the calculation of h(y) but not in the calculation of h(x), so even after the value of h(x) has been determined, h(y) is equally likely to be any valid hash value. Similarly, for any three keys x, y, and z, at least one of the three keys has a position i where its value zi differs from the other two, so that even after the values of h(x) and h(z) are determined, h(z) is equally likely to be any valid hash value. However, this reasoning breaks down for four keys because there are sets of keys w, x, y, and z where none of the four has a byte value that it does not share with at least one of the other keys. For instance, if the keys have two bytes each, and w, x, y, and z are the four keys that have either zero or one as their byte values, then each byte value in

Tabulation hashing each position is shared by exactly two of the four keys. For these four keys, the hash values computed by tabulation hashing will always satisfy the equation h(w) h(x) h(y) h(z) = 0, whereas for a 4-independent hashing scheme the same equation would only be satisfied with probability 1/m. Therefore, tabulation hashing is not 4-independent. Siegel (2004) uses the same idea of using exclusive or operations to combine random values from a table, with a more complicated algorithm based on expander graphs for transforming the key bits into table indices, to define hashing schemes that are k-independent for any constant or even logarithmic value of k. However, the number of table lookups needed to compute each hash value using Siegel's variation of tabulation hashing, while constant, is still too large to be practical, and the use of expanders in Siegel's technique also makes it not fully constructive. One limitation of tabulation hashing is that it assumes that the input keys have a fixed number of bits. Lemire (2012) has studied variations of tabulation hashing that can be applied to variable-length strings, and shown that they can be universal (2-independent) but not 3-independent.

146

Application
Because tabulation hashing is a universal hashing scheme, it can be used in any hashing-based algorithm in which universality is sufficient. For instance, in hash chaining, the expected time per operation is proportional to the sum of collision probabilities, which is the same for any universal scheme as it would be for truly random hash functions, and is constant whenever the load factor of the hash table is constant. Therefore, tabulation hashing can be used to compute hash functions for hash chaining with a theoretical guarantee of constant expected time per operation.[1] However, universal hashing is not strong enough to guarantee the performance of some other hashing algorithms. For instance, for linear probing, 5-independent hash functions are strong enough to guarantee constant time operation, but there are 4-independent hash functions that fail.[2] Nevertheless, despite only being 3-independent, tabulation hashing provides the same constant-time guarantee for linear probing.[3] Cuckoo hashing, another technique for implementing hash tables, guarantees constant time per lookup (regardless of the hash function). Insertions into a cuckoo hash table may fail, causing the entire table to be rebuilt, but such failures are sufficiently unlikely that the expected time per insertion (using either a truly random hash function or a hash function with logarithmic independence) is constant. With tabulation hashing, on the other hand, the best bound known on the failure probability is higher, high enough that insertions cannot be guaranteed to take constant expected time. Nevertheless, tabulation hashing is adequate to ensure the linear-expected-time construction of a cuckoo hash table for a static set of keys that does not change as the table is used.[3] Algorithms such as Karp-Rabin requires the efficient computation of hashing all consecutive sequences of characters. We typically use rolling hash functions for these problems. Tabulation hashing is used to construct families of strongly universal functions (for example, hashing by cyclic polynomials).

Notes
[1] Carter & Wegman (1979). [2] For the sufficiency of 5-independent hashing for linear probing, see Pagh, Pagh & Rui (2009). For examples of weaker hashing schemes that fail, see Ptracu & Thorup (2010). [3] Ptracu & Thorup (2011).

References
Carter, J. Lawrence; Wegman, Mark N. (1979), "Universal classes of hash functions", Journal of Computer and System Sciences 18 (2): 143154, doi:10.1016/0022-0000(79)90044-8, MR532173. Lemire, Daniel (2012), "The universality of iterated hashing over variable-length strings", Discrete Applied Mathematics 160: 604617, arXiv:1008.1715, doi:10.1016/j.dam.2011.11.009.

Tabulation hashing Pagh, Anna; Pagh, Rasmus; Rui, Milan (2009), "Linear probing with constant independence", SIAM Journal on Computing 39 (3): 11071120, doi:10.1137/070702278, MR2538852. Ptracu, Mihai; Thorup, Mikkel (2010), "On the k-independence required by linear probing and minwise independence" (http://people.csail.mit.edu/mip/papers/kwise-lb/kwise-lb.pdf), Automata, Languages and Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings, Part I, Lecture Notes in Computer Science, 6198, Springer, pp.715726, doi:10.1007/978-3-642-14165-2_60. Ptracu, Mihai; Thorup, Mikkel (2011), "The power of simple tabulation hashing", Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11), pp.110, arXiv:1011.5200, doi:10.1145/1993636.1993638. Siegel, Alan (2004), "On universal classes of extremely random constant-time hash functions", SIAM Journal on Computing 33 (3): 505543 (electronic), doi:10.1137/S0097539701386216, MR2066640. Wegman, Mark N.; Carter, J. Lawrence (1981), "New hash functions and their use in authentication and set equality", Journal of Computer and System Sciences 22 (3): 265279, doi:10.1016/0022-0000(81)90033-7, MR633535.

147

Cryptographic hash function


A cryptographic hash function is a hash function, that is, an algorithm that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an (accidental or intentional) change to the data will (with very high probability) change the hash value. The data to be encoded is often called the "message," and the hash value is sometimes called the message digest or simply digest. The ideal cryptographic hash function has four main or significant properties: it is easy to compute the hash value for any given message it is infeasible to generate a message that has a given hash
A cryptographic hash function (specifically, SHA-1) at work. Note that even small changes in the source input (here in the word "over") drastically change the resulting output, by the so-called avalanche effect.

it is infeasible to modify a message without changing the hash it is infeasible to find two different messages with the same hash Cryptographic hash functions have many information security applications, notably in digital signatures, message authentication codes (MACs), and other forms of authentication. They can also be used as ordinary hash functions, to index data in hash tables, for fingerprinting, to detect duplicate data or uniquely identify files, and as checksums to detect accidental data corruption. Indeed, in information security contexts, cryptographic hash values are sometimes called (digital) fingerprints, checksums, or just hash values, even though all these terms stand for functions with rather different properties and purposes.

Cryptographic hash function

148

Properties
Most cryptographic hash functions are designed to take a string of any length as input and produce a fixed-length hash value. A cryptographic hash function must be able to withstand all known types of cryptanalytic attack. As a minimum, it must have the following properties: Preimage resistance Given a hash it should be difficult to find any message such that . This concept is

related to that of one-way function. Functions that lack this property are vulnerable to preimage attacks. Second-preimage resistance Given an input it should be difficult to find another input where such that

. This property is sometimes referred to as weak collision resistance, and functions that lack this property are vulnerable to second-preimage attacks. Collision resistance It should be difficult to find two different messages and such that . Such

a pair is called a cryptographic hash collision. This property is sometimes referred to as strong collision resistance. It requires a hash value at least twice as long as that required for preimage-resistance, otherwise collisions may be found by a birthday attack. These properties imply that a malicious adversary cannot replace or modify the input data without changing its digest. Thus, if two strings have the same digest, one can be very confident that they are identical. A function meeting these criteria may still have undesirable properties. Currently popular cryptographic hash functions are vulnerable to length-extension attacks: given and but not , by choosing a suitable an attacker can calculate where denotes concatenation. This property can be used to break naive authentication schemes based on hash functions. The HMAC construction works around these problems. Ideally, one may wish for even stronger conditions. It should be impossible for an adversary to find two messages with substantially similar digests; or to infer any useful information about the data, given only its digest. Therefore, a cryptographic hash function should behave as much as possible like a random function while still being deterministic and efficiently computable. Checksum algorithms, such as CRC32 and other cyclic redundancy checks, are designed to meet much weaker requirements, and are generally unsuitable as cryptographic hash functions. For example, a CRC was used for message integrity in the WEP encryption standard, but an attack was readily discovered which exploited the linearity of the checksum.

Degree of difficulty
In cryptographic practice, difficult generally means almost certainly beyond the reach of any adversary who must be prevented from breaking the system for as long as the security of the system is deemed important. The meaning of the term is therefore somewhat dependent on the application, since the effort that a malicious agent may put into the task is usually proportional to his expected gain. However, since the needed effort usually grows very quickly with the digest length, even a thousand-fold advantage in processing power can be neutralized by adding a few dozen bits to the latter. In some theoretical analyses difficult has a specific mathematical meaning, such as not solvable in asymptotic polynomial time. Such interpretations of difficulty are important in the study of provably secure cryptographic hash functions but do not usually have a strong connection to practical security. For example, an exponential time algorithm can sometimes still be fast enough to make a feasible attack. Conversely, a polynomial time algorithm (e.g., one that requires n20 steps for n-digit keys) may be too slow for any practical use.

Cryptographic hash function

149

Illustration
An illustration of the potential use of a cryptographic hash is as follows: Alice poses a tough math problem to Bob and claims she has solved it. Bob would like to try it himself, but would yet like to be sure that Alice is not bluffing. Therefore, Alice writes down her solution, computes its hash and tells Bob the hash value (whilst keeping the solution secret). Then, when Bob comes up with the solution himself a few days later, Alice can prove that she had the solution earlier by revealing it and having Bob hash it and check that it matches the hash value given to him before. (This is an example of a simple commitment scheme; in actual practice, Alice and Bob will often be computer programs, and the secret would be something less easily spoofed than a claimed puzzle solution).

Applications
Verifying the integrity of files or messages
An important application of secure hashes is verification of message integrity. Determining whether any changes have been made to a message (or a file), for example, can be accomplished by comparing message digests calculated before, and after, transmission (or any other event). For this reason, most digital signature algorithms only confirm the authenticity of a hashed digest of the message to be "signed." Verifying the authenticity of a hashed digest of the message is considered proof that the message itself is authentic.

Password verification
A related application is password verification. Passwords are usually not stored in cleartext, but instead in digest form, to improve security. To authenticate a user, the password presented by the user is hashed and compared with the stored hash. This also means that the original passwords cannot be retrieved if forgotten or lost, and they have to be replaced with new ones. The password is often concatenated with a random, non-secret salt value that is stored with the password. Because users have different salts, it is not feasible to store tables of precomputed hash values for common passwords. Key stretching functions, such as PBKDF2, typically use repeated invocations of a cryptographic hash to increase the time required to perform brute force attacks on stored password digests.

File or data identifier


A message digest can also serve as a means of reliably identifying a file; several source code management systems, including Git, Mercurial and Monotone, use the sha1sum of various types of content (file content, directory trees, ancestry information, etc.) to uniquely identify them. Hashes are used to identify files on peer-to-peer filesharing networks. For example, in an ed2k link, an MD4-variant hash is combined with the file size, providing sufficient information for locating file sources, downloading the file and verifying its contents. Magnet links are another example. Such file hashes are often the top hash of a hash list or a hash tree which allows for additional benefits. One of the main applications of a hash function is to allow the fast look-up of a data in a hash table. Being hash functions of a particular kind, cryptographic hash functions lend themselves well to this application too. However, compared with standard hash functions, cryptographic hash functions tend to be much more expensive computationally. For this reason, they tend to be used in contexts where it is necessary for users to protect themselves against the possibility of forgery (the creation of data with the same digest as the expected data) by potentially malicious participants.

Cryptographic hash function

150

Pseudorandom generation and key derivation


Hash functions can also be used in the generation of pseudorandom bits, or to derive new keys or passwords from a single, secure key or password.

Hash functions based on block ciphers


There are several methods to use a block cipher to build a cryptographic hash function, specifically a one-way compression function. The methods resemble the block cipher modes of operation usually used for encryption. All well-known hash functions, including MD4, MD5, SHA-1 and SHA-2 are built from block-cipher-like components designed for the purpose, with feedback to ensure that the resulting function is not bijective. SHA-3 finalists include functions with block-cipher-like components (e.g., Skein, BLAKE) and functions based on other designs (e.g., JH, Keccak). A standard block cipher such as AES can be used in place of these custom block ciphers; that might be useful when an embedded system needs to implement both encryption and hashing with minimal code size or hardware area. However, that approach can have costs in efficiency and security. The ciphers in hash functions are built for hashing: they use large keys and blocks, can efficiently change keys every block, and have been designed and vetted for resistance to related-key attacks. General-purpose ciphers tend to have different design goals. In particular, AES has key and block sizes that make it nontrivial to use to generate long hash values; AES encryption becomes less efficient when the key changes each block; and related-key attacks make it potentially less secure for use in a hash function than for encryption.

MerkleDamgrd construction
A hash function must be able to process an arbitrary-length message into a fixed-length output. This can be achieved by breaking the input up into a series of equal-sized blocks, and operating on them in sequence using a one-way compression function. The compression function can either be specially designed for hashing or be built from a block cipher. A hash The MerkleDamgrd hash construction. function built with the MerkleDamgrd construction is as resistant to collisions as is its compression function; any collision for the full hash function can be traced back to a collision in the compression function. The last block processed should also be unambiguously length padded; this is crucial to the security of this construction. This construction is called the MerkleDamgrd construction. Most widely used hash functions, including SHA-1 and MD5, take this form. The construction has certain inherent flaws, including length-extension and generate-and-paste attacks, and cannot be parallelized. As a result, many entrants in the current NIST hash function competition are built on different, sometimes novel, constructions.

Cryptographic hash function

151

Use in building other cryptographic primitives


Hash functions can be used to build other cryptographic primitives. For these other primitives to be cryptographically secure, care must be taken to build them correctly. Message authentication codes (MACs) (also called keyed hash functions) are often built from hash functions. HMAC is such a MAC. Just as block ciphers can be used to build hash functions, hash functions can be used to build block ciphers. Luby-Rackoff constructions using hash functions can be provably secure if the underlying hash function is secure. Also, many hash functions (including SHA-1 and SHA-2) are built by using a special-purpose block cipher in a Davies-Meyer or other construction. That cipher can also be used in a conventional mode of operation, without the same security guarantees. See SHACAL, BEAR and LION. Pseudorandom number generators (PRNGs) can be built using hash functions. This is done by combining a (secret) random seed with a counter and hashing it. Some hash functions, such as Skein, Keccak, and RadioGatn output an arbitrarily long stream and can be used as a stream cipher, and stream ciphers can also be built from fixed-length digest hash functions. Often this is done by first building a cryptographically secure pseudorandom number generator and then using its stream of random bytes as keystream. SEAL is a stream cipher that uses SHA-1 to generate internal tables, which are then used in a keystream generator more or less unrelated to the hash algorithm. SEAL is not guaranteed to be as strong (or weak) as SHA-1.

Concatenation of cryptographic hash functions


Concatenating outputs from multiple hash functions provides collision resistance as good as the strongest of the algorithms included in the concatenated result. For example, older versions of TLS/SSL use concatenated MD5 and SHA-1 sums; that ensures that a method to find collisions in one of the functions doesn't allow forging traffic protected with both functions. For Merkle-Damgrd hash functions, the concatenated function is as collision-resistant as its strongest component,[1] but not more collision-resistant.[2] Joux[3] noted that 2-collisions lead to n-collisions: if it is feasible to find two messages with the same MD5 hash, it is effectively no more difficult to find as many messages as the attacker desires with identical MD5 hashes. Among the n messages with the same MD5 hash, there is likely to be a collision in SHA-1. The additional work needed to find the SHA-1 collision (beyond the exponential birthday search) is polynomial. This argument is summarized by Finney [4]. A more current paper and full proof of the security of such a combined construction gives a clearer and more complete explanation of the above.[5]

Cryptographic hash algorithms


There is a long list of cryptographic hash functions, although many have been found to be vulnerable and should not be used. Even if a hash function has never been broken, a successful attack against a weakened variant thereof may undermine the experts' confidence and lead to its abandonment. For instance, in August 2004 weaknesses were found in a number of hash functions that were popular at the time, including SHA-0, RIPEMD, and MD5. This has called into question the long-term security of later algorithms which are derived from these hash functions in particular, SHA-1 (a strengthened version of SHA-0), RIPEMD-128, and RIPEMD-160 (both strengthened versions of RIPEMD). Neither SHA-0 nor RIPEMD are widely used since they were replaced by their strengthened versions. As of 2009, the two most commonly used cryptographic hash functions are MD5 and SHA-1. However, MD5 has been broken; an attack against it was used to break SSL in 2008.[6] The SHA-0 and SHA-1 hash functions were developed by the NSA. In February 2005, a successful attack on SHA-1 was reported, finding collisions in about 269 hashing operations, rather than the 280 expected for a 160-bit hash function. In August 2005, another successful attack on SHA-1 was reported, finding collisions in 263 operations.

Cryptographic hash function Theoretical weaknesses of SHA-1 exist as well,[7][8] suggesting that it may be practical to break within years. New applications can avoid these problems by using more advanced members of the SHA family, such as SHA-2, or using techniques such as randomized hashing[9][10] that do not require collision resistance. However, to ensure the long-term robustness of applications that use hash functions, there is a competition to design a replacement for SHA-2, which will be given the name SHA-3 and become a FIPS standard around 2012.[11] Some of the following algorithms are used often in cryptography; consult the article for each specific algorithm for more information on the status of each algorithm. Note that this list does not include candidates in the current NIST hash function competition. For additional hash functions see the box at the bottom of the page.
Algorithm Output size (bits) Internal state [12] size Block size Length size Word size Rounds Best known attacks [13] (complexity:rounds) Collision Second preimage Yes ( 2192 [14] ) No No Yes ( 264 [18] ) No No Preimage

152

GOST

256

256

256

256

32

256

Yes ( 2105 [14] ) Yes Yes ( 263.3 [15] ) Yes ( 3 [17] )

Yes ( 2192 [14] ) No Yes ( 273 [16] ) Yes ( 278.4 [18] ) Yes ( 2123.4 [20] ) No

HAVAL MD2

256/224/192/160/128 128

256 384

1,024 128

64 -

32 32

160/128/96 864

MD4

128

128

512

64

32

48

MD5

128

128

512

64

32

64

Yes ( 220.96 [19] ) Yes With flaws ( 2352 or 2704 [21] ) Yes ( 218 No Yes ( 251:48 [22] ) No Yes ( 233.6 [23] ) Yes ( 251 [24] ) [15] )

PANAMA RadioGatn

256 Up to 608/1,216 (19 words)

8,736 58 words

256 3 words

32 164

No

No

RIPEMD RIPEMD-128/256 RIPEMD-160

128 128/256 160

128 128/256 160

512 512 512

64 64 64

32 32 32

48 64 80

No No No No No No

No No No No No No Yes ( 2248.4:42 [18] ) Yes ( 2494.6:42 [18] ) Yes ( 2184.3 [18] )

RIPEMD-320 SHA-0

320 160

320 160

512 512

64 64

32 32

80 80

SHA-1 SHA-256/224

160 256/224

160 256

512 512

64 64

32 32

80 64

Yes ( 228.5:24 [25] )

No

SHA-512/384

512/384

512

1,024

128

64

80

Yes ( 232.5:24 [25] ) Yes ( 262:19 [26] )

No

Tiger(2)-192/160/128

192/160/128

192

512

64

64

24

No

Cryptographic hash function

153
512 512 512 256 8 10

WHIRLPOOL

Yes ( 2120:4.5 [27] )

No

No

Notes
[1] Note that any two messages that collide the concatenated function also collide each component function, by the nature of concatenation. For example, if concat(sha1(message1), md5(message1)) == concat(sha1(message2), md5(message2)) then sha1(message1) == sha1(message2) and md5(message1)==md5(message2). The concatenated function could have other problems that the strongest hash lacks -- for example, it might leak information about the message when the strongest component does not, or it might be detectably nonrandom when the strongest component is not -- but it can't be less collision-resistant. [2] More generally, if an attack can produce a collision in one hash function's internal state, attacking the combined construction is only as difficult as a birthday attack against the other function(s). For the detailed argument, see the Joux and Finney references that follow. [3] Antoine Joux. Multicollisions in Iterated Hash Functions. Application to Cascaded Constructions. LNCS 3152/2004, pages 306-316 Full text (http:/ / www. springerlink. com/ index/ DWWVMQJU0N0A3UGJ. pdf). [4] http:/ / article. gmane. org/ gmane. comp. encryption. general/ 5154 [5] Jonathan J. Hoch and Adi Shamir (2008-02-20). On the Strength of the Concatenated Hash Combiner when all the Hash Functions are Weak (http:/ / eprint. iacr. org/ 2008/ 075. pdf). . [6] Alexander Sotirov, Marc Stevens, Jacob Appelbaum, Arjen Lenstra, David Molnar, Dag Arne Osvik, Benne de Weger, MD5 considered harmful today: Creating a rogue CA certificate (http:/ / www. win. tue. nl/ hashclash/ rogue-ca/ ), accessed March 29, 2009 [7] Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu, Finding Collisions in the Full SHA-1 (http:/ / people. csail. mit. edu/ yiqun/ SHA1AttackProceedingVersion. pdf) [8] Bruce Schneier, Cryptanalysis of SHA-1 (http:/ / www. schneier. com/ blog/ archives/ 2005/ 02/ cryptanalysis_o. html) (summarizes Wang et al. results and their implications) [9] Shai Halevi, Hugo Krawczyk, Update on Randomized Hashing (http:/ / csrc. nist. gov/ groups/ ST/ hash/ documents/ HALEVI_UpdateonRandomizedHashing0824. pdf) [10] Shai Halevi and Hugo Krawczyk, Randomized Hashing and Digital Signatures (http:/ / www. ee. technion. ac. il/ ~hugo/ rhash/ ) [11] NIST.gov - Computer Security Division - Computer Security Resource Center (http:/ / csrc. nist. gov/ groups/ ST/ hash/ sha-3/ index. html) [12] The internal state here means the "internal hash sum" after each compression of a data block. Most hash algorithms also internally use some additional variables such as length of the data compressed so far since that is needed for the length padding in the end. See the Merkle-Damgrd construction for details. [13] When omitted, rounds are full number. [14] http:/ / www. springerlink. com/ content/ 2514122231284103/ [15] http:/ / www. springerlink. com/ content/ n5vrtdha97a2udkx/ [16] http:/ / eprint. iacr. org/ 2008/ 089. pdf [17] http:/ / www. springerlink. com/ content/ v6526284mu858v37/ [18] http:/ / eprint. iacr. org/ 2010/ 016. pdf [19] [20] [21] [22] [23] [24] [25] [26] [27] http:/ / eprint. iacr. org/ 2009/ 223. pdf http:/ / springerlink. com/ content/ d7pm142n58853467/ http:/ / eprint. iacr. org/ 2008/ 515 http:/ / www. springerlink. com/ content/ 3540l03h1w31n6w7 http:/ / www. springerlink. com/ content/ 3810jp9730369045/ http:/ / eprint. iacr. org/ 2008/ 469. pdf http:/ / eprint. iacr. org/ 2008/ 270. pdf http:/ / www. springerlink. com/ content/ u762587644802p38/ https:/ / www. cosic. esat. kuleuven. be/ fse2009/ slides/ 2402_1150_Schlaeffer. pdf

Cryptographic hash function

154

References External links


Christof Paar, Jan Pelzl, "Hash Functions" (http://wiki.crypto.rub.de/Buch/movies.php), Chapter 11 of "Understanding Cryptography, A Textbook for Students and Practitioners". (companion web site contains online cryptography course that covers hash functions), Springer, 2009. "The ECRYPT Hash Function Website" (http://ehash.iaik.tugraz.at/wiki/The_eHash_Main_Page) "Series of mini-lectures about cryptographic hash functions" (http://www.guardtime.com/ educational-series-on-hashes/) by A. Buldas, 2011. "Cryptographic Hash-Function Basics: Definitions, Implications, and Separations for Preimage Resistance, Second-Preimage Resistance, and Collision Resistance" (http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.3.6200) by P. Rogaway, T. Shrimpton, 2004

155

Sets
Set (abstract data type)
In computer science, a set is an abstract data structure that can store certain values, without any particular order, and no repeated values. It is a computer implementation of the mathematical concept of a finite set. Unlike most other collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in a set. Some set data structures are designed for static or frozen sets that do not change after they are constructed. Static sets allow only query operations on their elements such as checking whether a given value is in the set, or enumerating the values in some arbitrary order. Other variants, called dynamic or mutable sets, allow also the insertion and/or deletion of elements from the set. An abstract data structure is a collection, or aggregate, of data. The data may be booleans, numbers, characters, or other data structures. If one considers the structure yielded by packaging[1] or indexing,[2] there are four basic data structures:[3][4] 1. 2. 3. 4. unpackaged, unindexed: bunch packaged, unindexed: set unpackaged, indexed: string (sequence) packaged, indexed: list (array)

In this view, the contents of a set are a bunch, and isolated data items are elementary bunches (elements). Whereas sets contain elements, bunches consist of elements. Further structuring may be achieved by considering the multiplicity of elements (sets become multisets, bunches become hyperbunches)[5] or their homogeneity (a record is a set of fields, not necessarily all of the same type).

Implementations
A set can be implemented in many ways. For example, one can use a list, ignoring the order of the elements and taking care to avoid repeated values. Sets are often implemented using various flavors of trees, tries, or hash tables. A set can be seen, and implemented, as a (partial) associative array, in which the value of each key-value pair has the unit type.

Type theory
In type theory, sets are generally identified with their indicator function: accordingly, a set of values of type be denoted by or may . (Subtypes and subsets may be modeled by refinement types, and quotient sets may be

replaced by setoids.) The characteristic function F of a set S is defined as: In theory, many other abstract data structures can be viewed as set structures with additional operations and/or additional axioms imposed on the standard operations. For example, an abstract heap can be viewed as a set structure with a min(S) operation that returns the element of smallest value.

Set (abstract data type)

156

Operations
Core set-theoretical operations
One may define the operations of the algebra of sets: union(S,T): returns the union of sets S and T. intersection(S,T): returns the intersection of sets S and T. difference(S,T): returns the difference of sets S and T. subset(S,T): a predicate that tests whether the set S is a subset of set T.

Static sets
Typical operations that may be provided by a static set structure S are: is_element_of(x,S): checks whether the value x is in the set S. is_empty(S): checks whether the set S is empty. size(S) or cardinality(S): returns the number of elements in S. iterate(S): returns a function that returns one more value of S at each call, in some arbitrary order. enumerate(S): returns a list containing the elements of S in some arbitrary order.

build(x1,x2,,xn,): creates a set structure with values x1,x2,,xn. create_from(collection): creates a new set structure containing all the elements of the given collection or all the elements returned by the given iterator.

Dynamic sets
Dynamic set structures typically add: create(): creates a new, initially empty set structure. create_with_capacity(n): creates a new set structure, initially empty but capable of holding up to n elements. add(S,x): adds the element x to S, if it is not present already. remove(S, x): removes the element x from S, if it is present. capacity(S): returns the maximum number of values that S can hold. Some set structures may allow only some of these operations. The cost of each operation will depend on the implementation, and possibly also on the particular values stored in the set, and the order in which they are inserted.

Additional operations
There are many other operations that can (in principle) be defined in terms of the above, such as: pop(S): returns an arbitrary element of S, deleting it from S. map(F,S): returns the set of distinct values resulting from applying function F to each element of S. filter(P,S): returns the subset containing all elements of S that satisfy a given predicate P. fold(A0,F,S): returns the value A|S| after applying Ai+1 := F(Ai, e) for each element e of S. clear(S): delete all elements of S. equal(S1, S2): checks whether the two given sets are equal (i.e. contain all and only the same elements). hash(S): returns a hash value for the static set S such that if equal(S1, S2) then hash(S1) = hash(S2)

Other operations can be defined for sets with elements of a special type: sum(S): returns the sum of all elements of S for some definition of "sum". For example, over integers or reals, it may be defined as fold(0, add, S). nearest(S,x): returns the element of S that is closest in value to x (by some metric).

Set (abstract data type)

157

Implementations
Sets can be implemented using various data structures, which provide different time and space trade-offs for various operations. Some implementations are designed to improve the efficiency of very specialized operations, such as nearest or union. Implementations described as "general use" typically strive to optimize the element_of, add, and delete operations. Sets are commonly implemented in the same way as associative arrays, namely, a self-balancing binary search tree for sorted sets (which has O(log n) for most operations), or a hash table for unsorted sets (which has O(1) average-case, but O(n) worst-case, for most operations). A sorted linear hash table[6] may be used to provide deterministically ordered sets. Other popular methods include arrays. In particular a subset of the integers 1..n can be implemented efficiently as an n-bit bit array, which also support very efficient union and intersection operations. A Bloom map implements a set probabilistically, using a very compact representation but risking a small chance of false positives on queries. The Boolean set operations can be implemented in terms of more elementary operations (pop, clear, and add), but specialized algorithms may yield lower asymptotic time bounds. If sets are implemented as sorted lists, for example, the naive algorithm for union(S,T) will take code proportional to the length m of S times the length n of T; whereas a variant of the list merging algorithm will do the job in time proportional to m+n. Moreover, there are specialized set data structures (such as the union-find data structure) that are optimized for one or more of these operations, at the expense of others.

Language support
One of the earliest languages to support sets was Pascal; many languages now include it, whether in the core language or in a standard library. Java offers the Set interface to support sets (with the HashSet class implementing it using a hash table), and the SortedSet sub-interface to support sorted sets (with the TreeSet class implementing it using a binary search tree). Apple's Foundation framework (part of Cocoa) provides the Objective-C classes NSSet [7], NSMutableSet [8] , NSCountedSet [9], NSOrderedSet [10], and NSMutableOrderedSet [11]. The CoreFoundation APIs provide the CFSet [12] and CFMutableSet [13] types for use in C. Python has built-in set and frozenset types [14] since 2.4, and since Python 3.0 and 2.7, supports non-empty set literals using a curly-bracket syntax, e.g.: { x, y, z }. The .NET Framework provides the generic HashSet [15] and SortedSet [16] classes that implement the generic ISet [17] interface. Smalltalk's class library includes Set and IdentitySet, using equality and identity for inclusion test respectively. Many dialects provide variations for compressed storage (NumberSet, CharacterSet), for ordering (OrderedSet, SortedSet, etc.) or for weak references (WeakIdentitySet). Ruby's standard library includes a set [18] module which contains Set and SortedSet classes that implement sets using hash tables, the latter allowing iteration in sorted order. OCaml's standard library contains a Set module, which implements a functional set data structure using binary search trees. The GHC implementation of Haskell provides a Data.Set [19] module, which implements a functional set data structure using binary search trees. The Tcl Tcllib package provides a set module which implements a set data structure based upon TCL lists. As noted in the previous section, in languages which do not directly support sets but do support associative arrays, sets can be emulated using associative arrays, by using the elements as keys, and using a dummy value as the values, which are ignored.

Set (abstract data type)

158

In C++
In C++, the Standard Template Library (STL) provides the set template class, which implements a sorted set using a binary search tree; SGI's STL also provides the hash_set template class, which implements a set using a hash table. In sets, the elements themselves are the keys, in contrast to sequenced containers, where elements are accessed using their (relative or absolute) position. Set elements must have a strict weak ordering. Some of the member functions in C++ and their description is given in the table below:

set member functions


Signature(s) iterator begin(); iterator end(); bool empty() const; iterator find(const key_type &x) const; void insert ( Input Iterator first, Input Iterator last ); pair<iterator, bool> insert ( const value_type& a ); iterator insert ( position(iterator), const value_type& a ); void clear(); Description Returns an iterator to the first element of the set. Returns an iterator just before the end of the set. Checks if the set container is empty (i.e. has a size of 0) Searches the container for an element x and if found, returns an iterator to it, or else returns an iterator to set::end Inserts an element into the set. The first version returns pair, with its member pair::first set to an iterator pointing to either the newly inserted element or to the element that already had its same value in the set. The pair::second element in the pair is set to true if a new element was inserted or false if an element with the same value existed. And the second version returns an iterator either pointing to newly inserted element or to element having same value in set.

Removes all elements in the set, making its size 0.

Multiset
A variation of the set is the multiset or bag, which is the same as a set data structure, but allows repeated ("equal") values (duplicates). The set of all bags over type T is given by the expression bag T. It is possible for objects in computer science to be considered "equal" under some equivalence relation but still distinct under another relation. Some types of multiset implementations will store distinct equal objects as separate items in the data structure; while others will collapse it down to one version (the first one encountered) and keep a positive integer count of the multiplicity of the element. C++'s Standard Template Library provides the multiset class for the sorted multiset, and SGI's STL provides the hash_multiset class, which implements a multiset using a hash table. For Java, third-party libraries provide multiset functionality: Apache Commons Collections provides the Bag [20] and SortedBag interfaces, with implementing classes like HashBag and TreeBag. Google Collections provides the Multiset [21] interface, with implementing classes like HashMultiset and TreeMultiset. Apple provides the NSCountedSet [9] class as part of Cocoa, and the CFBag [22] and CFMutableBag [23] types as part of CoreFoundation. Python's standard library includes collections.Counter [24], which is similar to a multiset. Smalltalk includes the Bag class, which can be instantiated to use either identity or equality as predicate for inclusion test.

Set (abstract data type) Where a multiset data structure is not available, a workaround is to use a regular set, but override the equality predicate of its items to always return "not equal" on distinct objects (however, such will still not be able to store multiple occurrences of the same object) or use an associative array mapping the values to their integer multiplicities (this will not be able to distinguish between equal elements at all). Typical operations on bags: Bag Membership - If B : bag T and x: T then the predicate x in B is true if, and only if, x appears in B at least once. Sub-bags - If B1, B2 : bag T then the predicate B1 B2 is true if each element that occurs in B1 occurs in B1 no more often than it occurs in B2. Counting Bags - If B : bag T and x : T then the number of times x occurs in B (a natural number) is given by the expression B # x. Scaling Bags - If B : bag T and n : N then n B is a bag which contains the same elements as B, except that every element that occurs m times in B occurs n * m times in n B. Bag Union - If B1, B2 : bag T then B1 B2 is a bag that contains just those values that occur in either B1 or B2, except that the number of times a value x occurs in B1 B2 is equal to (B1#x) + (B2#x).

159

References
[1] "Packaging" consists in supplying a container for an aggregation of objects in order to turn them into a single object. Consider a function call: without packaging, a function can be called to act upon a bunch only by passing each bunch element as a separate argument, which complicates the function's signature considerably (and is just not possible in some programming languages). By packaging the bunch's elements into a set, the function may now be called upon a single, elementary argument: the set object (the bunch's package). [2] Indexing is possible when the elements being considered are totally ordered. Being without order, the elements of a multiset (for example) do not have lesser/greater or preceding/succeeding relationships: they can only be compared in absolute terms (same/different). [3] Hehner, Eric C. R. (1981), "Bunch Theory: A Simple Set Theory for Computer Science", Information Processing Letters 12 (1) [4] Hehner, Eric C. R. (2004), A Practical Theory of Programming, second edition (http:/ / www. cs. utoronto. ca/ ~hehner/ aPToP/ ), [5] Hehner, Eric C. R. (2012), A Practical Theory of Programming, 2012-3-30 edition (http:/ / www. cs. toronto. edu/ ~hehner/ aPToP/ ), [6] Wang, Thomas (1997), Sorted Linear Hash Table (http:/ / www. concentric. net/ ~Ttwang/ tech/ sorthash. htm), [7] http:/ / developer. apple. com/ documentation/ Cocoa/ Reference/ Foundation/ Classes/ NSSet_Class/ [8] http:/ / developer. apple. com/ documentation/ Cocoa/ Reference/ Foundation/ Classes/ NSMutableSet_Class/ [9] http:/ / developer. apple. com/ documentation/ Cocoa/ Reference/ Foundation/ Classes/ NSCountedSet_Class/ [10] http:/ / developer. apple. com/ library/ mac/ #documentation/ Foundation/ Reference/ NSOrderedSet_Class/ Reference/ Reference. html [11] https:/ / developer. apple. com/ library/ mac/ #documentation/ Foundation/ Reference/ NSMutableOrderedSet_Class/ Reference/ Reference. html [12] http:/ / developer. apple. com/ documentation/ CoreFoundation/ Reference/ CFSetRef/ [13] http:/ / developer. apple. com/ documentation/ CoreFoundation/ Reference/ CFMutableSetRef/ [14] http:/ / docs. python. org/ library/ stdtypes. html#set-types-set-frozenset [15] http:/ / msdn. microsoft. com/ en-us/ library/ bb359438. aspx [16] http:/ / msdn. microsoft. com/ en-us/ library/ dd412070. aspx [17] http:/ / msdn. microsoft. com/ en-us/ library/ dd412081. aspx [18] http:/ / ruby-doc. org/ stdlib/ libdoc/ set/ rdoc/ index. html [19] http:/ / hackage. haskell. org/ packages/ archive/ containers/ 0. 2. 0. 1/ doc/ html/ Data-Set. html [20] http:/ / commons. apache. org/ collections/ api-release/ org/ apache/ commons/ collections/ Bag. html [21] http:/ / google-collections. googlecode. com/ svn/ trunk/ javadoc/ com/ google/ common/ collect/ Multiset. html [22] http:/ / developer. apple. com/ documentation/ CoreFoundation/ Reference/ CFBagRef/ [23] http:/ / developer. apple. com/ documentation/ CoreFoundation/ Reference/ CFMutableBagRef/ [24] http:/ / docs. python. org/ library/ collections. html#collections. Counter

Bit array

160

Bit array
A bit array (also known as bitmap, bitset, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level parallelism in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the number of bits to be stored, some space is wasted due to internal fragmentation.

Definition
A bit array is a mapping from some domain (almost always a range of integers) to values in the set {0, 1}. The values can be interpreted as dark/light, absent/present, locked/unlocked, valid/invalid, et cetera. The point is that there are only two possible values, so they can be stored in one bit. The array can be viewed as a subset of the domain (e.g. {0, 1, 2, ..., n1}), where a 1 bit indicates a number in the set and a 0 bit a number not in the set. This set data structure uses about n/w words of space, where w is the number of bits in each machine word. Whether the least significant bit or the most significant bit indicates the smallest-index number is largely irrelevant, but the former tends to be preferred.

Basic operations
Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated using bitwise operations. In particular: OR can be used to set a bit to one: 11101010 OR 00000100 = 11101110 AND can be used to set a bit to zero: 11101010 AND 11111101 = 11101000 AND together with zero-testing can be used to determine if a bit is set: 11101010 AND 00000001 = 00000000 = 0 11101010 AND 00000010 = 00000010 0 XOR can be used to invert or toggle a bit: 11101010 XOR 00000100 = 11101110 11101110 XOR 00000100 = 11101010 To obtain the bit mask needed for these operations, we can use a bit shift operator to shift the number 1 to the left by the appropriate number of places, as well as bitwise negation if necessary. Given two bit arrays of the same size representing sets, we can compute their union, intersection, and set-theoretic difference using n/w simple bit operations each (2n/w for difference), as well as the complement of either: for i from 0 to n/w-1 complement_a[i] := union[i] := intersection[i] := difference[i] :=

not a[i] a[i] or b[i] a[i] and b[i] a[i] and (not b[i])

If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly nested loop that loops through each word, one at a time. Only n/w memory accesses are required: for i from 0 to n/w-1 index := 0 // if needed word := a[i] for b from 0 to w-1

Bit array value := word and 1 0 word := word shift right 1 // do something with value index := index + 1 // if needed Both of these code samples exhibit ideal locality of reference, and so get a large performance boost from a data cache. If a cache line is k words, only about n/wk cache misses will occur.

161

More complex operations


Population / Hamming weight
If we wish to find the number of 1 bits in a bit array, sometimes called the population count or Hamming weight, there are efficient branch-free algorithms that can compute the number of bits in a word using a series of simple bit operations. We simply run such an algorithm on each word and keep a running total. Counting zeros is similar. See the Hamming weight article for examples of an efficient implementation.

Sorting
Similarly, sorting a bit array is trivial to do in O(n) time using counting sort we count the number of ones k, fill the last k/w words with ones, set only the low k mod w bits of the next word, and set the rest to zero.

Inversion
Vertical flipping of a one-bit-per-pixel image, or some FFT algorithms, require to flip the bits of individual words (so b31 b30 ... b0 becomes b0 ... b30 b31). When this operation is not available on the processor, it's still possible to proceed by successive passes, in this example on 32 bits: exchange two 16bit halfwords exchange bytes by pairs (0xddccbbaa -> 0xccddaabb) ... swap bits by pairs swap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1) The last operation can be written ((x&0x55555555)<<1) | (x&0xaaaaaaaa)>>1)).

Find first one


The find first one or find first set operation identifies the index or position of the least significant one bit in a word, and has widespread hardware support and efficient algorithms for its computation. When a priority queue is stored in a bit array, find first one can be used to identify the highest priority element in the queue. To expand a word-size find first one to longer arrays, one can find the first nonzero word and then run find first one on that word. The related operations find first zero, count leading zeros, count leading ones, count trailing zeros, count trailing ones, and log base 2 (see find first set) can also be extended to a bit array in a straightforward manner.

Bit array

162

Compression
Large bit arrays tend to have long streams of zeroes or ones. This phenomenon wastes storage and processing time. Run-length encoding is commonly used to compress these long streams. However, by compressing bit arrays too aggressively we run the risk of losing the benefits due to bit-level parallelism (vectorization). Thus, instead of compressing bit arrays as streams of bits, we might compress them as streams bytes or words (see Bitmap index (compression)). Examples: compressedbitset [1]: WAH Compressed BitSet for Java javaewah [2]: A compressed alternative to the Java BitSet class (using Enhanced WAH) CONCISE [3]: COmpressed 'N' Composable Integer Set, another bitmap compression scheme for Java EWAHBoolArray [4]: A compressed bitmap/bitset class in C++ CSharpEWAH [5]: compressed bitset class in C#

Advantages and disadvantages


Bit arrays, despite their simplicity, have a number of marked advantages over other data structures for the same problems: They are extremely compact; few other data structures can store n independent pieces of data in n/w words. They allow small arrays of bits to be stored and manipulated in the register set for long periods of time with no memory accesses. Because of their ability to exploit bit-level parallelism, limit memory access, and maximally use the data cache, they often outperform many other data structures on practical data sets, even those that are more asymptotically efficient. However, bit arrays aren't the solution to everything. In particular: Without compression, they are wasteful set data structures for sparse sets (those with few elements compared to their range) in both time and space. For such applications, compressed bit arrays, Judy arrays, tries, or even Bloom filters should be considered instead. Accessing individual elements can be expensive and difficult to express in some languages. If random access is more common than sequential and the array is relatively small, a byte array may be preferable on a machine with byte addressing. A word array, however, is probably not justified due to the huge space overhead and additional cache misses it causes, unless the machine only has word addressing.

Applications
Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a premium. Most commonly, they are used to represent a simple group of boolean flags or an ordered sequence of boolean values. Bit arrays are used for priority queues, where the bit at index k is set if and only if k is in the queue; this data structure is used, for example, by the Linux kernel, and benefits strongly from a find-first-zero operation in hardware. Bit arrays can be used for the allocation of memory pages, inodes, disk sectors, etc. In such cases, the term bitmap may be used. However, this term is frequently used to refer to raster images, which may use multiple bits per pixel. Another application of bit arrays is the Bloom filter, a probabilistic set data structure that can store large sets in a small space in exchange for a small probability of error. It is also possible to build probabilistic hash tables based on bit arrays that accept either false positives or false negatives.

Bit array Bit arrays and the operations on them are also important for constructing succinct data structures, which use close to the minimum possible space. In this context, operations like finding the nth 1 bit or counting the number of 1 bits up to a certain position become important. Bit arrays are also a useful abstraction for examining streams of compressed data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressed Huffman coding representation of a single 8-bit character can be anywhere from 1 to 255 bits long. In information retrieval, bit arrays are a good representation for the posting lists of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them using unary coding, the result is a bit array with a 1 bit in the nth position if and only if n is in the list. The implied probability of a gap of n is 1/2n. This is also the special case of Golomb coding where the parameter M is 1; this parameter is only normally selected when -log(2-p)/log(1-p) 1, or roughly the term occurs in at least 38% of documents.

163

Language support
The C programming language's bitfields, pseudo-objects found in structs with size equal to some number of bits, are in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the bits are still accessed using bitwise operators on most machines, and they can only be defined statically (like C's static arrays, their sizes are fixed at compile-time). It is also a common idiom for C programmers to use words as small bit arrays and access bits of them using bit operators. A widely available header file included in the X11 system, xtrapbits.h, is "a portable way for systems to define bit field manipulation of arrays of bits.". A more explanatory description of aforementioned approach can be found in the comp.lang.c faq [6]. In C++, although individual bools typically occupy the same space as a byte or an integer, the STL type vector<bool> is a partial template specialization in which bits are packed as a space efficiency optimization. Since bytes (and not bits) are the smallest addressable unit in C++, the [] operator does not return a reference to an element, but instead returns a proxy reference. This might seem a minor point, but it means that vector<bool> is not a standard STL container, which is why the use of vector<bool> is generally discouraged. Another unique STL class, bitset,[7] creates a vector of bits fixed at a particular size at compile-time, and in its interface and syntax more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as the ability to efficiently count the number of bits that are set. The Boost C++ Libraries provide a dynamic_bitset class[8] whose size is specified at run-time. The D programming language provides bit arrays in both of its competing standard libraries. In Phobos, they are provided in std.bitmanip, and in Tango, they are provided in tango.core.BitArray. As in C++, the [] operator does not return a reference, since individual bits are not directly addressable on most hardware, but instead returns a bool. In Java, the class BitSet creates a bit array that is then manipulated with functions named after bitwise operators familiar to C programmers. Unlike the bitset in C++, the Java BitSet does not have a "size" state (it has an effectiv