NOTES VI
File Organizations and Indexing
7 Indexes as Access Paths
8 Types of Single-level Indexes
8.1 Primary Indexes
8.2 l!stering Indexes
8." Secondary Indexes
Indexes as Access Paths
1 Introduction
- Indexes: access structures
o The index is called an access path on the field
o Used to seed u the retrie!al of records in resonse to certain search conditions
o Indexing fields: used to construct the index
- A single"le!el index is an auxiliar# file that $a%es it $ore efficient to search for a record in the data file
" The index is usuall# secified on one field of the file &although it could 'e secified on se!eral fields(
" One for$ of an index is a file of entries )field !alue* ointer to record+* ,hich is ordered '# field !alue
" The index file usuall# occuies considera'l# less dis% 'loc%s than the data file 'ecause its entries are $uch s$aller
" A 'inar# search on the index #ields a ointer to the file record
Exa$le: -i!en the follo,ing data file:
E.P/O0EE&NA.E* SSN* A112ESS* 3O4* SA/* 555 (
Suose that:
record size 26178 '#tes
'loc% size 46719 '#tes
r6:8888 records
Then* ,e get:
'loc%ing factor 4fr6 4 di! 26 719 di! 1786 : records;'loc%
nu$'er of file 'loc%s '6 &r;4fr(6 &:8888;:(6 18888 'loc%s
For an index on the SSN field* assu$e the field size V
SSN
6< '#tes*
assu$e the record ointer size P
2
6= '#tes5 Then:
index entr# size 2
I
6&V
SSN
> P
2
(6&<>=(61? '#tes
index 'loc%ing factor 4fr
I
6 4 di! 2
I
6 719 di! 1?6 :9 entries;'loc%
nu$'er of index 'loc%s '6 &r;4fr
I
(6 &:8888;:9(6 <:@ 'loc%s
'inar# search needs log
9
'
I
6 log
9
<:@6 18 'loc% accesses
This is co$ared to an a!erage linear search cost of:
&';9(6 :8888;96 17888 'loc% accesses
If the file records are ordered* the 'inar# search cost ,ould 'e:
log
9
'6 log
9
:88886 17 'loc% accesses
9 T#es of Single"/e!el Indexes
951 Pri$ar# Index
" 1efined on an ordered data file
" The data file is ordered on a key field
- Includes one index entr# for each block in the data fileA the index entr# has the %e# field !alue for the first record in the 'loc%* ,hich is
called the block anchor
o A ri$ar# index is an ordered file ,hose records are of fixed length ,ith t,o fields5
The first field is of the sa$e data t#e as the ordering %e# fieldBcalled the ri$ar# %e#Bof the data file* and
the second field is a ointer to a dis% 'loc% &a 'loc% address(5
Ce refer to the t,o field !alues of index entr# i as )D&i(* P&i(+5
- Exa$les &refer to figure(
o Ce use the NAME field as ri$ar# %e#* 'ecause that is the ordering %e# field of the file &assu$ing that each !alue of NAME is
uniEue(5
o Each entr# in the index has a NAME !alue and a ointer5 The first three index entries are as follo,s:
)D&1( 6 &Aaron*Ed(* P&1( 6 address of 'loc% 1+
)D&9( 6 &Ada$s*3ohn(* P&9( 6 address of 'loc% 9+
)D&:( 6 &Alexander*Ed(* P&:( 6 address of 'loc% :+
- Indexes can also 'e characterized as dense or sarse5
o A dense index has an index entr# for every search key value &and hence e!er# record( in the data file5
o A sarse &or nondense( index has index entries for onl# so$e of the search !alues5
- A ri$ar# index is hence a nondense &sarse( index*
o since it includes an entr# for each dis% 'loc% of the data file rather than for e!er# search !alue &or e!er# record(5
- The index file for a ri$ar# index needs su'stantiall# fe,er 'loc%s than does the data file* for t,o reasons5
o First* there are fewer index entries than there are records in the data file5
o Second* each index entr# is t#icall# smaller in size than a data record 'ecause it has onl# t,o fieldsA
FonseEuentl#* $ore index entries than data records can fit in one 'loc%5
A 'inar# search on the index file hence reEuires fe,er 'loc% accesses than a 'inar# search on the data file5
- A record ,hose ri$ar# %e# !alue is D lies in the 'loc% ,hose address is P&i(*
o ,here D&i( < D ) D&i > 1(5
o The i
th
'loc% in the data file contains all such records 'ecause of the h#sical ordering of the file records on the ri$ar# %e# field5
o To retrie!e a record* gi!en the !alue D of its ri$ar# %e# field*
Ce do a 'inar# search on the index file to find the aroriate index entr# i* and
Then retrie!e the data file 'loc% ,hose address is P&i(
- A
- Exa$les
- Suose that ,e ha!e an ordered file
o ,ith r 6 :8*888 records stored on a dis%
o ,ith 'loc% size 4 6 189G '#tes5
o File records are of fixed size and are unsanned*
,ith record length 2 6 188 '#tes5
o The 'loc%ing factor for the file ,ould 'e 'fr 6 &4;2( 6 &189G;188( 6 18 records er 'loc%5
o The nu$'er of 'loc%s needed for the file is
' 6 &r;'fr( 6 &:8*888;18( 6 :888 'loc%s5
o A 'inar# search on the data file ,ould need aroxi$atel#
log9' 6 &log9:888( 6 19 'loc% accesses5
o No, suose that
the ordering %e# field of the file is V 6 < '#tes long*
a 'loc% ointer is P 6 ? '#tes long* and
,e ha!e constructed a ri$ar# index for the file5
The size of each index entr# is 2i 6 &< > ?( 6 17 '#tes*
so the 'loc%ing factor for the index is
o 'fri 6 &4;2i( 6 &189G;17( 6 ?@ entries er 'loc%5
The total nu$'er of index entries ri is eEual to the nu$'er of 'loc%s in the data file* ,hich is :8885
The nu$'er of index 'loc%s is hence
'i 6 &ri;'fri( 6 &:888;?@( 6 G7 'loc%s5
To erfor$ a 'inar# search on the index file ,ould need
&log9'i( 6 &log9G7( 6 ? 'loc% accesses5
To search for a record using the index* ,e need one additional 'loc% access to the data file for a total of ? > 1 6 = 'loc%
accesses
an i$ro!e$ent o!er 'inar# search on the data file* ,hich reEuired 19 'loc% accesses5
- A $aHor ro'le$ ,ith a ri$ar# indexBas ,ith an# ordered fileBis insertion and deletion of records5
o if ,e atte$t to insert a record in its correct osition in the data file*
,e ha!e to not onl# $o!e records to $a%e sace for the ne, record 'ut also change so$e index entries*
since $o!ing records ,ill change the anchor records of so$e 'loc%s5
-
959 Flustering Index
" 1efined on an ordered data file
- The data file is ordered on a non-key field
- A clustering index is also an ordered file ,ith t,o fieldsA
o the first field is of the sa$e t#e as the clustering field of the data file* and
o the second field is a 'loc% ointer5
- There is one entr# in the clustering index for each distinct value of the clustering field* containing
o the !alue and
o a ointer to the first block in the data file that has a record ,ith that !alue for its clustering field5
- 2ecord insertion and deletion still cause ro'le$s* 'ecause the data records are h#sicall# ordered5
o To alle!iate the ro'le$ of insertion* it is co$$on to reser!e a ,hole 'loc% &or a cluster of contiguous 'loc%s( for each value of
the clustering fieldA
o all records ,ith that !alue are laced in the 'loc% &or 'loc% cluster(5
This $a%es insertion and deletion relati!el# straightfor,ard5
95: Secondar# Index
" 1efined on an unordered data file
- Fan 'e defined on
o a %e# field &,ith a uniEue !alue( or
o a non"%e# field ,ith dulicate !alues
- A secondar# index is also an ordered file ,ith t,o fields5
o The first field is of the sa$e data t#e as so$e nonordering field of the data file that is an indexing field5
o The second field is either a block ointer or a record ointer5
- There can 'e many secondar# indexes &and hence* indexing fields( for the sa$e file5
- Ce first consider a secondar# index access structure on a %e# field that has a distinct value for e!er# record5
o Such a field is so$eti$es called a secondar# %e#5
o In this case there is one index entr# for each record in the data file*
The index entr# contains
the !alue of the secondar# %e# for the record and
a ointer either to the 'loc% in ,hich the record is stored or to the record itself5
- Indexes can also 'e characterized as dense or sarse5
o A dense index has an index entr# for every search key value &and hence e!er# record( in the data file5
o A sarse &or nondense( index has index entries for onl# so$e of the search !alues5
- Therefore* Secondar# index is dense5
- Ce refer to the t,o field !alues of index entr# i as )D&i(* P&i(+5
o The entries are ordered '# !alue of D&i(* so ,e can erfor$ a 'inar# search5
o 4ecause the records of the data file are not h#sicall# ordered '# !alues of the secondar# %e# field*
Ce cannot use 'loc% anchors5
That is ,h# an index entr# is created for each record in the data file* rather than for each 'loc%* as in the case of a
ri$ar# index5
- The follo,ing figure illustrates a secondar# index in ,hich the ointers P&i( in the index entries are block pointers, not record ointers5
o Once the aroriate 'loc% is transferred to $ain $e$or#* a search for the desired record ,ithin the 'loc% can 'e carried out5
- A secondar# index usuall# needs $ore storage sace and longer search ti$e than does a ri$ar# index*
o 'ecause of its larger nu$'er of entries5
o Io,e!er* the improvement in search ti$e for an ar'itrar# record is $uch greater for a secondar# index than for a ri$ar#
index*
since ,e ,ould ha!e to do a linear search on the data file if the secondar# index did not exist5
o For a ri$ar# index* ,e could still use a 'inar# search on the $ain file* e!en if the index did not exist5
- Exa$le: the i$ro!e$ent in nu$'er of 'loc%s accessed5
o Fonsider the file of Exa$le 1
Exa$le1:
Cith r 6 :8*888 fixed"length records of size 2 6 188 '#tes stored on a dis%
Cith 'loc% size 4 6 189G '#tes5
The file has ' 6 :888 'loc%s* as calculated in Exa$le 15
To do a linear search on the file* ,e ,ould reEuire ';9 6 :888;9 6 1788 'loc% accesses on the a!erage5
o Suose that ,e construct a secondar# index on a nonordering %e# field of the file that is V 6 < '#tes long5
As in Exa$le 1* a 'loc% ointer is P 6 ? '#tes long*
so each index entr# is 2i 6 &< > ?( 6 17 '#tes* and
the 'loc%ing factor for the index is 'fri 6 &4;2i( 6 &189G;17( 6 ?@ entries er 'loc%5
In a dense secondar# index such as this*
o the total nu$'er of index entries ri is eEual to the number of records in the data file* ,hich is :8*8885
o The nu$'er of 'loc%s needed for the index is hence
'i 6 &ri;'fri( 6 &:8*888;?@( 6 GG9 'loc%s5
o A 'inar# search on this secondar# index needs
&log9'i( 6 &log9GG9( 6 < 'loc% accesses5
To search for a record using the index*
,e need an additional 'loc% access to the data file for a total of < > 1 6 18 'loc% accesses
a !ast i$ro!e$ent o!er the 1788 'loc% accesses needed on the a!erage for a linear search*
'ut slightl# ,orse than the se!en 'loc% accesses reEuired for the ri$ar# index5
- Freating a secondar# index on a nonkey field of a file5
- In this case* nu$erous records in the data file can ha!e the sa$e !alue for the indexing field5
o There are se!eral otions for i$le$enting such an index:
o Otion 1 is to include se!eral index entries ,ith the sa$e D&i( !alueBone for each record5 This ,ould 'e a dense index5
o Otion 9 is to ha!e !aria'le"length records for the index entries* ,ith a reeating field for the ointer5
Ce %ee a list of ointers )P&i*1(* 555* P&i*%(+ in the index entr# for D&i(
one ointer to each 'loc% that contains a record ,hose indexing field !alue eEuals D&i(5
In either otion 1 or otion 9* the 'inar# search algorith$ on the index $ust 'e $odified aroriatel#5
o Otion :* ,hich is $ore co$$onl# used* is
to %ee the index entries the$sel!es at a fixed length and ha!e a single entr# for each index field value* 'ut
to create an extra le!el of indirection to handle the $ultile ointers5
In this nondense sche$e*
the ointer P&i( in index entr# )D&i(* P&i(+ oints to a block of record pointers;
each record ointer in that 'loc% oints to one of the data file records ,ith !alue D&i( for the indexing field5
If so$e !alue D&i( occurs in too $an# records* so that their record ointers cannot fit in a single dis% 'loc%* a
cluster or lin%ed list of 'loc%s is used5
This techniEue is illustrated in the follo,ing figure5
2etrie!al !ia the index reEuires one or $ore additional 'loc% access
4ecause of the extra le!el* 'ut the algorith$s for searching the index and &$ore i$ortantl#( for inserting of ne,
records in the data file are straightfor,ard5
T#es of Indexes
Ordering Field Nonordering field
De# field Pri$ar# index Secondar# index &%e#(
Non%e# field Flustering index Secondar# index &non%e#(
Proerties of Index T#es
Type of
Index
Nu$'er of &First"le!el( Index Entries 1ense or Nondense 4loc% Anchoring on the
1ata File
Pri$ar# Nu$'er of 'loc%s in data file Nondense 0es
Flustering Nu$'er of distinct index field !alues Nondense
0es;no
&Note a(
Secondar# &%e#( Nu$'er of records in data file 1ense No
Secondar#
&non%e#(
Nu$'er of records &Note '( or Nu$'er of
distinct index field !alues &Note c(
1ense or Nondense No