0% found this document useful (0 votes)
122 views15 pages

Hash Files

Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views15 pages

Hash Files

Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

8/24/2010 Hash Files

Scribd

Explore
Community

Upload a Document
Search Books, Presentations, Business, Academics...

Login
maheshumbarkar
My Home
View Public Profile
My Documents
My Collections
Messages
Settings
Help
Log Out

What Exactly Are Hash Files?

DataStage® Server Edition utilizes several different types of hash files. The
default and most versatile type is a dynamic (more precisely a type 30) hash
file. A hash file can simply be described as a file that distributes data
throughout a pre-sized and space allocated file. Every row has to have a key
to determine where the row will reside within the file. This key is run thru an
algorithm (technically a hashing algorithm, hence the name) to determine
the location in the file for the row. One and only one row can exist at a single
location.

A hash file is usually composed of 2 files: the data file and the overflow file.
Since the file is pre-sized and the space allocated, rows that cannot fit within
the data file will spill over to the overflow file. The hash file is at its optimal
sizing when all of the rows are within the data file, and no rows are in the
overflow. This is because during a row's retrieval from the hash file, the
address of the row is first tried in the data file using the highly efficient hash
file algorithm. If the row is not there, then the overflow file is sequentially
scanned. The more rows in the overflow, the longer it takes to determine if a
row exists in a hash file.

Code:

$ ls al /var/opt/work/hash/tmpExample
total 2032
drwxrxrx 2 radnet radnet 96 May 14 01:08 .
drwxrx 65 radnet radnet 8192 May 14 0 1:53 ..
rwrr 1 radnet radnet 0 May 14 01:08 .Type30
rwrr 1 radnet radnet 774144 May 14 01:52 DATA.30
rwrr 1 radnet radnet 249856 May 14 0 1:08 OVER.30

The beauty of the dynamic hash file, as compared to the static hash file, is
that it will grow on demand. At a predetermined threshold, the file will
automatically double its size and redistribute the data within the file. This
removes the data in the overflow file and brings it back within the data file.
The dynamic hash file defaults to an initial sizing value of 1. This means that
the file will constantly resize by doubling itself to meet the incoming dataset.
Clearing the hash file will shrink it back to the initial sizing.

To find a row in a hash file, you must know the key that was used to deposit
the row somewhere within the file. The data is stored as a continuous string
of characters, with a column separator (@AM system variable). Each column
is positional in nature and relative to the columns preceding it. It should
become obvious that a hash file is almost the same as a sequential file,
meaning columns are positional and referenced by position, not name. In
fact, column names are irrelevant, you only need to know the key construct
and what data is located in what positional column. Because data is stored as
a text string, there is no explicit data typing enforced within a hash file. The

[Link]/doc/12516578/Hash-Files 1/15
8/24/2010 Hash Files

only rule in a hash file is to not use restricted ASCII characters within your
data, as they are internal data separators. Furthermore, each row within a
hash file is allowed to have it's own structure, because again data is stored
as a text string. Therefore, hash files are non-structured free-form scratch
files!

How Are They Meant To Be Used by DataStage®?

Hash files should be limited to three main uses:


1. Small reference lookups by primary key
2. Small staging files usually when a reference capability has to exist
3. Non-persistent scratchpad work files.

Small reference lookups by primary key - DataStage® has the ability to cache
lookups in memory, as long as they are hash files. Lookups are not joins,
they are specific searches in a hash file for a row using the key value for that
row. Since you will give the key value to the hash file row, the algorithm
quickly converts the key to the location of the row within the hash file and
returns the contents at that address. This type of activity using ODBC/OCI
stages is unacceptable for performance reasons because of the lack of
memory caching capability. In addition, as more instantiated jobs perform
the same lookup, the database will become saturated with lookup queries.

Small staging files usually when a reference capability has to exist - A


transformation job may prepare data that will be needed for reference in
another job. This is accomplished quite easily by staging the data into a hash
file. In order to make this data available for reference without a hash file
would require either loading the reference into a target table or some sort of
ETL scratch table.

Non-persistent scratchpad work files - Hash files are wonderful mechanisms for
staging data if it is needed for a reference lookup. The best choice for data
that is not needed for a reference lookup is a sequential file, because of the
appending nature of writing to a sequential file. Hash files have the overhead
of figuring out where the row should go, checking the space available within
the hash file, and then either writing it to the data section or the overflow.

How Can A Hash File Be Abused in a DataStage® Design?

1. Using a SQL interface to a hash file (UV/ODBC stage, UV/SQL commands). A hash file
has no secondary indexes to support any type of relational queries. While the
underlying engine has the technology in it to support this type of activity, you do not
want to go down the path of building a database within your ETL tool! This defeats the
entire purpose of exposing data transformation to the simple graphical metaphor. The
only exception would be using the UV/ODBC stage to perform aggregations or sorts on
the data, but even then you should question the method.
2. Cramming obscene volumes into a hash file. If you are creating distributed hash files
or 64BIT hash files, you should question what you're trying to use the hash file to

[Link]/doc/12516578/Hash-Files 2/15
8/24/2010 Hash Files

accomplish. These file options are meant for significant (files exceeding 2.2 gigabytes
of data) hash files. The knowledge required to create and maintain is not every day
ETL and therefore very unique. There are alternatives, such as partitioning your data
or using smaller sets of source data to stay within the optimal usage of the hash file.
3. You are storing persistent, production data inside your ETL tool. Hash files should be
used as throw away or disposable scratchpad work files. Anything else requires
archival and recovery procedures and processes to return your environment to it's pre-
corrupted state.
4. Preloading to memory when each row will only be looked up once. If all rows are going
to be referenced once, then preloading the file to memory is a waste of time and
memory. The preloading activity reads the row and then puts it into memory. During
processing the row is again referenced from memory. Without preloading the row
would still be read only once from the file, but no memory overhead is undertaken.
The accessing of the hash file will move a portion of the hash file into operating
system disk cache and potentially give some benefit to accessing via direct reads.
5. Preloading to memory when the majority of rows will never be referenced. If the
number of input processing rows does not exceed the number of rows in a reference
lookup, preloading to memory is a waste of time and memory. This is especially true
when the reference lookup vastly exceeds the number of input processing rows. This is
because the maximum number of reads on the reference lookup will be the number of
input processing rows. Preloading the rest of the rows in the reference lookup into
memory simply wastes time and memory.
6. Many parallel jobs writing to the same hash file. There has to be very careful
consideration and recognition as to the means by which DataStage® handles parallel
writes to a hash file, especially with write delay caching. A job writing to a hash file
will buffer its writes if using write delay caching, and then at the flush point grab the
hash file and purge its data to disk. Instantiated jobs writing to the same hash file will
encounter contention at the flush points. There is a point of diminishing return when
N+1 instantiations has a greater overall runtime than N instantiations. This is unique
to each server configuration and one you have to find out for yourself. In some cases 4
instances may have better overall runtime than 8 instances because of the disk
congestion and contention.
7. Using locking mechanisms for rows in the hash file. In order to help use DataStage®
in a more transactional mode, locking features have been added. These features
should NEVER be used for normal ETL. The overhead involved is tremendously
degrading to performance. DataStage® uses dirty reads and writes to the hash files. It
is time tested and true to run processes in serial and allow isolation of the hash file
that way rather than attempt to run several processes in parallel and rely on locking to
keep from clobbering data.
8. In DataStage® Designer, make sure that all of your jobs have the exact same
metadata for the hash file wherever it's used. The system of record for any metadata
should come from the job that creates the file, be it hashed or sequential. Using the
correct metadata should make reference lookups easy, as you must always specify the
key for the hash file metadata. If you are twiddling the primary key column metadata
in designer to "join" to a hash file in a transformer stage, then you did not understand
the preceding explanation as to the row address hashing algorithm.
9. DataStage® can be deceiving in that you can change the metadata for the hash file
stage and expect a different set of columns to be used to find a row. This is a
relational perspective, not a hash file perspective. DataStage® will read the entire
contents of that row into memory, so therefore you should have the complete
metadata that accurately describes the contents of the file. You should not alter
definitions from one job to another expecting results to meet your changed metadata.

Tricks, Features, and Unintended (but Cool) Results

There are a few tricks to making your hash files more efficient. The best
option is to try to stay with dynamic hash files and use initial sizing as your

[Link]/doc/12516578/Hash-Files 3/15
8/24/2010 Hash Files

[Link]/doc/12516578/Hash-Files 4/15
8/24/2010 Hash Files

[Link]/doc/12516578/Hash-Files 5/15
8/24/2010 Hash Files

Hash Files
Hash File in Datastage

Reads:
4,436
Uploaded:
02/17/2009
Category:
Uncategorized.
Tags:
lookup
data stage
data file

[Link]/doc/12516578/Hash-Files 6/15
8/24/2010 Hash Files
number
hash file
hash datastage
hash files
(More)
Rated:

(5 Ratings)
DownloadPrintMobileCollectionsReport Document
This is a private document.

ramasikhinam
Like Be the first of your friends to like this.

Ads by Google
DW & BI Services
Convert your data into Intelligence
Reports, Dashoards & Scorecards.
[Link]

ETL Jobs
100's of Jobs in ETL
Submit Your Resume Free. Now!
[Link]

Your Data Cube in 10 min


Analyze your data from every angle
and make right decisions !
[Link]

Share & Embed

Link / URL: [Link]


Embed Size & Settings:

Width: (auto) Auto


Height: 300
Start on page: 1
Preview View: Scroll

<a title="View Hash Files on

Related
1. 2 p.

Hash Table

Reads: 0

40 p.

Chapter 19

Reads: 0

22 p.

ch12

Reads: 36

[Link]/doc/12516578/Hash-Files 7/15
8/24/2010 Hash Files
2. 6 p.

Hash

Reads: 266

11 p.

ch12

Reads: 65

15 p.

HASH JOIN Internals

Reads: 504

3. 22 p.

Cryptography and Network Security ...

Reads: 244

16 p.

Dos via Algorithmic Complexity Att...

Reads: 97

31 p.

2007 SMCS Game Week Workbook

Reads: 204

4. 28 p.

Hashing

Reads: 511

9 p.

Hashing in HK by Eunice (Edited by...

Reads: 139

23 p.

Chapter 1.4 - Data Representation,...

Reads: 0

5. 20 p.

Bubble hashes

Reads: 1054

6 p.

06 Hashing

Reads: 65

6 p.
[Link]/doc/12516578/Hash-Files 8/15
8/24/2010 Hash Files

06 Hashing

Reads: 39

6. 23 p.

Game Week Workbook

Reads: 28

3 p.

Lauren Dodd's CV

Reads: 0

1 p.

Test Bank &amp; Solution for Funda...

Reads: 0

7. 32 p.

H 0665 - Version: H 0665E1 via MyG...

Reads: 0

2 p.

EFFECTIVE EXPRESSION

Reads: 0

12 p.

Aug 2010 North Denver News p1-12 F...

Reads: 0

8. 10 p.

Collections

Reads: 0

9 p.

Ch 5 Free Body Diagrams

Reads: 0

1 p.

Leaf by Leaf Lesson4 Guided Practi...

Reads: 0

9. 8 p.

Contact August 20, 2010

Reads: 0

More from this user


[Link]/doc/12516578/Hash-Files 9/15
8/24/2010 Hash Files
1. 458 p.

Transformation Guide

From: ramasikhinam

Reads: 2,046

4 p.

New in DataStage Version 8.x

From: ramasikhinam

Reads: 2,051

10 p.

Datastage Parallel Stages Groups

From: ramasikhinam

Reads: 5,495

2. 3 p.

Modify Stage Functions

From: ramasikhinam

Reads: 1,257

11 p.

Key board shortcuts

From: ramasikhinam

Reads: 500

4 p.

OWB Tips

From: ramasikhinam

Reads: 1,296

3. 4 p.

Informatica

From: ramasikhinam

Reads: 5,184

20 p.

Informatica

From: ramasikhinam

Reads: 10,014

8 p.

ram_h

[Link]/doc/12516578/Hash-Files 10/15
8/24/2010 Hash Files
From: ramasikhinam

Reads: 304

4. 2 p.

Functions

From: ramasikhinam

Reads: 397

9 p.

The Object Technology Architecture

From: ramasikhinam

Reads: 310

94 p.

BO Questions

From: ramasikhinam

Reads: 6,801

5. 35 p.

Informatica

From: ramasikhinam

Reads: 4,745

764 p.

OWB_Basics11g

From: ramasikhinam

Reads: 1,018

933 p.

OWB_Basic9i

From: ramasikhinam

Reads: 2,359

6. 7 p.

Hash Files

From: ramasikhinam

Reads: 4,436

46 p.

TeraData Ppt

From: ramasikhinam

Reads: 2,394
[Link]/doc/12516578/Hash-Files 11/15
8/24/2010 Hash Files

40 p.

Teradata Brief

From: ramasikhinam

Reads: 377

7. 14 p.

TeraData Questions

From: ramasikhinam

Reads: 12,588

20 p.

XML

From: ramasikhinam

Reads: 1,049

10 p.

Datatage Environment Variables ...

From: ramasikhinam

Reads: 2,662

8. 36 p.

Migrating From Business Objects...

From: ramasikhinam

Reads: 4,193

75 p.

B_new_m

From: ramasikhinam

Reads: 628

159 p.

Datastage 8.0 Installation

From: ramasikhinam

Reads: 10,854

9. 19 p.

STAR AND SNOWFLAKE SCHEMA

From: ramasikhinam

Reads: 3,634

Add a Comment

[Link]/doc/12516578/Hash-Files 12/15
8/24/2010 Hash Files

Submit

suryapkakani left a comment

good info

02 / 15 / 2010
Reply

sveeramani left a comment

Excellent information and very helpful Veera

04 / 20 / 2009
Reply

Upload a Document
Search Books, Presentations, Business, Academics...

Scribd

About
Press
Jobs
Contact
Blog
Scribd Store

Legal

Terms - General
Terms - API
Terms - Privacy
Copyright

Help & Tools

Getting Started
Community Guidelines
Support & FAQ
Web Stuff

Partners

[Link]/doc/12516578/Hash-Files 13/15
8/24/2010 Hash Files
/ 7
Partners
Branded Reader
Developers / API

Subscribe to Us
! Post a message on your social networks to let others know what you're reading. Select the sites below and start
On Scribd
On Twitter
On Facebook

What's New

We have updated our Terms of Service


Branded Reader
Desktop Uploader
edit preferences

[Link]/doc/12516578/Hash-Files 14/15
8/24/2010 Hash Files

[Link]/doc/12516578/Hash-Files 15/15

You might also like