0% found this document useful (0 votes)

122 views15 pages

Hash Files

Uploaded by

Mahesh_Umbarka_4467

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views15 pages

Hash Files

Uploaded by

Mahesh_Umbarka_4467

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

8/24/2010 Hash Files

Scribd

Explore
Community

Upload a Document
Search Books, Presentations, Business, Academics...

What Exactly Are Hash Files?

DataStage® Server Edition utilizes several different types of hash files. The
default and most versatile type is a dynamic (more precisely a type 30) hash
file. A hash file can simply be described as a file that distributes data
throughout a pre-sized and space allocated file. Every row has to have a key
to determine where the row will reside within the file. This key is run thru an
algorithm (technically a hashing algorithm, hence the name) to determine
the location in the file for the row. One and only one row can exist at a single
location.

A hash file is usually composed of 2 files: the data file and the overflow file.
Since the file is pre-sized and the space allocated, rows that cannot fit within
the data file will spill over to the overflow file. The hash file is at its optimal
sizing when all of the rows are within the data file, and no rows are in the
overflow. This is because during a row's retrieval from the hash file, the
address of the row is first tried in the data file using the highly efficient hash
file algorithm. If the row is not there, then the overflow file is sequentially
scanned. The more rows in the overflow, the longer it takes to determine if a
row exists in a hash file.

Code:

$ ls al /var/opt/work/hash/tmpExample
total 2032
drwxrxrx 2 radnet radnet 96 May 14 01:08 .
drwxrx 65 radnet radnet 8192 May 14 0 1:53 ..
rwrr 1 radnet radnet 0 May 14 01:08 .Type30
rwrr 1 radnet radnet 774144 May 14 01:52 DATA.30
rwrr 1 radnet radnet 249856 May 14 0 1:08 OVER.30

The beauty of the dynamic hash file, as compared to the static hash file, is
that it will grow on demand. At a predetermined threshold, the file will
automatically double its size and redistribute the data within the file. This
removes the data in the overflow file and brings it back within the data file.
The dynamic hash file defaults to an initial sizing value of 1. This means that
the file will constantly resize by doubling itself to meet the incoming dataset.
Clearing the hash file will shrink it back to the initial sizing.

To find a row in a hash file, you must know the key that was used to deposit
the row somewhere within the file. The data is stored as a continuous string
of characters, with a column separator (@AM system variable). Each column
is positional in nature and relative to the columns preceding it. It should
become obvious that a hash file is almost the same as a sequential file,
meaning columns are positional and referenced by position, not name. In
fact, column names are irrelevant, you only need to know the key construct
and what data is located in what positional column. Because data is stored as
a text string, there is no explicit data typing enforced within a hash file. The

[Link]/doc/12516578/Hash-Files 1/15
8/24/2010 Hash Files

only rule in a hash file is to not use restricted ASCII characters within your
data, as they are internal data separators. Furthermore, each row within a
hash file is allowed to have it's own structure, because again data is stored
as a text string. Therefore, hash files are non-structured free-form scratch
files!

How Are They Meant To Be Used by DataStage®?

Hash files should be limited to three main uses:

1. Small reference lookups by primary key
2. Small staging files usually when a reference capability has to exist
3. Non-persistent scratchpad work files.

Small reference lookups by primary key - DataStage® has the ability to cache
lookups in memory, as long as they are hash files. Lookups are not joins,
they are specific searches in a hash file for a row using the key value for that
row. Since you will give the key value to the hash file row, the algorithm
quickly converts the key to the location of the row within the hash file and
returns the contents at that address. This type of activity using ODBC/OCI
stages is unacceptable for performance reasons because of the lack of
memory caching capability. In addition, as more instantiated jobs perform
the same lookup, the database will become saturated with lookup queries.

Small staging files usually when a reference capability has to exist - A

transformation job may prepare data that will be needed for reference in
another job. This is accomplished quite easily by staging the data into a hash
file. In order to make this data available for reference without a hash file
would require either loading the reference into a target table or some sort of
ETL scratch table.

Non-persistent scratchpad work files - Hash files are wonderful mechanisms for
staging data if it is needed for a reference lookup. The best choice for data
that is not needed for a reference lookup is a sequential file, because of the
appending nature of writing to a sequential file. Hash files have the overhead
of figuring out where the row should go, checking the space available within
the hash file, and then either writing it to the data section or the overflow.

How Can A Hash File Be Abused in a DataStage® Design?

1. Using a SQL interface to a hash file (UV/ODBC stage, UV/SQL commands). A hash file
has no secondary indexes to support any type of relational queries. While the
underlying engine has the technology in it to support this type of activity, you do not
want to go down the path of building a database within your ETL tool! This defeats the
entire purpose of exposing data transformation to the simple graphical metaphor. The
only exception would be using the UV/ODBC stage to perform aggregations or sorts on
the data, but even then you should question the method.
2. Cramming obscene volumes into a hash file. If you are creating distributed hash files
or 64BIT hash files, you should question what you're trying to use the hash file to

[Link]/doc/12516578/Hash-Files 2/15
8/24/2010 Hash Files

accomplish. These file options are meant for significant (files exceeding 2.2 gigabytes
of data) hash files. The knowledge required to create and maintain is not every day
ETL and therefore very unique. There are alternatives, such as partitioning your data
or using smaller sets of source data to stay within the optimal usage of the hash file.
3. You are storing persistent, production data inside your ETL tool. Hash files should be
used as throw away or disposable scratchpad work files. Anything else requires
archival and recovery procedures and processes to return your environment to it's pre-
corrupted state.
4. Preloading to memory when each row will only be looked up once. If all rows are going
to be referenced once, then preloading the file to memory is a waste of time and
memory. The preloading activity reads the row and then puts it into memory. During
processing the row is again referenced from memory. Without preloading the row
would still be read only once from the file, but no memory overhead is undertaken.
The accessing of the hash file will move a portion of the hash file into operating
system disk cache and potentially give some benefit to accessing via direct reads.
5. Preloading to memory when the majority of rows will never be referenced. If the
number of input processing rows does not exceed the number of rows in a reference
lookup, preloading to memory is a waste of time and memory. This is especially true
when the reference lookup vastly exceeds the number of input processing rows. This is
because the maximum number of reads on the reference lookup will be the number of
input processing rows. Preloading the rest of the rows in the reference lookup into
memory simply wastes time and memory.
6. Many parallel jobs writing to the same hash file. There has to be very careful
consideration and recognition as to the means by which DataStage® handles parallel
writes to a hash file, especially with write delay caching. A job writing to a hash file
will buffer its writes if using write delay caching, and then at the flush point grab the
hash file and purge its data to disk. Instantiated jobs writing to the same hash file will
encounter contention at the flush points. There is a point of diminishing return when
N+1 instantiations has a greater overall runtime than N instantiations. This is unique
to each server configuration and one you have to find out for yourself. In some cases 4
instances may have better overall runtime than 8 instances because of the disk
congestion and contention.
7. Using locking mechanisms for rows in the hash file. In order to help use DataStage®
in a more transactional mode, locking features have been added. These features
should NEVER be used for normal ETL. The overhead involved is tremendously
degrading to performance. DataStage® uses dirty reads and writes to the hash files. It
is time tested and true to run processes in serial and allow isolation of the hash file
that way rather than attempt to run several processes in parallel and rely on locking to
keep from clobbering data.
8. In DataStage® Designer, make sure that all of your jobs have the exact same
metadata for the hash file wherever it's used. The system of record for any metadata
should come from the job that creates the file, be it hashed or sequential. Using the
correct metadata should make reference lookups easy, as you must always specify the
key for the hash file metadata. If you are twiddling the primary key column metadata
in designer to "join" to a hash file in a transformer stage, then you did not understand
the preceding explanation as to the row address hashing algorithm.
9. DataStage® can be deceiving in that you can change the metadata for the hash file
stage and expect a different set of columns to be used to find a row. This is a
relational perspective, not a hash file perspective. DataStage® will read the entire
contents of that row into memory, so therefore you should have the complete
metadata that accurately describes the contents of the file. You should not alter
definitions from one job to another expecting results to meet your changed metadata.

Tricks, Features, and Unintended (but Cool) Results

There are a few tricks to making your hash files more efficient. The best
option is to try to stay with dynamic hash files and use initial sizing as your

[Link]/doc/12516578/Hash-Files 3/15
8/24/2010 Hash Files

[Link]/doc/12516578/Hash-Files 4/15
8/24/2010 Hash Files

[Link]/doc/12516578/Hash-Files 5/15
8/24/2010 Hash Files

Hash Files
Hash File in Datastage

Reads:
4,436
Uploaded:
02/17/2009
Category:
Uncategorized.
Tags:
lookup
data stage
data file

[Link]/doc/12516578/Hash-Files 6/15
8/24/2010 Hash Files
number
hash file
hash datastage
hash files
(More)
Rated:

(5 Ratings)
DownloadPrintMobileCollectionsReport Document
This is a private document.

ramasikhinam
Like Be the first of your friends to like this.

Ads by Google
DW & BI Services
Convert your data into Intelligence
Reports, Dashoards & Scorecards.
[Link]

ETL Jobs
100's of Jobs in ETL
Submit Your Resume Free. Now!
[Link]

Your Data Cube in 10 min

Analyze your data from every angle
and make right decisions !
[Link]

Share & Embed

Link / URL: [Link]

Embed Size & Settings:

Width: (auto) Auto

Height: 300
Start on page: 1
Preview View: Scroll

<a title="View Hash Files on

Related
1. 2 p.

Hash Table

40 p.

Chapter 19

22 p.

ch12

[Link]/doc/12516578/Hash-Files 7/15
8/24/2010 Hash Files
2. 6 p.

Hash

11 p.

ch12

15 p.

HASH JOIN Internals

3. 22 p.

Cryptography and Network Security ...

16 p.

Dos via Algorithmic Complexity Att...

31 p.

2007 SMCS Game Week Workbook

4. 28 p.

Hashing

9 p.

Hashing in HK by Eunice (Edited by...

23 p.

Chapter 1.4 - Data Representation,...

5. 20 p.

Bubble hashes

6 p.

06 Hashing

6 p.
[Link]/doc/12516578/Hash-Files 8/15
8/24/2010 Hash Files

06 Hashing

6. 23 p.

Game Week Workbook

3 p.

Lauren Dodd's CV

1 p.

Test Bank & Solution for Funda...

7. 32 p.

H 0665 - Version: H 0665E1 via MyG...

2 p.

EFFECTIVE EXPRESSION

12 p.

Aug 2010 North Denver News p1-12 F...

8. 10 p.

Collections

9 p.

Ch 5 Free Body Diagrams

1 p.

Leaf by Leaf Lesson4 Guided Practi...

9. 8 p.

Contact August 20, 2010

More from this user

[Link]/doc/12516578/Hash-Files 9/15
8/24/2010 Hash Files
1. 458 p.

Transformation Guide

From: ramasikhinam

4 p.

New in DataStage Version 8.x

From: ramasikhinam

10 p.

Datastage Parallel Stages Groups

From: ramasikhinam

2. 3 p.

Modify Stage Functions

From: ramasikhinam

11 p.

Key board shortcuts

From: ramasikhinam

4 p.

OWB Tips

From: ramasikhinam

3. 4 p.

Informatica

From: ramasikhinam

20 p.

Informatica

From: ramasikhinam

8 p.

ram_h

[Link]/doc/12516578/Hash-Files 10/15
8/24/2010 Hash Files
From: ramasikhinam

4. 2 p.

Functions

From: ramasikhinam

9 p.

The Object Technology Architecture

From: ramasikhinam

94 p.

BO Questions

From: ramasikhinam

5. 35 p.

Informatica

From: ramasikhinam

764 p.

OWB_Basics11g

From: ramasikhinam

933 p.

OWB_Basic9i

From: ramasikhinam

6. 7 p.

Hash Files

From: ramasikhinam

46 p.

TeraData Ppt

From: ramasikhinam

Reads: 2,394
[Link]/doc/12516578/Hash-Files 11/15
8/24/2010 Hash Files

40 p.

Teradata Brief

From: ramasikhinam

7. 14 p.

TeraData Questions

From: ramasikhinam

20 p.

XML

From: ramasikhinam

10 p.

Datatage Environment Variables ...

From: ramasikhinam

8. 36 p.

Migrating From Business Objects...

From: ramasikhinam

75 p.

B_new_m

From: ramasikhinam

159 p.

Datastage 8.0 Installation

From: ramasikhinam

9. 19 p.

STAR AND SNOWFLAKE SCHEMA

From: ramasikhinam

Add a Comment

[Link]/doc/12516578/Hash-Files 12/15
8/24/2010 Hash Files

Submit

suryapkakani left a comment

good info

02 / 15 / 2010
Reply

sveeramani left a comment

Excellent information and very helpful Veera

04 / 20 / 2009
Reply

Upload a Document
Search Books, Presentations, Business, Academics...

Scribd

About
Press
Jobs
Contact
Blog
Scribd Store

Legal

Terms - General
Terms - API
Terms - Privacy
Copyright

Help & Tools

Getting Started
Community Guidelines
Support & FAQ
Web Stuff

Partners

[Link]/doc/12516578/Hash-Files 13/15
8/24/2010 Hash Files
/ 7
Partners
Branded Reader
Developers / API

Subscribe to Us
! Post a message on your social networks to let others know what you're reading. Select the sites below and start
On Scribd
On Twitter
On Facebook

What's New

We have updated our Terms of Service

Branded Reader
Desktop Uploader
edit preferences

[Link]/doc/12516578/Hash-Files 14/15
8/24/2010 Hash Files

[Link]/doc/12516578/Hash-Files 15/15

Hashed Files
No ratings yet
Hashed Files
16 pages
Disk Storage & File Structures
No ratings yet
Disk Storage & File Structures
27 pages
File Organization CH16 Updated
No ratings yet
File Organization CH16 Updated
30 pages
Disk Storage & File Structures Guide
No ratings yet
Disk Storage & File Structures Guide
10 pages
DBMS Chapter 4 Record Organization and Dile Management
No ratings yet
DBMS Chapter 4 Record Organization and Dile Management
36 pages
22-File Organization-06-09-2024
No ratings yet
22-File Organization-06-09-2024
23 pages
File Structures Indexing
No ratings yet
File Structures Indexing
58 pages
1 File Structure & Organization
No ratings yet
1 File Structure & Organization
23 pages
Dynamic Hashing Explained: Methods & Benefits
No ratings yet
Dynamic Hashing Explained: Methods & Benefits
79 pages
File Structure & Hashing Guide
No ratings yet
File Structure & Hashing Guide
12 pages
History of File Structures
No ratings yet
History of File Structures
26 pages
File Organization Notes
No ratings yet
File Organization Notes
21 pages
Disk Storage and File Structures Overview
No ratings yet
Disk Storage and File Structures Overview
34 pages
File Structures Indexing Kopyası
No ratings yet
File Structures Indexing Kopyası
76 pages
1 - Disk Storage - Ch13
No ratings yet
1 - Disk Storage - Ch13
31 pages
Data Organization for Students
No ratings yet
Data Organization for Students
111 pages
File Organization & Access Guide
No ratings yet
File Organization & Access Guide
6 pages
Elmasri 6e Ch17 PPT Compatibility Mode Repaired
No ratings yet
Elmasri 6e Ch17 PPT Compatibility Mode Repaired
32 pages
Elmasri - 6e - Ch17
No ratings yet
Elmasri - 6e - Ch17
43 pages
Unit 12 External Searching Techniques
No ratings yet
Unit 12 External Searching Techniques
8 pages
Unit 6 File Management
No ratings yet
Unit 6 File Management
70 pages
Hashed Files Internals
No ratings yet
Hashed Files Internals
22 pages
File Management in Operating Systems
No ratings yet
File Management in Operating Systems
40 pages
Disk Storage and File Structures Overview
No ratings yet
Disk Storage and File Structures Overview
32 pages
Hashing and Indexing in DBMS
No ratings yet
Hashing and Indexing in DBMS
37 pages
CNG351 Lecture 11 Part 2
No ratings yet
CNG351 Lecture 11 Part 2
32 pages
CO3 Session 6
No ratings yet
CO3 Session 6
29 pages
941 - ss2 Lesson Note On Types of File Organization
No ratings yet
941 - ss2 Lesson Note On Types of File Organization
2 pages
Lec 03 File Organization
No ratings yet
Lec 03 File Organization
24 pages
UNIT-IV - File Organization
No ratings yet
UNIT-IV - File Organization
10 pages
Lecture 37-39
No ratings yet
Lecture 37-39
35 pages
Chapter - 2 - Disk Storage, Basic File Structures, and Hashing
No ratings yet
Chapter - 2 - Disk Storage, Basic File Structures, and Hashing
71 pages
Hashing
No ratings yet
Hashing
61 pages
U3 MCQs
No ratings yet
U3 MCQs
41 pages
Mod 5
No ratings yet
Mod 5
13 pages
Unit 5 2nd Half
No ratings yet
Unit 5 2nd Half
37 pages
Ch11 Hash Indexes 1perpage Annotated
No ratings yet
Ch11 Hash Indexes 1perpage Annotated
28 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
17 pages
File Organization in DBMS
No ratings yet
File Organization in DBMS
10 pages
There Are Two Types of Hashing
No ratings yet
There Are Two Types of Hashing
2 pages
File Organisation
No ratings yet
File Organisation
29 pages
ss2 DPR Second Term
No ratings yet
ss2 DPR Second Term
5 pages
FS Mod5
No ratings yet
FS Mod5
23 pages
Adbs 5
No ratings yet
Adbs 5
37 pages
Dbms Unit III Notes
No ratings yet
Dbms Unit III Notes
27 pages
Year 11 Lesson Note ICT
No ratings yet
Year 11 Lesson Note ICT
51 pages
Presentation 7
No ratings yet
Presentation 7
21 pages
Hashing Techniques in DBMS
No ratings yet
Hashing Techniques in DBMS
11 pages
Ds Mod 5
No ratings yet
Ds Mod 5
17 pages
Indexing and Hashing Techniques
No ratings yet
Indexing and Hashing Techniques
36 pages
Unit-3 Hashing Storage Btree
No ratings yet
Unit-3 Hashing Storage Btree
26 pages
Understanding Hashing in Databases
No ratings yet
Understanding Hashing in Databases
10 pages
Unit Iii DBMS
No ratings yet
Unit Iii DBMS
36 pages
DBMS
No ratings yet
DBMS
12 pages
5 Data Storage and Indexing
No ratings yet
5 Data Storage and Indexing
60 pages
m5 Index PDF
No ratings yet
m5 Index PDF
60 pages
Csci 2111: Data and File Structures Week 10, Lectures 1 & 2: Hashing
No ratings yet
Csci 2111: Data and File Structures Week 10, Lectures 1 & 2: Hashing
19 pages
Sandy's DataStage Notes
No ratings yet
Sandy's DataStage Notes
23 pages
DB2 Health Check Guide
No ratings yet
DB2 Health Check Guide
5 pages
DB2 Health Check Guide
No ratings yet
DB2 Health Check Guide
5 pages
SAS Programming Skills
100% (1)
SAS Programming Skills
19 pages
Obasa's Blog - Ss2 Data Processing Note First Term (2019-2020 Session)
No ratings yet
Obasa's Blog - Ss2 Data Processing Note First Term (2019-2020 Session)
13 pages
Linux Foundation Certified System PDF - 1581092538
No ratings yet
Linux Foundation Certified System PDF - 1581092538
115 pages
Enigma2 Telne
No ratings yet
Enigma2 Telne
7 pages
SQL Tutorials: - Option 1 - Sqlite3
No ratings yet
SQL Tutorials: - Option 1 - Sqlite3
5 pages
3.3.5 Cloud Storage
No ratings yet
3.3.5 Cloud Storage
8 pages
Crime Reporting System Project
No ratings yet
Crime Reporting System Project
24 pages
SQL Subqueries and Set Operations Guide
No ratings yet
SQL Subqueries and Set Operations Guide
15 pages
XAOATEST Workflow Package Overview
No ratings yet
XAOATEST Workflow Package Overview
3 pages
Saleemullah - Resume
No ratings yet
Saleemullah - Resume
4 pages
Command Line Troubleshooting Tools Reference Guide
No ratings yet
Command Line Troubleshooting Tools Reference Guide
5 pages
Salesforce Apex Language Reference
100% (1)
Salesforce Apex Language Reference
3,259 pages
SOC & SIEM: Comprehensive Overview
No ratings yet
SOC & SIEM: Comprehensive Overview
29 pages
1.3-Basic Packet Analysis Wireshark
100% (1)
1.3-Basic Packet Analysis Wireshark
43 pages
Visitor Gate Pass System Review
No ratings yet
Visitor Gate Pass System Review
1 page
Web Security Testing Checklist
No ratings yet
Web Security Testing Checklist
1 page
TCODES For SAP
No ratings yet
TCODES For SAP
1 page
Cybersecurity Essentials Guide
No ratings yet
Cybersecurity Essentials Guide
3 pages
42 MCQ Question Answers With Explanations: 42 MCQ Q&A For Distributed Denial of Service - +91 9739521088
No ratings yet
42 MCQ Question Answers With Explanations: 42 MCQ Q&A For Distributed Denial of Service - +91 9739521088
11 pages
Hibernatecheatsheet-1 4
No ratings yet
Hibernatecheatsheet-1 4
2 pages
Agile Coaching & Transformation Expert
No ratings yet
Agile Coaching & Transformation Expert
7 pages
Restore Windows Folder Options & Control Panel
No ratings yet
Restore Windows Folder Options & Control Panel
6 pages
Demystifying Agile HR
No ratings yet
Demystifying Agile HR
8 pages
Threat Events
No ratings yet
Threat Events
9 pages
Django CRUD for Beginners
No ratings yet
Django CRUD for Beginners
8 pages
Overview of SQL Statement Types
No ratings yet
Overview of SQL Statement Types
3 pages
Experiment 2
No ratings yet
Experiment 2
2 pages
System Design For Cracking Interviews
No ratings yet
System Design For Cracking Interviews
15 pages
Inception & Evolutionary Requirements
No ratings yet
Inception & Evolutionary Requirements
14 pages
Python and CSV Files
No ratings yet
Python and CSV Files
35 pages
Idoc Step-By-Step: The Scenario
No ratings yet
Idoc Step-By-Step: The Scenario
9 pages

Hash Files

Uploaded by

Hash Files

Uploaded by

8/24/2010 Hash Files

What Exactly Are Hash Files?

How Are They Meant To Be Used by DataStage®?

Hash files should be limited to three main uses:

Small staging files usually when a reference capability has to exist - A

How Can A Hash File Be Abused in a DataStage® Design?

Tricks, Features, and Unintended (but Cool) Results

Your Data Cube in 10 min

Share & Embed

Link / URL: [Link]

Width: (auto) Auto

<a title="View Hash Files on

HASH JOIN Internals

Cryptography and Network Security ...

Dos via Algorithmic Complexity Att...

2007 SMCS Game Week Workbook

Hashing in HK by Eunice (Edited by...

Chapter 1.4 - Data Representation,...

Game Week Workbook

Test Bank &amp; Solution for Funda...

H 0665 - Version: H 0665E1 via MyG...

Aug 2010 North Denver News p1-12 F...

Ch 5 Free Body Diagrams

Leaf by Leaf Lesson4 Guided Practi...

Contact August 20, 2010

More from this user

New in DataStage Version 8.x

Datastage Parallel Stages Groups

Modify Stage Functions

Key board shortcuts

The Object Technology Architecture

Datatage Environment Variables ...

Migrating From Business Objects...

Datastage 8.0 Installation

STAR AND SNOWFLAKE SCHEMA

suryapkakani left a comment

sveeramani left a comment

Excellent information and very helpful Veera

Help & Tools

We have updated our Terms of Service

You might also like

Test Bank & Solution for Funda...