0% found this document useful (0 votes)
689 views857 pages

Parallel Job Advanced Developer Guide

Version 8 Release 1 Parallel Job Advanced Developer Guide is now available. This guide is based on the WebSphere DataStage Designer interface. Job design tips include DB2 database tips and teradata database tips.

Uploaded by

Andres Mejia
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
689 views857 pages

Parallel Job Advanced Developer Guide

Version 8 Release 1 Parallel Job Advanced Developer Guide is now available. This guide is based on the WebSphere DataStage Designer interface. Job design tips include DB2 database tips and teradata database tips.

Uploaded by

Andres Mejia
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 857

IBM WebSphere DataStage and QualityStage

Version 8 Release 1

Parallel Job Advanced Developer Guide

LC18-9892-02

IBM WebSphere DataStage and QualityStage

Version 8 Release 1

Parallel Job Advanced Developer Guide

LC18-9892-02

Note Before using this information and the product that it supports, read the information in Notices on page 817.

Ascential Software Corporation 2001, 2005. Copyright International Business Machines Corporation 2006, 2008. All rights reserved. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents
Chapter 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2. Job design tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
WebSphere DataStage Designer interface Processing large volumes of data . . Modular development . . . . . . Designing for good performance . . . Combining data . . . . . . . . Sorting data . . . . . . . . . Default and explicit type conversions . Using Transformer stages . . . . . Using Sequential File stages . . . . Using Database stages . . . . . . Database sparse lookup vs. join . . DB2 database tips. . . . . . . Oracle database tips . . . . . . Teradata Database Tips

Chapter 3. Improving performance . . . . . . . . . . . . . . . . . . . . . . . . 11


Understanding a flow . . . . . . . . Score dumps . . . . . . . . . . Example score dump . . . . . . . Tips for debugging . . . . . . . . . Performance monitoring . . . . . . . Job monitor . . . . . . . . . . Iostat . . . . . . . . . . . . Load average . . . . . . . . . . Runtime information . . . . . . . Performance data . . . . . . . . OS/RDBMS specific tools . . . . . . Performance analysis . . . . . . . . Selectively rewriting the flow . . . . Identifying superfluous repartitions . . Identifying buffering issues . . . . . Resource estimation . . . . . . . . Creating a model . . . . . . . . Making a projection . . . . . . . Generating a resource estimation report . Examples of resource estimation . . . Resolving bottlenecks . . . . . . . . Choosing the most efficient operators . . Partitioner insertion, sort insertion . . . Combinable Operators . . . . . . . Disk I/O . . . . . . . . . . . Ensuring data is evenly partitioned . . Buffering . . . . . . . . . . . Platform specific tuning . . . . . . . HP-UX . . . . . . . . . . . . AIX . . . . . . . . . . . . . Disk space requirements of post-release 7.0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . data sets

Chapter 4. Link buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


Buffering assumptions . Controlling buffering . Buffering policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . 28 . 28

Copyright IBM Corp. 2006, 2008

iii

Overriding default buffering behavior . . . Operators with special buffering requirements .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. .

. 28 . 29

Chapter 5. Specifying your own parallel stages . . . . . . . . . . . . . . . . . . 31


Defining custom stages . . Defining custom stages . . Defining build stages . . . Build stage macros . . . . Informational macros . . Flow-control macros . . Input and output macros . Transfer Macros . . . . How your code is executed Inputs and outputs . . . Example Build Stage . . Defining wrapped stages . . Example wrapped stage

Chapter 6. Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . 51


Buffering . . . . . . . . . . . . . APT_BUFFER_FREE_RUN . . . . . . APT_BUFFER_MAXIMUM_MEMORY . . APT_BUFFER_MAXIMUM_TIMEOUT . . APT_BUFFER_DISK_WRITE_INCREMENT . APT_BUFFERING_POLICY . . . . . . APT_SHARED_MEMORY_BUFFERS . . . Building Custom Stages . . . . . . . . DS_OPERATOR_BUILDOP_DIR . . . . OSH_BUILDOP_CODE . . . . . . . OSH_BUILDOP_HEADER . . . . . . OSH_BUILDOP_OBJECT . . . . . . . OSH_BUILDOP_XLC_BIN . . . . . . OSH_CBUILDOP_XLC_BIN . . . . . . Compiler . . . . . . . . . . . . . APT_COMPILER . . . . . . . . . APT_COMPILEOPT . . . . . . . . APT_LINKER . . . . . . . . . . APT_LINKOPT . . . . . . . . . . DB2 Support . . . . . . . . . . . . APT_DB2INSTANCE_HOME . . . . . APT_DB2READ_LOCK_TABLE . . . . . APT_DBNAME . . . . . . . . . . APT_RDBMS_COMMIT_ROWS . . . . DB2DBDFT . . . . . . . . . . . Debugging . . . . . . . . . . . . APT_DEBUG_OPERATOR . . . . . . APT_DEBUG_MODULE_NAMES . . . . APT_DEBUG_PARTITION . . . . . . APT_DEBUG_SIGNALS . . . . . . . APT_DEBUG_STEP . . . . . . . . . APT_DEBUG_SUBPROC . . . . . . . APT_EXECUTION_MODE . . . . . . APT_PM_DBX . . . . . . . . . . APT_PM_GDB . . . . . . . . . . APT_PM_LADEBUG . . . . . . . . APT_PM_SHOW_PIDS . . . . . . . APT_PM_XLDB . . . . . . . . . . APT_PM_XTERM . . . . . . . . . APT_SHOW_LIBLOAD . . . . . . . Decimal support

iv

Parallel Job Advanced Developer Guide

APT_DECIMAL_INTERM_PRECISION . . . . . . . . . . . . . . . . . APT_DECIMAL_INTERM_SCALE . . . . . . . . . . . . . . . . . . . APT_DECIMAL_INTERM_ROUND_MODE . . . . . . . . . . . . . . . Disk I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_BUFFER_DISK_WRITE_INCREMENT . . . . . . . . . . . . . . . . APT_CONSISTENT_BUFFERIO_SIZE. . . . . . . . . . . . . . . . . . APT_EXPORT_FLUSH_COUNT . . . . . . . . . . . . . . . . . . . APT_IO_MAP/APT_IO_NOMAP and APT_BUFFERIO_MAP/APT_BUFFERIO_NOMAP APT_PHYSICAL_DATASET_BLOCK_SIZE . . . . . . . . . . . . . . . . General Job Administrationob Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . APT_MONITOR_SIZE . . . . . . . . . . . . . . . . . . . . . . . APT_MONITOR_TIME . . . . . . . . . . . . . . . . . . . . . . APT_NO_JOBMON. . . . . . . . . . . . . . . . . . . . . . . . APT_PERFORMANCE_DATA . . . . . . . . . . . . . . . . . . . . Look up support . . . . . . . . . . . . . . . . . . . . . . . . . APT_LUTCREATE_MMAP . . . . . . . . . . . . . . . . . . . . . APT_LUTCREATE_NO_MMAP . . . . . . . . . . . . . . . . . . . Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . APT_COPY_TRANSFORM_OPERATOR . . . . . . . . . . . . . . . . . APT_DATE_CENTURY_BREAK_YEAR . . . . . . . . . . . . . . . . . APT_EBCDIC_VERSION . . . . . . . . . . . . . . . . . . . . . . APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL . . . . . . . . . . . APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS . . . . . . . . . . . . APT_INSERT_COPY_BEFORE_MODIFY . . . . . . . . . . . . . . . . . APT_OLD_BOUNDED_LENGTH . . . . . . . . . . . . . . . . . . . APT_OPERATOR_REGISTRY_PATH . . . . . . . . . . . . . . . . . . APT_PM_NO_SHARED_MEMORY . . . . . . . . . . . . . . . . . . APT_PM_NO_NAMED_PIPES . . . . . . . . . . . . . . . . . . . . APT_PM_SOFT_KILL_WAIT . . . . . . . . . . . . . . . . . . . . APT_PM_STARTUP_CONCURRENCY . . . . . . . . . . . . . . . . . APT_RECORD_COUNTS . . . . . . . . . . . . . . . . . . . . . . APT_SAVE_SCORE . . . . . . . . . . . . . . . . . . . . . . . . APT_SHOW_COMPONENT_CALLS . . . . . . . . . . . . . . . . . . APT_STACK_TRACE . . . . . . . . . . . . . . . . . . . . . . . APT_WRITE_DS_VERSION . . . . . . . . . . . . . . . . . . . . . OSH_PRELOAD_LIBS . . . . . . . . . . . . . . . . . . . . . . . Networkupport . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_COLLATION_SEQUENCE . . . . . . . . . . . . . . . . . . . APT_COLLATION_STRENGTH . . . . . . . . . . . . . . . . . . . APT_ENGLISH_MESSAGES . . . . . . . . . . . . . . . . . . . . . APT_IMPEXP_CHARSET . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 61 61 61 61 61 62 62 62 62 62 62 62 62 63 63 63 63 64 64 64 64 64 64 64 64 64 65 65 65 65 65 65 65 66 66 66 66 66 66 66 66 67 67 67 67 67 67 68 68 68 68 68 68 68 68 68 69 69 69 69

Contents

APT_INPUT_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . . APT_OS_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . . . APT_OUTPUT_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . APT_STRING_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . . Oracle Supportartitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_NO_PART_INSERTION . . . . . . . . . . . . . . . . . . . . . . . APT_PARTITION_COUNT . . . . . . . . . . . . . . . . . . . . . . . . APT_PARTITION_NUMBER . . . . . . . . . . . . . . . . . . . . . . . . Reading and writing files . . . . . . . . . . . . . . . . . . . . . . . . . . APT_DELIMITED_READ_SIZE . . . . . . . . . . . . . . . . . . . . . . . APT_FILE_IMPORT_BUFFER_SIZE . . . . . . . . . . . . . . . . . . . . . APT_FILE_EXPORT_BUFFER_SIZE . . . . . . . . . . . . . . . . . . . . . APT_IMPORT_PATTERN_USES_FILESET . . . . . . . . . . . . . . . . . . . APT_MAX_DELIMITED_READ_SIZE . . . . . . . . . . . . . . . . . . . . APT_PREVIOUS_FINAL_DELIMITER_COMPATIBLE . . . . . . . . . . . . . . . APT_STRING_PADCHAR . . . . . . . . . . . . . . . . . . . . . . . . Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_DUMP_SCORE . . . . . . . . . . . . . . . . . . . . . . . . . . APT_ERROR_CONFIGURATION . . . . . . . . . . . . . . . . . . . . . . APT_MSG_FILELINE . . . . . . . . . . . . . . . . . . . . . . . . . . APT_PM_PLAYER_MEMORY . . . . . . . . . . . . . . . . . . . . . . . APT_PM_PLAYER_TIMING . . . . . . . . . . . . . . . . . . . . . . . . APT_RECORD_COUNTS . . . . . . . . . . . . . . . . . . . . . . . . . OSH_DUMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OSH_ECHO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OSH_EXPLAIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . OSH_PRINT_SCHEMAS . . . . . . . . . . . . . . . . . . . . . . . . . SAS Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_HASH_TO_SASHASH . . . . . . . . . . . . . . . . . . . . . . . . APT_NO_SASOUT_INSERT . . . . . . . . . . . . . . . . . . . . . . . . APT_NO_SAS_TRANSFORMS . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_ACCEPT_ERROR . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_CHARSET_ABORT . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_COMMAND . . . . . . . . . . . . . . . . . . . . . . . . . APT_SASINT_COMMAND . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_DEBUG . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_DEBUG_IO . . . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_DEBUG_LEVEL . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_DEBUG_VERBOSE . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_NO_PSDS_USTRING . . . . . . . . . . . . . . . . . . . . . . APT_SAS_S_ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_SCHEMASOURCE_DUMP. . . . . . . . . . . . . . . . . . . . . APT_SAS_SHOW_INFO . . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_TRUNCATION . . . . . . . . . . . . . . . . . . . . . . . . Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_NO_SORT_INSERTION . . . . . . . . . . . . . . . . . . . . . . . APT_SORT_INSERTION_CHECK_ONLY . . . . . . . . . . . . . . . . . . . Sybase support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_SYBASE_NULL_AS_EMPTY . . . . . . . . . . . . . . . . . . . . . . APT_SYBASE_PRESERVE_BLANKS . . . . . . . . . . . . . . . . . . . . . Teradata Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_TERA_64K_BUFFERS . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69 69 69 69 70 70 70 70 70 70 70 71 71 71 71 71 71 71 72 72 72 72 72 72 72 72 72 74 74 74 74 75 75 75 75 75 75 75 75 75 76 76 76 76 76 76 76 76 77 77 77 77 77 77 77 77 78 78 78 78 78

vi

Parallel Job Advanced Developer Guide

APT_TERA_NO_ERR_CLEANUP . . . . . . . . . . . . . . . . . APT_TERA_NO_SQL_CONVERSION . . . . . . . . . . . . . . . APT_TERA_NO_PERM_CHECKS . . . . . . . . . . . . . . . . . APT_TERA_SYNC_DATABASE . . . . . . . . . . . . . . . . . . APT_TERA_SYNC_PASSWORD . . . . . . . . . . . . . . . . . APT_TERA_SYNC_USER . . . . . . . . . . . . . . . . . . . . Transport Blocks . . . . . . . . . . . . . . . . . . . . . . . . APT_AUTO_TRANSPORT_BLOCK_SIZE . . . . . . . . . . . . . . APT_LATENCY_COEFFICIENT . . . . . . . . . . . . . . . . . APT_DEFAULT_TRANSPORT_BLOCK_SIZE . . . . . . . . . . . . . APT_MAX_TRANSPORT_BLOCK_SIZE/ APT_MIN_TRANSPORT_BLOCK_SIZE . Guide to setting environment variables . . . . . . . . . . . . . . . . Environment variable settings for all jobs . . . . . . . . . . . . . . Optional environment variable settings . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

78 78 78 78 78 79 79 79 79 79 79 79 80 80

Chapter 7. Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Stage to Operator Mapping . . . . . . . . . . . . . . . Changeapply operator . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . changeapply: properties . . . . . . . . . . . . . . . Schemas . . . . . . . . . . . . . . . . . . . . Changeapply: syntax and options . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . Changecapture operator . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . Key and value fields . . . . . . . . . . . . . . . . Changecapture: syntax and options . . . . . . . . . . . Changecapture example 1: all output results . . . . . . . . Example 2: dropping output results . . . . . . . . . . . Checksum operator . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . Properties . . . . . . . . . . . . . . . . . . . . Checksum: syntax and options . . . . . . . . . . . . . Checksum: example . . . . . . . . . . . . . . . . Compare operator . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . compare: properties . . . . . . . . . . . . . . . . Compare: syntax and options . . . . . . . . . . . . . Compare example 1: running the compare operator in parallel . . Example 2: running the compare operator sequentially . . . . Copy operator . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . Copy: properties . . . . . . . . . . . . . . . . . Copy: syntax and options . . . . . . . . . . . . . . Preventing WebSphere DataStage from removing a copy operator . Copy example 1: The copy operator . . . . . . . . . . . Example 2: running the copy operator sequentially . . . . . . Diff operator . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . diff: properties . . . . . . . . . . . . . . . . . . Transfer behavior . . . . . . . . . . . . . . . . . Diff: syntax and options . . . . . . . . . . . . . . . Diff example 1: general example . . . . . . . . . . . . Example 2: Dropping Output Results . . . . . . . . . . Encode operator . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . encode: properties . . . . . . . . . . . . . . . . . Encode: syntax and options . . . . . . . . . . . . . . Encoding WebSphere DataStage data sets . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . Filter operator

Contents

vii

Data flow diagram . . . . . . . . . . . filter: properties . . . . . . . . . . . . Filter: syntax and options . . . . . . . . . Job monitoring information . . . . . . . . . Expressions . . . . . . . . . . . . . . Input data types . . . . . . . . . . . . Filter example 1: comparing two fields . . . . . Example 2: testing for a null . . . . . . . . Example 3: evaluating input records . . . . . . Job scenario: mailing list for a wine auction . . . Funnel operators . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . sortfunnel: properties . . . . . . . . . . . Funnel operator . . . . . . . . . . . . Sort funnel operators . . . . . . . . . . . Generator operator . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . generator: properties . . . . . . . . . . . Generator: syntax and options . . . . . . . . Using the generator operator . . . . . . . . Example 1: using the generator operator . . . . Example 2: executing the operator in parallel . . . Example 3: using generator with an input data set . Defining the schema for the operator . . . . . Timestamp fields . . . . . . . . . . . . Head operator . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . head: properties . . . . . . . . . . . . Head: syntax and options . . . . . . . . . Head example 1: head operator default behavior . Example 2: extracting records from a large data set . Example 3: locating a single record . . . . . . Lookup operator . . . . . . . . . . . . . Data flow diagrams . . . . . . . . . . . lookup: properties . . . . . . . . . . . . Lookup: syntax and options . . . . . . . . Partitioning . . . . . . . . . . . . . . Create-only mode . . . . . . . . . . . . Lookup example 1: single lookup table record . . Example 2: multiple lookup table record . . . . Example 3: interest rate lookup example . . . . Example 4: handling duplicate fields example . . Merge operator . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . merge: properties . . . . . . . . . . . . Merge: syntax and options . . . . . . . . . Merging records . . . . . . . . . . . . Understanding the merge operator . . . . . . Example 1: updating national data with state data . Example 2: handling duplicate fields . . . . . Job scenario: galactic industries . . . . . . . Missing records . . . . . . . . . . . . Modify operator . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . modify: properties . . . . . . . . . . . . Modify: syntax and options . . . . . . . . Transfer behavior . . . . . . . . . . . . Avoiding contiguous modify operators . . . . . Performing conversions . . . . . . . . . . Allowed conversions . . . . . . . . . . . pcompress operator . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117 117 118 119 120 121 122 123 123 123 125 125 125 126 126 129 130 130 130 131 132 133 133 134 139 140 141 141 141 142 143 143 143 144 144 145 149 149 150 150 151 152 153 154 154 154 156 159 162 164 166 168 170 170 171 171 172 172 172 203 205

viii

Parallel Job Advanced Developer Guide

Data flow diagram . . . . . . . . . pcompress: properties . . . . . . . . Pcompress: syntax and options . . . . . Compressed data sets . . . . . . . . Example . . . . . . . . . . . . . Peek operator . . . . . . . . . . . . Data flow diagram . . . . . . . . . peek: properties . . . . . . . . . . Peek: syntax and options . . . . . . . Using the operator . . . . . . . . . PFTP operator . . . . . . . . . . . . Data flow diagram . . . . . . . . . Operator properties . . . . . . . . . Pftp: syntax and options . . . . . . . . Restartability . . . . . . . . . . . pivot operator . . . . . . . . . . . . Properties: pivot operator . . . . . . . Pivot: syntax and options . . . . . . . Pivot: examples . . . . . . . . . . Remdup operator . . . . . . . . . . . Data flow diagram . . . . . . . . . remdup: properties . . . . . . . . . Remdup: syntax and options . . . . . . Removing duplicate records . . . . . . Using options to the operator . . . . . . Using the operator . . . . . . . . . Example 1: using remdup . . . . . . . Example 2: using the -last option . . . . . Example 3: case-insensitive string matching . Example 4: using remdup with two keys . . Sample operator . . . . . . . . . . . Data flow diagram . . . . . . . . . sample: properties . . . . . . . . . . Sample: syntax and options . . . . . . Example sampling of a data set . . . . . Sequence operator . . . . . . . . . . . Data flow diagram . . . . . . . . . sequence: properties . . . . . . . . . Sequence: syntax and options . . . . . . Example of Using the sequence Operator . . Switch operator . . . . . . . . . . . Data flow diagram . . . . . . . . . switch: properties . . . . . . . . . . Switch: syntax and options . . . . . . . Job monitoring information . . . . . . . Example metadata and summary messages . Customizing job monitor messages . . . . Tail operator. . . . . . . . . . . . . Data flow diagram . . . . . . . . . tail: properties . . . . . . . . . . . Tail: syntax and options . . . . . . . . Tail example 1: tail operator default behavior . Example 2: tail operator with both options . Transform operator . . . . . . . . . . Running your job on a non-NFS MPP . . . Data flow diagram . . . . . . . . . transform: properties . . . . . . . . . Transform: syntax and options. . . . . . Transfer behavior . . . . . . . . . . The transformation language . . . . . . The transformation language versus C . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

205 205 205 206 207 208 208 208 209 210 211 212 212 213 218 219 219 219 220 221 221 222 222 223 224 226 226 226 226 226 227 227 227 228 229 229 230 230 230 230 231 231 232 233 237 237 237 238 238 238 238 239 239 240 240 240 240 241 250 251 287

Contents

ix

Using the transform operator . . . . . . . . . . . . . . . . Example 1: student-score distribution . . . . . . . . . . . . . Example 2: student-score distribution with a letter grade added to example Example 3: student-score distribution with a class field added to example . Example 4. student record distribution with null score values and a reject . Example 5. student record distribution with null score values handled . . Example 6. student record distribution with vector manipulation . . . . Example 7: student record distribution using sub-record . . . . . . . Example 8: external C function calls . . . . . . . . . . . . . . Writerangemap operator . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . writerangemap: properties . . . . . . . . . . . . . . . . . Writerangemap: syntax and options . . . . . . . . . . . . . . Using the writerange operator . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

288 288 291 294 297 299 302 306 309 311 311 311 312 313

Chapter 8. The import/export library

. . . . . . . . . . . . . . . . . . . . . . 315


Record schemas . . . . . . . . . . . . Import example 1: import schema . . . . . Example 2: export schema . . . . . . . . Field and record properties . . . . . . . . Complete and partial schemas . . . . . . . Implicit import and export . . . . . . . . Error handling during import/export . . . . ASCII and EBCDIC conversion tables . . . . Import operator . . . . . . . . . . . . Data flow diagram . . . . . . . . . . import: properties . . . . . . . . . . . Import: syntax and options . . . . . . . . How to import data . . . . . . . . . . Example 1: importing from a single data file . . Example 2: importing from multiple data files . Export operator . . . . . . . . . . . . Data flow diagram . . . . . . . . . . export: properties . . . . . . . . . . . Export: syntax and options . . . . . . . . How to export data . . . . . . . . . . Export example 1: data set export to a single file Example 2: Data Set Export to Multiple files . . Import/export properties . . . . . . . . . Setting properties . . . . . . . . . . . Properties . . . . . . . . . . . . . Properties: reference listing . . . . . . . .

Chapter 9. The partitioning library . . . . . . . . . . . . . . . . . . . . . . . 411


The entire partitioner . . . . Using the partitioner . . . Data flow diagram . . . entire: properties . . . . Syntax . . . . . . . . The hash partitioner . . . . Specifying hash keys . . . Example . . . . . . . Using the partitioner . . . Data flow diagram . . . hash: properties . . . . Hash: syntax and options . The modulus partitioner . . . Data flow diagram . . . modulus: properties . . . Modulus: syntax and options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 412 412 412 413 413 414 414 415 415 415 416 417 417 418 418

Parallel Job Advanced Developer Guide

Example . . . . . . . . . . . . . . The random partitioner . . . . . . . . . . Using the partitioner . . . . . . . . . . Data flow diagram . . . . . . . . . . random: properties . . . . . . . . . . Syntax . . . . . . . . . . . . . . . The range Partitioner . . . . . . . . . . . Considerations when using range partitioning . The range partitioning algorithm . . . . . . Specifying partitioning keys . . . . . . . Creating a range map . . . . . . . . . Example: configuring and using range partitioner Using the partitioner . . . . . . . . . . Data flow diagram . . . . . . . . . . range: properties . . . . . . . . . . . Range: syntax and options . . . . . . . . Writerangemap operator . . . . . . . . . . Data flow diagram . . . . . . . . . . writerangemap: properties . . . . . . . . Writerangemap: syntax and options . . . . . Using the writerange operator . . . . . . . The makerangemap utility . . . . . . . . . Makerangemap: syntax and options . . . . . Using the makerangemap utility . . . . . . The roundrobin partitioner . . . . . . . . . Using the partitioner . . . . . . . . . . Data flow diagram . . . . . . . . . . roundrobin: properties . . . . . . . . . Syntax . . . . . . . . . . . . . . . The same partitioner . . . . . . . . . . . Using the partitioner . . . . . . . . . . Data flow diagram . . . . . . . . . . same: properties . . . . . . . . . . . Syntax . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

418 419 420 420 420 421 421 422 422 422 423 425 425 426 426 426 428 428 429 429 430 431 431 433 433 434 434 434 434 435 435 435 436 436

Chapter 10. The collection library . . . . . . . . . . . . . . . . . . . . . . . . 437


The ordered collector . . . . . . . Ordered collecting . . . . . . . ordered Collector: properties . . . Syntax . . . . . . . . . . . The roundrobin collector . . . . . Round robin collecting . . . . . roundrobin collector: properties . . Syntax . . . . . . . . . . . The sortmerge collector . . . . . . Understanding the sortmerge collector Data flow diagram . . . . . . Specifying collecting keys . . . . sortmerge: properties . . . . . . Sortmerge: syntax and options

Chapter 11. The restructure library . . . . . . . . . . . . . . . . . . . . . . . 445


The aggtorec operator . . . . . . . . . . . . . . . . Output formats . . . . . . . . . . . . . . . . . . aggtorec: properties . . . . . . . . . . . . . . . . Aggtorec: syntax and options . . . . . . . . . . . . . Aggtorec example 1: the aggtorec operator without the toplevelkeys Example 2: the aggtorec operator with multiple key options . . . Example 3: The aggtorec operator with the toplevelkeys option . The field_export operator . . . . . . . . . . . . . . . . . . . . . . . . . . . option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 445 446 446 448 448 449 450

Contents

xi

Data flow diagram . . . . . . . . . . . . . . . field_export: properties . . . . . . . . . . . . . . Field_export: syntax and options . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . The field_import operator . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . field_import: properties . . . . . . . . . . . . . . Field_import: syntax and options . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . The makesubrec operator . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . makesubrec: properties . . . . . . . . . . . . . . Transfer behavior . . . . . . . . . . . . . . . . Subrecord length . . . . . . . . . . . . . . . . Makesubrec: syntax and options . . . . . . . . . . . The makevect operator . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . makevect: properties . . . . . . . . . . . . . . . Transfer Behavior . . . . . . . . . . . . . . . . Non-consecutive fields . . . . . . . . . . . . . . Makevect: syntax and options . . . . . . . . . . . . Makevect example 1: The makevect operator . . . . . . . Example 2: The makevect operator with missing input fields . The promotesubrec Operator . . . . . . . . . . . . . Data Flow Diagram . . . . . . . . . . . . . . . promotesubrec: properties . . . . . . . . . . . . . Promotesubrec: syntax and options . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . The splitsubrec Operator . . . . . . . . . . . . . . Data Flow Diagram . . . . . . . . . . . . . . . splitsubrec properties . . . . . . . . . . . . . . . Splitsubrec: syntax and options . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . The splitvect operator . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . splitvect: properties . . . . . . . . . . . . . . . Splitvect: syntax and options . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . The tagbatch operator . . . . . . . . . . . . . . . Tagged fields and operator limitations . . . . . . . . . Operator action and transfer behavior . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . tagbatch: properties . . . . . . . . . . . . . . . Added, missing, and duplicate fields . . . . . . . . . Input data set requirements . . . . . . . . . . . . Tagbatch: syntax and options . . . . . . . . . . . . Tagbatch example 1: simple flattening of tag cases . . . . . Example 2: The tagbatch operator, missing and duplicate cases . Example 3: The tagbatch operator with multiple keys . . . . The tagswitch operator . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . tagswitch: properties . . . . . . . . . . . . . . . Input and output interface schemas . . . . . . . . . . The case option. . . . . . . . . . . . . . . . . Using the operator . . . . . . . . . . . . . . . Tagswitch: syntax and options . . . . . . . . . . . . Tagswitch example 1: default behavior . . . . . . . . . Example 2: the tagswitch operator, one case chosen . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

450 451 451 452 453 453 454 454 455 457 457 458 458 458 459 460 460 460 461 461 461 462 462 463 463 464 464 464 465 465 465 466 466 467 467 468 468 468 469 469 470 470 471 471 472 472 474 475 476 477 477 478 478 478 478 479 480 481

Chapter 12. The sorting library . . . . . . . . . . . . . . . . . . . . . . . . . 485


The tsort operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485

xii

Parallel Job Advanced Developer Guide

Configuring the tsort operator . . . . . . . . Using a sorted data set . . . . . . . . . . Specifying sorting keys . . . . . . . . . . Data flow diagram . . . . . . . . . . . tsort: properties . . . . . . . . . . . . Tsort: syntax and options . . . . . . . . . Example: using a sequential tsort operator . . . Example: using a parallel tsort operator . . . . Performing a total sort . . . . . . . . . . Example: performing a total sort . . . . . . . The psort operator . . . . . . . . . . . . Performing a partition sort . . . . . . . . . Configuring the partition sort operator . . . . . Using a sorted data set . . . . . . . . . . Data Flow Diagram . . . . . . . . . . . psort: properties . . . . . . . . . . . . Psort: syntax and options . . . . . . . . . Example: using a sequential partition sort operator . Example: using a parallel partition sort operator . Performing a total sort . . . . . . . . . . Range partitioning. . . . . . . . . . . . Example: Performing a Total Sort . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

487 487 488 489 489 490 493 494 495 497 499 499 501 501 503 503 504 506 507 508 510 511

Chapter 13. The join library . . . . . . . . . . . . . . . . . . . . . . . . . . 515


Data flow diagrams . . . . . . Join: properties . . . . . . . Transfer behavior . . . . . . Input data set requirements . . Memory use . . . . . . . . Job monitor reporting . . . . Comparison with other operators . Input data used in the examples . innerjoin operator . . . . . . . Innerjoin: syntax and options . . Example . . . . . . . . . leftouterjoin operator . . . . . . Leftouterjoin: syntax and options . Example . . . . . . . . . rightouterjoin operator . . . . . Rightouterjoin: syntax and options Example . . . . . . . . . fullouterjoin operator . . . . . . Fullouterjoin: syntax and options . Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 516 517 517 517 517 517 518 518 519 519 520 520 521 522 522 523 524 524 525

Chapter 14. The ODBC interface library . . . . . . . . . . . . . . . . . . . . . 527


Accessing ODBC from WebSphere DataStage . National Language Support . . . . . . ICU character set options . . . . . . Mapping between ODBC and ICU character The odbcread operator . . . . . . . . Data flow diagram . . . . . . . . odbcread: properties . . . . . . . . Odbclookup: syntax and options . . . . Operator action . . . . . . . . . . Column name conversion . . . . . . Data type conversion . . . . . . . . External data source record size . . . . Reading external data source tables . . . Join operations . . . . . . . . . . . . . . . . sets

Contents

xiii

Odbcread example 1: reading an external data source table and modifying a field name . The odbcwrite operator . . . . . . . . . . . . . . . . . . . . . . . . Writing to a multibyte database . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . . odbcwrite: properties . . . . . . . . . . . . . . . . . . . . . . . . Operator action . . . . . . . . . . . . . . . . . . . . . . . . . . Where the odbcwrite operator runs . . . . . . . . . . . . . . . . . . . Odbcwrite: syntax and options . . . . . . . . . . . . . . . . . . . . Example 1: writing to an existing external data source table . . . . . . . . . . . Example 2: creating an external datasource table . . . . . . . . . . . . . . Example 3: writing to an external data source table using the modify operator . . . . Other features . . . . . . . . . . . . . . . . . . . . . . . . . . The odbcupsert operator . . . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . . odbcupsert: properties . . . . . . . . . . . . . . . . . . . . . . . Operator action . . . . . . . . . . . . . . . . . . . . . . . . . . Odbcupsert: syntax and options . . . . . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . The odbclookup operator . . . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . . odbclookup: properties . . . . . . . . . . . . . . . . . . . . . . . Odbclookup: syntax and options . . . . . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

533 534 535 535 535 535 536 538 540 541 542 543 543 543 544 544 545 546 547 548 549 549 551

Chapter 15. The SAS interface library . . . . . . . . . . . . . . . . . . . . . . 553


Using WebSphere DataStage to run SAS code . . . . . . . . . Writing SAS programs . . . . . . . . . . . . . . . . Using SAS on sequential and parallel systems . . . . . . . . Pipeline parallelism and SAS . . . . . . . . . . . . . . Configuring your system to use the SAS interface operators . . . . An example data flow . . . . . . . . . . . . . . . . Representing SAS and non-SAS Data in DataStage . . . . . . . Getting input from a SAS data set . . . . . . . . . . . . Getting input from a WebSphere DataStage data set or a SAS data set Converting between data set types . . . . . . . . . . . . Converting SAS data to WebSphere DataStage data. . . . . . . . a WebSphere DataStage example . . . . . . . . . . . . . . Parallelizing SAS steps . . . . . . . . . . . . . . . . Executing PROC steps in parallel . . . . . . . . . . . . . Some points to consider in parallelizing SAS code . . . . . . . Using SAS with European languages . . . . . . . . . . . . Using SAS to do ETL . . . . . . . . . . . . . . . . . . The SAS interface operators . . . . . . . . . . . . . . . Specifying a character set and SAS mode . . . . . . . . . . Parallel SAS data sets and SAS International . . . . . . . . . Specifying an output schema . . . . . . . . . . . . . . Controlling ustring truncation . . . . . . . . . . . . . . Generating a proc contents report . . . . . . . . . . . . WebSphere DataStage-inserted partition and sort components . . . Long name support . . . . . . . . . . . . . . . . . Environment variables . . . . . . . . . . . . . . . . The sasin operator . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . sasin: properties . . . . . . . . . . . . . . . . . . Sasin: syntax and options . . . . . . . . . . . . . . . The sas operator . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . sas: properties . . . . . . . . . . . . . . . . . . . SAS: syntax and options . . . . . . . . . . . . . . . . The sasout operator . . . . . . . . . . . . . . . . . . Data flow diagram

xiv

Parallel Job Advanced Developer Guide

sasout: properties . . . . . Sasout: syntax and options . . The sascontents operator . . . Data flow diagram . . . . sascontents: properties . . . sascontents: syntax and options Example reports . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

593 593 595 595 596 596 597

Chapter 16. The Oracle interface library . . . . . . . . . . . . . . . . . . . . . 599


Accessing Oracle from WebSphere DataStage . . . . . . . . . . . Changing library paths . . . . . . . . . . . . . . . . . Preserving blanks in fields . . . . . . . . . . . . . . . . Handling # and $ characters in Oracle column names . . . . . . . National Language Support . . . . . . . . . . . . . . . . ICU character set options . . . . . . . . . . . . . . . . Mapping between ICU and Oracle character sets . . . . . . . . The oraread operator . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . oraread: properties . . . . . . . . . . . . . . . . . . Operator action . . . . . . . . . . . . . . . . . . . . Where the oraread operator runs . . . . . . . . . . . . . . Column name conversion . . . . . . . . . . . . . . . . Data type conversion . . . . . . . . . . . . . . . . . . Oracle record size . . . . . . . . . . . . . . . . . . . Targeting the read operation . . . . . . . . . . . . . . . Join operations . . . . . . . . . . . . . . . . . . . . Oraread: syntax and options . . . . . . . . . . . . . . . . Oraread example 1: reading an Oracle table and modifying a field name Example 2: reading from an Oracle table in parallel with the query option The orawrite operator . . . . . . . . . . . . . . . . . . Writing to a multibyte database . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . orawrite: properties . . . . . . . . . . . . . . . . . . Operator action . . . . . . . . . . . . . . . . . . . . Data type conversion . . . . . . . . . . . . . . . . . . Write modes . . . . . . . . . . . . . . . . . . . . . Matched and unmatched fields . . . . . . . . . . . . . . Orawrite: syntax and options . . . . . . . . . . . . . . . Example 1: writing to an existing Oracle table . . . . . . . . . Example 2: creating an Oracle table . . . . . . . . . . . . . Example 3: writing to an Oracle table using the modify operator . . . The oraupsert operator . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . oraupsert: properties . . . . . . . . . . . . . . . . . . Operator Action . . . . . . . . . . . . . . . . . . . Associated environment variables . . . . . . . . . . . . . Oraupsert: syntax and options . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . The oralookup operator . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . Properties . . . . . . . . . . . . . . . . . . . . . Oralookup: syntax and options . . . . . . . . . . . . . . Example

Chapter 17. The DB2 interface library . . . . . . . . . . . . . . . . . . . . . . 633


Configuring WebSphere DataStage access . . . . . Establishing a remote connection to a DB2 server . Handling # and $ characters in DB2 column names. Using the -padchar option . . . . . . . . . Running multiple DB2 interface operators in a single . . . . . . . . step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 634 634 635 635

Contents

xv

National Language Support . . . . . . . . . . . . . . . . . . . . . Specifying character settings . . . . . . . . . . . . . . . . . . . . Preventing character-set conversion . . . . . . . . . . . . . . . . . . The db2read operator . . . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . db2read: properties . . . . . . . . . . . . . . . . . . . . . . . Operator action . . . . . . . . . . . . . . . . . . . . . . . . . Conversion of a DB2 result set to a WebSphere DataStage data set . . . . . . . Targeting the read operation . . . . . . . . . . . . . . . . . . . . Specifying open and close commands . . . . . . . . . . . . . . . . . Db2read: syntax and options . . . . . . . . . . . . . . . . . . . . Db2read example 1: reading a DB2 table with the table option . . . . . . . . . Example 2: reading a DB2 table sequentially with the -query option . . . . . . . Example 3: reading a table in parallel with the -query option . . . . . . . . . The db2write and db2load operators . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . db2write and db2load: properties. . . . . . . . . . . . . . . . . . . Actions of the write operators . . . . . . . . . . . . . . . . . . . . How WebSphere DataStage writes the table: the default SQL INSERT statement . . . Field conventions in write operations to DB2 . . . . . . . . . . . . . . . Data type conversion . . . . . . . . . . . . . . . . . . . . . . . Write modes . . . . . . . . . . . . . . . . . . . . . . . . . . Matched and unmatched fields . . . . . . . . . . . . . . . . . . . Db2write and db2load: syntax and options . . . . . . . . . . . . . . . db2load special characteristics . . . . . . . . . . . . . . . . . . . . Db2write example 1: Appending Data to an Existing DB2 Table . . . . . . . . Example 2: writing data to a DB2 table in truncate mode . . . . . . . . . . . Example 3: handling unmatched WebSphere DataStage fields in a DB2 write operation . Example 4: writing to a DB2 table containing an unmatched column . . . . . . . The db2upsert operator . . . . . . . . . . . . . . . . . . . . . . . Partitioning for db2upsert . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . db2upsert: properties . . . . . . . . . . . . . . . . . . . . . . . Operator action . . . . . . . . . . . . . . . . . . . . . . . . . Db2upsert: syntax and options . . . . . . . . . . . . . . . . . . . The db2part operator . . . . . . . . . . . . . . . . . . . . . . . . Db2upsert: syntax and options . . . . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . . . . . The db2lookup operator . . . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . db2lookup: properties . . . . . . . . . . . . . . . . . . . . . . Db2lookup: syntax and options . . . . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . . . . . Considerations for reading and writing DB2 tables . . . . . . . . . . . . . . Data translation anomalies . . . . . . . . . . . . . . . . . . . . . Using a node map . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

636 636 636 637 637 637 637 638 639 640 641 643 644 644 645 645 645 646 646 647 647 648 649 649 656 657 658 659 660 661 661 661 661 662 663 665 666 667 668 669 669 669 671 672 672 672

Chapter 18. The Informix interface library . . . . . . . . . . . . . . . . . . . . 675


Configuring the INFORMIX user environment . . . . Read operators for Informix . . . . . . . . . . Data flow diagram . . . . . . . . . . . . Read operator action . . . . . . . . . . . . Execution mode . . . . . . . . . . . . . Column name conversion . . . . . . . . . . Data type conversion . . . . . . . . . . . . Informix example 1: Reading all data from an Informix Write operators for Informix . . . . . . . . . . Data flow diagram . . . . . . . . . . . . Operator action . . . . . . . . . . . . . . Execution mode . . . . . . . . . . . . . Column name conversion . . . . . . . . . . . . . . . . . . . . . . . . table

xvi

Parallel Job Advanced Developer Guide

Data type conversion . . . . . . . . . . . . . . . . . . . Write modes . . . . . . . . . . . . . . . . . . . . . . Matching WebSphere DataStage fields with columns of Informix table . . Limitations . . . . . . . . . . . . . . . . . . . . . . Example 2: Appending data to an existing Informix table . . . . . . . Example 3: writing data to an INFORMIX table in truncate mode . . . . Example 4: Handling unmatched WebSphere DataStage fields in an Informix Example 5: Writing to an INFORMIX table with an unmatched column . . hplread operator . . . . . . . . . . . . . . . . . . . . . Special operator features . . . . . . . . . . . . . . . . . Establishing a remote connection to the hplread operator . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the hplread operator . . . . . . . . . . . . . . Hplread: syntax and options . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . hplwrite operator for Informix. . . . . . . . . . . . . . . . . Special operator features . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the hplwrite operator . . . . . . . . . . . . . . hplwrite: syntax and options . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . infxread operator . . . . . . . . . . . . . . . . . . . . . Data Flow Diagram . . . . . . . . . . . . . . . . . . . infxread: properties . . . . . . . . . . . . . . . . . . . Infxread: syntax and Options . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . infxwrite operator . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the infxwrite operator . . . . . . . . . . . . . . infxwrite: syntax and options . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . xpsread operator . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the xpsread operator . . . . . . . . . . . . . . Xpsread: syntax and options . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . xpswrite operator . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the xpswrite operator . . . . . . . . . . . . . . Xpswrite: syntax and options . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

681 681 682 682 683 684 685 686 687 687 687 688 688 689 690 690 690 690 691 691 692 693 693 693 694 695 695 696 696 696 698 698 698 698 699 700 700 701 701 701 703

Chapter 19. The Teradata interface library . . . . . . . . . . . . . . . . . . . . 705


National language support . . . . . . . . . . Teradata database character sets . . . . . . . Japanese language support . . . . . . . . . Specifying a WebSphere DataStage ustring character Teraread operator . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . teraread: properties . . . . . . . . . . . Specifying the query . . . . . . . . . . . Column name and data type conversion . . . . teraread restrictions . . . . . . . . . . . Teraread: syntax and Options . . . . . . . . Terawrite Operator . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . terawrite: properties . . . . . . . . . . . Column Name and Data Type Conversion . . . . Correcting load errors . . . . . . . . . . Write modes . . . . . . . . . . . . . . Writing fields . . . . . . . . . . . . . . . . set

Contents

xvii

Limitations . . . . . . . Restrictions . . . . . . . Terawrite: syntax and options .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. 714 . 715 . 715

Chapter 20. The Sybase interface library . . . . . . . . . . . . . . . . . . . . . 719


Accessing Sybase from WebSphere DataStage. . . . . . . . Sybase client configuration . . . . . . . . . . . . . National Language Support . . . . . . . . . . . . . The asesybasereade and sybasereade Operators . . . . . . . Data flow diagram . . . . . . . . . . . . . . . asesybasereade and sybaseread: properties . . . . . . . Operator Action . . . . . . . . . . . . . . . . Where asesybasereade and sybasereade Run . . . . . . . Column name conversion . . . . . . . . . . . . . Data type conversion . . . . . . . . . . . . . . . Targeting the read operation . . . . . . . . . . . . Join Operations . . . . . . . . . . . . . . . . . Asesybasereade and sybasereade: syntax and Options . . . . Sybasereade example 1: Reading a Sybase Table and Modifying a The asesybasewrite and sybasewrite Operators . . . . . . . Writing to a Multibyte Database . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . asesybasewrite and sybasewrite: properties . . . . . . . Operator Action . . . . . . . . . . . . . . . . Where asesybasewrite and sybasewrite Run . . . . . . . Data conventions on write operations to Sybase . . . . . . Data type conversion . . . . . . . . . . . . . . . Write Modes . . . . . . . . . . . . . . . . . Matched and unmatched fields . . . . . . . . . . . asesybasewrite and sybasewrite: syntax and Options . . . . Example 1: Writing to an Existing Sybase Table . . . . . . Example 2: Creating a Sybase Table . . . . . . . . . . Example 3: Writing to a Sybase Table Using the modify Operator The asesybaseupsert and sybaseupsert Operators . . . . . . Data flow diagram . . . . . . . . . . . . . . . asesybaseupsert and sybaseupsert: properties . . . . . . . Operator Action . . . . . . . . . . . . . . . . Asesybaseupsert and sybaseupsert: syntax and Options . . . Example . . . . . . . . . . . . . . . . . . . The asesybaselookup and sybaselookup Operators . . . . . . Data flow diagram . . . . . . . . . . . . . . . asesybaselookup and sybaselookup: properties . . . . . . asesybaselookup and sybaselookup: syntax and Options . . . Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Name

Chapter 21. The SQL Server interface library . . . . . . . . . . . . . . . . . . . 749


Accessing SQL Server from WebSphere DataStage UNIX . . . . . . . . . . . . . . Windows . . . . . . . . . . . . . National Language Support . . . . . . . The sqlsrvrread operator . . . . . . . . Data flow diagram . . . . . . . . . sqlsrvrread: properties . . . . . . . . Operator action . . . . . . . . . . . Where the sqlsrvrread operator runs . . . Column name conversion . . . . . . . Data type conversion . . . . . . . . . SQL Server record size . . . . . . . . Targeting the read operation . . . . . . Join operations

xviii

Parallel Job Advanced Developer Guide

Sqlsrvrread: syntax and options . . . . . . . . . . . . . . Sqlsrvrread example 1: Reading a SQL Server table and modifying a field The sqlsrvrwrite operator . . . . . . . . . . . . . . . . . Writing to a multibyte database . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . sqlsrvrwrite: properties . . . . . . . . . . . . . . . . . Operator action . . . . . . . . . . . . . . . . . . . . Where the sqlsrvrwrite operator runs . . . . . . . . . . . . Data conventions on write operations to SQL Server . . . . . . . Write modes . . . . . . . . . . . . . . . . . . . . . Sqlsrvrwrite: syntax and options . . . . . . . . . . . . . . Example 1: Writing to an existing SQL Server table . . . . . . . . Example 2: Creating a SQL Server table . . . . . . . . . . . Example 3: Writing to a SQL Server table using the modify operator . . The sqlsrvrupsert operator . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . sqlsrvrupsert: properties . . . . . . . . . . . . . . . . . Operator action . . . . . . . . . . . . . . . . . . . . Sqlsrvrupsert: syntax and options . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . The sqlsrvrlookup operator . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . sqlsrvrlookup: properties . . . . . . . . . . . . . . . . Sqlsrvrlookup: syntax and options . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . .

. . name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

753 755 756 756 756 756 757 757 757 758 759 761 762 763 764 764 764 764 765 767 768 769 769 770 772

Chapter 22. The iWay interface library . . . . . . . . . . . . . . . . . . . . . . 773


Accessing iWay from WebSphere DataStage National Language Support . . . . . The iwayread operator . . . . . . . Data flow diagram . . . . . . . iwayread: properties . . . . . . . Operator action . . . . . . . . . Data type conversion . . . . . . . Iwayread: syntax and options . . . . Example: Reading a table via iWay . . The iwaylookup operator . . . . . . Data flow diagram . . . . . . . iwaylookup: properties . . . . . . Iwaylookup: syntax and options . . . Example: looking up a table via iWay

Chapter 23. The Netezza Interface Library . . . . . . . . . . . . . . . . . . . . 783


Netezza write operator . . . . Netezza data load methods . . . nzload method . . . . . . External table method . . . Write modes . . . . . . . . Limitations of write operation . Character set limitations . . . Bad input records . . . . . Error logs . . . . . . . Syntax for nzwrite operation

Chapter 24. The Classic Federation interface library . . . . . . . . . . . . . . . . 787


Accessing the federated database from WebSphere DataStage . National language support . . . . . . . . . . . . . International components for unicode character set parameter Mapping between federated and ICU character sets . . . Read operations with classicfedread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787 787 788 788 788

Contents

xix

classicfedread: properties . . . . . . . . . Classicfedread: syntax and options . . . . . . Column name conversion . . . . . . . . . Data type conversion . . . . . . . . . . . Reading external data source tables . . . . . . Write operations with classicfedwrite . . . . . . Matched and unmatched fields . . . . . . . Classicfedwrite: syntax and options . . . . . . Writing to multibyte databases . . . . . . . Insert and update operations with classicfedupsert . . classicfedupsert: Properties . . . . . . . . . Classicfedupsert: syntax and options . . . . . Example of a federated table when a classicfedupsert Lookup Operations with classicfedlookup . . . . . classicfedlookup: properties . . . . . . . . classicfedlookup: syntax and options . . . . . Example of a classicfedlookup operation . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . operation is performed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

789 789 791 791 792 792 793 793 797 797 798 798 799 800 801 802 803

Chapter 25. Header files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805


C++ classes - sorted by header file . C++ macros - sorted by header file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805 . 809

Product documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811


Contacting IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811

How to read syntax diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 813 Product accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815 Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821

xx

Parallel Job Advanced Developer Guide

Chapter 1. Terminology
Because of the technical nature of some of the descriptions in this manual, it sometimes talks about details of the engine that drives parallel jobs. This involves the use of terms that might be unfamiliar to ordinary parallel job users. v Operators. These underlie the stages in an IBM WebSphere DataStage job. A single stage might correspond to a single operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, WebSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job. v OSH. This is the scripting language used internally by the WebSphere DataStage parallel engine. v Players. Players are the workhorse processes in a parallel job. There is generally a player for each operator on each node. Players are the children of section leaders; there is one section leader per processing node. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file).

Copyright IBM Corp. 2006, 2008

Parallel Job Advanced Developer Guide

Chapter 2. Job design tips


These topics give some hints and tips for the good design of parallel jobs.

WebSphere DataStage Designer interface


The following are some tips for smooth use of the WebSphere DataStage Designer when actually laying out your job on the canvas. v To re-arrange an existing job design, or insert new stage types into an existing job flow, first disconnect the links from the stage to be changed, then the links will retain any meta data associated with them. v A Lookup stage can only have one input stream, one output stream, and, optionally, one reject stream. Depending on the type of lookup, it can have several reference links. To change the use of particular Lookup links in an existing job flow, disconnect the links from the Lookup stage and then right-click to change the link type, for example, Stream to Reference. v The Copy stage is a good placeholder between stages if you anticipate that new stages or logic will be needed in the future without damaging existing properties and derivations. When inserting a new stage, simply drag the input and output links from the Copy placeholder to the new stage. Unless the Force property is set in the Copy stage, WebSphere DataStage optimizes the actual copy out at runtime.

Processing large volumes of data


The ability to process large volumes of data in a short period of time depends on all aspects of the flow and the environment being optimized for maximum throughput and performance. Performance tuning and optimization are iterative processes that begin with job design and unit tests, proceed through integration and volume testing, and continue throughout the production life cycle of the application. Here are some performance pointers: v When writing intermediate results that will only be shared between parallel jobs, always write to persistent data sets (using Data Set stages). You should ensure that the data is partitioned, and that the partitions, and sort order, are retained at every stage. Avoid format conversion or serial I/O. v Data Set stages should be used to create restart points in the event that a job or sequence needs to be rerun. But, because data sets are platform and configuration specific, they should not be used for long-term backup and recovery of source data. v Depending on available system resources, it might be possible to optimize overall processing time at run time by allowing smaller jobs to run concurrently. However, care must be taken to plan for scenarios when source files arrive later than expected, or need to be reprocessed in the event of a failure. v Parallel configuration files allow the degree of parallelism and resources used by parallel jobs to be set dynamically at runtime. Multiple configuration files should be used to optimize overall throughput and to match job characteristics to available hardware resources in development, test, and production modes. The proper configuration of scratch and resource disks and the underlying filesystem and physical hardware architecture can significantly affect overall job performance. Within clustered ETL and database environments, resource-pool naming can be used to limit processing to specific nodes, including database nodes when appropriate.

Modular development
You should aim to use modular development techniques in your job designs in order to maximize the reuse of parallel jobs and components and save yourself time.

Copyright IBM Corp. 2006, 2008

v Use job parameters in your design and supply values at run time. This allows a single job design to process different data in different circumstances, rather than producing multiple copies of the same job with slightly different arguments. v Using job parameters allows you to exploit the WebSphere DataStage Directors multiple invocation capability. You can run several invocations of a job at the same time with different runtime arguments. v Use shared containers to share common logic across a number of jobs. Remember that shared containers are inserted when a job is compiled. If the shared container is changed, the jobs using it will need recompiling.

Designing for good performance


Here are some tips for designing good performance into your job from the outset.

Avoid unnecessary type conversions.


Be careful to use proper source data types, especially from Oracle. You can set the OSH_PRINT_SCHEMAS environment variable to verify that runtime schemas match the job design column definitions. If you are using stage variables on a Transformer stage, ensure that their data types match the expected result types.

Use Transformer stages sparingly and wisely


Do not have multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations (see Using Transformer Stages for more guidance).

Increase sort performance where possible


Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of other stage types. See Sorting Data for guidance.

Remove unneeded columns


Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table.

Avoid reading from sequential files using the Same partitioning method.
Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly repartition (see Using Sequential File Stages for more tips on using Sequential file stages).

Combining data
The two major ways of combining data in a WebSphere DataStage job are via a Lookup stage or a Join stage. How do you decide which one to use? Lookup and Join stages perform equivalent operations: combining two or more input data sets based on one or more specified keys. When one unsorted input is very large or sorting is not feasible, Lookup is preferred. When all inputs are of manageable size or are pre-sorted, Join is the preferred solution.

Parallel Job Advanced Developer Guide

The Lookup stage is most appropriate when the reference data for all Lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory. The Lookup stage requires all but the first input (the primary input) to fit into physical memory. If the reference to a lookup is directly from a DB2 or Oracle table and the number of input rows is significantly smaller than the reference rows, 1:100 or more, a Sparse Lookup might be appropriate. If performance issues arise while using Lookup, consider using the Join stage. The Join stage must be used if the data sets are larger than available memory resources.

Sorting data
Look at job designs and try to reorder the job flow to combine operations around the same sort keys if possible, and coordinate your sorting strategy with your hashing strategy. It is sometimes possible to rearrange the order of business logic within a job flow to leverage the same sort order, partitioning, and groupings. If data has already been partitioned and sorted on a set of key columns, specify the dont sort, previously sorted option for the key columns in the Sort stage. This reduces the cost of sorting and takes greater advantage of pipeline parallelism. When writing to parallel data sets, sort order and partitioning are preserved. When reading from these data sets, try to maintain this sorting if possible by using Same partitioning method. The stable sort option is much more expensive than non-stable sorts, and should only be used if there is a need to maintain row order other than as needed to perform the sort. The performance of individual sorts can be improved by increasing the memory usage per partition using the Restrict Memory Usage (MB) option of the Sort stage. The default setting is 20 MB per partition. Note that sort memory usage can only be specified for standalone Sort stages, it cannot be changed for inline (on a link) sorts.

Default and explicit type conversions


When you are mapping data from source to target you might need to perform data type conversions. Some conversions happen automatically, and these can take place across the output mapping of any parallel job stage that has an input and an output link. Other conversions need a function to explicitly perform the conversion. These functions can be called from a Modify stage or a Transformer stage, and are listed in Appendix B of WebSphere DataStage Parallel Job Developer Guide. (Modify is the preferred stage for such conversions - see Using Transformer Stages.) The following table shows which conversions are performed automatically and which need to be explicitly performed. d indicates automatic (default) conversion, m indicates that manual conversion is required, a blank square indicates that conversion is not possible:
Table 1. Default and explicit type conversions, part 1 Target Source int8 uint8 int16 uint16 d dm d d d d int8 uint8 d int16 d d uint16 d d d int32 d d d d uint32 d d d d int64 d d d d uint64 d d d d sfloat d d d d dfloat dm d d d

Chapter 2. Job design tips

Table 1. Default and explicit type conversions, part 1 (continued) Target Source int32 uint32 int64 uint64 sfloat dfloat decimal string ustring raw date time time stamp int8 dm d dm d dm dm dm dm dm m m m m m uint8 d d d d d d d d d int16 d d d d d d d dm dm uint16 d d d d d d d d d d d d d d dm d d m m m m m m m d d d d d dm dm d d d dm d d d d dm d d d d d d dm dm dm int32 uint32 d int64 d d uint64 d d d sfloat d d d d dfloat d d d d d

Table 2. Default and explicit type conversions, part 2 Target Source int8 uint8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat decimal string ustring raw date time timestamp m m m m m m m m m dm dm dm d decimal d d d d d d d d d dm string dm d dm dm dm m d d d dm dm ustring dm d dm dm dm m d d d dm dm d m m m m m m m m m m raw date m time m timestamp m

d = default conversion; m = modify operator conversion; blank = no conversion needed or provided You should also note the following points about type conversion:

Parallel Job Advanced Developer Guide

v When converting from variable-length to fixed-length strings using default conversions, parallel jobs pad the remaining length with NULL (ASCII zero) characters. v The environment variable APT_STRING_PADCHAR can be used to change the default pad character from an ASCII NULL (0x0) to another character; for example, an ASCII space (Ox20) or a unicode space (U+0020). v As an alternate solution, the PadString function can be used to pad a variable-length (Varchar) string to a specified length using a specified pad character. Note that PadString does not work with fixed-length (Char) string types. You must first convert Char to Varchar before using PadString.

Using Transformer stages


In general, it is good practice not to use more Transformer stages than you have to. You should especially avoid using multiple Transformer stages where the logic can be combined into a single stage. It is often better to use other stage types for certain types of operation: v Use a Copy stage rather than a Transformer for simple operations such as: Providing a job design placeholder on the canvas. (Provided you do not set the Force property to True on the Copy stage, the copy will be optimized out of the job at run time.) Renaming columns. Dropping columns. Implicit type conversions (see Default and Explicit Type Conversions). Note that, if runtime column propagation is disabled, you can also use output mapping on a stage to rename, drop, or convert columns on a stage that has both inputs and outputs. v Use the Modify stage for explicit type conversion (see Default and Explicit Type Conversions) and null handling. v Where complex, reusable logic is required, or where existing Transformer-stage based job flows do not meet performance requirements, consider building your own custom stage (see c_deeadvrf_Specifying_Your_Own_Parallel_Stages.dita,) v Use a BASIC Transformer stage where you want to take advantage of user-defined functions and routines.

Using Sequential File stages


Certain considerations apply when reading and writing fixed-length fields using the Sequential File stage. v If reading columns that have an inherently variable-width type (for example, integer, decimal, or varchar) then you should set the Field Width property to specify the actual fixed-width of the input column. Do this by selecting Edit Row... from the shortcut menu for a particular column in the Columns tab, and specify the width in the Edit Column Meta Data dialog box. v If writing fixed-width columns with types that are inherently variable-width, then set the Field Width property and the Pad char property in the Edit Column Meta Data dialog box to match the width of the output column. Other considerations are as follows: v If a column is nullable, you must define the null field value and length in the Edit Column Meta Data dialog box. v Be careful when reading delimited, bounded-length varchar columns (that is, varchars with the length option set). If the source file has fields which are longer than the maximum varchar length, these extra characters are silently discarded. v Avoid reading from sequential files using the Same partitioning method. Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly repartition.

Chapter 2. Job design tips

Using Database stages


The best choice is to use connector stages if available for your database. The next best choice are the Enterprise database stages as these give maximum parallel performance and features when compared to plug-in stages. The Enterprise stages are: v DB2/UDB Enterprise v Informix Enterprise v Oracle Enterprise v Teradata Enterprise v SQLServer Enterprise v Sybase Enterprise v ODBC Enterprise v iWay Enterprise v Netezza Enterprise You should avoid generating target tables in the database from your WebSphere DataStage job (that is, using the Create write mode on the database stage) unless they are intended for temporary storage only. This is because this method does not allow you to, for example, specify target table space, and you might inadvertently violate data-management policies on the database. If you want to create a table on a target database from within a job, use the Open command property on the database stage to explicitly create the table and allocate tablespace, or any other options required. The Open command property allows you to specify a command (for example some SQL) that will be executed by the database before it processes any data from the stage. There is also a Close property that allows you to specify a command to execute after the data from the stage has been processed. (Note that, when using user-defined Open and Close commands, you might need to explicitly specify locks where appropriate.)

Database sparse lookup vs. join


Data read by any database stage can serve as the reference input to a Lookup stage. By default, this reference data is loaded into memory like any other reference link. When directly connected as the reference link to a Lookup stage, both DB2/UDB Enterprise and Oracle Enterprise stages allow the lookup type to be changed to Sparse and send individual SQL statements to the reference database for each incoming Lookup row. Sparse Lookup is only available when the database stage is directly connected to the reference link, with no intermediate stages. It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a WebSphere DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup. For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup might be appropriate.

DB2 database tips


If available, use the DB2 connector. Otherwise, always use the DB2 Enterprise stage in preference to the DB2/API plugin stage for reading from, writing to, and performing lookups against a DB2 Enterprise Server Edition with the Database Partitioning Feature (DBF). The DB2 Enterprise stage is designed for maximum performance and scalability against very large partitioned DB2 UNIX databases. The DB2/API plugin should only be used to read from and write to DB2 on other, non-UNIX platforms. You might, for example, use it to access mainframe editions through DB2 Connect.

Parallel Job Advanced Developer Guide

Write vs. load


The DB2 Enterprise stage offers the choice between SQL methods (insert, update, upsert, delete) or fast loader methods when writing to a DB2 database. The choice between these methods depends on the required performance, database log usage, and recoverability considerations as follows: v The write method (using insert, update, upsert, or delete) communicates directly with DB2 database nodes to execute instructions in parallel. All operations are logged to the DB2 database log, and the target table(s) can be accessed by other users. Time and row-based commit intervals determine the transaction size and availability of new rows to other applications. v The load method requires that the user running the job has DBADM privilege on the target database. During a load operation an exclusive lock is placed on the entire DB2 tablespace into which the data is being loaded, and so this tablespace cannot be accessed by anyone else while the load is taking place. The load is also non-recoverable: if the load operation is terminated before it is completed, the contents of the table are unusable and the tablespace is left in the load pending state. If this happens, the WebSphere DataStage job must be re-run with the stage set to truncate mode to clear the load pending state.

Oracle database tips


When designing jobs that use Oracle sources or targets, note that the parallel engine will use its interpretation of the Oracle meta data (for example, exact data types) based on interrogation of Oracle, overriding what you might have specified in the Columns tab. For this reason it is best to import your Oracle table definitions using the Import Orchestrate Schema Definitions command from he WebSphere DataStage Designer. Choose the Database table option and follow the instructions from the wizard.

Loading and indexes


When you use the Load write method in an Oracle Enterprise stage, you are using the Parallel Direct Path load method. If you want to use this method to write tables that have indexes on them (including indexes automatically generated by primary key constraints), you must specify the Index Mode property (you can set it to Maintenance or Rebuild). An alternative is to set the environment variable APT_ORACLE_LOAD_OPTIONS to OPTIONS (DIRECT=TRUE, PARALLEL=FALSE). This allows the loading of indexed tables without index maintenance, but the load is performed sequentially. You can use the upsert write method to insert rows into an Oracle table without bypassing indexes or constraints. In order to automatically generate the SQL needed, set the Upsert Mode property to Auto-generated and identify the key column(s) on the Columns tab by selecting the Key check boxes.

Teradata Database Tips


You can use the Additional Connections Options property in the Teradata Enterprise stage (which is a dependent of DB Options Mode) to specify details about the number of connections to Teradata. The possible values of this are: v sessionsperplayer. This determines the number of connections each player in the job has to Teradata. The number should be selected such that:
(sessions per player * number of nodes * players per node) = total requested sessions

The default value is 2. Setting this too low on a large system can result in so many players that the job fails due to insufficient resources. v requestedsessions. This is a number between 1 and the number of vprocs in the database. The default is the maximum number of available sessions.

Chapter 2. Job design tips

10

Parallel Job Advanced Developer Guide

Chapter 3. Improving performance


Use the information in these topics to help resolve any performance problems. These topics assume that basic steps to assure performance have been taken: a suitable configuration file has been set up, reasonable swap space configured and so on, and that you have followed the design guidelines laid down in Chapter 2, Job design tips, on page 3.

Understanding a flow
In order to resolve any performance issues it is essential to have an understanding of the flow of WebSphere DataStage jobs.

Score dumps
To help understand a job flow you might take a score dump. Do this by setting the APT_DUMP_SCORE environment variable true and running the job (APT _DUMP_SCORE can be set in the Administrator client, under the Parallel Reporting ranch). This causes a report to be produced which shows the operators, processes and data sets in the job. The report includes information about: v v v v Where and how data is repartitioned. Whether WebSphere DataStage had inserted extra operators in the flow. The degree of parallelism each operator runs with, and on which nodes. Information about where data is buffered.

The dump score information is included in the job log when you run a job. The score dump is particularly useful in showing you where WebSphere DataStage is inserting additional components in the job flow. In particular WebSphere DataStage will add partition and sort operators where the logic of the job demands it. Sorts in particular can be detrimental to performance and a score dump can help you to detect superfluous operators and amend the job design to remove them.

Example score dump


The following score dump shows a flow with a single data set, which has a hash partitioner, partitioning on key a. It shows three operators: generator, tsort, and peek. Tsort and peek are combined, indicating that they have been optimized into the same process. All the operators in this flow are running on one node.
##I TFSC 004000 14:51:50(000) <main_program> This step has 1 data set: ds0: {op0[1p] (sequential generator) eOther(APT_HashPartitioner { key={ value=a } })->eCollectAny op1[2p] (parallel APT_CombinedOperatorController:tsort)} It has 2 operators: op0[1p] {(sequential generator) on nodes ( lemond.torrent.com[op0,p0] )} op1[2p] {(parallel APT_CombinedOperatorController: (tsort) (peek) ) on nodes (

Copyright IBM Corp. 2006, 2008

11

lemond.torrent.com[op1,p0] lemond.torrent.com[op1,p1] )} It runs 3 processes on 2 nodes.

Tips for debugging


v Use the Data Set Management utility, which is available in the Tools menu of the WebSphere DataStage Designer to examine the schema, look at row counts, and delete a Parallel Data Set. You can also view the data itself. v Check the WebSphere DataStage job log for warnings. These might indicate an underlying logic problem or unexpected data type conversion. v Enable the APT_DUMP_SCORE and APT_RECORD_COUNTS environment variables. Also enable OSH_PRINT_SCHEMAS to ensure that a runtime schema of a job matches the design-time schema that was expected. v The UNIX command od -xc displays the actual data contents of any file, including any embedded ASCII NULL characters. v The UNIX command, wc -lc filename, displays the number of lines and characters in the specified ASCII text file. Dividing the total number of characters by the number of lines provides an audit to ensure that all rows are the same length. It is important to know that the wc utility works by counting UNIX line delimiters, so if the file has any binary columns, this count might be incorrect.

Performance monitoring
There are various tools you can you use to aid performance monitoring, some provided with WebSphere DataStage and some general UNIX tools.

Job monitor
You access the WebSphere DataStage job monitor through the WebSphere DataStage Director (see WebSphere DataStage Director Client Guide). You can also use certain dsjob commands from the command line to access monitoring functions (see Retrieving Information for details). The job monitor provides a useful snapshot of a jobs performance at a moment of execution, but does not provide thorough performance metrics. That is, a job monitor snapshot should not be used in place of a full run of the job, or a run with a sample set of data. Due to buffering and to some job semantics, a snapshot image of the flow might not be a representative sample of the performance over the course of the entire job. The CPU summary information provided by the job monitor is useful as a first approximation of where time is being spent in the flow. However, it does not include any sorts or similar that might be inserted automatically in a parallel job. For these components, the score dump can be of assistance. See Score Dumps. A worst-case scenario occurs when a job flow reads from a data set, and passes immediately to a sort on a link. The job will appear to hang, when, in fact, rows are being read from the data set and passed to the sort. The operation of the job monitor is controlled by two environment variables: APT_MONITOR_TIME and APT_MONITOR_SIZE. By default the job monitor takes a snapshot every five seconds. You can alter the time interval by changing the value of APT_MONITOR_TIME, or you can have the monitor generate a new snapshot every so-many rows by following this procedure: 1. Select APT_MONITOR_TIME on the WebSphere DataStage Administrator environment variable dialog box, and press the set to default button. 2. Select APT_MONITOR_SIZE and set the required number of rows as the value for this variable.

12

Parallel Job Advanced Developer Guide

Iostat
The UNIX tool Iostat is useful for examining the throughput of various disk resources. If one or more disks have high throughput, understanding where that throughput is coming from is vital. If there are spare CPU cycles, IO is often the culprit. The specifics of Iostat output vary slightly from system to system. Here is an example from a Linux machine which slows a relatively light load: (The first set of output is cumulative data since the machine was booted)
Device: tps dev8-0 13.50 ... Device: tps dev8-0 4.00 Blk_read/s 144.09 Blk_read/s 0.00 Blk_wrtn/s 122.33 Blk_read 346233038 Blk_wrtn 293951288

Blk_wrtn/s Blk_read Blk_wrtn 96.00 0 96

Load average
Ideally, a performant job flow should be consuming as much CPU as is available. The load average on the machine should be two to three times the value as the number of processors on the machine (for example, an 8-way SMP should have a load average of roughly 16-24). Some operating systems, such as HPUX, show per-processor load average. In this case, load average should be 2-3, regardless of number of CPUs on the machine. If the machine is not CPU-saturated, it indicates a bottleneck might exist elsewhere in the flow. A useful strategy in this case is to over-partition your data, as more partitions cause extra processes to be started, utilizing more of the available CPU power. If the flow cause the machine to be fully loaded (all CPUs at 100%), then the flow is likely to be CPU limited, and some determination needs to be made as to where the CPU time is being spent (setting the APT_PM_PLAYER _TIMING environment variable can be helpful here - see the following section). The commands top or uptime can provide the load average.

Runtime information
When you set the APT_PM_PLAYER_TIMING environment variable, information is provided for each operator in a job flow. This information is written to the job log when the job is run. An example output is:
##I TFPM 000324 08:59:32(004) ##I TFPM 000325 08:59:32(005) user: 0.00 sys: 0.00 suser: ##I TFPM 000324 08:59:32(006) ##I TFPM 000325 08:59:32(012) 0.00 sys: 0.00 suser: 0.09 ##I TFPM 000324 08:59:32(013) ##I TFPM 000325 08:59:32(019) 0.00 sys: 0.00 suser: 0.09 <generator,0> Calling runLocally: step=1, node=rh73dev04, op=0, ptn=0 <generator,0> Operator completed. status: APT_StatusOk elapsed: 0.04 0.09 ssys: 0.02 (total CPU: 0.11) <peek,0> Calling runLocally: step=1, node=rh73dev04, op=1, ptn=0 <peek,0> Operator completed. status: APT_StatusOk elapsed: 0.01 user: ssys: 0.02 (total CPU: 0.11) <peek,1> Calling runLocally: step=1, node=rh73dev04a, op=1, ptn=1 <peek,1> Operator completed. status: APT_StatusOk elapsed: 0.00 user: ssys: 0.02 (total CPU: 0.11)}

This output shows that each partition of each operator has consumed about one tenth of a second of CPU time during its runtime portion. In a real world flow, wed see many operators, and many partitions. It is often useful to see how much CPU each operator (and each partition of each component) is using. If one partition of an operator is using significantly more CPU than others, it might mean the data is partitioned in an unbalanced way, and that repartitioning, or choosing different partitioning keys might be a useful strategy.

Chapter 3. Improving performance

13

If one operator is using a much larger portion of the CPU than others, it might be an indication that youve discovered a problem in your flow. Common sense is generally required here; a sort is going to use dramatically more CPU time than a copy. This will, however, give you a sense of which operators are the CPU hogs, and when combined with other metrics presented in this document can be very enlightening. Setting the environment variable APT_DISABLE_COMBINATION might be useful in some situations to get finer-grained information as to which operators are using up CPU cycles. Be aware, however, that setting this flag will change the performance behavior of your flow, so this should be done with care. Unlike the job monitor cpu percentages, setting APT_PM_PLAYER_TIMING will provide timings on every operator within the flow.

Performance data
You can record performance data about job objects and computer resource utilization in parallel job runs. You can record performance data in these ways: v At design time, with the Designer client v At run time, with either the Designer client or the Director client Performance data is written to an XML file that is in the default directory C:\IBM\InformationServer\ Server\Performance. You can override the default location by setting the environment variable APT_PERFORMANCE_DATA. Use the Administrator client to set a value for this variable at the project level, or use the Parameters page of the Job Properties window to specify a value at the job level.

Recording performance data at design time


At design time, you can set a flag to specify that you want to record performance data when the job runs. To 1. 2. 3. 4. record performance data at design time: Open a job in the Designer client. Click Edit Job Properties. Click the Execution page. Select the Record job performance data check box.

Performance data is recorded each time that the job runs successfully.

Recording performance data at run time


You can use the Designer client or the Director client to record performance data at run time. To record performance data at run time: 1. Open a job in the Designer client, or select a job in the display area of the Director client. 2. Click the Run button on the toolbar to open the Job Run Options window. 3. Click the General page. 4. Select the Record job performance data check box. Performance data is recorded each time that the job runs successfully.

Viewing performance data


Use the Performance Analysis window to view charts that interpret job performance and computer resource utilization. First you must record data about job performance. See Recording performance data at design time or Recording performance data at run time for more information.

14

Parallel Job Advanced Developer Guide

You can view performance data in either the Designer client or the Director client. To view performance data: 1. Open the Performance Analysis window by using one of the following methods: v In the Designer client, click File Performance Analysis. v In the Director client, click Job Analyze Performance. v In either client, click the Performance Analysis toolbar button. 2. In the Performance Data group in the left pane, select the job run that you want to analyze. Job runs are listed in descending order according to the timestamp. 3. In the Charts group, select the chart that you want to view. 4. If you want to exclude certain job objects from a chart, use one of the following methods: v For individual objects, clear the check boxes in the Job Tree group. v For all objects of the same type, clear the check boxes in the Partitions, Stages, and Phases groups. 5. Optional: In the Filters group, change how data is filtered in a chart. 6. Click Save to save the job performance data in an archive. The archive includes the following files: v Performance data file named performance.xxxx (where xxxx is the suffix that is associated with the job run) v Computer descriptions file named description.xxxx v Computer utilization file named utilization.xxxx v Exported job definition named exportedjob.xxxx When you open a performance data file, the system creates a mapping between the job stages that are displayed on the Designer client canvas and the operating system processes that define a job. The mapping might not create a direct relationship between stages and processes for these reasons: v Some stages compile into many processes. v Some stages are combined into a single process. You can use the check boxes in the Filters area of the Performance Analysis window to include data about hidden operators in the performance data file. For example, Modify stages are combined with the previous stage in a job design. If you want to see the percentage of elapsed time that is used by a modify operator, clear the Hide Inserted Operators check box. Similarly, you can clear the Hide Composite Operators check box to expose performance data about composite operators. You can delete performance data files by clicking Delete. All of the data files that belong to the selected job run, including the performance data, utilization data, computer description data, and job export data, are deleted from the server.

OS/RDBMS specific tools


Each OS and RDBMS has its own set of tools which might be useful in performance monitoring. Talking to the sysadmin or DBA might provide some useful monitoring strategies.

Performance analysis
Once you have carried out some performance monitoring, you can analyze your results. Bear in mind that, in a parallel job flow, certain operators might complete before the entire flow has finished, but the job isnt successful until the slowest operator has finished all its processing.

Chapter 3. Improving performance

15

Selectively rewriting the flow


One of the most useful mechanisms in detecting the cause of bottlenecks in your flow is to rewrite portions of it to exclude stages from the set of possible causes. The goal of modifying the flow is to see the new, modified, flow run noticeably faster than the original flow. If the flow is running at roughly an identical speed, change the flow further. While editing a flow for testing, it is important to keep in mind that removing one stage might have unexpected affects in the flow. Comparing the score dump between runs is useful before concluding what has made the performance difference. When modifying the flow, be aware of introducing any new performance problems. For example, adding a Data Set stage to a flow might introduce disk contention with any other data sets being read. This is rarely a problem, but might be significant in some cases. Moving data into and out of parallel operation are two very obvious areas of concern. Changing a job to write into a Copy stage (with no outputs) will throw the data away. Keep the degree of parallelism the same, with a nodemap if necessary. Similarly, landing any read data to a data set can be helpful if the datas point of origin is a flat file or RDBMS. This pattern should be followed, removing any potentially suspicious operators while trying to keep the rest of the flow intact. Removing any custom stages should be at the top of the list.

Identifying superfluous repartitions


Superfluous repartitioning should be identified. Due to operator or license limitations (import, export, RDBMS ops, SAS, and so on) some stages will run with a degree of parallelism that is different than the default degree of parallelism. Some of these cannot be eliminated, but understanding the where, when and why these repartitions occur is important for flow analysis. Repartitions are especially expensive when the data is being repartitioned on an MPP system, where significant network traffic will result. Sometimes you might be able to move a repartition upstream in order to eliminate a previous, implicit repartition. Imagine an Oracle stage performing a read (using the oraread operator). Some processing is done on the data and it is then hashed and joined with another data set. There might be a repartition after the oraread operator, and then the hash, when only one repartition is really necessary. Similarly, specifying a nodemap for an operator might prove useful to eliminate repartitions. In this case, a transform stage sandwiched between a DB2 stage reading (db2read) and another one writing (db2write) might benefit from a nodemap placed on it to force it to run with the same degree of parallelism as the two db2 operators to avoid two repartitions.

Identifying buffering issues


Buffering is one of the more complex aspects to parallel job performance tuning. Buffering is described in detail in Buffering, The goal of buffering on a specific link is to make the producing operators output rate match the consumption rate of the downstream operator. In any flow where this is incorrect behavior for the flow (for example, the downstream operator has two inputs, and waits until it had exhausted one of those inputs before reading from the next) performance is degraded. Identifying these spots in the flow requires an understanding of how each operator involved reads its record, and is often only found by empirical observation. You can diagnose a buffering tuning issue when a flow runs slowly when it is one massive flow, but each component runs quickly when broken up. For example, replacing an Oracle write stage with a copy stage vastly improves performance, and writing that same data to a data set, then loading via an Oracle stage, also goes quickly. When the two are put together, performance is poor.

16

Parallel Job Advanced Developer Guide

Buffering details specific, common buffering configurations aimed at resolving various bottlenecks.

Resource estimation
New in this release, you can estimate and predict the resource utilization of parallel job runs by creating models and making projections in the Resource Estimation window. A model estimates the system resources for a job, including the amount of scratch space, disk space, and CPU time that is needed for each stage to run on each partition. A model also estimates the data set throughput in a job. You can generate these types of models: v Static models estimate disk space and scratch space only. These models are based on a data sample that is automatically generated from the record schema. Use static models at compilation time. v Dynamic models predict disk space, scratch space, and CPU time. These models are based on a sampling of actual input data. Use dynamic models at run time. An input projection estimates the size of all of the data sources in a job. You can project the size in megabytes or in number of records. A default projection is created when you generate a model. The resource utilization results from a completed job run are treated as an actual model. A job can have only one actual model. In the Resource Estimation window, the actual model is the first model in the Models list. Similarly, the total size of the data sources in a completed job run are treated as an actual projection. You must select the actual projection in the Input Projections list to view the resource utilization statistics in the actual model. You can compare the actual model to your generated models to calibrate your modeling techniques.

Creating a model
You can create a static or dynamic model to estimate the resource utilization of a parallel job run. You can create models in the Designer client or the Director client. You must compile a job before you create a model. To create a model: 1. Open a job in the Designer client, or select a job in the Director client. 2. Open the Resource Estimation window by using one of the following methods: v In the Designer, click File Estimate Resource. v In the Director, click Job Estimate Resource. v Click the Resource Estimation toolbar button. The first time that you open the Resource Estimation window for a job, a static model is generated by default. 3. Click the Model toolbar button to display the Create Resource Model options. 4. Type a name in the Model Name field. The specified name must not already exist. 5. Select a type in the Model Type field. 6. If you want to specify a data sampling range for a dynamic model, use one of the following methods: v Click the Copy Previous button to copy the sampling specifications from previous models, if any exist. v Clear the Auto check box for a data source, and type values in the From and To fields to specify a record range. 7. Click Generate.

Chapter 3. Improving performance

17

After the model is created, the Resource Estimation window displays an overview of the model that includes the model type, the number of data segments in the model, the input data size, and the data sampling description for each input data source. Use the controls in the left pane of the Resource Estimation window to view statistics about partition utilization, data set throughput, and operator utilization in the model. You can also compare the model to other models that you generate.

Static and dynamic models


Static models estimate resource utilization at compilation time. Dynamic models predict job performance at run time. The following table describes the differences between static and dynamic models. Use this table to help you decide what type of model to generate.
Table 3. Static and dynamic models Characteristics Job run Sample data Static models Not required. Dynamic models Required.

Requires automatic data sampling. Accepts automatic data sampling or a Uses the actual size of the input data data range: if the size can be determined. v Automatic data sampling Otherwise, the sample size is set to a determines the sample size default value of 1000 records on each dynamically according to the stage output link from each source stage. type: For a database source stage, the sample size is set to 1000 records on each output link from the stage. For all other source stage types, the sample size is set to the minimum number of input records among all sources on all partitions. v A data range specifies the number of records to include in the sample for each data source. If the size of the sample data exceeds the actual size of the input data, the model uses the entire input data set.

Scratch space Disk space CPU utilization Number of records

Estimates are based on a worst-case scenario. Estimates are based on a worst-case scenario. Not estimated. Estimates are based on a best-case scenario. No record is dropped. Input data is propagated from the source stages to all other stages in the job. Solely determined by the record schema. Estimates are based on a worst-case scenario.

Estimates are based on linear regression. Estimates are based on linear regression. Estimates are based on linear regression. Dynamically determined. Best-case scenario does not apply. Input data is processed, not propagated. Records can be dropped. Estimates are based on linear regression. Dynamically determined by the actual record at run time. Estimates are based on linear regression.

Record size

18

Parallel Job Advanced Developer Guide

Table 3. Static and dynamic models (continued) Characteristics Data partitioning Static models Data is assumed to be evenly distributed among all partitions. Dynamic models Dynamically determined. Estimates are based on linear regression.

When a model is based on a worst-case scenario, the model uses maximum values. For example, if a variable can hold up to 100 characters, the model assumes that the variable always holds 100 characters. When a model is based on a best-case scenario, the model assumes that no single input record is dropped anywhere in the data flow. The accuracy of a model depends on these factors: Schema definition The size of records with variable-length fields cannot be determined until the records are processed. Use fixed-length or bounded-length schemas as much as possible to improve accuracy. Input data When the input data contains more records with one type of key field than another, the records might be unevenly distributed across partitions. Specify a data sampling range that is representative of the input data. Parallel processing environment The availability of system resources when you run a job can affect the degree to which buffering occurs. Generate models in an environment that is similar to your production environment in terms of operating system, processor type, and number of processors.

Custom stages and dynamic models


To be estimated in dynamic models, Custom stages must support the end-of-wave functionality in the parallel engine. If a Custom stage serves as an import operator or needs scratch space or disk space, the stage must declare its type by calling the function APT_Operator::setExternalDataDirection() when the stage overrides the APT_Operator::describeOperator() function. Define the external data direction by using the following enumerated type:
enum externalDataDirection{ eNone, /** Data "source" operator - an import or database read. */ eDataSource, /** Data "sink" operator - an export or database write. */ eDataSink, /** Data "scratch" operator - a tsort or buffer operator. */ eDataScratch, /** Data "disk" operator - an export or dataset/fileset. */ eDataDisk };

Custom stages that need disk space and scratch space must call two additional functions within the dynamic scope of APT_Operator::runLocally(): v For disk space, call APT_Operator::setDiskSpace() to describe actual disk space usage. v For scratch space, call APT_Operator::setScratchSpace() to describe actual scratch space usage. Both functions accept values of APT_Int64.

Making a projection
You can make a projection to predict the resource utilization of a job by specifying the size of the data sources.
Chapter 3. Improving performance

19

You must generate at least one model before you make a projection. Projections are applied to all existing models, except the actual model. To make a projection: 1. Open a job in the Designer client, or select a job in the Director client. 2. Open the Resource Estimation window by using one of the following methods: v In the Designer, click File Estimate Resource. v In the Director, click Job Estimate Resource. v Click the Resource Estimation toolbar button. Click the Projection toolbar button to display the Make Resource Projection options. Type a name in the Projection Name field. The specified name must not already exist. Select the unit of measurement for the projection in the Input Units field. Specify the input size upon which to base the projection by using one of the following methods:

3. 4. 5. 6.

v Click the Copy Previous button to copy the specifications from previous projections, if any exist. v If the Input Units field is set to Size in Megabytes, type a value in the Megabytes (MB) field for each data source. v If the Input Units field is set to Number of Records, type a value in the Records field for each data source. 7. Click Generate. The projection applies the input data information to the existing models, excluding the actual model, to predict the resource utilization for the given input data.

Generating a resource estimation report


You can generate a report to see a resource estimation summary for a selected model. Reports contain an overview of the job, the model, and the input projection. The reports also give statistics about the partition utilization and data set throughput for each data source. To generate a report: 1. In the Resource Estimation window, select a model in the Models list. 2. Select a projection in the Input Projections list. If you do not select a projection, the default projection is used. 3. Click the Report toolbar button. By default, reports are saved in the following directory: C:\IBM\InformationServer\Clients\Classic\Estimation\server_name\project_name\job_name\html\ report.html You can print the report or rename it by using the controls in your Web browser.

Examples of resource estimation


You can use models and projections during development to optimize job design and to configure your environment for more efficient processing. The following examples show how to use resource estimation techniques for a job that performs these tasks: v Consolidates input data from three distributed data sources v Merges the consolidated data with records from a fourth data source

20

Parallel Job Advanced Developer Guide

v Updates records in the fourth data source with the current date v Saves the merged records to two different data sets based on the value of a specific field In this example, each data source has 5 million records. You can use resource estimation models and projections to answer questions such as these: v Which stage merges data most efficiently? v When should data be sorted? v Are there any performance bottlenecks? v What are the disk and scratch space requirements if the size of the input data increases?

Example - Find the best stage to merge data


In this example, you create models to determine whether a Lookup, Merge, or Join stage is the most efficient stage to merge data. The example job consolidates input data from three data sources and merges this data with records from a fourth data source. You can use Lookup, Merge, or Join stages to merge data. Lookup stages do not require the input data to be sorted, but the stage needs more memory to create a dynamic lookup table. Merge and Join stages require that the input data is sorted, but use less memory than a Lookup stage. To find out which stage is most efficient, design three jobs: v Job 1 uses a Lookup stage to merge data. One input link to the Lookup stage carries the consolidated data from the three data sources. The other input link carries the data from the fourth data source and includes an intermediate Transformer stage to insert the current date. v Job 2 uses a Merge stage to merge data. One input link to the Merge stage carries the consolidated data from the three data sources. The other input link carries the data from the fourth data source. This job includes an intermediate Transformer stage to insert the current date and a Sort stage to sort the data. v Job 3 uses a Join stage to merge data by using a left outer join. One input link to the Join stage carries the consolidated data from the three data sources. The other input link carries the data from the fourth data source. This job includes an intermediate Transformer stage to insert the current date and a Sort stage to sort the data. The next step is to generate an automatic dynamic model for each job. The models are based on a single-node configuration on Windows XP, with a 1.8 GHz processor and 2 GB of RAM. The following table summarizes the resource utilization statistics for each job:
Table 4. Resource utilization statistics Job Job 1 (Lookup stage) Job 2 (Merge stage) Job 3 (Join stage) CPU (seconds) 229.958 219.084 209.25 Disk (MB) 801.125 801.125 801.125 Scratch (MB) 0 915.527 915.527

By comparing the models, you see that Job 1 does not require any scratch space, but is the slowest of the three jobs. The Lookup stage also requires memory to build a lookup table for a large amount of reference data. Therefore, the optimal job design uses either a Merge stage or a Join stage to merge data.

Example - Decide when to sort data


In this example, you create models to decide whether to sort data before or after the data is consolidated from three data sources.

Chapter 3. Improving performance

21

The previous example demonstrated that a Merge or Join stage is most efficient to merge data in the example job. These stage types require that the input data is sorted. Now you need to decide whether to sort the input data from your three data sources before or after you consolidate the data. To understand the best approach, design two jobs: v Job 4 sorts the data first: 1. Each data source is linked to a separate Sort stage. 2. The sorted data is sent to a single Funnel stage for consolidation. 3. The Funnel stage sends the data to the Merge or Join stage, where it is merged with the data from the fourth data source. v Job 5 consolidates the data first: 1. The three source stages are linked to a single Funnel stage that consolidates the data. 2. The consolidated data is sent to a single Sort stage for sorting. 3. The Sort stage sends the data to the Merge or Join stage, where it is merged with the data from the fourth data source. Use the same processing configuration as in the first example to generate an automatic dynamic model for each job. The resource utilization statistics for each job are shown in the table:
Table 5. Resource utilization statistics Job Job 4 (Sort before Funnel) Job 5 (Sort after Funnel) CPU (seconds) 74.6812 64.1079 Disk (MB) 515.125 515.125 Scratch (MB) 801.086 743.866

You can see that sorting data after consolidation is a better design because Job 5 uses approximately 15% less CPU time and 8% less scratch space than Job 4.

Example - Find bottlenecks


In this example, you use the partition utilization statistics in a model to identify any performance bottlenecks in a job. Then, you apply a job parameter to remove the bottleneck. Models in the previous examples describe how to optimize the design of the example job. The best performance is achieved when you: 1. Consolidate the data from your three sources. 2. Sort the data. 3. Use a Merge or a Join stage to merge the data with records from a fourth data source. An intermediate Transformer stage adds the current date to the records in the fourth data source before the data is merged. The Transformer stage appends the current date to each input record by calling the function DateToString(CurrentDate()) and assigning the returned value to a new output field. When you study the partition utilization statistics in the models, you notice a performance bottleneck in the Transformer stage: v In Job 2, the Transformer stage uses 139.145 seconds out of the 219.084 seconds of total CPU time for the job. v In Job 3, the Transformer stage uses 124.355 seconds out of the 209.25 seconds of total CPU time for the job. A more efficient approach is to assign a job parameter to the new output field. After you modify the Transformer stage in each job, generate automatic dynamic models to compare the performance:

22

Parallel Job Advanced Developer Guide

Table 6. Resource utilization statistics Job CPU (seconds) Disk (MB) 801.125 Scratch (MB) 915.527

Job 6 (Merge stage with job 109.065 parameter in Transformer stage) Job 7 (Join stage with job parameter in Transformer stage) 106.5

801.125

915.527

Job performance is significantly improved after you remove the bottleneck in the Transformer stage. Total CPU time for Jobs 6 and 7 is about half of the total CPU time for Jobs 2 and 3. CPU time for the Transformer stage is a small portion of total CPU time: v In Job 6, the Transformer stage uses 13.8987 seconds out of the 109.065 seconds of total CPU time for the job. v In Job 7, the Transformer stage uses 13.1489 seconds out of the 106.5 seconds of total CPU time for the job. These models also show that job performance improves by approximately 2.4% when you merge data by using a Join stage rather than a Merge stage.

Example - Project resource requirements


In this example, you make projections to find out how much disk space and scratch space are needed when the input data size increases. Each data source in the example job has 5 million records. According to your previous models, Job 7 requires approximately 800 MB of disk space and 915 MB of scratch space. Suppose the size of each data source increases as follows: v 18 million records for data source 1 v 20 million records for data source 2 v 22 million records for data source 3 v 60 million records for data source 4 Make a projection by specifying the increased number of records for each data source. When the projection is applied to the model for Job 7, the estimation shows that approximately 3204 Mb of disk space and 5035 Mb of scratch space are needed. By estimating the disk allocation, the projection helps you prevent a job from stopping prematurely due to a lack of disk space.

Resolving bottlenecks Choosing the most efficient operators


Because WebSphere DataStage offers a wide range of different stage types, with different operators underlying them, there can be several different ways of achieving the same effects within a job. This section contains some hint as to preferred practice when designing for performance is concerned. When analyzing your flow you should try substituting preferred operators in particular circumstances.

Modify and transform


Modify, due to internal implementation details, is a particularly efficient operator. Any transformation which can be implemented in the Modify stage will be more efficient than implementing the same operation in a Transformer stage. Transformations that touch a single column (for example, keep/drop, type conversions, some string manipulations, null handling) should be implemented in a Modify stage rather than a Transformer.

Chapter 3. Improving performance

23

Lookup and join


Lookup and join perform equivalent operations: combining two or more input data sets based on one or more specified keys. Lookup requires all but one (the first or primary) input to fit into physical memory. Join requires all inputs to be sorted. When one unsorted input is very large or sorting isnt feasible, lookup is the preferred solution. When all inputs are of manageable size or are pre-sorted, join is the preferred solution.

Partitioner insertion, sort insertion


Partitioner insertion and sort insertion each make writing a flow easier by alleviating the need for a user to think about either partitioning or sorting data. By examining the requirements of operators in the flow, the parallel engine can insert partitioners, collectors and sorts as necessary within a dataflow. However, there are some situations where these features can be a hindrance. If data is pre-partitioned and pre-sorted, and the WebSphere DataStage job is unaware of this, you could disable automatic partitioning and sorting for the whole job by setting the following environment variables while the job runs: v APT_NO_PART_INSERTION v APT_NO_SORT_INSERTION You can also disable partitioning on a per-link basis within your job design by explicitly setting a partitioning method of Same on the Input page Partitioning tab of the stage the link is input to. To disable sorting on a per-link basis, insert a Sort stage on the link, and set the Sort Key Mode option to Dont Sort (Previously Sorted). We advise that average users leave both partitioner insertion and sort insertion alone, and that power users perform careful analysis before changing these options.

Combinable Operators
Combined operators generally improve performance at least slightly (in some cases the difference is dramatic). There might also be situations where combining operators actually hurts performance, however. Identifying such operators can be difficult without trial and error. The most common situation arises when multiple operators are performing disk I/O (for example, the various file stages and sort). In these sorts of situations, turning off combination for those specific stages might result in a performance increase if the flow is I/O bound. Combinable operators often provide a dramatic performance increase when a large number of variable length fields are used in a flow.

Disk I/O
Total disk throughput is often a fixed quantity that WebSphere DataStage has no control over. It can, however, be beneficial to follow some rules. v If data is going to be read back in, in parallel, it should never be written as a sequential file. A data set or file set stage is a much more appropriate format. v When importing fixed-length data, the Number of Readers per Node property on the Sequential File stage can often provide a noticeable performance boost as compared with a single process reading the data.

24

Parallel Job Advanced Developer Guide

v Some disk arrays have read ahead caches that are only effective when data is read repeatedly in like-sized chunks. Setting the environment variable APT_CONSISTENT_BUFFERIO_SIZE=N will force stages to read data in chunks which are size N or a multiple of N. v Memory mapped I/O, in many cases, contributes to improved performance. In certain situations, however, such as a remote disk mounted via NFS, memory mapped I/O might cause significant performance problems. Setting the environment variables APT_IO_NOMAP and APT_BUFFERIO_NOMAP true will turn off this feature and sometimes affect performance. (AIX and HP-UX default to NOMAP. Setting APT_IO_MAP and APT_BUFFERIO_MAP true can be used to turn memory mapped I/O on for these platforms.)

Ensuring data is evenly partitioned


Because of the nature of parallel jobs, the entire flow runs only as fast as its slowest component. If data is not evenly partitioned, the slowest component is often slow due to data skew. If one partition has ten records, and another has ten million, then a parallel job cannot make ideal use of the resources. Setting the environment variable APT_RECORD_COUNTS displays the number of records per partition for each component. Ideally, counts across all partititions should be roughly equal. Differences in data volumes between keys often skew data slightly, but any significant (e.g., more than 5-10%) differences in volume should be a warning sign that alternate keys, or an alternate partitioning strategy, might be required.

Buffering
Buffering is intended to slow down input to match the consumption rate of the output. When the downstream operator reads very slowly, or not at all, for a length of time, upstream operators begin to slow down. This can cause a noticeable performance loss if the buffers optimal behavior is something other than rate matching. By default, each link has a 3 MB in-memory buffer. Once that buffer reaches half full, the operator begins to push back on the upstream operators rate. Once the 3 MB buffer is filled, data is written to disk in 1 MB chunks. In most cases, the easiest way to tune buffering is to eliminate the pushback and allow it to buffer the data to disk as necessary. Setting APT_BUFFER_FREE_RUN=N or setting Buffer Free Run in the Output page Advanced tab on a particular stage will do this. A buffer will read N * max_memory (3 MB by default) bytes before beginning to push back on the upstream. If there is enough disk space to buffer large amounts of data, this will usually fix any egregious slowdown issues cause by the buffer operator. If there is a significant amount of memory available on the machine, increasing the maximum in-memory buffer size is likely to be very useful if buffering is causing any disk I/O. Setting the APT_BUFFER_MAXIMUM_MEMORY environment variable or Maximum memory buffer size on the Output page Advanced tab on a particular stage will do this. It defaults to 3145728 (3 MB). For systems where small to medium bursts of I/O are not desirable, the 1 MB write to disk size chunk size might be too small. The environment variable APT_BUFFER_DISK_WRITE_INCREMENT or Disk write increment on the Output page Advanced tab on a particular stage controls this and defaults to 1048576 (1 MB). This setting might not exceed max_memory * 2/3. Finally, in a situation where a large, fixed buffer is needed within the flow, setting Queue upper bound on the Output page Advanced tab (no environment variable exists) can be set equal to max_memory to force a buffer of exactly max_memory bytes. Such a buffer will block an upstream operator (until data is read by the downstream operator) once its buffer has been filled, so this setting should be used with extreme caution. This setting is rarely, if ever, necessary to achieve good performance, but might be useful in an attempt to squeeze every last byte of performance out of the system where it is desirable to eliminate buffering to disk entirely. No environment variable is available for this flag, and therefore this can only be set at the individual stage level.
Chapter 3. Improving performance

25

Platform specific tuning HP-UX


HP-UX has a limitation when running in 32-bit mode, which limits memory mapped I/O to 2 GB per machine. This can be an issue when dealing with large lookups. The Memory Windows options can provide a work around for this memory limitation. Product Support can provide this document on request.

AIX
If you are running WebSphere DataStage Enterprise Edition on an RS/6000 SP or a network of workstations, verify your setting of the network parameter thewall .

Disk space requirements of post-release 7.0.1 data sets


Some parallel data sets generated with WebSphere DataStage 7.0.1 and later releases require more disk space when the columns are of type VarChar when compared to 7.0. This is due to changes added for performance improvements for bounded length VarChars in 7.0.1. The preferred solution is to use unbounded length VarChars (dont set any length) for columns where the maximum length is rarely used. Alternatively, you can set the environment variable, APT_OLD_BOUNDED_LENGTH, but this is not recommended, as it leads to performance degradation.

26

Parallel Job Advanced Developer Guide

Chapter 4. Link buffering


These topics contain an in-depth description of when and how WebSphereDataStage buffers data within a job, and how you can change the automatic settings if required. WebSphere DataStage automatically performs buffering on the links of certain stages. This is primarily intended to prevent deadlock situations arising (where one stage is unable to read its input because a previous stage in the job is blocked from writing to its output). Deadlock situations can occur where you have a fork-join in your job. This is where a stage has two output links whose data paths are joined together later in the job. The situation can arise where all the stages in the flow are waiting for each other to read or write, so none of them can proceed. No error or warning message is output for deadlock; your job will be in a state where it will wait forever for an input. WebSphere DataStage automatically inserts buffering into job flows containing fork-joins where deadlock situations might arise. In most circumstances you should not need to alter the default buffering implemented by WebSphere DataStage. However you might want to insert buffers in other places in your flow (to smooth and improve performance) or you might want to explicitly control the buffers inserted to avoid deadlocks. WebSphere DataStage allows you to do this, but use caution when altering the default buffer settings.

Buffering assumptions
This section describes buffering in more detail, and in particular the design assumptions underlying its default behavior. Buffering in WebSphere DataStage is designed around the following assumptions: v Buffering is primarily intended to remove the potential for deadlock in flows with fork-join structure. v Throughput is preferable to overhead. The goal of the WebSphere DataStage buffering mechanism is to keep the flow moving with as little memory and disk usage as possible. Ideally, data should simply stream through the data flow and rarely land to disk. Upstream operators should tend to wait for downstream operators to consume their input before producing new data records. v Stages in general are designed so that on each link between stages data is being read and written whenever possible. While buffering is designed to tolerate occasional backlog on specific links due to one operator getting ahead of another, it is assumed that operators are at least occasionally attempting to read and write data on each link. Buffering is implemented by the automatic insertion of a hidden buffer operator on links between stages. The buffer operator attempts to match the rates of its input and output. When no data is being read from the buffer operator by the downstream stage, the buffer operator tries to throttle back incoming data from the upstream stage to avoid letting the buffer grow so large that it must be written out to disk. The goal is to avoid situations where data will be have to be moved to and from disk needlessly, especially in situations where the consumer cannot process data at the same rate as the producer (for example, due to a more complex calculation). Because the buffer operator wants to keep the flow moving with low overhead, it is assumed in general that it is better to cause the producing stage to wait before writing new records, rather than allow the buffer operator to consume resources.

Copyright IBM Corp. 2006, 2008

27

Controlling buffering
WebSphere DataStage offers two ways of controlling the operation of buffering: you can use environment variables to control buffering on all links of all stages in all jobs, or you can make individual settings on the links of particular stages via the stage editors.

Buffering policy
You can set this via the APT_BUFFERING_POLICY environment variable, or via the Buffering mode field on the Inputs or Outputs page Advanced tab for individual stage editors. The environment variable has the following possible values: v AUTOMATIC_BUFFERING. Buffer a data set only if necessary to prevent a dataflow deadlock. This setting is the default if you do not define the environment variable. v FORCE_BUFFERING. Unconditionally buffer all links. v NO_BUFFERING. Do not buffer links. This setting can cause deadlock if used inappropriately. The possible settings for the Buffering mode field are: v (Default). This will take whatever the default settings are as specified by the environment variables (this will be Auto buffer unless you have explicitly changed the value of the APT_BUFFERING _POLICY environment variable). v Auto buffer. Buffer data only if necessary to prevent a dataflow deadlock situation. v Buffer. This will unconditionally buffer all data output from/input to this stage. v No buffer. Do not buffer data under any circumstances. This could potentially lead to deadlock situations if not used carefully.

Overriding default buffering behavior


Since the default value of APT_BUFFERING_POLICY is AUTOMATIC_BUFFERING, the default action of WebSphere DataStage is to buffer a link only if required to avoid deadlock. You can, however, override the default buffering operation in your job. For example, some operators read an entire input data set before outputting a single record. The Sort stage is an example of this. Before a sort operator can output a single record, it must read all input to determine the first output record. Therefore, these operators internally buffer the entire output data set, eliminating the need of the default buffering mechanism. For this reason, WebSphere DataStage never inserts a buffer on the output of a sort. You might also develop a customized stage that does not require its output to be buffered, or you might want to change the size parameters of the WebSphere DataStage buffering mechanism. In this case, you can set the various buffering parameters. These can be set via environment variables or via the Advanced tab on the Inputs or Outputs page for individual stage editors. What you set in the Outputs page Advanced tab will automatically appear in the Inputs page Advanced tab of the stage at the other end of the link (and vice versa) The available environment variables are as follows: v APT_BUFFER_MAXIMUM_MEMORY. Specifies the maximum amount of virtual memory, in bytes, used per buffer. The default size is 3145728 (3 MB). If your step requires 10 buffers, each processing node would use a maximum of 30 MB of virtual memory for buffering. If WebSphere DataStage has to buffer more data than Maximum memory buffer size, the data is written to disk. v APT_BUFFER_DISK_WRITE_INCREMENT. Sets the size, in bytes, of blocks of data being moved to/from disk by the buffering operator. The default is 1048576 (1 MByte.) Adjusting this value trades amount of disk access against throughput for small amounts of data. Increasing the block size reduces

28

Parallel Job Advanced Developer Guide

disk access, but might decrease performance when data is being read/written in smaller units. Decreasing the block size increases throughput, but might increase the amount of disk access. v APT_BUFFER_FREE_RUN. Specifies how much of the available in-memory buffer to consume before the buffer offers resistance to any new data being written to it, as a percentage of Maximum memory buffer size. When the amount of buffered data is less than the Buffer free run percentage, input data is accepted immediately by the buffer. After that point, the buffer does not immediately accept incoming data; it offers resistance to the incoming data by first trying to output data already in the buffer before accepting any new input. In this way, the buffering mechanism avoids buffering excessive amounts of data and can also avoid unnecessary disk I/O. The default percentage is 0.5 (50% of Maximum memory buffer size or by default 1.5 MB). You must set Buffer free run greater than 0.0. Typical values are between 0.0 and 1.0. You can set Buffer free run to a value greater than 1.0. In this case, the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing data to disk. The available settings in the Input or Outputs pageAdvanced tab of stage editors are: v Maximum memory buffer size (bytes). Specifies the maximum amount of virtual memory, in bytes, used per buffer. The default size is 3145728 (3 MB). v Buffer free run (percent). Specifies how much of the available in-memory buffer to consume before the buffer resists. This is expressed as a percentage of Maximum memory buffer size. When the amount of data in the buffer is less than this value, new data is accepted automatically. When the data exceeds it, the buffer first tries to write some of the data it contains before accepting more. The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing to disk. v Queue upper bound size (bytes). Specifies the maximum amount of data buffered at any time using both memory and disk. The default value is zero, meaning that the buffer size is limited only by the available disk space as specified in the configuration file (resource scratchdisk). If you set Queue upper bound size (bytes) to a non-zero value, the amount of data stored in the buffer will not exceed this value (in bytes) plus one block (where the data stored in a block cannot exceed 32 KB). If you set Queue upper bound size to a value equal to or slightly less than Maximum memory buffer size, and set Buffer free run to 1.0, you will create a finite capacity buffer that will not write to disk. However, the size of the buffer is limited by the virtual memory of your system and you can create deadlock if the buffer becomes full. (Note that there is no environment variable for Queue upper bound size). v Disk write increment (bytes). Sets the size, in bytes, of blocks of data being moved to/from disk by the buffering operator. The default is 1048576 (1 MB). Adjusting this value trades amount of disk access against throughput for small amounts of data. Increasing the block size reduces disk access, but might decrease performance when data is being read/written in smaller units. Decreasing the block size increases throughput, but might increase the amount of disk access.

Operators with special buffering requirements


If you have built a custom stage that is designed to not consume one of its inputs, for example to buffer all records before proceeding, the default behavior of the buffer operator can end up being a performance bottleneck, slowing down the job. This section describes how to fix this problem. Although the buffer operator is not designed for buffering an entire data set as output by a stage, it is capable of doing so assuming sufficient memory or disk space is available to buffer the data. To achieve this you need to adjust the settings described above appropriately, based on your job. You might be able to solve your problem by modifying one buffering property, the Buffer free run setting. This controls the amount of memory/disk space that the buffer operator is allowed to consume before it begins to push back on the upstream operator.

Chapter 4. Link buffering

29

The default setting for Buffer free run is 0.5 for the environment variable, (50% for Buffer free run on the Advanced tab), which means that half of the internal memory buffer can be consumed before pushback occurs. This biases the buffer operator to avoid allowing buffered data to be written to disk. If your stage needs to buffer large data sets, we recommend that you initially set Buffer free run to a very large value such as 1000, and then adjust according to the needs of your application. This will allow the buffer operator to freely use both memory and disk space in order to accept incoming data without pushback. We recommend that you set the Buffer free run property only for those links between stages that require a non-default value; this means altering the setting on the Inputs page or Outputs page Advanced tab of the stage editors, not the environment variable.

30

Parallel Job Advanced Developer Guide

Chapter 5. Specifying your own parallel stages


In addition to the wide range of parallel stage types available, WebSphere DataStage allows you to define your own stage types, which you can then use in parallel jobs. There are three different types of stage that you can define: v Custom. This allows knowledgeable Orchestrate users to specify an Orchestrate operator as a WebSphere DataStage stage. This is then available to use in WebSphere DataStage Parallel jobs. v Build. This allows you to design and build your own operator as a stage to be included in WebSphere DataStage Parallel Jobs. v Wrapped. This allows you to specify a UNIX command to be executed by a WebSphere DataStage stage. You define a wrapper file that in turn defines arguments for the UNIX command and inputs and outputs. WebSphere DataStage Designer provides an interface that allows you to define a new WebSphere DataStage Parallel job stage of any of these types. This interface is also available from the repository tree of the WebSphere DataStage Designer. This topic describes how to use this interface.

Defining custom stages


To define a custom stage type: 1. Do one of: a. Choose File New from the Designer menu. The New dialog box appears. b. Open the Stage Type folder and select the Parallel Custom Stage Type icon. c. Click OK. TheStage Type dialog box appears, with the General page on top. Or: d. Select a folder in the repository tree. e. Choose New Other Parallel Stage Custom from the shortcut menu. The Stage Type dialog box appears, with the General page on top. 2. Fill in the fields on the General page as follows: v Stage type name. This is the name that the stage will be known by to WebSphere DataStage. Avoid using the same name as existing stages. v Parallel Stage type. This indicates the type of new Parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting. v Execution Mode. Choose the execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See WebSphere DataStage Parallel Job Developer Guide for a description of the execution mode. v Mapping. Choose whether the stage has a Mapping tab or not. A Mapping tab enables the user of the stage to specify how output columns are derived from the data produced by the stage. Choose None to specify that output mapping is not performed, choose Default to accept the default setting that WebSphere DataStage uses. v Preserve Partitioning. Choose the default setting of the Preserve Partitioning flag. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the preserve partitioning flag.

Copyright IBM Corp. 2006, 2008

31

v Partitioning. Choose the default partitioning method for the stage. This is the method that will appear in the Inputs page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the partitioning methods. v Collecting. Choose the default collection method for the stage. This is the method that will appear in the Inputs page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the collection methods. v Operator. Enter the name of the Orchestrate operator that you want the stage to invoke. v Short Description. Optionally enter a short description of the stage. v Long Description. Optionally enter a long description of the stage. 3. Go to the Links page and specify information about the links allowed to and from the stage you are defining. Use this to specify the minimum and maximum number of input and output links that your custom stage can have, and to enable the ViewData feature for target data (you cannot enable target ViewData if your stage has any output links). When the stage is used in a job design, a ViewData button appears on the Input page, which allows you to view the data on the actual data target (provided some has been written there). In order to use the target ViewData feature, you have to specify an Orchestrate operator to read the data back from the target. This will usually be different to the operator that the stage has used to write the data (that is, the operator defined in the Operator field of the General page). Specify the reading operator and associated arguments in the Operator and Options fields. If you enable target ViewData, a further field appears in the Properties grid, called ViewData. 4. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a version number to the stage so you can keep track of any subsequent changes. You can specify that the actual stage will use a custom GUI by entering the ProgID for a custom GUI in the Custom GUI Prog ID field. You can also specify that the stage has its own icon. You need to supply a 16 x 16 bit bitmap and a 32 x 32 bit bitmap to be displayed in various places in the WebSphere DataStage user interface. Click the 16 x 16 Bitmap button and browse for the smaller bitmap file. Click the 32 x 32 Bitmap button and browse for the large bitmap file. Note that bitmaps with 32-bit color are not supported. Click the Reset Bitmap Info button to revert to using the default WebSphere DataStage icon for this stage. 5. Go to the Properties page. This allows you to specify the options that the Orchestrate operator requires as properties that appear in the Stage Properties tab. For custom stages the Properties tab always appears under the Stage page. 6. Fill in the fields as follows: v Property name. The name of the property. v Data type. The data type of the property. Choose from: Boolean Float Integer String Pathname List Input Column Output Column If you choose Input Column or Output Column, when the stage is included in a job a drop-down list will offer a choice of the defined input or output columns.

32

Parallel Job Advanced Developer Guide

If you choose list you should open the Extended Properties dialog box from the grid shortcut menu to specify what appears in the list. v Prompt. The name of the property that will be displayed on the Properties tab of the stage editor. v Default Value. The value the option will take if no other is specified. v Required. Set this to True if the property is mandatory. v Repeats. Set this true if the property repeats (that is, you can have multiple instances of it). v Use Quoting. Specify whether the property will haves quotes added when it is passed to the Orchestrate operator. v Conversion. Specifies the type of property as follows: -Name. The name of the property will be passed to the operator as the option value. This will normally be a hidden property, that is, not visible in the stage editor. -Name Value. The name of the property will be passed to the operator as the option name, and any value specified in the stage editor is passed as the value. -Value. The value for the property specified in the stage editor is passed to the operator as the option name. Typically used to group operator options that are mutually exclusive. Value only. The value for the property specified in the stage editor is passed as it is. Input Schema. Specifies that the property will contain a schema string whose contents are populated from the Input page Columns tab. Output Schema. Specifies that the property will contain a schema string whose contents are populated from the Output page Columns tab. None. This allows the creation of properties that do not generate any osh, but can be used for conditions on other properties (for example, for use in a situation where you have mutually exclusive properties, but at least one of them must be specified). Schema properties require format options. Select this check box to specify that the stage being specified will have a Format tab. If you have enabled target ViewData on the Links page, the following property is also displayed: ViewData. Select Yes to indicate that the value of this property should be used when viewing data. For example, if this property specifies a file to write to when the stage is used in a job design, the value of this property will be used to read the data back if ViewData is used in the stage. If you select a conversion type of Input Schema or Output Schema, you should note the following: Data Type is set to String. Required is set to Yes. The property is marked as hidden and will not appear on the Properties page when the custom stage is used in a job design. If your stage can have multiple input or output links there would be a Input Schema property or Output Schema property per-link. When the stage is used in a job design, the property will contain the following OSH for each input or output link:
-property_name record {format_properties} ( column_definition {format_properties}; ...)

v v v

Where: v property_name is the name of the property (usually `schema) v format_properties are formatting information supplied on the Format page (if the stage has one). v there is one column_definition for each column defined in the Columns tab for that link. The format_props in this case refers to per-column format information specified in the Edit Column Meta Data dialog box.

Chapter 5. Specifying your own parallel stages

33

Schema properties are mutually exclusive with schema file properties. If your custom stage supports both, you should use the Extended Properties dialog box to specify a condition of schemafile= for the schema property. The schema property is then only valid provided the schema file property is blank (or does not exist). 7. If you want to specify a list property, or otherwise control how properties are handled by your stage, choose Extended Properties from the Properties grid shortcut menu to open the Extended Properties dialog box. The settings you use depend on the type of property you are specifying: v Specify a category to have the property appear under this category in the stage editor. By default all properties appear in the Options category. v Specify that the property will be hidden and not appear in the stage editor. This is primarily intended to support the case where the underlying operator needs to know the JobName. This can be passed using a mandatory String property with a default value that uses a DS Macro. However, to prevent the user from changing the value, the property needs to be hidden. v If you are specifying a List category, specify the possible values for list members in the List Value field. v If the property is to be a dependent of another property, select the parent property in the Parents field. v Specify an expression in the Template field to have the actual value of the property generated at compile time. It is usually based on values in other properties and columns. v Specify an expression in the Conditions field to indicate that the property is only valid if the conditions are met. The specification of this property is a bar | separated list of conditions that are ANDed together. For example, if the specification was a=b|c!=d, then this property would only be valid (and therefore only available in the GUI) when property a is equal to b, and property c is not equal to d. 8. If your custom stage will create columns, go to the Mapping Additions page. It contains a grid that allows for the specification of columns created by the stage. You can also specify that column details are filled in from properties supplied when the stage is used in a job design, allowing for dynamic specification of columns. The grid contains the following fields: v Column name. The name of the column created by the stage. You can specify the name of a property you specified on the Property page of the dialog box to dynamically allocate the column name. Specify this in the form #property_name#, the created column will then take the value of this property, as specified at design time, as the name of the created column. v Parallel type. The type of the column (this is the underlying data type, not the SQL data type). Again you can specify the name of a property you specified on the Property page of the dialog box to dynamically allocate the column type. Specify this in the form #property_name#, the created column will then take the value of this property, as specified at design time, as the type of the created column. (Note that you cannot use a repeatable property to dynamically allocate a column type in this way.) v Nullable. Choose Yes or No to indicate whether the created column can contain a null. v Conditions. Allows you to enter an expression specifying the conditions under which the column will be created. This could, for example, depend on the setting of one of the properties specified in the Property page. You can propagate the values of the Conditions fields to other columns if required. Do this by selecting the columns you want to propagate to, then right-clicking in the source Conditions field and choosing Propagate from the shortcut menu. A dialog box asks you to confirm that you want to propagate the conditions to all columns. 9. Click OK when you are happy with your custom stage definition. The Save As dialog box appears. 10. Select the folder in the repository tree where you want to store the stage type and click OK.

34

Parallel Job Advanced Developer Guide

Defining custom stages


To define a custom stage type: 1. Do one of: a. Choose File New from the Designer menu. The New dialog box appears. b. Open the Stage Type folder and select the Parallel Custom Stage Type icon. c. Click OK. TheStage Type dialog box appears, with the General page on top. Or: d. Select a folder in the repository tree. e. Choose New Other Parallel Stage Custom from the shortcut menu. The Stage Type dialog box appears, with the General page on top. 2. Fill in the fields on the General page as follows: v Stage type name. This is the name that the stage will be known by to WebSphere DataStage. Avoid using the same name as existing stages. v Parallel Stage type. This indicates the type of new Parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting. v Execution Mode. Choose the execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See WebSphere DataStage Parallel Job Developer Guide for a description of the execution mode. v Mapping. Choose whether the stage has a Mapping tab or not. A Mapping tab enables the user of the stage to specify how output columns are derived from the data produced by the stage. Choose None to specify that output mapping is not performed, choose Default to accept the default setting that WebSphere DataStage uses. v Preserve Partitioning. Choose the default setting of the Preserve Partitioning flag. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the preserve partitioning flag. v Partitioning. Choose the default partitioning method for the stage. This is the method that will appear in the Inputs page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the partitioning methods. v Collecting. Choose the default collection method for the stage. This is the method that will appear in the Inputs page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the collection methods. v Operator. Enter the name of the Orchestrate operator that you want the stage to invoke. v Short Description. Optionally enter a short description of the stage. v Long Description. Optionally enter a long description of the stage. 3. Go to the Links page and specify information about the links allowed to and from the stage you are defining. Use this to specify the minimum and maximum number of input and output links that your custom stage can have, and to enable the ViewData feature for target data (you cannot enable target ViewData if your stage has any output links). When the stage is used in a job design, a ViewData button appears on the Input page, which allows you to view the data on the actual data target (provided some has been written there). In order to use the target ViewData feature, you have to specify an Orchestrate operator to read the data back from the target. This will usually be different to the operator that the stage has used to write the data (that is, the operator defined in the Operator field of the General page). Specify the reading operator and associated arguments in the Operator and Options fields.
Chapter 5. Specifying your own parallel stages

35

If you enable target ViewData, a further field appears in the Properties grid, called ViewData. 4. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a version number to the stage so you can keep track of any subsequent changes. You can specify that the actual stage will use a custom GUI by entering the ProgID for a custom GUI in the Custom GUI Prog ID field. You can also specify that the stage has its own icon. You need to supply a 16 x 16 bit bitmap and a 32 x 32 bit bitmap to be displayed in various places in the WebSphere DataStage user interface. Click the 16 x 16 Bitmap button and browse for the smaller bitmap file. Click the 32 x 32 Bitmap button and browse for the large bitmap file. Note that bitmaps with 32-bit color are not supported. Click the Reset Bitmap Info button to revert to using the default WebSphere DataStage icon for this stage. 5. Go to the Properties page. This allows you to specify the options that the Orchestrate operator requires as properties that appear in the Stage Properties tab. For custom stages the Properties tab always appears under the Stage page. 6. Fill in the fields as follows: v Property name. The name of the property. v Data type. The data type of the property. Choose from: Boolean Float Integer String Pathname List Input Column Output Column If you choose Input Column or Output Column, when the stage is included in a job a drop-down list will offer a choice of the defined input or output columns. If you choose list you should open the Extended Properties dialog box from the grid shortcut menu to specify what appears in the list. Prompt. The name of the property that will be displayed on the Properties tab of the stage editor. Default Value. The value the option will take if no other is specified. Required. Set this to True if the property is mandatory. Repeats. Set this true if the property repeats (that is, you can have multiple instances of it). Use Quoting. Specify whether the property will haves quotes added when it is passed to the Orchestrate operator. Conversion. Specifies the type of property as follows: -Name. The name of the property will be passed to the operator as the option value. This will normally be a hidden property, that is, not visible in the stage editor. -Name Value. The name of the property will be passed to the operator as the option name, and any value specified in the stage editor is passed as the value. -Value. The value for the property specified in the stage editor is passed to the operator as the option name. Typically used to group operator options that are mutually exclusive. Value only. The value for the property specified in the stage editor is passed as it is. Input Schema. Specifies that the property will contain a schema string whose contents are populated from the Input page Columns tab. Output Schema. Specifies that the property will contain a schema string whose contents are populated from the Output page Columns tab.

v v v v v v

36

Parallel Job Advanced Developer Guide

None. This allows the creation of properties that do not generate any osh, but can be used for conditions on other properties (for example, for use in a situation where you have mutually exclusive properties, but at least one of them must be specified). v Schema properties require format options. Select this check box to specify that the stage being specified will have a Format tab. If you have enabled target ViewData on the Links page, the following property is also displayed: v ViewData. Select Yes to indicate that the value of this property should be used when viewing data. For example, if this property specifies a file to write to when the stage is used in a job design, the value of this property will be used to read the data back if ViewData is used in the stage. If you select a conversion type of Input Schema or Output Schema, you should note the following: v Data Type is set to String. v Required is set to Yes. v The property is marked as hidden and will not appear on the Properties page when the custom stage is used in a job design. If your stage can have multiple input or output links there would be a Input Schema property or Output Schema property per-link. When the stage is used in a job design, the property will contain the following OSH for each input or output link:
-property_name record {format_properties} ( column_definition {format_properties}; ...)

Where: v property_name is the name of the property (usually `schema) v format_properties are formatting information supplied on the Format page (if the stage has one). v there is one column_definition for each column defined in the Columns tab for that link. The format_props in this case refers to per-column format information specified in the Edit Column Meta Data dialog box. Schema properties are mutually exclusive with schema file properties. If your custom stage supports both, you should use the Extended Properties dialog box to specify a condition of schemafile= for the schema property. The schema property is then only valid provided the schema file property is blank (or does not exist). 7. If you want to specify a list property, or otherwise control how properties are handled by your stage, choose Extended Properties from the Properties grid shortcut menu to open the Extended Properties dialog box. The settings you use depend on the type of property you are specifying: v Specify a category to have the property appear under this category in the stage editor. By default all properties appear in the Options category. v Specify that the property will be hidden and not appear in the stage editor. This is primarily intended to support the case where the underlying operator needs to know the JobName. This can be passed using a mandatory String property with a default value that uses a DS Macro. However, to prevent the user from changing the value, the property needs to be hidden. v If you are specifying a List category, specify the possible values for list members in the List Value field. v If the property is to be a dependent of another property, select the parent property in the Parents field. v Specify an expression in the Template field to have the actual value of the property generated at compile time. It is usually based on values in other properties and columns. v Specify an expression in the Conditions field to indicate that the property is only valid if the conditions are met. The specification of this property is a bar | separated list of conditions that

Chapter 5. Specifying your own parallel stages

37

are ANDed together. For example, if the specification was a=b|c!=d, then this property would only be valid (and therefore only available in the GUI) when property a is equal to b, and property c is not equal to d. 8. If your custom stage will create columns, go to the Mapping Additions page. It contains a grid that allows for the specification of columns created by the stage. You can also specify that column details are filled in from properties supplied when the stage is used in a job design, allowing for dynamic specification of columns. The grid contains the following fields: v Column name. The name of the column created by the stage. You can specify the name of a property you specified on the Property page of the dialog box to dynamically allocate the column name. Specify this in the form #property_name#, the created column will then take the value of this property, as specified at design time, as the name of the created column. v Parallel type. The type of the column (this is the underlying data type, not the SQL data type). Again you can specify the name of a property you specified on the Property page of the dialog box to dynamically allocate the column type. Specify this in the form #property_name#, the created column will then take the value of this property, as specified at design time, as the type of the created column. (Note that you cannot use a repeatable property to dynamically allocate a column type in this way.) v Nullable. Choose Yes or No to indicate whether the created column can contain a null. v Conditions. Allows you to enter an expression specifying the conditions under which the column will be created. This could, for example, depend on the setting of one of the properties specified in the Property page. You can propagate the values of the Conditions fields to other columns if required. Do this by selecting the columns you want to propagate to, then right-clicking in the source Conditions field and choosing Propagate from the shortcut menu. A dialog box asks you to confirm that you want to propagate the conditions to all columns. 9. Click OK when you are happy with your custom stage definition. The Save As dialog box appears. 10. Select the folder in the repository tree where you want to store the stage type and click OK.

Defining build stages


You define a Build stage to enable you to provide a custom operator that can be executed from a parallel job stage. The stage will be available to all jobs in the project in which the stage was defined. You can make it available to other projects using the WebSphere DataStage Export facilities. The stage is automatically added to the job palette. When defining a Build stage you provide the following information: v Description of the data that will be input to the stage. v Whether records are transferred from input to output. A transfer copies the input record to the output buffer. If you specify auto transfer, the operator transfers the input record to the output record immediately after execution of the per record code. The code can still access data in the output buffer until it is actually written. v Any definitions and header file information that needs to be included. v Code that is executed at the beginning of the stage (before any records are processed). v Code that is executed at the end of the stage (after all records have been processed). v Code that is executed every time the stage processes a record. v Compilation and build details for actually building the stage. Note that the custom operator that your build stage executes must have at least one input data set and one output data set.

38

Parallel Job Advanced Developer Guide

The Code for the Build stage is specified in C++. There are a number of macros available to make the job of coding simpler (see Build Stage Macros. There are also a number of header files available containing many useful functions, see Appendix A. When you have specified the information, and request that the stage is generated, WebSphere DataStage generates a number of files and then compiles these to build an operator which the stage executes. The generated files include: v Header files (ending in .h) v Source files (ending in .c) v Object files (ending in .so) The following shows a build stage in diagrammatic form: To define a Build stage: 1. Do one of: a. Choose File New from the Designer menu. The New dialog box appears. b. Open the Stage Type folder and select the Parallel Build Stage Type icon. c. Click OK. TheStage Type dialog box appears, with the General page on top. Or: d. Select a folder in the repository tree. e. Choose New Other Parallel Stage Custom from the shortcut menu. The Stage Type dialog box appears, with the General page on top. 2. Fill in the fields on the General page as follows: v Stage type name. This is the name that the stage will be known by to WebSphere DataStage. Avoid using the same name as existing stages. v Class Name. The name of the C++ class. By default this takes the name of the stage type. v Parallel Stage type. This indicates the type of new parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting. v Execution mode. Choose the default execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See WebSphere DataStage Parallel Job Developer Guide for a description of the execution mode. v Preserve Partitioning. This shows the default setting of the Preserve Partitioning flag, which you cannot change in a Build stage. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the preserve partitioning flag. v Partitioning. This shows the default partitioning method, which you cannot change in a Build stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the partitioning methods. v Collecting. This shows the default collection method, which you cannot change in a Build stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the collection methods. v Operator. The name of the operator that your code is defining and which will be executed by the WebSphere DataStage stage. By default this takes the name of the stage type. v Short Description. Optionally enter a short description of the stage. v Long Description. Optionally enter a long description of the stage.

Chapter 5. Specifying your own parallel stages

39

3. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a release number to the stage so you can keep track of any subsequent changes. You can specify that the actual stage will use a custom GUI by entering the ProgID for a custom GUI in the Custom GUI Prog ID field. You can also specify that the stage has its own icon. You need to supply a 16 x 16 bit bitmap and a 32 x 32 bit bitmap to be displayed in various places in the WebSphere DataStage user interface. Click the 16 x 16 Bitmap button and browse for the smaller bitmap file. Click the 32 x 32 Bitmap button and browse for the large bitmap file. Note that bitmaps with 32-bit color are not supported. Click the Reset Bitmap Info button to revert to using the default WebSphere DataStage icon for this stage. 4. Go to the Properties page. This allows you to specify the options that the Build stage requires as properties that appear in the Stage Properties tab. For custom stages the Properties tab always appears under the Stage page. Fill in the fields as follows: v Property name. The name of the property. This will be passed to the operator you are defining as an option, prefixed with `- and followed by the value selected in the Properties tab of the stage editor. v Data type. The data type of the property. Choose from: Boolean Float Integer String Pathname List Input Column Output Column If you choose Input Column or Output Column, when the stage is included in a job a drop-down list will offer a choice of the defined input or output columns. If you choose list you should open the Extended Properties dialog box from the grid shortcut menu to specify what appears in the list. v Prompt. The name of the property that will be displayed on the Properties tab of the stage editor. v Default Value. The value the option will take if no other is specified. v Required. Set this to True if the property is mandatory. v Conversion. Specifies the type of property as follows: -Name. The name of the property will be passed to the operator as the option value. This will normally be a hidden property, that is, not visible in the stage editor. -Name Value. The name of the property will be passed to the operator as the option name, and any value specified in the stage editor is passed as the value. -Value. The value for the property specified in the stage editor is passed to the operator as the option name. Typically used to group operator options that are mutually exclusive. Value only. The value for the property specified in the stage editor is passed as it is. 5. If you want to specify a list property, or otherwise control how properties are handled by your stage, choose Extended Properties from the Properties grid shortcut menu to open the Extended Properties dialog box. The settings you use depend on the type of property you are specifying: v Specify a category to have the property appear under this category in the stage editor. By default all properties appear in the Options category. v If you are specifying a List category, specify the possible values for list members in the List Value field.

40

Parallel Job Advanced Developer Guide

v If the property is to be a dependent of another property, select the parent property in the Parents field. v Specify an expression in the Template field to have the actual value of the property generated at compile time. It is usually based on values in other properties and columns. v Specify an expression in the Conditions field to indicate that the property is only valid if the conditions are met. The specification of this property is a bar | separated list of conditions that are ANDed together. For example, if the specification was a=b|c!=d, then this property would only be valid (and therefore only available in the GUI) when property a is equal to b, and property c is not equal to d. Click OK when you are happy with the extended properties. 6. Click on the Build page. The tabs here allow you to define the actual operation that the stage will perform. The Interfaces tab enable you to specify details about inputs to and outputs from the stage, and about automatic transfer of records from input to output. You specify port details, a port being where a link connects to the stage. You need a port for each possible input link to the stage, and a port for each possible output link from the stage. You provide the following information on the Input sub-tab: v Port Name. Optional name for the port. The default names for the ports are in0, in1, in2 ... . You can refer to them in the code using either the default name or the name you have specified. v Alias. Where the port name contains non-ascii characters, you can give it an alias in this column (this is only available where NLS is enabled). v AutoRead. This defaults to True which means the stage will automatically read records from the port. Otherwise you explicitly control read operations in the code. v Table Name. Specify a table definition in the WebSphere DataStage Repository which describes the meta data for the port. You can browse for a table definition by choosing Select Table from the menu that appears when you click the browse button. You can also view the schema corresponding to this table definition by choosing View Schema from the same menu. You do not have to supply a Table Name. If any of the columns in your table definition have names that contain non-ascii characters, you should choose Column Aliases from the menu. The Build Column Aliases dialog box appears. This lists the columns that require an alias and let you specify one. v RCP. Choose True if runtime column propagation is allowed for inputs to this port. Defaults to False. You do not need to set this if you are using the automatic transfer facility. You provide the following information on the Output sub-tab: v Port Name. Optional name for the port. The default names for the links are out0, out1, out2 ... . You can refer to them in the code using either the default name or the name you have specified. v Alias. Where the port name contains non-ascii characters, you can give it an alias in this column. v AutoWrite. This defaults to True which means the stage will automatically write records to the port. Otherwise you explicitly control write operations in the code. Once records are written, the code can no longer access them. v Table Name. Specify a table definition in the WebSphere DataStage Repository which describes the meta data for the port. You can browse for a table definition. You do not have to supply a Table Name. A shortcut menu accessed from the browse button offers a choice of Clear Table Name, Select Table, Create Table,View Schema, and Column Aliases. The use of these is as described for the Input sub-tab. v RCP. Choose True if runtime column propagation is allowed for outputs from this port. Defaults to False. You do not need to set this if you are using the automatic transfer facility. The Transfer sub-tab allows you to connect an input buffer to an output buffer such that records will be automatically transferred from input to output. You can also disable automatic transfer, in which case you have to explicitly transfer data in the code. Transferred data sits in an output buffer and can still be accessed and altered by the code until it is actually written to the port. You provide the following information on the Transfer tab:
Chapter 5. Specifying your own parallel stages

41

v Input. Select the input port to connect to the buffer from the drop-down list. If you have specified an alias, this will be displayed here. v Output. Select the output port to transfer input records from the output buffer to from the drop-down list. If you have specified an alias, this will be displayed here. v Auto Transfer. This defaults to False, which means that you have to include code which manages the transfer. Set to True to have the transfer carried out automatically. v Separate. This is False by default, which means this transfer will be combined with other transfers to the same port. Set to True to specify that the transfer should be separate from other transfers. The Logic tab is where you specify the actual code that the stage executes. The Definitions sub-tab allows you to specify variables, include header files, and otherwise initialize the stage before processing any records. The Pre-Loop sub-tab allows you to specify code which is executed at the beginning of the stage, before any records are processed. The Per-Record sub-tab allows you to specify the code which is executed once for every record processed. The Post-Loop sub-tab allows you to specify code that is executed after all the records have been processed. You can type straight into these pages or cut and paste from another editor. The shortcut menu on the Pre-Loop, Per-Record, and Post-Loop pages gives access to the macros that are available for use in the code. The Advanced tab allows you to specify details about how the stage is compiled and built. Fill in the page as follows: v Compile and Link Flags. Allows you to specify flags that are passed to the C++ compiler. v Verbose. Select this check box to specify that the compile and build is done in verbose mode. v Debug. Select this check box to specify that the compile and build is done in debug mode. Otherwise, it is done in optimize mode. v Suppress Compile. Select this check box to generate files without compiling, and without deleting the generated files. This option is useful for fault finding. v Base File Name. The base filename for generated files. All generated files will have this name followed by the appropriate suffix. This defaults to the name specified under Operator on the General page. v Source Directory. The directory where generated .c files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the WebSphere DataStage Administrator (see WebSphere DataStage Administrator Client Guide). v Header Directory. The directory where generated .h files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the WebSphere DataStage Administrator (see WebSphere DataStage Administrator Client Guide). v Object Directory. The directory where generated .so files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the WebSphere DataStage Administrator (see WebSphere DataStage Administrator Client Guide). v Wrapper directory. The directory where generated .op files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the WebSphere DataStage Administrator (see WebSphere DataStage Administrator Client Guide). 7. When you have filled in the details in all the pages, click Generate to generate the stage. A window appears showing you the result of the build.

42

Parallel Job Advanced Developer Guide

Build stage macros


There are a number of macros you can use when specifying Pre-Loop, Per-Record, and Post-Loop code. Insert a macro by selecting it from the short cut menu. They are grouped into the following categories: v Informational v Flow-control v Input and output v Transfer

Informational macros
Use these macros in your code to determine the number of inputs, outputs, and transfers as follows: v inputs(). Returns the number of inputs to the stage. v outputs(). Returns the number of outputs from the stage. v transfers(). Returns the number of transfers in the stage.

Flow-control macros
Use these macros to override the default behavior of the Per-Record loop in your stage definition: v endLoop(). Causes the operator to stop looping, following completion of the current loop and after writing any auto outputs for this loop. v nextLoop() Causes the operator to immediately skip to the start of next loop, without writing any outputs. v failStep() Causes the operator to return a failed status and terminate the job.

Input and output macros


These macros allow you to explicitly control the read and write and transfer of individual records. Each of the macros takes an argument as follows: v input is the index of the input (0 to n). If you have defined a name for the input port you can use this in place of the index in the form portname.portid_. v output is the index of the output (0 to n). If you have defined a name for the output port you can use this in place of the index in the form portname.portid_. v index is the index of the transfer (0 to n). The following macros are available: v readRecord(input). Immediately reads the next record from input, if there is one. If there is no record, the next call to inputDone() will return true. v writeRecord(output). Immediately writes a record to output. v inputDone(input). Returns true if the last call to readRecord() for the specified input failed to read a new record, because the input has no more records. v holdRecord(input). Causes auto input to be suspended for the current record, so that the operator does not automatically read a new record at the start of the next loop. If auto is not set for the input, holdRecord() has no effect. v discardRecord(output). Causes auto output to be suspended for the current record, so that the operator does not output the record at the end of the current loop. If auto is not set for the output, discardRecord() has no effect. v discardTransfer(index). Causes auto transfer to be suspended, so that the operator does not perform the transfer at the end of the current loop. If auto is not set for the transfer, discardTransfer() has no effect.

Chapter 5. Specifying your own parallel stages

43

Transfer Macros
These macros allow you to explicitly control the transfer of individual records. Each of the macros takes an argument as follows: v input is the index of the input (0 to n). If you have defined a name for the input port you can use this in place of the index in the form portname.portid_. v output is the index of the output (0 to n). If you have defined a name for the output port you can use this in place of the index in the form portname.portid_. v index is the index of the transfer (0 to n). The following macros are available: v doTransfer(index). Performs the transfer specified by index. v doTransfersFrom(input). Performs all transfers from input. v doTransfersTo(output). Performs all transfers to output. v transferAndWriteRecord(output). Performs all transfers and writes a record for the specified output. Calling this macro is equivalent to calling the macros doTransfersTo() and writeRecord().

How your code is executed


This section describes how the code that you define when specifying a Build stage executes when the stage is run in a WebSphere DataStage job. The sequence is as follows: 1. Handles any definitions that you specified in the Definitions sub-tab when you entered the stage details. 2. Executes any code that was entered in the Pre-Loop sub-tab. 3. Loops repeatedly until either all inputs have run out of records, or the Per-Record code has explicitly invoked endLoop(). In the loop, performs the following steps: a. Reads one record for each input, except where any of the following is true: b. The input has no more records left. c. The input has Auto Read set to false. d. The holdRecord() macro was called for the input last time around the loop. e. Executes the Per-Record code, which can explicitly read and write records, perform transfers, and invoke loop-control macros such as endLoop(). f. Performs each specified transfer, except where any of the following is true: g. The input of the transfer has no more records. h. The transfer has Auto Transfer set to False. i. The discardTransfer() macro was called for the transfer during the current loop iteration. j. Writes one record for each output, except where any of the following is true: k. The output has Auto Write set to false. l. The discardRecord() macro was called for the output during the current loop iteration. 4. If you have specified code in the Post-loop sub-tab, executes it. 5. Returns a status, which is written to the WebSphere DataStage Job Log.

Inputs and outputs


The input and output ports that you defined for your Build stage are where input and output links attach to the stage. By default, links are connected to ports in the order they are connected to the stage, but where your stage allows multiple input or output links you can change the link order using the Link Order tab on the stage editor.

44

Parallel Job Advanced Developer Guide

When you specify details about the input and output ports for your Build stage, you need to define the meta data for the ports. You do this by loading a table definition from the WebSphere DataStage Repository. When you actually use your stage in a job, you have to specify meta data for the links that attach to these ports. For the job to run successfully the meta data specified for the port and that specified for the link should match. An exception to this is where you have runtime column propagation enabled for the job. In this case the input link meta data can be a super-set of the port meta data and the extra columns will be automatically propagated.

Using multiple inputs


Where you require your stage to handle multiple inputs, there are some special considerations. Your code needs to ensure the following: v The stage only tries to access a column when there are records available. It should not try to access a column after all records have been read (use the inputDone() macro to check), and should not attempt to access a column unless either Auto Read is enabled on the link or an explicit read record has been performed. v The reading of records is terminated immediately after all the required records have been read from it. In the case of a port with Auto Read disabled, the code must determine when all required records have been read and call the endLoop() macro. In most cases you might keep Auto Read enabled when you are using multiple inputs, this minimizes the need for explicit control in your code. But there are circumstances when this is not appropriate. The following paragraphs describes some common scenarios:

Using auto read for all inputs


All ports have Auto Read enabled and so all record reads are handled automatically. You need to code for Per-record loop such that each time it accesses a column on any input it first uses the inputDone() macro to determine if there are any more records. This method is fine if you want your stage to read a record from every link, every time round the loop.

Using inputs with auto read enabled for some and disabled for others
You define one (or possibly more) inputs as Auto Read, and the rest with Auto Read disabled. You code the stage in such a way as the processing of records from the Auto Read input drives the processing of the other inputs. Each time round the loop, your code should call inputDone() on the Auto Read input and call exitLoop() to complete the actions of the stage. This method is fine where you process a record from the Auto Read input every time around the loop, and then process records from one or more of the other inputs depending on the results of processing the Auto Read record.

Using inputs with auto read disabled


Your code must explicitly perform all record reads. You should define Per-Loop code which calls readRecord() once for each input to start processing. Your Per-record code should call inputDone() for every input each time round the loop to determine if a record was read on the most recent readRecord(), and if it did, call readRecord() again for that input. When all inputs run out of records, the Per-Loop code should exit. This method is intended where you want explicit control over how each input is treated.

Chapter 5. Specifying your own parallel stages

45

Example Build Stage


This section shows you how to define a Build stage called Divide, which basically divides one number by another and writes the result and any remainder to an output link. The stage also checks whether you are trying to divide by zero and, if you are, sends the input record down a reject link. To demonstrate the use of properties, the stage also lets you define a minimum divisor. If the number you are dividing by is smaller than the minimum divisor you specify when adding the stage to a job, then the record is also rejected. The input to the stage is defined as auto read, while the two outputs have auto write disabled. The code has to explicitly write the data to one or other of the output links. In the case of a successful division the data written is the original record plus the result of the division and any remainder. In the case of a rejected record, only the original record is written. The input record has two columns: dividend and divisor. Output 0 has four columns: dividend, divisor, result, and remainder. Output 1 (the reject link) has two columns: dividend and divisor. If the divisor column of an input record contains zero or is less than the specified minimum divisor, the record is rejected, and the code uses the macro transferAndWriteRecord(1) to transfer the data to port 1 and write it. If the divisor is not zero, the code uses doTransfersTo(0) to transfer the input record to Output 0, assigns the division results to result and remainder and finally calls writeRecord(0) to write the record to output 0. The following screen shots show how this stage is defined in WebSphere DataStage using the Stage Type dialog box: 1. First general details are supplied in the General tab. 2. Details about the stages creation are supplied on the Creator page. 3. The optional property of the stage is defined in the Properties tab. 4. Details of the inputs and outputs is defined on the interfaces tab of the Build page. Details about the single input to Divide are given on the Input sub-tab of the Interfaces tab. A table definition for the inputs link is available to be loaded from the WebSphere DataStage Repository Details about the outputs are given on the Output sub-tab of the Interfaces tab. When you use the stage in a job, make sure that you use table definitions compatible with the tables defined in the input and output sub-tabs. Details about the transfers carried out by the stage are defined on the Transfer sub-tab of the Interfaces tab. 5. The code itself is defined on the Logic tab. In this case all the processing is done in the Per-Record loop and so is entered on the Per-Record sub-tab. 6. As this example uses all the compile and build defaults, all that remains is to click Generate to build the stage.

Defining wrapped stages


You define a Wrapped stage to enable you to specify a UNIX command to be executed by a WebSphere DataStage stage. You define a wrapper file that handles arguments for the UNIX command and inputs and outputs. The Designer provides an interface that helps you define the wrapper. The stage will be available to all jobs in the project in which the stage was defined. You can make it available to other projects using the Designer Export facilities. You can add the stage to your job palette using palette customization features in the Designer. When defining a Wrapped stage you provide the following information:

46

Parallel Job Advanced Developer Guide

v v v v

Details of the UNIX command that the stage will execute. Description of the data that will be input to the stage. Description of the data that will be output from the stage. Definition of the environment in which the command will execute.

The UNIX command that you wrap can be a built-in command, such as grep, a utility, such as SyncSort, or your own UNIX application. The only limitation is that the command must be `pipe-safe (to be pipe-safe a UNIX command reads its input sequentially, from beginning to end). You need to define meta data for the data being input to and output from the stage. You also need to define the way in which the data will be input or output. UNIX commands can take their inputs from standard in, or another stream, a file, or from the output of another command via a pipe. Similarly data is output to standard out, or another stream, to a file, or to a pipe to be input to another command. You specify what the command expects. WebSphere DataStage handles data being input to the Wrapped stage and will present it in the specified form. If you specify a command that expects input on standard in, or another stream, WebSphere DataStage will present the input data from the jobs data flow as if it was on standard in. Similarly it will intercept data output on standard out, or another stream, and integrate it into the jobs data flow. You also specify the environment in which the UNIX command will be executed when you define the wrapped stage. To define a Wrapped stage: 1. Do one of: a. Choose File New from the Designer menu. The New dialog box appears. b. Open the Stage Type folder and select the Parallel Wrapped Stage Type icon. c. Click OK. TheStage Type dialog box appears, with the General page on top. Or: d. Select a folder in the repository tree. e. Choose New Other Parallel Stage Wrapped from the shortcut menu. The Stage Type dialog box appears, with the General page on top. 2. Fill in the fields on the General page as follows: v Stage type name. This is the name that the stage will be known by to WebSphere DataStage. Avoid using the same name as existing stages or the name of the actual UNIX command you are wrapping. v Category. The category that the new stage will be stored in under the stage types branch. Type in or browse for an existing category or type in the name of a new one. The category also determines what group in the palette the stage will be added to. Choose an existing category to add to an existing group, or specify a new category to create a new palette group. v Parallel Stage type. This indicates the type of new Parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting. v Wrapper Name. The name of the wrapper file WebSphere DataStage will generate to call the command. By default this will take the same name as the Stage type name. v Execution mode. Choose the default execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See WebSphere DataStage Parallel Job Developer Guide for a description of the execution mode. v Preserve Partitioning. This shows the default setting of the Preserve Partitioning flag, which you cannot change in a Wrapped stage. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the preserve partitioning flag.
Chapter 5. Specifying your own parallel stages

47

v Partitioning. This shows the default partitioning method, which you cannot change in a Wrapped stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the partitioning methods. v Collecting. This shows the default collection method, which you cannot change in a Wrapped stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See WebSphere DataStage Parallel Job Developer Guide for a description of the collection methods. v Command. The name of the UNIX command to be wrapped, plus any required arguments. The arguments that you enter here are ones that do not change with different invocations of the command. Arguments that need to be specified when the Wrapped stage is included in a job are defined as properties for the stage. v Short Description. Optionally enter a short description of the stage. v Long Description. Optionally enter a long description of the stage. 3. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a release number to the stage so you can keep track of any subsequent changes. You can specify that the actual stage will use a custom GUI by entering the ProgID for a custom GUI in the Custom GUI Prog ID field. You can also specify that the stage has its own icon. You need to supply a 16 x 16 bit bitmap and a 32 x 32 bit bitmap to be displayed in various places in the WebSphere DataStage user interface. Click the 16 x 16 Bitmap button and browse for the smaller bitmap file. Click the 32 x 32 Bitmap button and browse for the large bitmap file. Note that bitmaps with 32-bit color are not supported. Click the Reset Bitmap Info button to revert to using the default WebSphere DataStage icon for this stage. 4. Go to the Properties page. This allows you to specify the arguments that the UNIX command requires as properties that appear in the stage Properties tab. For wrapped stages the Properties tab always appears under the Stage page. Fill in the fields as follows: v Property name. The name of the property that will be displayed on the Properties tab of the stage editor. v Data type. The data type of the property. Choose from: Boolean Float Integer String Pathname List Input Column Output Column If you choose Input Column or Output Column, when the stage is included in a job a drop-down list will offer a choice of the defined input or output columns. If you choose list you should open the Extended Properties dialog box from the grid shortcut menu to specify what appears in the list. v Prompt. The name of the property that will be displayed on the Properties tab of the stage editor. v Default Value. The value the option will take if no other is specified. v Required. Set this to True if the property is mandatory. v Repeats. Set this true if the property repeats (that is, you can have multiple instances of it). v Conversion. Specifies the type of property as follows:

48

Parallel Job Advanced Developer Guide

-Name. The name of the property will be passed to the command as the argument value. This will normally be a hidden property, that is, not visible in the stage editor. -Name Value. The name of the property will be passed to the command as the argument name, and any value specified in the stage editor is passed as the value. -Value. The value for the property specified in the stage editor is passed to the command as the argument name. Typically used to group operator options that are mutually exclusive. Value only. The value for the property specified in the stage editor is passed as it is. 5. If you want to specify a list property, or otherwise control how properties are handled by your stage, choose Extended Properties from the Properties grid shortcut menu to open the Extended Properties dialog box. The settings you use depend on the type of property you are specifying: v Specify a category to have the property appear under this category in the stage editor. By default all properties appear in the Options category. v If you are specifying a List category, specify the possible values for list members in the List Value field. v If the property is to be a dependent of another property, select the parent property in the Parents field. v Specify an expression in the Template field to have the actual value of the property generated at compile time. It is usually based on values in other properties and columns. v Specify an expression in the Conditions field to indicate that the property is only valid if the conditions are met. The specification of this property is a bar | separated list of conditions that are ANDed together. For example, if the specification was a=b|c!=d, then this property would only be valid (and therefore only available in the GUI) when property a is equal to b, and property c is not equal to d. Click OK when you are happy with the extended properties. 6. Go to the Wrapped page. This allows you to specify information about the command to be executed by the stage and how it will be handled. The Interfaces tab is used to describe the inputs to and outputs from the stage, specifying the interfaces that the stage will need to function. Details about inputs to the stage are defined on the Inputs sub-tab: v Link. The link number, this is assigned for you and is read-only. When you actually use your stage, links will be assigned in the order in which you add them. In the example, the first link will be taken as link 0, the second as link 1 and so on. You can reassign the links using the stage editors Link Ordering tab on the General page. v Table Name. The meta data for the link. You define this by loading a table definition from the Repository. Type in the name, or browse for a table definition. Alternatively, you can specify an argument to the UNIX command which specifies a table definition. In this case, when the wrapped stage is used in a job design, the designer will be prompted for an actual table definition to use. v Stream. Here you can specify whether the UNIX command expects its input on standard in, or another stream, or whether it expects it in a file. Click on the browse button to open the Wrapped Stream dialog box. In the case of a file, you should also specify whether the file to be read is given in a command line argument, or by an environment variable. Details about outputs from the stage are defined on the Outputs sub-tab: v Link. The link number, this is assigned for you and is read-only. When you actually use your stage, links will be assigned in the order in which you add them. In the example, the first link will be taken as link 0, the second as link 1 and so on. You can reassign the links using the stage editors Link Ordering tab on the General page. v Table Name. The meta data for the link. You define this by loading a table definition from the Repository. Type in the name, or browse for a table definition.
Chapter 5. Specifying your own parallel stages

49

v Stream. Here you can specify whether the UNIX command will write its output to standard out, or another stream, or whether it outputs to a file. Click on the browse button to open the Wrapped Stream dialog box. In the case of a file, you should also specify whether the file to be written is specified in a command line argument, or by an environment variable. The Environment tab gives information about the environment in which the command will execute. Set the following on the Environment tab: v All Exit Codes Successful. By default WebSphere DataStage treats an exit code of 0 as successful and all others as errors. Select this check box to specify that all exit codes should be treated as successful other than those specified in the Failure codes grid. v Exit Codes. The use of this depends on the setting of the All Exits Codes Successful check box. If All Exits Codes Successful is not selected, enter the codes in the Success Codes grid which will be taken as indicating successful completion. All others will be taken as indicating failure. If All Exits Codes Successful is selected, enter the exit codes in the Failure Code grid which will be taken as indicating failure. All others will be taken as indicating success. v Environment. Specify environment variables and settings that the UNIX command requires in order to run. 7. When you have filled in the details in all the pages, click Generate to generate the stage.

Example wrapped stage


This section shows you how to define a Wrapped stage called exhort which runs the UNIX sort command in parallel. The stage sorts data in two files and outputs the results to a file. The incoming data has two columns, order number and code. The sort command sorts the data on the second field, code. You can optionally specify that the sort is run in reverse order. Wrapping the sort command in this way would be useful if you had a situation where you had a fixed sort operation that was likely to be needed in several jobs. Having it as an easily reusable stage would save having to configure a built-in sort stage every time you needed it. When included in a job and run, the stage will effectively call the Sort command as follows:
sort -r -o outfile -k 2 infile1 infile2

The following screen shots show how this stage is defined in WebSphere DataStage using the Stage Type dialog box: 1. First general details are supplied in the General tab. The argument defining the second column as the key is included in the command because this does not vary: 2. The reverse order argument (-r) are included as properties because it is optional and might or might not be included when the stage is incorporated into a job. 3. The fact that the sort command expects two files as input is defined on the Input sub-tab on the Interfaces tab of the Wrapper page. 4. The fact that the sort command outputs to a file is defined on the Output sub-tab on the Interfaces tab of the Wrapper page. Note: When you use the stage in a job, make sure that you use table definitions compatible with the tables defined in the input and output sub-tabs. 5. Because all exit codes other than 0 are treated as errors, and because there are no special environment requirements for this command, you do not need to alter anything on the Environment tab of the Wrapped page. All that remains is to click Generate to build the stage.

50

Parallel Job Advanced Developer Guide

Chapter 6. Environment Variables


These topics list the environment variables that are available for affecting the set up and operation of parallel jobs. There are many environment variables that affect the design and running of parallel jobs in WebSphere DataStage. Commonly used ones are exposed in the WebSphere DataStage Administrator client, and can be set or unset using the Administrator (see WebSphere DataStage Administrator Client Guide). There are additional environment variables, however. This topic describes all the environment variables that apply to parallel jobs. They can be set or unset as you would any other UNIX system variables, or you can add them to the User Defined section in the WebSphere DataStage Administrator environment variable tree. The available environment variables are grouped according to function. They are summarized in the following table. The final section in this topic gives some guidance to setting the environment variables. Category Environment Variable Buffering APT_BUFFER_FREE_RUN APT_BUFFER_MAXIMUM_MEMORY APT_BUFFER_MAXIMUM_TIMEOUT APT_BUFFER_DISK_WRITE_INCREMENT APT_BUFFERING_POLICY APT_SHARED_MEMORY_BUFFERS Building Custom Stages DS_OPERATOR_BUILDOP_DIR OSH_BUILDOP_CODE OSH_BUILDOP_HEADER OSH_BUILDOP_OBJECT OSH_BUILDOP_XLC_BIN OSH_CBUILDOP_XLC_BIN Compiler APT_COMPILER APT_COMPILEOPT APT_LINKER APT_LINKOPT DB2 Support APT_DB2INSTANCE_HOME APT_DB2READ_LOCK_TABLE APT_DBNAME APT_RDBMS_COMMIT_ROWS
Copyright IBM Corp. 2006, 2008

51

DB2DBDFT Debugging APT_DEBUG_OPERATOR APT_DEBUG_MODULE_NAMES APT_DEBUG_PARTITION APT_DEBUG_SIGNALS APT_DEBUG_STEP APT_DEBUG_SUBPROC APT_EXECUTION_MODE APT_PM_DBX APT_PM_GDB APT_PM_SHOW_PIDS APT_PM_XLDB APT_PM_XTERM APT_SHOW_LIBLOAD Decimal Support APT_DECIMAL_INTERM_PRECISION APT_DECIMAL_INTERM_SCALE APT_DECIMAL_INTERM_ROUND_MODE Disk I/O APT_BUFFER_DISK_WRITE_INCREMENT APT_CONSISTENT_BUFFERIO_SIZE APT_EXPORT_FLUSH_COUNT APT_IO_MAP/APT_IO_NOMAP and APT_BUFFERIO_MAP/APT_BUFFERIO_NOMAP APT_PHYSICAL_DATASET_BLOCK_SIZE General Job Administration APT_CHECKPOINT_DIR APT_CLOBBER_OUTPUT APT_CONFIG_FILE APT_DISABLE_COMBINATION APT_EXECUTION_MODE APT_ORCHHOME APT_STARTUP_SCRIPT APT_NO_STARTUP_SCRIPT APT_STARTUP_STATUS APT_THIN_SCORE Job Monitoring APT_MONITOR_SIZE APT_MONITOR_TIME

52

Parallel Job Advanced Developer Guide

APT_NO_JOBMON APT_PERFORMANCE_DATA Look Up support APT_LUTCREATE_NO_MMAP Miscellaneous APT_COPY_TRANSFORM_OPERATOR APT_EBCDIC_VERSION on page 65 APT_DATE_CENTURY_BREAK_YEAR APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS APT_INSERT_COPY_BEFORE_MODIFY APT_OLD_BOUNDED_LENGTH APT_OPERATOR_REGISTRY_PATH APT_PM_NO_SHARED_MEMORY APT_PM_NO_NAMED_PIPES APT_PM_SOFT_KILL_WAIT APT_PM_STARTUP_CONCURRENCY APT_RECORD_COUNTS APT_SAVE_SCORE APT_SHOW_COMPONENT_CALLS APT_STACK_TRACE APT_WRITE_DS_VERSION OSH_PRELOAD_LIBS Network APT_IO_MAXIMUM_OUTSTANDING APT_IOMGR_CONNECT_ATTEMPTS APT_PM_CONDUCTOR_HOSTNAME APT_PM_NO_TCPIP APT_PM_NODE_TIMEOUT APT_PM_SHOWRSH APT_PM_STARTUP_PORT on page 68 APT_PM_USE_RSH_LOCALLY APT_RECVBUFSIZE APT_RECVBUFSIZE NLS APT_COLLATION_SEQUENCE APT_COLLATION_STRENGTH APT_ENGLISH_MESSAGES APT_IMPEXP_CHARSET

Chapter 6. Environment Variables

53

APT_INPUT_CHARSET APT_OS_CHARSET APT_OUTPUT_CHARSET APT_STRING_CHARSET Oracle Support APT_ORACLE_LOAD_DELIMITED APT_ORACLE_LOAD_OPTIONS APT_ORACLE_NO_OPS APT_ORACLE_PRESERVE_BLANKS APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM APT_ORA_WRITE_FILES APT_ORAUPSERT_COMMIT_ROW_INTERVAL APT_ORAUPSERT_COMMIT_TIME_INTERVAL Partitioning APT_NO_PART_INSERTION APT_PARTITION_COUNT APT_PARTITION_NUMBER Reading and Writing Files APT_DELIMITED_READ_SIZE APT_FILE_IMPORT_BUFFER_SIZE APT_FILE_EXPORT_BUFFER_SIZE APT_IMPORT_PATTERN_USES_FILESET APT_MAX_DELIMITED_READ_SIZE APT_PREVIOUS_FINAL_DELIMITER_COMPATIBLE APT_STRING_PADCHAR Reporting APT_DUMP_SCORE APT_ERROR_CONFIGURATION APT_MSG_FILELINE APT_PM_PLAYER_MEMORY APT_PM_PLAYER_TIMING APT_RECORD_COUNTS OSH_DUMP OSH_ECHO OSH_EXPLAIN OSH_PRINT_SCHEMAS SAS Support APT_HASH_TO_SASHASH APT_NO_SASOUT_INSERT APT_NO_SAS_TRANSFORMS

54

Parallel Job Advanced Developer Guide

APT_SAS_ACCEPT_ERROR APT_SAS_CHARSET APT_SAS_CHARSET_ABORT APT_SAS_COMMAND APT_SASINT_COMMAND APT_SAS_DEBUG APT_SAS_DEBUG_IO APT_SAS_DEBUG_LEVEL APT_SAS_DEBUG_VERBOSE APT_SAS_NO_PSDS_USTRING APT_SAS_S_ARGUMENT APT_SAS_SCHEMASOURCE_DUMP APT_SAS_SHOW_INFO APT_SAS_TRUNCATION Sorting APT_NO_SORT_INSERTION APT_SORT_INSERTION_CHECK_ONLY Teradata Support APT_TERA_64K_BUFFERS APT_TERA_NO_ERR_CLEANUP APT_TERA_NO_PERM_CHECKS APT_TERA_NO_SQL_CONVERSION APT_TERA_SYNC_DATABASE APT_TERA_SYNC_USER Transport Blocks APT_AUTO_TRANSPORT_BLOCK_SIZE APT_LATENCY_COEFFICIENT APT_DEFAULT_TRANSPORT_BLOCK_SIZE APT_MAX_TRANSPORT_BLOCK_SIZE/ APT_MIN_TRANSPORT_BLOCK_SIZE

Buffering
These environment variable are all concerned with the buffering WebSphere DataStage performs on stage links to avoid deadlock situations. These settings can also be made on the Inputs page or Outputs page Advanced tab of the parallel stage editors.

APT_BUFFER_FREE_RUN
This environment variable is available in the WebSphere DataStage Administrator, under the Parallel category. It specifies how much of the available in-memory buffer to consume before the buffer resists. This is expressed as a decimal representing the percentage of Maximum memory buffer size (for example,

Chapter 6. Environment Variables

55

0.5 is 50%). When the amount of data in the buffer is less than this value, new data is accepted automatically. When the data exceeds it, the buffer first tries to write some of the data it contains before accepting more. The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing to disk.

APT_BUFFER_MAXIMUM_MEMORY
Sets the default value of Maximum memory buffer size. The default value is 3145728 (3 MB). Specifies the maximum amount of virtual memory, in bytes, used per buffer.

APT_BUFFER_MAXIMUM_TIMEOUT
WebSphere DataStage buffering is self tuning, which can theoretically lead to long delays between retries. This environment variable specified the maximum wait before a retry in seconds, and is by default set to 1.

APT_BUFFER_DISK_WRITE_INCREMENT
Sets the size, in bytes, of blocks of data being moved to/from disk by the buffering operator. The default is 1048576 (1 MB). Adjusting this value trades amount of disk access against throughput for small amounts of data. Increasing the block size reduces disk access, but might decrease performance when data is being read/written in smaller units. Decreasing the block size increases throughput, but might increase the amount of disk access.

APT_BUFFERING_POLICY
This environment variable is available in the WebSphere DataStage Administrator, under the Parallel category. Controls the buffering policy for all virtual data sets in all steps. The variable has the following settings: v AUTOMATIC_BUFFERING (default). Buffer a data set only if necessary to prevent a data flow deadlock. v FORCE_BUFFERING. Unconditionally buffer all virtual data sets. Note that this can slow down processing considerably. v NO_BUFFERING. Do not buffer data sets. This setting can cause data flow deadlock if used inappropriately.

APT_SHARED_MEMORY_BUFFERS
Typically the number of shared memory buffers between two processes is fixed at 2. Setting this will increase the number used. The likely result of this is POSSIBLY both increased latency and increased performance. This setting can significantly increase memory use.

Building Custom Stages


These environment variables are concerned with the building of custom operators that form the basis of customized stages (as described in Specifying your own parallel stages,

56

Parallel Job Advanced Developer Guide

DS_OPERATOR_BUILDOP_DIR
Identifies the directory in which generated buildops are created. By default this identifies a directory called buildop under the current project directory. If the directory is changed, the corresponding entry in APT_OPERATOR_REGISTRY_PATH needs to change to match the buildop folder.

OSH_BUILDOP_CODE
Identifies the directory into which buildop writes the generated .C file and build script. It defaults to the current working directory. The -C option of buildop overrides this setting.

OSH_BUILDOP_HEADER
Identifies the directory into which buildop writes the generated .h file. It defaults to the current working directory. The -H option of buildop overrides this setting.

OSH_BUILDOP_OBJECT
Identifies the directory into which buildop writes the dynamically loadable object file, whose extension is .so on Solaris, .sl on HP-UX, or .o on AIX. Defaults to the current working directory. The -O option of buildop overrides this setting.

OSH_BUILDOP_XLC_BIN
AIX only. Identifies the directory specifying the location of the shared library creation command used by buildop. On older AIX systems this defaults to /usr/lpp/xlC/bin/makeC++SharedLib_r for thread-safe compilation. On newer AIX systems it defaults to /usr/ibmcxx/bin/makeC++SharedLib_r. For non-thread-safe compilation, the default path is the same, but the name of the file is makeC++SharedLib.

OSH_CBUILDOP_XLC_BIN
AIX only. Identifies the directory specifying the location of the shared library creation command used by cbuildop. If this environment variable is not set, cbuildop checks the setting of OSH_BUILDOP_XLC_BIN for the path. On older AIX systems OSH_CBUILDOP_XLC_BIN defaults to /usr/lpp/xlC/bin/ makeC++SharedLib_r for thread-safe compilation. On newer AIX systems it defaults to /usr/ibmcxx/bin/makeC++SharedLib_r. For non-threadsafe compilation, the default path is the same, but the name of the command is makeC++SharedLib.

Compiler
These environment variables specify details about the C++ compiler used by WebSphere DataStage in connection with parallel jobs.

APT_COMPILER
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Compiler branch. Specifies the full path to the C++ compiler.

Chapter 6. Environment Variables

57

APT_COMPILEOPT
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Compiler branch. Specifies extra options to be passed to the C++ compiler when it is invoked.

APT_LINKER
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Compiler branch. Specifies the full path to the C++ linker.

APT_LINKOPT
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Compiler branch. Specifies extra options to be passed to the C++ linker when it is invoked.

DB2 Support
These environment variables are concerned with setting up access to DB2 databases from WebSphere DataStage.

APT_DB2INSTANCE_HOME
Specifies the DB2 installation directory. This variable is set by WebSphere DataStage to values obtained from the DB2Server table, representing the currently selected DB2 server.

APT_DB2READ_LOCK_TABLE
If this variable is defined and the open option is not specified for the DB2 stage, WebSphere DataStage performs the following open command to lock the table:
lock table table_name in share mode

APT_DBNAME
Specifies the name of the database if you choose to leave out the Database option for DB2 stages. If APT_DBNAME is not defined as well, DB2DBDFT is used to find the database name. These variables are set by WebSphere DataStage to values obtained from the DB2Server table, representing the currently selected DB2 server.

APT_RDBMS_COMMIT_ROWS
Specifies the number of records to insert into a data set between commits. The default value is 2048.

DB2DBDFT
For DB2 operators, you can set the name of the database by using the -dbname option or by setting APT_DBNAME. If you do not use either method, DB2DBDFT is used to find the database name. These variables are set by WebSphere DataStage to values obtained from the DB2Server table, representing the currently selected DB2 server.

Debugging
These environment variables are concerned with the debugging of WebSphere DataStage parallel jobs.

58

Parallel Job Advanced Developer Guide

APT_DEBUG_OPERATOR
Specifies the operators on which to start debuggers. If not set, no debuggers are started. If set to an operator number (as determined from the output of APT_DUMP_SCORE), debuggers are started for that single operator. If set to -1, debuggers are started for all operators.

APT_DEBUG_MODULE_NAMES
This comprises a list of module names separated by white space that are the modules to debug, that is, where internal IF_DEBUG statements will be run. The subproc operator module (module name is subproc) is one example of a module that uses this facility.

APT_DEBUG_PARTITION
Specifies the partitions on which to start debuggers. One instance, or partition, of an operator is run on each node running the operator. If set to a single number, debuggers are started on that partition; if not set or set to -1, debuggers are started on all partitions. See the description of APT_DEBUG_OPERATOR for more information on using this environment variable. For example, setting APT_DEBUG_STEP to 0, APT_DEBUG_OPERATOR to 1, and APT_DEBUG_PARTITION to -1 starts debuggers for every partition of the second operator in the first step.
APT_DEBUG_ OPERATOR not set -1 -1 >= 0 >= 0 APT_DEBUG_ PARTITION any value not set or -1 >= 0 -1 >= 0 Effect no debugging debug all partitions of all operators debug a specific partition of all operators debug all partitions of a specific operator debug a specific partition of a specific operator

APT_DEBUG_SIGNALS
You can use the APT_DEBUG_SIGNALS environment variable to specify that signals such as SIGSEGV, SIGBUS, and so on, should cause a debugger to start. If any of these signals occurs within an APT_Operator::runLocally() function, a debugger such as dbx is invoked. Note that the UNIX and WebSphere DataStage variables DEBUGGER, DISPLAY, and APT_PM_XTERM, specifying a debugger and how the output should be displayed, must be set correctly.

APT_DEBUG_STEP
Specifies the steps on which to start debuggers. If not set or if set to -1, debuggers are started on the processes specified by APT_DEBUG_OPERATOR and APT_DEBUG_PARTITION in all steps. If set to a step number, debuggers are started for processes in the specified step.

Chapter 6. Environment Variables

59

APT_DEBUG_SUBPROC
Display debug information about each subprocess operator.

APT_EXECUTION_MODE
This environment variable is available in the WebSphere DataStage Administrator under the Parallel branch. By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: v ONE_PROCESS one-process mode v MANY_PROCESS many-process mode v NO_SERIALIZE many-process mode, without serialization In ONE_PROCESS mode: v The application executes in a single UNIX process. You need only run a single debugger session and can set breakpoints anywhere in your code. v Data is partitioned according to the number of nodes defined in the configuration file. v Each operator is run as a subroutine and is called the number of times appropriate for the number of partitions on which it must operate. In MANY_PROCESS mode the framework forks a new process for each instance of each operator and waits for it to complete rather than calling operators as subroutines. In both cases, the step is run entirely on the Conductor node rather than spread across the configuration. NO_SERIALIZE mode is similar to MANY_PROCESS mode, but the WebSphere DataStage persistence mechanism is not used to load and save objects. Turning off persistence might be useful for tracking errors in derived C++ classes.

APT_PM_DBX
Set this environment variable to the path of your dbx debugger, if a debugger is not already included in your path. This variable sets the location; it does not run the debugger.

APT_PM_GDB
Linux only. Set this environment variable to the path of your xldb debugger, if a debugger is not already included in your path. This variable sets the location; it does not run the debugger.

APT_PM_LADEBUG
Tru64 only. Set this environment variable to the path of your dbx debugger, if a debugger is not already included in your path. This variable sets the location; it does not run the debugger.

APT_PM_SHOW_PIDS
If this variable is set, players will output an informational message upon startup, displaying their process id.

60

Parallel Job Advanced Developer Guide

APT_PM_XLDB
Set this environment variable to the path of your xldb debugger, if a debugger is not already included in your path. This variable sets the location; it does not run the debugger.

APT_PM_XTERM
If WebSphere DataStage invokes dbx, the debugger starts in an xterm window; this means WebSphere DataStage must know where to find the xterm program. The default location is /usr/bin/X11/xterm. You can override this default by setting the APT_PM_XTERM environment variable to the appropriate path. APT_PM_XTERM is ignored if you are using xldb.

APT_SHOW_LIBLOAD
If set, dumps a message to stdout every time a library is loaded. This can be useful for testing/verifying the right library is being loaded. Note that the message is output to stdout, NOT to the error log.

Decimal support APT_DECIMAL_INTERM_PRECISION


Specifies the default maximum precision value for any decimal intermediate variables required in calculations. Default value is 38.

APT_DECIMAL_INTERM_SCALE
Specifies the default scale value for any decimal intermediate variables required in calculations. Default value is 10.

APT_DECIMAL_INTERM_ROUND_MODE
Specifies the default rounding mode for any decimal intermediate variables required in calculations. The default is round_inf.

Disk I/O
These environment variables are all concerned with when and how WebSphere DataStage parallel jobs write information to disk.

APT_BUFFER_DISK_WRITE_INCREMENT
For systems where small to medium bursts of I/O are not desirable, the default 1MB write to disk size chunk size might be too small. APT_BUFFER_DISK_WRITE_INCREMENT controls this and can be set larger than 1048576 (1 MB). The setting might not exceed max_memory * 2/3.

APT_CONSISTENT_BUFFERIO_SIZE
Some disk arrays have read ahead caches that are only effective when data is read repeatedly in like-sized chunks. Setting APT_CONSISTENT_BUFFERIO_SIZE=N will force stages to read data in chunks which are size N or a multiple of N.

Chapter 6. Environment Variables

61

APT_EXPORT_FLUSH_COUNT
Allows the export operator to flush data to disk more often than it typically does (data is explicitly flushed at the end of a job, although the OS might choose to do so more frequently). Set this variable to an integer which, in number of records, controls how often flushes should occur. Setting this value to a low number (such as 1) is useful for real time applications, but there is a small performance penalty associated with setting this to a low value.

APT_IO_MAP/APT_IO_NOMAP and APT_BUFFERIO_MAP/ APT_BUFFERIO_NOMAP


In many cases memory mapped I/O contributes to improved performance. In certain situations, however, such as a remote disk mounted via NFS, memory mapped I/O might cause significant performance problems. Setting the environment variables APT_IO_NOMAP and APT_BUFFERIO_NOMAP true will turn off this feature and sometimes affect performance. (AIX and HP-UX default to NOMAP. Setting APT_IO_MAP and APT_BUFFERIO_MAP true can be used to turn memory mapped I/O on for these platforms.)

APT_PHYSICAL_DATASET_BLOCK_SIZE
Specify the block size to use for reading and writing to a data set stage. The default is 128 KB.

General Job Administration


These environment variables are concerned with details about the running of WebSphere DataStage parallel jobs.

APT_CHECKPOINT_DIR
This environment variable is available in the WebSphere DataStage Administrator under the Parallel branch. By default, when running a job, WebSphere DataStage stores state information in the current working directory. Use APT_CHECKPOINT_DIR to specify another directory.

APT_CLOBBER_OUTPUT
This environment variable is available in the WebSphere DataStage Administrator under the Parallel branch. By default, if an output file or data set already exists, WebSphere DataStage issues an error and stops before overwriting it, notifying you of the name conflict. Setting this variable to any value permits WebSphere DataStage to overwrite existing files or data sets without a warning message.

APT_CONFIG_FILE
This environment variable is available in the WebSphere DataStage Administrator under the Parallel branch. Sets the path name of the configuration file. (You might want to include this as a job parameter, so that you can specify the configuration file at job run time).

APT_DISABLE_COMBINATION
This environment variable is available in the WebSphere DataStage Administrator under the Parallel branch. Globally disables operator combining. Operator combining is WebSphere DataStages default behavior, in which two or more (in fact any number of) operators within a step are combined into one process where possible.

62

Parallel Job Advanced Developer Guide

You might need to disable combining to facilitate debugging. Note that disabling combining generates more UNIX processes, and hence requires more system resources and memory. It also disables internal optimizations for job efficiency and run times.

APT_EXECUTION_MODE
This environment variable is available in the WebSphere DataStage Administrator under the Parallel branch. By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: v ONE_PROCESS one-process mode v MANY_PROCESS many-process mode v NO_SERIALIZE many-process mode, without serialization In ONE_PROCESS mode: v The application executes in a single UNIX process. You need only run a single debugger session and can set breakpoints anywhere in your code. v Data is partitioned according to the number of nodes defined in the configuration file. v Each operator is run as a subroutine and is called the number of times appropriate for the number of partitions on which it must operate. In MANY_PROCESS mode the framework forks a new process for each instance of each operator and waits for it to complete rather than calling operators as subroutines. In both cases, the step is run entirely on the Conductor node rather than spread across the configuration. NO_SERIALIZE mode is similar to MANY_PROCESS mode, but the WebSphere DataStage persistence mechanism is not used to load and save objects. Turning off persistence might be useful for tracking errors in derived C++ classes.

APT_ORCHHOME
Must be set by all WebSphere DataStageusers to point to the top-level directory of the WebSphere DataStage parallel engine installation.

APT_STARTUP_SCRIPT
As part of running an application, WebSphere DataStage creates a remote shell on all WebSphere DataStage processing nodes on which the job runs. By default, the remote shell is given the same environment as the shell from which WebSphere DataStage is invoked. However, you can write an optional startup shell script to modify the shell configuration of one or more processing nodes. If a startup script exists, WebSphere DataStage runs it on remote shells before running your application. APT_STARTUP_SCRIPT specifies the script to be run. If it is not defined, WebSphere DataStage searches ./startup.apt, $APT_ORCHHOME/etc/startup.apt and $APT_ORCHHOME/etc/startup, in that order. APT_NO_STARTUP_SCRIPT disables running the startup script.

APT_NO_STARTUP_SCRIPT
Prevents WebSphere DataStage from executing a startup script. By default, this variable is not set, and WebSphere DataStage runs the startup script. If this variable is set, WebSphere DataStage ignores the startup script. This might be useful when debugging a startup script. See also APT_STARTUP_SCRIPT.

Chapter 6. Environment Variables

63

APT_STARTUP_STATUS
Set this to cause messages to be generated as parallel job startup moves from phase to phase. This can be useful as a diagnostic if parallel job startup is failing.

APT_THIN_SCORE
Setting this variable decreases the memory usage of steps with 100 operator instances or more by a noticable amount. To use this optimization, set APT_THIN_SCORE=1 in your environment. There are no performance benefits in setting this variable unless you are running out of real memory at some point in your flow or the additional memory is useful for sorting or buffering. This variable does not affect any specific operators which consume large amounts of memory, but improves general parallel job memory handling.

Job Monitoring
These environment variables are concerned with the Job Monitor on WebSphere DataStage.

APT_MONITOR_SIZE
This environment variable is available in the WebSphere DataStage Administrator under the Parallel branch. Determines the minimum number of records the WebSphere DataStage Job Monitor reports. The default is 5000 records.

APT_MONITOR_TIME
This environment variable is available in the WebSphere DataStage Administrator under the Parallel branch. Determines the minimum time interval in seconds for generating monitor information at runtime. The default is 5 seconds. This variable takes precedence over APT_MONITOR_SIZE.

APT_NO_JOBMON
Turn off job monitoring entirely.

APT_PERFORMANCE_DATA
Set this variable to turn on performance data output generation. APT_PERFORMANCE_DATA can be either set with no value, or be set to a valid path which will be used as the default location for performance data output.

Look up support APT_LUTCREATE_MMAP


This is only valid on TRU64 systems. Set this to force lookup tables to be created using memory mapped files. By default on TRU64 lookup table creation is done in memory created using malloc. This is for performance reasons. If, for some reason, malloced memory is not desirable, this variable can be used to switch over the memory mapped files.

64

Parallel Job Advanced Developer Guide

APT_LUTCREATE_NO_MMAP
Set this to force lookup tables to be created using malloced memory. By default lookup table creation is done using memory mapped files. There might be situations, depending on the OS configuration or file system, where writing to memory mapped files causes poor performance. In these situations this variable can be set so that malloced memory is used, which should boost performance.

Miscellaneous APT_COPY_TRANSFORM_OPERATOR
If set, distributes the shared object file of the sub-level transform operator and the shared object file of user-defined functions (not extern functions) via distribute-component in a non-NFS MPP.

APT_DATE_CENTURY_BREAK_YEAR
Four digit year which marks the century two-digit dates belong to. It is set to 1900 by default.

APT_EBCDIC_VERSION
Certain operators, including the import and export operators, support the ebcdic property specifying that field data is represented in the EBCDIC character set. The APT_EBCDIC_VERSION variable indicates the specific EBCDIC character set to use. Legal values are: HP IBM ATT USS IBM037 Use the IBM 037 EBCDIC character set. IBM500 Use the IBM 500 EBCDIC character set. If the value of the variable is HP, IBM, ATT, or USS, then EBCDIC data is internally converted to/from 7-bit ASCII. If the value is IBM037 or IBM500, internal conversion is between EBCDIC and ISO-8859-1 (the 8-bit Latin-1 superset of ASCII, with accented character support). use the EBCDIC character set supported by HP terminals (this is the default setting, except on USS installations). Use the EBCDIC character set supported by IBM 3780 terminals Use the EBCDIC character set supported by AT&T terminals. Use the IBM 1047 EBCDIC character set (this is the default setting on USS installations).

APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL
When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.

APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS
When set, WebSphere DataStage will reject any string or ustring fields read that go over their fixed size. By default these records are truncated.

Chapter 6. Environment Variables

65

APT_INSERT_COPY_BEFORE_MODIFY
When defined, turns on automatic insertion of a copy operator before any modify operator (WARNING: if this variable is not set and the operator immediately preceding modify in the data flow uses a modify adapter, the modify operator will be removed from the data flow). Only set this if you write your own custom operators AND use modify within those operators.

APT_OLD_BOUNDED_LENGTH
Some parallel datasets generated with WebSphere DataStage 7.0.1 and later releases require more disk space when the columns are of type VarChar when compared to 7.0. This is due to changes added for performance improvements for bounded length VarChars in 7.0.1. Set APT_OLD_BOUNDED_LENGTH to any value to revert to pre-7.0.1 storage behavior when using bounded length varchars. Setting this variable can have adverse performance effects. The preferred and more performant solution is to use unbounded length VarChars (dont set any length) for columns where the maximum length is rarely used, rather than set this environment variable.

APT_OPERATOR_REGISTRY_PATH
Used to locate operator .apt files, which define what operators are available and which libraries they are found in.

APT_PM_NO_SHARED_MEMORY
By default, shared memory is used for local connections. If this variable is set, named pipes rather than shared memory are used for local connections. If both APT_PM_NO_NAMED_PIPES and APT_PM_NO_SHARED_MEMORY are set, then TCP sockets are used for local connections.

APT_PM_NO_NAMED_PIPES
Specifies not to use named pipes for local connections. Named pipes will still be used in other areas of WebSphere DataStage, including subprocs and setting up of the shared memory transport protocol in the process manager.

APT_PM_SOFT_KILL_WAIT
Delay between SIGINT and SIGKILL during abnormal job shutdown. Gives time for processes to run cleanups if they catch SIGINT. Defaults to ZERO.

APT_PM_STARTUP_CONCURRENCY
Setting this to a small integer determines the number of simultaneous section leader startups to be allowed. Setting this to 1 forces sequential startup. The default is defined by SOMAXCONN in sys/socket.h (currently 5 for Solaris, 10 for AIX).

APT_RECORD_COUNTS
Causes WebSphere DataStage to print, for each operator Player, the number of records consumed by getRecord() and produced by putRecord(). Abandoned input records are not necessarily accounted for. Buffer operators do not print this information.

66

Parallel Job Advanced Developer Guide

APT_SAVE_SCORE
Sets the name and path of the file used by the performance monitor to hold temporary score data. The path must be visible from the host machine. The performance monitor creates this file, therefore it need not exist when you set this variable.

APT_SHOW_COMPONENT_CALLS
This forces WebSphere DataStage to display messages at job check time as to which user overloadable functions (such as checkConfig and describeOperator) are being called. This will not produce output at runtime and is not guaranteed to be a complete list of all user-overloadable functions being called, but an effort is made to keep this synchronized with any new virtual functions provided.

APT_STACK_TRACE
This variable controls the number of lines printed for stack traces. The values are: v unset. 10 lines printed v 0. infinite lines printed v N. N lines printed v none. no stack trace The last setting can be used to disable stack traces entirely.

APT_WRITE_DS_VERSION
By default, WebSphere DataStage saves data sets in the Orchestrate Version 4.1 format. APT_WRITE_DS_VERSION lets you save data sets in formats compatible with previous versions of Orchestrate. The values of APT_WRITE_DS_VERSION are: v v3_0. Orchestrate Version 3.0 v v3. Orchestrate Version 3.1.x v v4. Orchestrate Version 4.0 v v4_0_3. Orchestrate Version 4.0.3 and later versions up to but not including Version 4.1 v v4_1. Orchestrate Version 4.1 and later versions through and including Version 4.6

OSH_PRELOAD_LIBS
Specifies a colon-separated list of names of libraries to be loaded before any other processing. Libraries containing custom operators must be assigned to this variable or they must be registered. For example, in Korn shell syntax:
$ export OSH_PRELOAD_LIBS="orchlib1:orchlib2:mylib1"

Network
These environment variables are concerned with the operation of WebSphere DataStage parallel jobs over a network.

Chapter 6. Environment Variables

67

APT_IO_MAXIMUM_OUTSTANDING
Sets the amount of memory, in bytes, allocated to a WebSphere DataStage job on every physical node for network communications. The default value is 2097152 (2MB). When you are executing many partitions on a single physical node, this number might need to be increased.

APT_IOMGR_CONNECT_ATTEMPTS
Sets the number of attempts for a TCP connect in case of a connection failure. This is necessary only for jobs with a high degree of parallelism in an MPP environment. The default value is 2 attempts (1 retry after an initial failure).

APT_PM_CONDUCTOR_HOSTNAME
The network name of the processing node from which you invoke a job should be included in the configuration file as either a node or a fastname. If the network name is not included in the configuration file, WebSphere DataStage users must set the environment variable APT_PM_CONDUCTOR_HOSTNAME to the name of the node invoking the WebSphere DataStage job.

APT_PM_NO_TCPIP
This turns off use of UNIX sockets to communicate between player processes at runtime. If the job is being run in an MPP (non-shared memory) environment, do not set this variable, as UNIX sockets are your only communications option.

APT_PM_NODE_TIMEOUT
This controls the number of seconds that the conductor will wait for a section leader to start up and load a score before deciding that something has failed. The default for starting a section leader process is 30. The default for loading a score is 120.

APT_PM_SHOWRSH
Displays a trace message for every call to RSH.

APT_PM_STARTUP_PORT
Use this environment variable to specify the port number from which the parallel engine will start looking for TCP/IP ports. By default, WebSphere DataStage will start look at port 10000. If you know that ports in this range are used by another application, set APT_PM_STARTUP_PORT to start at a different level. You should check the /etc/services file for reserved ports.

APT_PM_USE_RSH_LOCALLY
If set, startup will use rsh even on the conductor node.

NLS Support
These environment variables are concerned with WebSphere DataStages implementation of NLS.

68

Parallel Job Advanced Developer Guide

Note: You should not change the settings of any of these environment variables other than APT_COLLATION _STRENGTH if NLS is enabled on your server.

APT_COLLATION_SEQUENCE
This variable is used to specify the global collation sequence to be used by sorts, compares, and so on This value can be overridden at the stage level.

APT_COLLATION_STRENGTH
Set this to specify the defines the specifics of the collation algorithm. This can be used to ignore accents, punctuation or other details. It is set to one of Identical, Primary, Secondary, Tertiary, or Quartenary. Setting it to Default unsets the environment variable. http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html

APT_ENGLISH_MESSAGES
If set to 1, outputs every message issued with its English equivalent.

APT_IMPEXP_CHARSET
Controls the character encoding of ustring data imported and exported to and from WebSphere DataStage, and the record and field properties applied to ustring fields. Its syntax is:
APT_IMPEXP_CHARSET icu_character_set

APT_INPUT_CHARSET
Controls the character encoding of data input to schema and configuration files. Its syntax is:
APT_INPUT_CHARSET icu_character_set

APT_OS_CHARSET
Controls the character encoding WebSphere DataStage uses for operating system data such as the names of created files and the parameters to system calls. Its syntax is:
APT_OS_CHARSET icu_character_set

APT_OUTPUT_CHARSET
Controls the character encoding of WebSphere DataStage output messages and operators like peek that use the error logging system to output data input to the osh parser. Its syntax is:
APT_OUTPUT_CHARSET icu_character_set

APT_STRING_CHARSET
Controls the character encoding WebSphere DataStage uses when performing automatic conversions between string and ustring fields. Its syntax is:
APT_STRING_CHARSET icu_character_set

Chapter 6. Environment Variables

69

Oracle Support
These environment variables are concerned with the interaction between WebSphere DataStage and Oracle databases.

APT_ORACLE_LOAD_DELIMITED
If this is defined, the orawrite operator creates delimited records when loading into Oracle sqlldr. This method preserves leading and trailing blanks within string fields (VARCHARS in the database). The value of this variable is used as the delimiter. If this is defined without a value, the default delimiter is a comma. Note that you cannot load a string which has embedded double quotes if you use this.

APT_ORACLE_LOAD_OPTIONS
You can use the environment variable APT_ORACLE_LOAD_OPTIONS to control the options that are included in the Oracle load control file.You can load a table with indexes without using the Index Mode or Disable Constraints properties by setting the APT_ORACLE_LOAD_OPTIONS environment variable appropriately. You need to set the Direct option or the PARALLEL option to FALSE, for example:
APT_ORACLE_LOAD_OPTIONS=OPTIONS(DIRECT=FALSE,PARALLEL=TRUE)

In this example the stage would still run in parallel, however, since DIRECT is set to FALSE, the conventional path mode rather than the direct path mode would be used. If loading index organized tables (IOTs), you should not set both DIRECT and PARALLEL to true as direct parallel path load is not allowed for IOTs.

APT_ORACLE_NO_OPS
Set this if you do not have Oracle Parallel server installed on an AIX system. It disables the OPS checking mechanism on the Oracle Enterprise stage.

APT_ORACLE_PRESERVE_BLANKS
Set this to set the PRESERVE BLANKS option in the control file. This preserves leading and trailing spaces. When PRESERVE BLANKS is not set Oracle removes the spaces and considers fields with only spaces to be NULL values.

APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM
By default WebSphere DataStage determines the number of processing nodes available for a parallel write to Oracle from the configuration file. Set APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM to use the number of data files in the destination tables tablespace instead.

APT_ORA_WRITE_FILES
Set this to prevent the invocation of the Oracle loader when write mode is selected on an Oracle Enterprise destination stage. Instead, the sqlldr commands are written to a file, the name of which is specified by this environment variable. The file can be invoked once the job has finished to run the loaders sequentially. This can be useful in tracking down export and pipe-safety issues related to the loader.

70

Parallel Job Advanced Developer Guide

APT_ORAUPSERT_COMMIT_ROW_INTERVAL APT_ORAUPSERT_COMMIT_TIME_INTERVAL
These two environment variables work together to specify how often target rows are committed when using the Upsert method to write to Oracle. Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.

Partitioning
The following environment variables are concerned with how WebSphere DataStage automatically partitions data.

APT_NO_PART_INSERTION
WebSphere DataStage automatically inserts partition components in your application to optimize the performance of the stages in your job. Set this variable to prevent this automatic insertion.

APT_PARTITION_COUNT
Read only. WebSphere DataStage sets this environment variable to the number of partitions of a stage. The number is based both on information listed in the configuration file and on any constraints applied to the stage. The number of partitions is the degree of parallelism of a stage. For example, if a stage executes on two processing nodes, APT_PARTITION_COUNT is set to 2. You can access the environment variable APT_PARTITION_COUNT to determine the number of partitions of the stage from within: v an operator wrapper v a shell script called from a wrapper v getenv() in C++ code v sysget() in the SAS language.

APT_PARTITION_NUMBER
Read only. On each partition, WebSphere DataStage sets this environment variable to the index number (0, 1, ...) of this partition within the stage. A subprocess can then examine this variable when determining which partition of an input file it should handle.

Reading and writing files


These environment variables are concerned with reading and writing files.

APT_DELIMITED_READ_SIZE
By default, the WebSphere DataStage will read ahead 500 bytes to get the next delimiter. For streaming inputs (socket, FIFO, and so on) this is sub-optimal, since the WebSphere DataStage might block (and not output any records). WebSphere DataStage, when reading a delimited record, will read this many bytes (minimum legal value for this is 2) instead of 500. If a delimiter is NOT available within N bytes, N will be incremented by a factor of 2 (when this environment variable is not set, this changes to 4).

Chapter 6. Environment Variables

71

APT_FILE_IMPORT_BUFFER_SIZE
The value in kilobytes of the buffer for reading in files. The default is 128 (that is, 128 KB). It can be set to values from 8 upward, but is clamped to a minimum value of 8. That is, if you set it to a value less than 8, then 8 is used. Tune this upward for long-latency files (typically from heavily loaded file servers).

APT_FILE_EXPORT_BUFFER_SIZE
The value in kilobytes of the buffer for writing to files. The default is 128 (that is, 128 KB). It can be set to values from 8 upward, but is clamped to a minimum value of 8. That is, if you set it to a value less than 8, then 8 is used. Tune this upward for long-latency files (typically from heavily loaded file servers).

APT_IMPORT_PATTERN_USES_FILESET
When this is set, WebSphere DataStage will turn any file pattern into a fileset before processing the files. This allows the files to be processed in parallel as opposed to sequentially. By default file pattern will cat the files together to be used as the input.

APT_MAX_DELIMITED_READ_SIZE
By default, when reading, WebSphere DataStage will read ahead 500 bytes to get the next delimiter. If it is not found, WebSphere DataStage looks ahead 4*500=2000 (1500 more) bytes, and so on (4X) up to 100,000 bytes. This variable controls the upper bound which is by default 100,000 bytes. Note that this variable should be used instead of APT_DELIMITED_READ_SIZE when a larger than 500 bytes read-ahead is desired.

APT_PREVIOUS_FINAL_DELIMITER_COMPATIBLE
Set this to revert to the pre-release 7.5 behavior of the final delimiter whereby, when writing data, a space is inserted after every field in a record including the last one. (The new behavior is that the a space is written after every field except the last one).

APT_STRING_PADCHAR
Overrides the pad character of 0x0 (ASCII null), used by default when WebSphere DataStage extends, or pads, a string field to a fixed length.

Reporting
These environment variables are concerned with various aspects of WebSphere DataStage jobs reporting their progress.

APT_DUMP_SCORE
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Reporting. Configures WebSphere DataStage to print a report showing the operators, processes, and data sets in a running job.

APT_ERROR_CONFIGURATION
Controls the format of WebSphere DataStage output messages. Note: Changing these settings can seriously interfere with WebSphere DataStage logging.

72

Parallel Job Advanced Developer Guide

This variables value is a comma-separated list of keywords (see table below). Each keyword enables a corresponding portion of the message. To disable that portion of the message, precede it with a !. Default formats of messages displayed by WebSphere DataStage include the keywords severity, moduleId, errorIndex, timestamp, opid, and message. The following table lists keywords, the length (in characters) of the associated components in the message, and the keywords meaning. The characters ## precede all messages. The keyword lengthprefix appears in three locations in the table. This single keyword controls the display of all length prefixes.
Keyword severity vseverity jobid Length 1 7 3 Meaning Severity indication: F, E, W, or I. Verbose description of error severity (Fatal, Error, Warning, Information). The job identifier of the job. This allows you to identify multiple jobrunning at once. The default job identifier is 0. The module identifier. For WebSphere DataStage-defined messages, this value is a four byte string beginning with T. For user-defined messages written to the error log, this string is USER. For all outputs from a subprocess, the string is USBP. The index of the message specified at the time the message was written to the error log. The message time stamp. This component consists of the string HH:MM:SS(SEQ), at the time the message was written to the error log. Messages generated in the same second have ordered sequence numbers. The IP address of the processing node generating the message. This 15-character string is in octet form, with individual octets zero filled, for example, 104.032.007.100. Length in bytes of the following field. The node name of the processing node generating the message. Length in bytes of the following field.

moduleId

errorIndex

timestamp

13

ipaddr

15

lengthprefix nodename lengthprefix

2 variable 2

Chapter 6. Environment Variables

73

Keyword opid

Length variable

Meaning The string <main_program> for error messages originating in your main program (outside of a step or within the APT_Operator::describeOperator() override). The string <node_nodename> representing system error messages originating on a node, where nodename is the name of the node. The operator originator identifier, represented by ident, partition_number, for errors originating within a step. This component identifies the instance of the operator that generated the message. ident is the operator name (with the operator index in parenthesis if there is more than one instance of it). partition_number defines the partition number of the operator issuing the message.

lengthprefix message

5 variable 1

Length, in bytes, of the following field. Maximum length is 15 KB. Error text. Newline character

APT_MSG_FILELINE
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Reporting branch. Set this to have WebSphere DataStage log extra internal information for parallel jobs.

APT_PM_PLAYER_MEMORY
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Reporting branch. Setting this variable causes each player process to report the process heap memory allocation in the job log when returning.

APT_PM_PLAYER_TIMING
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Reporting branch. Setting this variable causes each player process to report its call and return in the job log. The message with the return is annotated with CPU times for the player process.

APT_RECORD_COUNTS
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Reporting branch. Causes WebSphere DataStage to print to the job log, for each operator player, the number of records input and output. Abandoned input records are not necessarily accounted for. Buffer operators do not print this information.

74

Parallel Job Advanced Developer Guide

OSH_DUMP
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Reporting branch. If set, it causes WebSphere DataStage to put a verbose description of a job in the job log before attempting to execute it.

OSH_ECHO
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Reporting branch. If set, it causes WebSphere DataStage to echo its job specification to the job log after the shell has expanded all arguments.

OSH_EXPLAIN
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Reporting branch. If set, it causes WebSphere DataStage to place a terse description of the job in the job log before attempting to run it.

OSH_PRINT_SCHEMAS
This environment variable is available in the WebSphere DataStage Administrator under the Parallel Reporting branch. If set, it causes WebSphere DataStage to print the record schema of all data sets and the interface schema of all operators in the job log.

SAS Support
These environment variables are concerned with WebSphere DataStage interaction with SAS.

APT_HASH_TO_SASHASH
The WebSphere DataStage hash partitioner contains support for hashing SAS data. In addition, WebSphere DataStage provides the sashash partitioner which uses an alternative non-standard hashing algorithm. Setting the APT_HASH_TO_SASHASH environment variable causes all appropriate instances of hash to be replaced by sashash. If the APT_NO_SAS_TRANSFORMS environment variable is set, APT_HASH_TO_SASHASH has no affect.

APT_NO_SASOUT_INSERT
This variable selectively disables the sasout operator insertions. It maintains the other SAS-specific transformations.

APT_NO_SAS_TRANSFORMS
WebSphere DataStage automatically performs certain types of SAS-specific component transformations, such as inserting an sasout operator and substituting sasRoundRobin for RoundRobin. Setting the APT_NO_SAS_TRANSFORMS variable prevents WebSphere DataStage from making these transformations.

APT_SAS_ACCEPT_ERROR
When a SAS procedure causes SAS to exit with an error, this variable prevents the SAS-interface operator from terminating. The default behavior is for WebSphere DataStage to terminate the operator with an error.
Chapter 6. Environment Variables

75

APT_SAS_CHARSET
When the -sas_cs option of a SAS-interface operator is not set and a SAS-interface operator encounters a ustring, WebSphere DataStage interrogates this variable to determine what character set to use. If this variable is not set, but APT_SAS_CHARSET_ABORT is set, the operator will abort; otherwise the -impexp_charset option or the APT_IMPEXP_CHARSET environment variable is accessed. Its syntax is:
APT_SAS_CHARSET icu_character_set | SAS_DBCSLANG

APT_SAS_CHARSET_ABORT
Causes a SAS-interface operator to abort if WebSphere DataStage encounters a ustring in the schema and neither the -sas_cs option nor the APT_SAS_CHARSET environment variable is set.

APT_SAS_COMMAND
Overrides the $PATH directory for SAS with an absolute path to the basic SAS executable. An example path is:
/usr/local/sas/sas8.2/sas

APT_SASINT_COMMAND
Overrides the $PATH directory for SAS with an absolute path to the International SAS executable. An example path is:
/usr/local/sas/sas8.2int/dbcs/sas

APT_SAS_DEBUG
Set this to set debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which might then be copied into the WebSphere DataStage log. Use APT_SAS_DEBUG=1, APT_SAS_DEBUG_IO=1, and APT_SAS_DEBUG_VERBOSE=1 to get all debug messages.

APT_SAS_DEBUG_IO
Set this to set input/output debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which might then be copied into the WebSphere DataStage log.

APT_SAS_DEBUG_LEVEL
Its syntax is:
APT_SAS_DEBUG_LEVEL=[0-3]

Specifies the level of debugging messages to output from the SAS driver. The values of 1, 2, and 3 duplicate the output for the -debug option of the SAS operator:
no, yes, and verbose.

APT_SAS_DEBUG_VERBOSE
Set this to set verbose debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which might then be copied into the WebSphere DataStage log.

76

Parallel Job Advanced Developer Guide

APT_SAS_NO_PSDS_USTRING
Set this to prevent WebSphere DataStage from automatically converting SAS char types to ustrings in an SAS parallel data set.

APT_SAS_S_ARGUMENT
By default, WebSphere DataStage executes SAS with -s 0. When this variable is set, its value is be used instead of 0. Consult the SAS documentation for details.

APT_SAS_SCHEMASOURCE_DUMP
When using SAS Schema Source, causes the command line to be written to the log when executing SAS. You use it to inspect the data contained in a -schemaSource. Set this if you are getting an error when specifying the SAS data set containing the schema source.

APT_SAS_SHOW_INFO
Displays the standard SAS output from an import or export transaction. The SAS output is normally deleted since a transaction is usually successful.

APT_SAS_TRUNCATION
Its syntax is:
APT_SAS_TRUNCATION ABORT | NULL | TRUNCATE

Because a ustring of n characters does not fit into n characters of a SAS char value, the ustring value must be truncated beyond the space pad characters and \0. The sasin and sas operators use this variable to determine how to truncate a ustring value to fit into a SAS char field. TRUNCATE, which is the default, causes the ustring to be truncated; ABORT causes the operator to abort; and NULL exports a null field. For NULL and TRUNCATE, the first five occurrences for each column cause an information message to be issued to the log.

Sorting
The following environment variables are concerned with how WebSphere DataStage automatically sorts data.

APT_NO_SORT_INSERTION
WebSphere DataStage automatically inserts sort components in your job to optimize the performance of the operators in your data flow. Set this variable to prevent this automatic insertion.

APT_SORT_INSERTION_CHECK_ONLY
When sorts are inserted automatically by WebSphere DataStage, if this is set, the sorts will just check that the order is correct, they wont actually sort. This is a better alternative to shutting partitioning and sorting off insertion off using APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION.

Chapter 6. Environment Variables

77

Sybase support
These environment variables are concerned with setting up access to Sybase databases from WebSphere DataStage.

APT_SYBASE_NULL_AS_EMPTY
Set APT_SYBASE_NULL_AS_EMPTY to 1 to extract null values as empty, and to load null values as when reading or writing an IQ database.

APT_SYBASE_PRESERVE_BLANKS
Set APT_SYBASE_PRESERVE_BLANKS to preserve trailing blanks while writing to an IQ database.

Teradata Support
The following environment variables are concerned with WebSphere DataStage interaction with Teradata databases.

APT_TERA_64K_BUFFERS
WebSphere DataStage assumes that the terawrite operator writes to buffers whose maximum size is 32 KB. Enable the use of 64 KB buffers by setting this variable. The default is that it is not set.

APT_TERA_NO_ERR_CLEANUP
Setting this variable prevents removal of error tables and the partially written target table of a terawrite operation that has not successfully completed. Set this variable for diagnostic purposes only. In some cases, setting this variable forces completion of an unsuccessful write operation.

APT_TERA_NO_SQL_CONVERSION
Set this to prevent the SQL statements you are generating from being converted to the character set specified for your stage (character sets can be specified at project, job, or stage level). The SQL statements are converted to LATIN1 instead.

APT_TERA_NO_PERM_CHECKS
Set this to bypass permission checking on the several system tables that need to be readable for the load process. This can speed up the start time of the load process slightly.

APT_TERA_SYNC_DATABASE
Specifies the database used for the terasync table. By default, the database used for the terasync table is specified as part of APT_TERA_SYNC_USER. If you want the database to be different, set this variable. You must then give APT_TERA_SYNC_USER read and write permission for this database.

APT_TERA_SYNC_PASSWORD
Specifies the password for the user identified by APT_TERA_SYNC_USER.

78

Parallel Job Advanced Developer Guide

APT_TERA_SYNC_USER
Specifies the user that creates and writes to the terasync table.

Transport Blocks
The following environment variables are all concerned with the block size used for the internal transfer of data as jobs run. Some of the settings only apply to fixed length records The following variables are used only for fixed-length records.: v APT_MIN_TRANSPORT_BLOCK_SIZE v APT_MAX_TRANSPORT_BLOCK_SIZE v APT_DEFAULT_TRANSPORT_BLOCK_SIZE v APT_LATENCY_COEFFICIENT v APT_AUTO_TRANSPORT_BLOCK_SIZE

APT_AUTO_TRANSPORT_BLOCK_SIZE
This environment variable is available in the WebSphere DataStage Administrator, under the Parallel category. When set, Orchestrate calculates the block size for transferring data internally as jobs run. It uses this algorithm:
if (recordSize * APT_LATENCY_COEFFICIENT < APT_MIN_TRANSPORT_BLOCK_SIZE) blockSize = minAllowedBlockSize else if (recordSize * APT_LATENCY_COEFFICIENT > APT_MAX_TRANSPORT_BLOCK_SIZE) blockSize = maxAllowedBlockSize else blockSize = recordSize * APT_LATENCY_COEFFICIENT

APT_LATENCY_COEFFICIENT
Specifies the number of writes to a block which transfers data between players. This variable allows you to control the latency of data flow through a step. The default value is 5. Specify a value of 0 to have a record transported immediately. This is only used for fixed length records. Note: Many operators have a built-in latency and are not affected by this variable.

APT_DEFAULT_TRANSPORT_BLOCK_SIZE
Specify the default block size for transferring data between players. It defaults to 131072 (128 KB).

APT_MAX_TRANSPORT_BLOCK_SIZE/ APT_MIN_TRANSPORT_BLOCK_SIZE
Specify the minimum and maximum allowable block size for transferring data between players. APT_MIN_TRANSPORT_BLOCK_SIZE cannot be less than 8192 which is its default value. APT_MAX_TRANSPORT_BLOCK_SIZE cannot be greater than 1048576 which is its default value. These variables are only meaningful when used in combination with APT_LATENCY_COEFFICIENT and APT_AUTO_TRANSPORT_BLOCK_SIZE.

Guide to setting environment variables


This section gives some guide as to which environment variables should be set in what circumstances.
Chapter 6. Environment Variables

79

Environment variable settings for all jobs


We recommend that you set the following environment variables for all jobs: v APT_CONFIG_FILE v APT_DUMP_SCORE v APT_RECORD_COUNTS

Optional environment variable settings


We recommend setting the following environment variables as needed on a per-job basis. These variables can be used to turn the performance of a particular job flow, to assist in debugging, and to change the default behavior of specific parallel job stages.

Performance tuning
v APT_BUFFER_MAXIMUM_MEMORY v APT_BUFFER_FREE_RUN v TMPDIR. This defaults to /tmp. It is used for miscellaneous internal temporary data, including FIFO queues and Transformer temporary storage. As a minor optimization, it can be better to ensure that it is set to a file system separate to the WebSphere DataStage install directory.

Job flow debugging


v v v v v OSH_PRINT_SCHEMAS APT_DISABLE_COMBINATION APT_PM_PLAYER_TIMING APT_PM_PLAYER_MEMORY APT_BUFFERING_POLICY

Job flow design


v APT_STRING_PADCHAR

80

Parallel Job Advanced Developer Guide

Chapter 7. Operators
The parallel job stages are built on operators. These topics describe those operators and is intended for knowledgeable Orchestrate users. The first section describes how WebSphere DataStage stages map to operators. Subsequent sections are an alphabetical listing and description of operators. Some operators are part of a library of related operators, and each of these has its own topic as follows: v The Import/Export Library v The Partitioning Library v The Collection Library v v v v v v v v v v v The The The The The The The The The The The Restructure Library Sorting Library Join Library ODBC Interface Library SAS Interface Library Oracle Interface Library DB2 Interface Library Informix Interface Library Sybase Interface Library SQL Server Interface Library iWay Interface Library

In these descriptions, the term WebSphere DataStage refers to the parallel engine that WebSphere DataStage uses to execute the operators.

Stage to Operator Mapping


There is not a one to one mapping between WebSphere DataStage stages and operators. Some stages are based on a range of related operators and which one is used depends on the setting of the stages properties. All of the stages can include common operators such as partition and sort depending on how they are used in a job. Table 7 shows the mapping between WebSphere DataStage stages and operators. Where a stage uses an operator with a particular option set, this option is also given. The WebSphere DataStage stages are listed by palette category in the same order in which they are described in the WebSphere DataStage Parallel Job Developer Guide.
Table 7. Stage to Operator Mapping WebSphere DataStage Stage File Set Sequential File Operator Import Operator Options (where applicable) Comment - file -filepattern Export Operator File Set Import Operator Export Operator Lookup File Set Lookup Operator -file -fileset -fileset -createOnly Represents a permanent data set

Copyright IBM Corp. 2006, 2008

81

Table 7. Stage to Operator Mapping (continued) WebSphere DataStage Stage External Source Operator Import Operator Options (where applicable) Comment -source -sourcelist External Target Export Operator -destination -destinationlist Complex Flat File Transformer BASIC Transformer Import Operator Transform Operator Represents server job transformer stage (gives access to BASIC transforms) Group Operator fullouterjoin Operator innerjoin Operator leftouterjoin Operator rightouterjoin Operator Merge Lookup Merge Operator Lookup Operator The oralookup Operator The db2lookup Operator The sybaselookup Operator for direct lookup in Oracle table (`sparse mode) for direct lookup in DB2 table (`sparse mode) for direct lookup in table accessed via iWay (`sparse mode) for direct lookup in Sybase table (`sparse mode)

Aggregator Join

The sybaselookup Operator The sqlsrvrlookup Operator Funnel Funnel Operators Sortfunnel Operator Sequence Operator Sort The psort Operator The tsort Operator Remove Duplicates Compress Expand Copy Modify Filter External Filter Change Capture Change Apply Difference Remdup Operator Pcompress Operator Pcompress Operator Generator Operator Modify Operator Filter Operator Changecapture Operator Changeapply Operator Diff Operator -compress -expand

Any executable command line that acts as a filter

82

Parallel Job Advanced Developer Guide

Table 7. Stage to Operator Mapping (continued) WebSphere DataStage Stage Compare Encode Decode Switch Generic Surrogate Key Column Import Column Export Make Subrecord Split Subrecord Combine records Promote subrecord Make vector Split vector Head Tail Sample Peek Row Generator Column generator Write Range Map SAS Parallel Data Set SAS Operator Compare Operator Encode Operator Encode Operator Switch Operator Surrogate key operator The field_import Operator The field_export Operator The makesubrec Operator The splitsubrec Operator The aggtorec Operator The makesubrec Operator The makevect Operator The splitvect Operator Head Operator Tail Operator Sample Operator Peek Operator Generator Operator Generator Operator Writerangemap Operator The sas Operator The sasout Operator The sasin Operator The sascontents Operator DB2/UDB Enterprise The db2read Operator The db2write and db2load Operators The db2upsert Operator The db2lookup Operator Oracle Enterprise The oraread Operator The oraupsert Operator The orawrite Operator The oralookup Operator Informix Enterprise The hplread operator The hplwrite Operator For in-memory (`normal) lookups For in-memory (`normal) lookups ( Represents Orchestrate parallel SAS data set. Any operator -encode -decode Options (where applicable) Comment

Chapter 7. Operators

83

Table 7. Stage to Operator Mapping (continued) WebSphere DataStage Stage Operator The infxread Operator The infxwrite Operator The xpsread Operator The xpswrite Operator Teradata Teraread Operator Terawrite Operator Sybase The sybasereade Operator The sybasewrite Operator The sybaseupsert Operator The sybaselookup Operator SQL Server The sqlsrvrread Operator The sqlsrvrwrite Operator The sqlsrvrupsert Operator The sybaselookup Operator iWay The iwayread Operator The iwaylookup Operator For in-memory (`normal) lookups For in-memory (`normal) lookups Options (where applicable) Comment

Changeapply operator
The changeapply operator takes the change data set output from the changecapture operator and applies the encoded change operations to a before data set to compute an after data set. If the before data set is identical to the before data set that was input to the changecapture operator, then the output after data set for changeapply is identical to the after data set that was input to the changecapture operator. That is:
change := changecapture(before, after) after := changeapply(before, change)

You use the companion operator changecapture to provide a data set that contains the changes in the before and after data sets.

84

Parallel Job Advanced Developer Guide

Data flow diagram


before data set change data set

key:type; value:type; beforeRec:*

change_code:int8; key:type; value:type;... changeRec:*;

afterRec:*

changeapply after output

changeapply: properties
Table 8. changeapply Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Input partitioning style Output partitioning style Preserve-partitioning flag in output data set Composite operator Value 2 1 beforeRec:*, changeRec:* afterRec:* changeRec:*->afterRec:*, dropping the change_code field; beforeRec:*->afterRec:* with type conversions keys in same partition keys in same partition propagated no

The before input to changeapply must have the same fields as the before input that was input to changecapture, and an automatic conversion must exist between the types of corresponding fields. In addition, results are only guaranteed if the contents of the before input to changeapply are identical (in value and record order in each partition) to the before input that was fed to changecapture, and if the keys are unique. The change input to changeapply must have been output from changecapture without modification. Because preserve-partitioning is set on the change output of changecapture (under normal circumstances you should not override this), the changeapply operator has the same number of partitions as the changecapture operator. Additionally, both inputs of changeapply are designated as same partitioning by the operator logic. The changeapply operator performs these actions for each change record: v If the before keys come before the change keys in the specified sort order, the before record is consumed and transferred to the output; no change record is consumed. This is a copy. v If the before keys are equal to the change keys, the behavior depends on the code in the change_code field of the change record: Insert: The change record is consumed and transferred to the output; no before record is consumed.

Chapter 7. Operators

85

v If Insert: The change record is consumed and transferred to the output; no before record is consumed (the same as when the keys are equal). Delete: A warning is issued and the change record is consumed; no record is transferred to the output; no before record is consumed. Edit or Copy: A warning is issued and the change record is consumed and transferred to the output; no before record is consumed. This is an insert. v If the before input of changeapply is identical to the before input of changecapture and either keys are unique or copy records are used, then the output of changeapply is identical to the after input of changecapture. However, if the before input of changeapply is not the same (different record contents or ordering), or keys are not unique and copy records are not used, this fact is not detected and the rules described above are applied anyway, producing a result that might or might not be useful.

If key fields are not unique, and there is more than one consecutive insert with the same key, then changeapply applies all the consecutive inserts before existing records. This record order might be different from the after data set given to changecapture. Delete: The value fields of the before and change records are compared. If they are not the same, the before record is consumed and transferred to the output; no change record is consumed (copy). If the value fields are the same or if ignoreDeleteValues is specified, the change and before records are both consumed; no record is transferred to the output. If key fields are not unique, the value fields ensure that the correct record is deleted. If more than one record with the same keys have matching value fields, the first encountered is deleted. This might cause different record ordering than in the after data set given to the changecapture operator. Edit: The change record is consumed and transferred to the output; the before record is just consumed. If key fields are not unique, then the first before record encountered with matching keys is edited. This might be a different record from the one that was edited in the after data set given to the changecapture operator, unless the -keepCopy option was used. Copy: The change record is consumed. The before record is consumed and transferred to the output. the before keys come after the change keys, behavior also depends on the change_code field.

Schemas
The changeapply output data set has the same schema as the change data set, with the change_code field removed. The before interface schema is:
record (key:type; ... value:type; ... beforeRec:*;)

The change interface schema is:


record (change_code:int8; key:type; ... value:type; ... changeRec:*;)

The after interface schema is:


record (afterRec:*;)

Transfer behavior
The change to after transfer uses an internal transfer adapter to drop the change_code field from the transfer. This transfer is declared first, so the schema of the change data set determines the schema of the after data set.

86

Parallel Job Advanced Developer Guide

Key comparison fields


An internal, generic comparison function compares key fields. An internal, generic equality function compares non-key fields. You adjust the comparison with parameters and equality functions for individual fields using the -param suboption of the -key, -allkeys, -allvalues, and -value options.

Changeapply: syntax and options


You must specify at least one -key field or specify the -allkeys option. Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
changeapply -key input_field_name [-cs | ci] [-asc | -desc][-nulls first | last] [param params] [-key inpt_field_name [-cs | ci] [-asc | -desc] -nulls first | last][param params ...] | -allkeys [-cs | ci] [-asc | -desc] [-nulls first | last][param params] [-allvalues [-cs | ci] [-param params]] [-codeField field_name] [-copyCode n] [-collation_sequence locale | collation_file_pathname | OFF] [-deleteCode n ] [-doStats] [-dropkey input_field_name ...] [-dropvalue input_field_name ...] [-editCode n] [-ignoreDeleteValues] [-insertCode n] [-value inpt_field_name [-ci | -cs] [param params] ...]

Note: The -checkSort option has been deprecated. By default, partitioner and sort components are now inserted automatically.
Table 9. Changeapply options Option -key Use -key input_field_name [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params ] [-key input_field_name [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params ] ...] Specify one or more key fields. You must specify at least one key for this option or specify the -allkeys option. These options are mutually exclusive.You cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci suboption specifies that the comparison of value keys is case insensitive. The -cs suboption specifies a case-sensitive comparison, which is the default. -asc and -desc specify ascending or descending sort order. -nulls first | last specifies the position of nulls. The -params suboption allows you to specify extra parameters for a key. Specify parameters using pr operty = value pairs separated by commas.

Chapter 7. Operators

87

Table 9. Changeapply options (continued) Option -allkeys Use -allkeys [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params] Specify that all fields not explicitly declared are key fields. The suboptions are the same as the suboptions described for the -key option above. You must specify either the -allkeys option or the -key option. They are mutually exclusive. -allvalues -allvalues [-cs | ci] [-param params] Specify that all fields not otherwise explicitly declared are value fields. The -ci suboption specifies that the comparison of value keys is case insensitive. The -cs suboption specifies a case-sensitive comparison, which is the default. The -param suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. The -allvalues option is mutually exclusive with the -value and -allkeys options. -codeField -codeField field_name The name of the change code field. The default is change_code. This should match the field name used in changecapture. -collation_sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale. v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname. v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -copyCode -copyCode n Specifies the value for the change_code field in the change record for the copy result. The n value is an int8. The default value is 0. A copy record means that the before record should be copied to the output without modification.

88

Parallel Job Advanced Developer Guide

Table 9. Changeapply options (continued) Option -deleteCode Use -deleteCode n Specifies the value for the change_code field in the change record for the delete result. The n value is an int8. The default value is 2. A delete record means that a record in the before data set must be deleted to produce the after data set. -doStats -doStats Configures the operator to display result information containing the number of input records and the number of copy, delete, edit, and insert records. -dropkey -dropkey input_field_name Optionally specify that the field is not a key field. If you specify this option, you must also specify the -allkeys option. There can be any number of occurrences of this option. -dropvalue -dropvalue input_field_name Optionally specify that the field is not a value field. If you specify this option, you must also specify the -allvalues option. There can be any number of occurrences of this option. -editCode -editCode n Specifies the value for the change_code field in the change record for the edit result. The n value is an int8. The default value is 3. An edit record means that the value fields in the before data set must be edited to produce the after data set. -ignoreDeleteValues -ignoreDeleteValues Do not check value fields on deletes. Normally, changeapply compares the value fields of delete change records to those in the before record to ensure that it is deleting the correct record. The -ignoreDeleteValues option turns off this behavior. -insertCode -insertCode n Specifies the value for the change_code field in the output record for the insert result. The n value is an int8. The default value is 1. An insert means that a record must be inserted into the before data set to reproduce the after data set.

Chapter 7. Operators

89

Table 9. Changeapply options (continued) Option -value Use -value field [-ci| -cs] [param params] Optionally specifies the name of a value field. The -value option might be repeated if there are multiple value fields. The value fields are modified by edit records, and can be used to ensure that the correct record is deleted when keys are not unique. Note that you cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci suboption specifies that the comparison of values is case insensitive. The -cs suboption specifies a case-sensitive comparison, which is the default. The -params suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. The -value and -allvalues options are mutually exclusive.

Example
This example assumes that the input data set records contain customer, month, and balance fields. The operation examines the customer and month fields of each input record for differences. By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the changeapply operator and other operators. Here is the data flow diagram for the example:

90

Parallel Job Advanced Developer Guide

before

after

hash

hash

tsort

tsort

copy

changecapture schema: key changeapply value after data set


Here is the osh command:
$ osh "hash -key month -key customer < beforeRaw.ds | tsort -key month -key customer | copy > before_capture.v > before_apply.v; hash -key month -key customer < afterRaw.ds | tsort -key month -key customer > after.v; changecapture -key month -key customer -value balance < before_capture.v < after.v > change.v; changeapply -key month -key customer -value balance < before_apply.v < change.v > after.ds"

customer:int16; month:string[3]; name:string[21]; accounttype:int8 balance:sfloat;

Chapter 7. Operators

91

Changecapture operator
The changecapture operator takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. The operator produces a change data set, whose schema is transferred from the schema of the after data set with the addition of one field: a change code with values encoding the four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set. You can use the companion operator changeapply to combine the changes from the changecapture operator with the original before data set to reproduce the after data set. The changecapture operator is very similar to the diff operator described in Diff Operator .

Data flow diagram


before data set after data set

key:type; value:type; beforeRec:*

key:type; value:type;... afterRec:*;

change_code:int8; changeRec:*;

changecapture change output

Key and value fields


Records from the two input data sets are compared using key and value fields which must be top-level non-vector fields and can be nullable. Using the -param suboption of the -key, -allkeys, -allvalues, and -value options, you can provide comparison arguments to guide the manner in which key and value fields are compared. In the case of equal key fields, the value fields are compared to distinguish between the copy and edit cases.

Transfer behavior
In the insert and edit cases, the after input is transferred to output. In the delete case, an internal transfer adapter transfers the before keys and values to output. In the copy case, the after input is optionally transferred to output. Because an internal transfer adapter is used, no user transfer or view adapter can be used with the changecapture operator.

Determining differences
The changecapture output data set has the same schema as the after data set, with the addition of a change_code field. The contents of the output depend on whether the after record represents an insert, delete, edit, or copy to the before data set: v Insert: a record exists in the after data set but not the before data set as indicated by the sorted key fields. The after record is consumed and transferred to the output. No before record is consumed.

92

Parallel Job Advanced Developer Guide

If key fields are not unique, changecapture might fail to identify an inserted record with the same key fields as an existing record. Such an insert might be represented as a series of edits, followed by an insert of an existing record. This has consequences for changeapply. v Delete: a record exists in the before data set but not the after data set as indicated by the sorted key fields. The before record is consumed and the key and value fields are transferred to the output; no after record is consumed. If key fields are not unique, changecapture might fail to identify a deleted record if another record with the same keys exists. Such a delete might be represented as a series of edits, followed by a delete of a different record. This has consequences for changeapply. v Edit: a record exists in both the before and after data sets as indicated by the sorted key fields, but the before and after records differ in one or more value fields. The before record is consumed and discarded; the after record is consumed and transferred to the output. If key fields are not unique, or sort order within a key is not maintained between the before and after data sets, spurious edit records might be generated for those records whose sort order has changed. This has consequences for changeapply v Copy: a record exists in both the before and after data sets as indicated by the sorted key fields, and furthermore the before and after records are identical in value fields as well. The before record is consumed and discarded; the after record is consumed and optionally transferred to the output. If no after record is transferred, no output is generated for the record; this is the default. The operator produces a change data set, whose schema is transferred from the schema of the after data set, with the addition of one field: a change code with values encoding insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set.

Changecapture: syntax and options


changecapture -key input_field_name [-cs | ci] [-asc | -desc][-nulls first | last][-param params] [-key input_field_name [-cs | ci] [-asc | -desc][-nulls first | last] [-param params ...] | -allkeys [-cs | ci] [-asc | -desc] [-nulls first | last][-param params] [-allvalues [-cs | ci] [-param params]] [-codeField field_name] [-copyCode n] [-collation_sequence locale | collation_file_pathname |OFF] [-deleteCode n] [-doStats] [-dropkey input_field_name ...] [-dropvalue input_field_name ...] [-editCode n] [-insertCode n] [-keepCopy | -dropCopy] [-keepDelete | -dropDelete] [-keepEdit | -dropEdit] [-keepInsert | -dropInsert] [-value input_field_name [-ci | -cs] [-param params] ...]

Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes. You must specify either one or more -key fields or the -allkeys option. You can parameterize each key fields comparison operation and specify the expected sort order (the default is ascending). Note: The -checkSort option has been deprecated. By default, partitioner and sort components are now inserted automatically.

Chapter 7. Operators

93

Table 10. Changecapture options Option -key Use -key input_field_name [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params] [-key input_field_name [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params] ...] Specify one or more key fields. You must specify either the -allkeys option or at least one key for the -key option. These options are mutually exclusive.You cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci option specifies that the comparison of value keys is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. -asc and -desc specify ascending or descending sort order. -nulls first | last specifies the position of nulls. The -param suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. -allkeys -allkeys [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params] Specify that all fields not explicitly declared are key fields. The suboptions are the same as the suboptions described for the -key option above. You must specify either the -allkeys option or the -key option. They are mutually exclusive. -allvalues -allvalues [-cs | ci] [-param params] Specify that all fields not otherwise explicitly declared are value fields. The -ci option specifies that the comparison of value keys is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. The -param option allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. The -allvalues option is mutually exclusive with the -value and -allkeys options. You must specify the -allvalues option when you supply the -dropkey option. -codeField -codeField field_name Optionally specify the name of the change code field. The default is change_code.

94

Parallel Job Advanced Developer Guide

Table 10. Changecapture options (continued) Option -collation_sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -copyCode -copyCode n Optionally specify the value of the change_code field in the output record for the copy result. The n value is an int8. The default value is 0. A copy result means that all keys and all values in the before data set are equal to those in the after data set. -deleteCode -deleteCode n Optionally specify the value for the change_code field in the output record for the delete result. The n value is an int8. The default value is 2. A delete result means that a record exists in the before data set but not in the after data set as defined by the key fields. -doStats -doStats Optionally configure the operator to display result information containing the number of input records and the number of copy, delete, edit, and insert records. -dropkey -dropkey input_field_name Optionally specify that the field is not a key field. If you specify this option, you must also specify the -allkeys option. There can be any number of occurrences of this option. -dropvalue -dropvalue input_field_name Optionally specify that the field is not a value field. If you specify this option, you must also specify the -allvalues option. There can be any number of occurrences of this option.

Chapter 7. Operators

95

Table 10. Changecapture options (continued) Option -editCode Use -editCode n Optionally specify the value for the change_code field in the output record for the edit result. The n value is an int8. The default value is 3. An edit result means all key fields are equal but one or more value fields are different. -insertCode -insertCode n Optionally specify the value for the change_code field in the output record for the insert result. The n value is an int8. The default value is 1. An insert result means that a record exists in the after data set but not in the before data set as defined by the key fields. -keepCopy | -dropCopy -keepDelete | -dropDelete -keepEdit | -dropEdit -keepInsert | -dropInsert -keepCopy | -dropCopy -keepDelete | -dropDelete -keepEdit | -dropEdit -keepInsert | -dropInsert Optionally specifies whether to keep or drop copy records at output. By default, the operator creates an output record for all differences except copy. -value -value field_name [-ci | -cs] [-param params] Optionally specifies one or more value fields. When a before and after record are determined to be copies based on the difference keys (as defined by -key), the value keys can then be used to determine if the after record is an edited version of the before record. Note that you cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci option specifies that the comparison of values is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. The -param option allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. The -value and -allvalues options are mutually exclusive.

Changecapture example 1: all output results


This example assumes that the input data set records contain customer, month, and balance fields. The operation examines the customer and month fields of each input record for differences. By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the changecapture operator and other operators. Here is the osh command:
$osh "changecapture -key month -key customer -value balance < before_capture.v < after.v > change.ds"

96

Parallel Job Advanced Developer Guide

Example 2: dropping output results


In some cases, you might be interested only in some results of the changecapture operator. In this example, you keep only the output records of the edit, delete and insert results. That is, you explicitly drop the copy results so that the output data set contains records only when there is a difference between the before and after data records. As in Example 1, this example assumes that the before and after data sets are already sorted. Shown below is the data flow diagram for this example:
before data set schema: after data set schema: customer:int16; month:string[3]; name:string[21]; accounttype:int8; balance:sfloat;

key

value

customer:int16; month:string[3]; name:string[21]; accounttype:int8; balance:sfloat;

step

before

after

changecapture

switch
(-key change_code)

output 0 output 1 output 2 (delete) (edit) (insert)


output data sets schema: change_code:int8; customer:int16; month:string[3]; name:string[21]; accounttype:int8; balance:sfloat;

You specify these key and value fields to the changecapture operator:
-key month -key customer -value balance

After you run the changecapture operator, you invoke the switch operator to divide the output records into data sets based on the result type. The switch operator in this example creates three output data sets: one for delete results, one for edit results, and one for insert results. It creates only three data sets,
Chapter 7. Operators

97

because you have explicitly dropped copy results from the changecapture operator by specifying -dropCopy. By creating a separate data set for each of the three remaining result types, you can handle each one differently:
-deleteCode 0 -editCode 1 -insertCode 2

Here is the osh command:


$ osh "changecapture -key month -key customer -value balance -dropCopy -deleteCode 0 -editCode 1 -insertCode 2 < before.ds < after.ds | switch -key changecapture > outDelete.ds > outEdit.ds > outInsert.ds"

Checksum operator
You can use the checksum operator to add a checksum field to your data records. You can use the same operator later in the flow to validate the data.

Data flow diagram


input data set

inRec:*;

checksum
OutRec:*; checksum:string; crcbuffer:string;

output data set

Properties
Property Number of input data sets Number of output data sets Input interface schema Output interface schema Value 1 1 inRec:* outRec:*; string:checksum; string:crcbuffer note that the checksum field name can be changed, and the crcbuffer field is optional. Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set inRec -> outRec without record modification parallel (default) or sequential any (parallel mode) any (sequential mode) propagated

98

Parallel Job Advanced Developer Guide

Property Composite operator Combinable operator

Value no yes

The checksum operator: v Takes any single data set as input v Has an input interface schema consisting of a single schema variable inRec and an output interface schema consisting of a single schema variable outRec v Copies the input data set to the output data set, and adds one or possibly two, fields.

Checksum: syntax and options


checksum [-checksum_name field_name] [-export_name field_name] [-dropcol field | keepcol field] Table 11. Checksum options Option -checksum_name Use -checksum_name field_name Specifies a name for the output field containing the checksum value. By default the field is named checksum. -export_name -export_name field_name Specifies the name of the output field containing the buffer the checksum algorithm was run with. If this option is not specified, the checksum buffer is not written to the output data set. -dropcol -dropcol field Specifies a field that will be not be used to generate the checksum. This option can be repeated to specify multiple fields. This option is mutually exclusive with -keepcol. -keepcol -keepcol field Specifies a list of fields that will be used to generate the checksum. This option can be repeated to specify multiple fields. This option is mutually exclusive with -dropcol.

Checksum: example
In this example you use checksum to add a checksum field named check that is calculated from the fields week_total, month_total, and quarter_total. The osh command is:
$ osh "checksum -checksum_name check -keepcol week_total -keepcol month_total -keepcol quarter_total < in.ds > out0.ds

Chapter 7. Operators

99

Compare operator
The compare operator performs a field-by-field comparison of records in two presorted input data sets. This operator compares the values of top-level non-vector data types such as strings. All appropriate comparison parameters are supported, for example, case sensitivity and insensitivity for string comparisons. The compare operator does not change the schema, partitioning, or content of the records in either input data set. It transfers both data sets intact to a single output data set generated by the operator. The comparison results are also recorded in the output data set. By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the changecapture operator and other operators.

Data flow diagram


input data sets

key0:type0; keyN:typeN; inRec:*;

key0:type0; keyN:typeN; inRec:*;

result:int8; first:subrec(rec:*;);second:subrec (rec:*;);

compare output data set


Note: If you do not specify key fields, the operator treats all fields as key fields.

compare: properties
Table 12. Compare properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Input partitioning style Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 2 1 key0:type0; ... keyN:typeN; inRec:*; result:int8; first:subrec(rec:*;); second:subrec(rec:*;); The first input data set is transferred to first.rec, The second input data set is transferred to second.rec parallel (default) or sequential keys in same partition same (parallel mode) any (sequential mode) propagated no

100

Parallel Job Advanced Developer Guide

The compare operator: v Compares only scalar data types. See Restrictions . v Takes two presorted data sets as input and outputs one data set. v Has an input interface schema consisting of the key fields and the schema variable inRec, and an output interface schema consisting of the result field of the comparison and a subrecord field containing each input record. Performs a field-by-field comparison of the records of the input data sets. Transfers the two input data sets to the single output data set without altering the input schemas, partitioning, or values. Writes to the output data set signed integers that indicate comparison results.

Restrictions
The compare operator: v Compares only scalar data types, specifically string, integer, float, decimal, raw, date, time, and timestamp; you cannot use the operator to compare data types such as tagged aggregate, subrec, vector, and so on. v Compares only fields explicitly specified as key fields, except when you do not explicitly specify any key field. In that case, the operator compares all fields that occur in both records.

Results field
The operator writes the following default comparison results to the output data set. In each case, you can specify an alternate value:
Description of Comparison Results The record in the first input data set is greater than the corresponding record in the second input data set. The record in the first input data set is equal to the corresponding record in the second input data set. The record in the first input data set is less than the corresponding record in the second input data set. The number of records in the first input data is greater than the number of records in the second input data set. The number of records in the first input data set is less than the number of records in the second input data set. Default Value 1 0 -1 2 -2

When this operator encounters any of the mismatches described in the table shown above, you can force it to take one or both of the following actions: v Terminate the remainder of the current comparison v Output a warning message to the screen

Compare: syntax and options


compare [-abortOnDifference] [-field fieldname [-ci | -cs] [-param params] ...] |[-key fieldname [-ci | -cs] [-param params] ...] [-collation_sequence locale |collation_file_pathname | OFF] [-first n] [-gt n | -eq n | -lt n] [-second n] [-warnRecordCountMismatch]

Chapter 7. Operators

101

None of the options are required.


Table 13. Compare options Option -abortOnDifference Use -abortOnDifference Forces the operator to abort its operation each time a difference is encountered between two corresponding fields in any record of the two input data sets. This option is mutually exclusve with -warnRecordCountMismatch, -lt, -gt, -first, and -second. -collation_sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -field or -key -field fieldname [-ci | -cs] [-param params] | -key fieldname [-ci | -cs] [-param params] -field or -key is a key field to compare in the two input data sets. The maximum number of fields is the number of fields in the input data sets. If no key fields are explicitly specified, all fields shared by the two records being processed are compared. fieldname specifies the name of the field. -ci specifies that the comparison of strings is case-insensitive. -cs specifies case-sensitive string comparison, which is the default. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property=value pairs separated by commas. -first -first n Configures the operator to write n (a signed integer between -128 and 127) to the output data set if the number of records in the second input data set exceeds the number of records in the first input data set. The default value is -2.

102

Parallel Job Advanced Developer Guide

Table 13. Compare options (continued) Option -gt | -eq | -lt Use -gt n | -eq n | -lt n Configures the operator to write n (a signed integer between -128 and 127) to the output data set if the record in the first input data set is: Greater than (-gt) the equivalent record in the second input data set. The default is 1. Equal to (-eq) the equivalent record in the second input data set. The default is 0. Less than (-lt) the equivalent record in the second input data set. The default is -1. -second -second n Configures the operator to write n (an integer between -128 and 127) to the output data set if the number of records in the first input data set exceeds the number of records in the second input data set. The default value is 2. -warnRecordCountMismatch -warnRecordCountMismatch Forces the operator to output a warning message when a comparison is aborted due to a mismatch in the number of records in the two input data sets.

Compare example 1: running the compare operator in parallel


Each record has the fields name, age, and gender. All operations are performed on the key fields, age and gender. By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the compare operator and other operators. The compare operator runs in parallel mode which is the default mode for this operator; and the -abortOnDifference option is selected to force the operator to abort at the first indication of mismatched records. Here is the osh code corresponding to these operations:
$ osh "compare -abortOnDifference -field age -field gender < sortedDS0.v < sortedDS1.v > outDS.ds"

The output record format for a successful comparison of records looks like this, assuming all default values are used:
result:0 first:name; second:age; third:gender;

Example 2: running the compare operator sequentially


By default, the compare operator executes in parallel on all processing nodes defined in the default node pool. However, you might want to run the operator sequentially on a single node. This could be useful when you intend to store a persistent data set to disk in a single partition. For example, your parallel job might perform data cleansing and data reduction on its input to produce an output data set that is much smaller than the input. Before storing the results to disk, or passing the result to a sequential job, you can use a sequential compare operator to store the data set to disk with a single partition.
Chapter 7. Operators

103

To force the operator to execute sequentially specify the [-seq] framework argument. When executed sequentially, the operator uses a collection method of any. A sequential operator using this collection method can have its collection method overridden by an input data set to the operator. Suppose you want to run the same job as shown in Example 1: Running the compare Operator in Parallel but you want the compare operator to run sequentially. Issue this osh command:
$ osh "compare -field gender -field age [-seq] < inDS0.ds < inDS1.ds > outDS.ds"

Copy operator
You can use the modify operator with the copy operator to modify the data set as the operator performs the copy operation. See Modify Operator for more information on modifying data.

Data flow diagram


input data set

inRec:*;

copy

outRec:*;

outRec:*;

outRec:*;

output data sets

Copy: properties
Table 14. Copy properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator Value 1 0 or more (0 - n) set by user inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential any (parallel mode) any (sequential mode) propagated no yes

The copy operator:

104

Parallel Job Advanced Developer Guide

v Takes any single data set as input v Has an input interface schema consisting of a single schema variable inRec and an output interface schema consisting of a single schema variable outRec v Copies the input data set to the output data sets without affecting the record schema or contents

Copy: syntax and options


copy [-checkpoint n] [-force] Table 15. Copy options Option -checkpoint Use -checkpoint n Specifies the number of records copied from the input persistent data set to each segment of each partition of the output data set. The value of n must be positive. Its default value is 1. In order for this option to be specified, the input data set to the copy operator must be persistent and the operator must be run in parallel. The step containing the copy operator must be checkpointed, that is, you must have specified the keyword -restartable as part of the step definition. -force -force Specifies that WebSphere DataStage cannot attempt to optimize the step by removing the copy operator. In some cases, WebSphere DataStage can remove a copy operator if it determines that the copy operator is unnecessary. However, your job might require the copy operator to execute. In this case, you use the -force option. See Preventing WebSphere DataStage from Removing a copy Operator .

Preventing WebSphere DataStage from removing a copy operator


Before running a job, WebSphere DataStage optimizes each step. As part of this optimization, WebSphere DataStage removes unnecessary copy operators. However, this optimization can sometimes remove a copy operator that you do not want removed. For example, the following data flow imports a single file into a virtual data set, then copies the resulting data set to a new data set:

Chapter 7. Operators

105

step import

copy

OutDS.ds
Here is the osh command:
$ osh "import -file inFile.dat -schema recordSchema | copy > outDS.ds"

This occurs: 1. The import operator reads the data file, inFile.data, into a virtual data set. The virtual data set is written to a single partition because it reads a single data file. In addition, the import operator executes only on the processing node containing the file. 2. The copy operator runs on all processing nodes in the default node pool, because no constraints have been applied to the input operator. Thus, it writes one partition of outDS.ds to each processing node in the default node pool. However, if WebSphere DataStage removes the copy operator as part of optimization, the resultant persistent data set, outDS.ds, would be stored only on the processing node executing the import operator. In this example, outDS.ds would be stored as a single partition data set on one node. To prevent removal specify the -force option. The operator explicitly performs the repartitioning operation to spread the data over the system.

Copy example 1: The copy operator


In this example, you sort the records of a data set. However, before you perform the sort, you use the copy operator to create two copies of the data set: a persistent copy, which is saved to disk, and a virtual data set, which is passed to the sort operator. Here is a data flow diagram of the operation:

106

Parallel Job Advanced Developer Guide

step

copy

tsort

persistent data set

Output data set 0 from the copy operator is written to outDS1.ds and output data set 1 is written to the tsort operator. The syntax is as follows:
$ osh "... | copy > outDS1.ds | tsort options ...

Example 2: running the copy operator sequentially


By default, the copy operator executes in parallel on all processing nodes defined in the default node pool. However, you might have a job in which you want to run the operator sequentially, that is, on a single node. For example, you might want to store a persistent data set to disk in a single partition. You can run the operator sequentially by specifying the [seq] framework argument to the copy operator. When run sequentially, the operator uses a collection method of any. However, you can override the collection method of a sequential operator. This can be useful when you want to store a sorted data set to a single partition. Shown below is a osh command data flow example using the ordered collection operator with a sequential copy operator.
$ osh ". . . opt1 | ordered | copy [seq] > outDS.ds"

Diff operator
Note: The diff operator has been superseded by the changecapture operator. While the diff operator has been retained for backwards compatibility, you might use the changecapture operator for new development. The diff operator performs a record-by-record comparison of two versions of the same data set (the before and after data sets) and outputs a data set whose records represent the difference between them. The operator assumes that the input data sets are hash-partitioned and sorted in ascending order on the key fields you specify for the comparison. The comparison is performed based on a set of difference key fields. Two records are copies of one another if they have the same value for all difference keys. In addition, you can specify a set of value key fields. If two records are copies based on the difference key fields, the value key fields determine if one record is a copy or an edited version of the other. The diff operator is very similar to the changecapture operator described in Changecapture Operator . In most cases, you should use the changecapture operator rather than the diff operator.
Chapter 7. Operators

107

By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the diff operator and other operators. The diff operator does not behave like Unix diff.

Data flow diagram


The input data sets are known as the before and after data sets.

before data set

after data set

key0; keyN; value0; valueN; beforeRec:*;

key0; keyN; value0; valueN; afterRec:*;

diff:int8; beforeRec:*; afterRec:*

diff output data set

diff: properties
Table 16. Diff properties Property Number of input data sets Number of output data sets Input interface schema before data set: after data sets: Output interface schema Transfer behavior before to output: after to output: Execution mode Input partitioning style Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator parallel (default) or sequential keys in same partition any (parallel mode) any (sequential mode) propagated no diff:int8; beforeRec:*; afterRec:*; beforeRec -> beforeRec without record modification afterRec -> afterRec without record modification Value 2 1 key0; ... keyn; value0; ... valuen; beforeRec:*; key0; ... keyn; value0; ... valuen; afterRec:*;

Transfer behavior
The operator produces a single output data set, whose schema is the catenation of the before and after input schemas. Each record of the output data set has the following format:

108

Parallel Job Advanced Developer Guide

diff:int8

Fields from before record

Fields from after record that are not in before record

The usual name conflict resolution rules apply. The output data set contains a number of records in the range:
num_in_before <= num_in_output <= (num_in_before + num_in_after)

The number of records in the output data set depends on how many records are copies, edits, and deletes. If the before and after data sets are exactly the same, the number of records in the output data set equals the number of records in the before data set. If the before and after data sets are completely different, the output data set contains one record for each before and one record for each after data set record.

Key fields
The before data sets schema determines the difference key type. You can use an upstream modify operator to alter it. The after data sets key field(s) must have the same name as the before key field(s) and be either of the same data type or of a compatible data type. The same rule holds true for the value fields: The after data sets value field(s) must be of the same name and data type as the before value field(s). You can use an upstream modify operator to bring this about. Only top-level, non-vector, non-nullable fields might be used as difference keys. Only top-level, non-vector fields might be used as value fields. Value fields might be nullable.

Identical field names


When the two input data sets have the same field name, the diff operator retains the field of the first input, drops the identically named field from the second output, and issues a warning for each dropped field. Override the default behavior by modifying the second field name so that both versions of the field are retained in the output. (See Modify Operator .) You can then write a custom operator to select the version you require for a given job.

Determining differences
The diff operator reads the current record from the before data set, reads the current record from the after data set, and compares the records of the input data sets using the difference keys. The comparison results are classified as follows: v Insert: A record exists in the after data set but not the before data set. The operator transfers the after record to the output. The operator does not copy the current before record to the output but retains it for the next iteration of the operator. The data types default value is written to each before field in the output. By default the operator writes a 0 to the diff field of the output record. v Delete: A record exists in the before data set but not the after data set. The operator transfers the before record to the output The operator does not copy the current after record to the output but retains it for the next iteration of the operator. The data types default value is written to each after field in the output. By default, the operator writes a 1 to the diff field of the output record.

Chapter 7. Operators

109

v Copy: The record exists in both the before and after data sets and the specified value field values have not been changed. The before and after records are both transferred to the output. By default, the operator writes a 2 to the diff (first) field of the output record. v Edit: The record exists in both the before and after data sets; however, one or more of the specified value field values have been changed. The before and after records are both transferred to the output. By default, the operator writes a 3 to the diff (first) field of the output record. Options are provided to drop each kind of output record and to change the numerical value written to the diff (first) field of the output record. In addition to the difference key fields, you can optionally define one or more value key fields. If two records are determined to be copies because they have equal values for all the difference key fields, the operator then examines the value key fields. v Records whose difference and value key fields are equal are considered copies of one another. By default, the operator writes a 2 to the diff (first) field of the output record. v Records whose difference key fields are equal but whose value key fields are not equal are considered edited copies of one another. By default, the operator writes a 3 to the diff (first) field of the output record.

Diff: syntax and options


You must specify at least one difference key to the operator using -key.
diff -key field [-ci | -cs] [-param params] [-key field [-ci | -cs] [-param params]...] [-allValues [-ci | -cs] [-param params]] [-collation_sequence locale |collation_file_pathname | OFF] [-copyCode n] [-deleteCode n] [-dropCopy] [-dropDelete] [-dropEdit] [-dropInsert] [-editCode n] [-insertCode n] [-stats] [-tolerateUnsorted] [-value field [-ci | -cs] [-param params] ...] Table 17. Diff options Option -key Use -key field [-ci | -cs] [-param params] Specifies the name of a difference key field. The -key option might be repeated if there are multiple key fields. Note that you cannot use a nullable, vector, subrecord, or tagged aggregate field as a difference key. The -ci option specifies that the comparison of difference key values is case insensitive. The -csoption specifies a case-sensitive comparison, which is the default. The -params suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas.

110

Parallel Job Advanced Developer Guide

Table 17. Diff options (continued) Option -allValues Use -allValues [-ci | -cs] [-param params] Specifies that all fields other than the difference key fields identified by -key are used as value key fields. The operator does not use vector, subrecord, and tagged aggregate fields as value keys and skips fields of these data types. When a before and after record are determined to be copies based on the difference keys, the value keys can then be used to determine if the after record is an edited version of the before record. The -ci option specifies that the comparison of value keys is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. The -params suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. -collation_sequence -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -copyCode -copyCode n Specifies the value for the diff field in the output record when the before and after records are copies. The n value is an int8. The default value is 2. A copy means all key fields and all optional value fields are equal. -deleteCode -deleteCode n Specifies the value for the diff field in the output record for the delete result. The n value is an int8. The default value is 1. A delete result means that a record exists in the before data set but not in the after data set as defined by the difference key fields.

Chapter 7. Operators

111

Table 17. Diff options (continued) Option -dropCopy -dropDelete -dropEdit -dropInsert Use -dropCopy -dropDelete -dropEdit -dropInsert Specifies to drop the output record, meaning not generate it, for any one of the four difference result types. By default, an output record is always created by the operator. You can specify any combination of these four options. -editCode -editCode n Specifies the value for the diff field in the output record for the edit result. The n value is an int8. The default value is 3. An edit result means all difference key fields are equal but one or more value key fields are different. -insertCode -insertCode n Specifies the value for the diff field in the output record for the insert result. The n value is an int8. The default value is 0. An insert result means that a record exists in the after data set but not in the before data set as defined by the difference key fields. -stats -stats Configures the operator to display result information containing the number of input records and the number of copy, delete, edit, and insert records. -tolerateUnsorted -tolerateUnsorted Specifies that the input data sets are not sorted. By default, the operator generates an error and aborts the step when detecting unsorted inputs. This option allows you to process groups of records that might be arranged by the difference key fields but not sorted. The operator consumes input records in the order in which they appear on its input. If you use this option, no automatic partitioner or sort insertions are made.

112

Parallel Job Advanced Developer Guide

Table 17. Diff options (continued) Option -value Use -value field [-ci| -cs] Optionally specifies the name of a value key field. The -value option might be repeated if there are multiple value fields. When a before and after record are determined to be copies based on the difference keys (as defined by -key), the value keys can then be used to determine if the after record is an edited version of the before record. Note that you cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci option specifies that the comparison of value keys is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. The -params suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas.

Diff example 1: general example


The following example assumes that the input data set records contain a customer and month field. The operator examines the customer and month fields of each input record for differences. By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the diff operator and other operators. Here is the osh command:
$ osh " diff -key month -key customer < before.v < after.v > outDS.ds"

Example 2: Dropping Output Results


In some cases, you might be interested only in some results of the diff operator. In this example, you keep only the output records of the edit, delete and insert results. That is, you explicitly drop the copy results so that the output data set contains records only when there is a difference between the before and after data records.

Chapter 7. Operators

113

before data set schema:

after data set schema: customer:int16; month:string[3]; name:string[21]; accounttype:int8; balance:sfloat;

difference key value key

customer:int16; month:string[3]; name:string[21]; accounttype:int8; balance:sfloat;

before

after

step

diff

switch
(-key diff)

output 0 output 1 output 2 (delete) (edit) (insert)


output data sets schema: diff:int8; customer:int16; month:string[3]; name:string[21]; accounttype:int8; balance:sfloat;

Here is the data flow for this example: You specify these key and value fields to the diff operator:
key=month key=customer value=balance

After you run the diff operator, you invoke the switch operator to divide the output records into data sets based on the result type. The switch operator in this example creates three output data sets: one for delete results, one for edit results, and one for insert results. It creates only three data sets, because you have explicitly dropped copy results from the diff operation by specifying -dropCopy. By creating a separate data set for each of the three remaining result types, you can handle each one differently:
deleteCode=0 editCode=1 insertCode=2

Here is the osh command:

114

Parallel Job Advanced Developer Guide

$ osh "diff -key month -key customer -value balance -dropCopy -deleteCode 0 -editCode 1 -insertCode 2 < before.ds < after.ds | switch -key diff > outDelete.ds > outEdit.ds > outInsert.ds"

Encode operator
The encode operator encodes or decodes a WebSphere DataStage data set using a UNIX encoding command that you supply. The operator can convert a WebSphere DataStage data set from a sequence of records into a stream of raw binary data. The operator can also reconvert the data stream to a WebSphere DataStage data set.

Data flow diagram


input data set encoded data set

in:*;

encode
(mode = encode)

encode
(mode = decode) out:*; decoded data set

encoded data set

In the figure shown above, the mode argument specifies whether the operator is performing an encoding or decoding operation. Possible values for mode are: v encode: encode the input data set v decode: decode the input data set

encode: properties
Table 18. encode properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator Value 1 1 mode = encode: in:*; mode = decode: none mode = encode: none mode = decode: out:*; in -> out without record modification for an encode/decode cycle parallel (default) or sequential mode =encode: any mode =decode: same any mode = encode: sets mode = decode: propagates no no

Encode: syntax and options


Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
Chapter 7. Operators

115

encode -command command_line [-direction encode | decode] |[-mode [encode | decode]] |[-encode | -decode] Table 19. Encode options Option -command Use -command command_line Specifies the command line used for encoding/decoding. The command line must configure the UNIX command to accept input from stdin and write its results to stdout. The command must be located in the search path of your job and be accessible by every processing node on which the encode operator executes. -direction or -mode -direction encode | decode -mode encode | decode Specifies the mode of the operator. If you do not select a direction, it defaults to encode. -encode -decode Specify encoding of the data set. Encoding is the default mode. Specify decoding of the data set.

Encoding WebSphere DataStage data sets


Each record of a data set has defined boundaries that mark its beginning and end. The encode operator lets you invoke a UNIX command that encodes a WebSphere DataStage data set, which is in record format, into raw binary data and vice versa.

Processing encoded data sets


An encoded data set is similar to a WebSphere DataStage data set. An encoded, persistent data set is stored on disk in the same way as a normal data set, by two or more files: v A single descriptor file v One or more data files However, an encoded data set cannot be accessed like a standard WebSphere DataStage data set, because its records are in an encoded binary format. Nonetheless, you can specify an encoded data set to any operator that does no field-based processing or reordering of the records. For example, you can invoke the copy operator to create a copy of the encoded data set. You can further encode a data set using a different encoding operator to create an encoded-compressed data set. For example, you might compress the encoded file using WebSphere DataStages pcompress operator (see Pcompress Operator ), then invoke the unencode command to convert a binary file to an emailable format. You would then restore the data set by first decompressing and then decoding the data set.

Encoded data sets and partitioning


When you encode a data set, you remove its normal record boundaries. The encoded data set cannot be repartitioned, because partitioning in WebSphere DataStage is performed record-by-record. For that reason, the encode operator sets the preserve-partitioning flag in the output data set. This prevents an WebSphere DataStage operator that uses a partitioning method of any from repartitioning the data set and causes WebSphere DataStage to issue a warning if any operator attempts to repartition the data set.

116

Parallel Job Advanced Developer Guide

For a decoding operation, the operator takes as input a previously encoded data set. The preserve-partitioning flag is propagated from the input to the output data set.

Example
In the following example, the encode operator compresses a data set using the UNIX gzip utility. By default, gzip takes its input from stdin. You specify the -c switch to configure the operator to write its results to stdout as required by the operator: Here is the osh code for this example:
$ osh " ... op1 | encode -command gzip > encodedDS.ds"

The following example decodes the previously compressed data set so that it might be used by other WebSphere DataStage operators. To do so, you use an instance of the encode operator with a mode of decode. In a converse operation to the encoding, you specify the same operator, gzip, with the -cd option to decode its input. Here is the osh command for this example:
$ osh "encode -decode -command gzip -d < inDS.ds | op2 ..."

In this example, the command line uses the -d switch to specify the decompress mode of gzip.

Filter operator
The filter operator transfers the input records in which one or more fields meet the requirements you specify. If you request a reject data set, the filter operator transfers records that do not meet the requirements to the reject data set.

Data flow diagram


input data set

inRec:*;

filter

outRec:*; outRec:*; outRec:*;

output data sets

optional reject data set

filter: properties
Table 20. filter properties Property Number of input data sets Number of output data sets Input interface schema Value 1 1 or more, and, optionally, a reject data set inRec:*;

Chapter 7. Operators

117

Table 20. filter properties (continued) Property Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator Value outRec:*; inRec -> outRec without record modification parallel by default, or sequential any (parallel mode) any (sequential mode) propagated no yes

Filter: syntax and options


The -where option is required. Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
filter -where P[-target dsNum] [-where P [-target dsNumm] ... ] [-collation_sequence locale |collation_file_pathname | OFF] [-first] [-nulls first | last] [-reject] Table 21. Filter options Option -where Use -where P [-target dsNum] Specifies the predicate which determines the filter. In SQL, a predicate is an expression which evaluates as TRUE, FALSE, or UNKNOWN and whose value depends on the value of one or more field values. Enclose the predicate in single quotes. Single quotes within the predicate must be preceded by the backslash character (\), as in Example 3: Evaluating Input Records below. If a field is formatted as a special WebSphere DataStage data type, such as date or timestamp, enclose it in single quotes. Multi-byte Unicode character data is supported in predicate field names, constants, and literals. Multiple -where options are allowed. Each occurrence of -where causes the output data set to be incremented by one, unless you use the -target suboption. -first -first Records are output only to the data set corresponding to the first -where clause they match. The default is to write a record to the data sets corresponding to all -where clauses they match.

118

Parallel Job Advanced Developer Guide

Table 21. Filter options (continued) Option -collation_sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: Specify a predefined IBM ICU locale Write your own collation sequence using ICU syntax, and supply its collation_file_pathname Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -nulls -nulls first | last By default, nulls are evaluated first, before other values. To override this default, specify -nulls last. -reject -reject By default, records that do not meet specified requirements are dropped. Specify this option to override the default. If you do, attach a reject output data set to the operator. -target -target dsNum An optional sub-property of where. Use it to specify the target data set for a where clause. Multiple -where clauses can direct records to the same output data set. If a target data set is not specified for a particular -where clause, the output data set for that clause is implied by the order of all -where properties that do not have the -target sub-property. For example: Property -where -where -where -where -where "field1 "field2 "field3 "field4 "field5 < 4" like bb" like aa" -target > 10" -target like c.*" Data set 0 1 2 0 2

Job monitoring information


The filter operator reports business logic information which can be used to make decisions about how to process data. It also reports summary statistics based on the business logic. The business logic is included in the metadata messages generated by WebSphere DataStage as custom information. It is identified with:
name="BusinessLogic"
Chapter 7. Operators

119

The output summary per criterion is included in the summary messages generated by WebSphere DataStage as custom information. It is identified with:
name="CriterionSummary"

The XML tags, criterion, case and where, are used by the filter operator when generating business logic and criterion summary custom information. These tags are used in the example information below.

Example metadata and summary messages


<response type="metadata"> <component ident="filter"> <componentstats startTime="2002-08-08 14:41:56"/> <linkstats portNum="0" portType="in"/> <linkstats portNum="0" portType="out/"> <linkstats portNum="1" portType="out/"> <custom_info Name="BusinessLogic" Desc="User-supplied logic to filter operator"> <criterion name="where"> <where value="true" output_port="0"/> <where value="false" output_port="1"/> </criterion> </component> </response> <response type="summary"> <component ident="filter" pid="2239"> <componentstats startTime= "2002-08-08 14:41:59" stopTime="2002-08-08 14:42:40"percentCPU="99.5"/> <linkstats portNum="0" portType="in" recProcessed="1000000"/> <linkstats portNum="0" portType="out" recProcessed="500000"/> <linkstats portNum="1" portType="out" recProcessed="500000"/> <custom_info Name="CriterionSummary" Desc="Output summary per riterion"> <where value="true" output_port="0" recProcessed="500000"/> <where value="false" output_port="0" recProcessed="500000"/> </custom_info> </component> </response>

Customizing job monitor messages


WebSphere DataStage specifies the business logic and criterion summary information for the filter operator using the functions addCustomMetadata() and addCustomSummary(). You can also use these functions to generate similar information for the operators you write.

Expressions
The behavior of the filter operator is governed by expressions that you set. You can use the following elements to specify the expressions: v Fields of the input data set v Requirements involving the contents of the fields v Optional constants to be used in comparisons v The Boolean operators AND and OR to combine requirements When a record meets the requirements, it is written unchanged to an output data set. Which of the output data sets it is written to is either implied by the order of your -where options or explicitly defined by means of the -target suboption. The filter operator supports standard SQL expressions, except when comparing strings.

120

Parallel Job Advanced Developer Guide

Input data types


If you specify a single field for evaluation, that field can be of any data type. Note that WebSphere DataStages treatment of strings differs slightly from that of standard SQL. If you compare fields they must be of the same or compatible data types. Otherwise, the operation terminates with an error. Compatible data types are those that WebSphere DataStage converts by default. Regardless of any conversions the whole record is transferred unchanged to the output. If the fields are not compatible upstream of the filter, you can convert the types by using the modify operator prior to using the filter. Field data type conversion is based on the following rules: v Any integer, signed or unsigned, when compared to a floating-point type, is converted to floating-point. v Comparisons within a general type convert the smaller to the larger size (sfloat to dfloat, uint8 to uint16, and so on) v When signed and unsigned integers are compared, unsigned are converted to signed. v Decimal, raw, string, time, date, and timestamp do not figure in type conversions. When any of these is compared to another type, filter returns an error and terminates. Note: The conversion of numeric data types might result in a loss of range and cause incorrect results. WebSphere DataStage displays a warning messages to that effect when range is lost. The input field can contain nulls. If it does, null values are less than all non-null values, unless you specify the operatorss nulls last option.

Supported Boolean expressions and operators


The following list summarizes the Boolean expressions that WebSphere DataStage supports. In the list, BOOLEAN denotes any Boolean expression. 1. true 2. false 3. six comparison operators: =, <>, <, >, <=, >= 4. is null 5. is not null 6. like abc 7. The second operand must be a regular expression. See Regular Expressions . 8. between (for example, A between B and C is equivalent to B <= A and A => C) 9. not BOOLEAN 10. BOOLEAN is true 11. BOOLEAN is false 12. BOOLEAN is not true 13. BOOLEAN is not false Any of these can be combined using AND or OR.

Regular expressions
The description of regular expressions in this section has been taken from this publication: Rouge Wave, Tools.h++.

Chapter 7. Operators

121

One-character regular expressions


The following rules determine one-character regular expressions that match a single character: v Any character that is not a special character matches itself. Special characters are defined below. v A backslash (\) followed by any special character matches the literal character itself; the backslash escapes the special character. v The special characters are: +*?.[]^$ v The period (.) matches any character except the new line; for example, .umpty matches either Humpty or Dumpty. v A set of characters enclosed in brackets ([]) is a one-character regular expression that matches any of the characters in that set. For example, [akm] matches either an a, k, or m. A range of characters can be indicated with a dash. For example, [a-z] matches any lowercase letter. However, if the first character of the set is the caret (^), then the regular expression matches any character except those in the set. It does not match the empty string. For example, [^akm] matches any character except a, k, or m. The caret loses its special meaning if it is not the first character of the set.

Multi-character regular expressions


The following rules can be used to build multi-character regular expressions: v A one-character regular expression followed by an asterisk (*) matches zero or more occurrences of the regular expression. For example, [a-z]* matches zero or more lowercase characters. v A one-character regular expression followed by a plus (+) matches one or more occurrences of the regular expression. For example, [a-z]+ matches one or more lowercase characters. v A question mark (?) is an optional element. The preceeding regular expression can occur zero or once in the string, no more. For example, xy?z matches either xyz or xz.

Order of association
As in SQL, expressions are associated left to right. AND and OR have the same precedence. You might group fields and expressions in parentheses to affect the order of evaluation.

String comparison
WebSphere DataStage operators sort string values according to these general rules: v Characters are sorted in lexicographic order v Strings are evaluated by their ASCII value v Sorting is case sensitive, that is, uppercase letters appear before lowercase letter in sorted data v Null characters appear before non-null characters in a sorted data set, unless you specify the nulls last option v Byte-for-byte comparison is performed

Filter example 1: comparing two fields


You want to compare fields A and O. If the data in field A is greater than the data in field O, the corresponding records are to be written to the output data set. Use the following osh command:
$ osh "... | filter -where A > O ..."

122

Parallel Job Advanced Developer Guide

Example 2: testing for a null


You want to test field A to see if it contains a null. If it does, you want to write the corresponding records to the output data set. Use the following osh command:
$ osh "... | filter -where A is null ..."

Example 3: evaluating input records


You want to evaluate each input record to see if these conditions prevail: v EITHER all the following are true Field A does not have the value 0 Field a does not have the value 3 Field o has the value 0 v OR field q equals the string ZAG Here is the osh command for this example:
$ osh "... | filter -where A <> 0 and a <> 3 and o=0 or q = \ZAG\ ... "

Job scenario: mailing list for a wine auction


The following extended example illustrates the use of the filter operator to extract a list of prospects who should be sent a wine auction catalog, drawn from a large list of leads. A -where clause selects individuals at or above legal drinking age (adult) with sufficient income to be likely to respond to such a catalog (rich). The example illustrates the use of the where clause by not only producing the list of prospects, but by also producing a list of all individuals who are either adult or rich (or both) and a list of all individuals who are adult.

Schema for implicit import


The example assumes you have created the following schema and stored it as filter_example.schema:
record ( first_name: string[max=16]; ast_name: string[max=20]; gender: string[1]; age: uint8; income: decimal[9,2]; state: string[2]; )

OSH syntax
osh " filter -where age >= 21 and income > 50000.00 -where income > 50000.00 -where age >= 21 -target 1 -where age >= 21 < [record@filter_example.schema] all12.txt 0>| AdultAndRich.txt 1>| AdultOrRich.txt 2>| Adult.txt "

Chapter 7. Operators

123

The first -where option directs all records that have age >= 21 and income > 50000.00 to output 0, which is then directed to the file AdultAndRich.txt. The second -where option directs all records that have income > 50000.00 to output 1, which is then directed to AdultOrRich.txt. The third -where option directs all records that have age >= 21 also to output 1 (because of the expression -target 1) which is then directed to AdultOrRich.txt. The result of the second and third -where options is that records that satisfy either of the two conditions income > 50000.00 or age >= 21 are sent to output 1. A record that satisfies multiple -where options that are directed to the same output are only written to output once, so the effect of these two options is exactly the same as:
-where income > 50000.00 or age >= 21

The fourth -where option causes all records satisfying the condition age >= 21 to be sent to the output 2, because the last -where option without a -target suboption directs records to output 1. This output is then sent to Adult.txt.

Input data
As a test case, the following twelve data records exist in an input file all12.txt.
John Parker M 24 0087228.46 MA Susan Calvin F 24 0091312.42 IL William Mandella M 67 0040676.94 CA Ann Claybourne F 29 0061774.32 FL Frank Chalmers M 19 0004881.94 NY Jane Studdock F 24 0075990.80 TX Seymour Glass M 18 0051531.56 NJ Laura Engels F 57 0015280.31 KY John Boone M 16 0042729.03 CO Jennifer Sarandon F 58 0081319.09 ND William Tell M 73 0021008.45 SD Ann Dillard F 21 0004552.65 MI Jennifer Sarandon F 58 0081319.09 ND

Outputs
The following output comes from running WebSphere DataStage. Because of parallelism, the order of the records might be different for your installation. If order matters, you can apply the psort or tsort operator to the output of the filter operator. After the WebSphere DataStage job is run, the file AdultAndRich.txt contains:
John Parker M 24 0087228.46 MA Susan Calvin F 24 0091312.42 IL Ann Claybourne F 29 0061774.32 FL Jane Studdock F 24 0075990.80 TX Jennifer Sarandon F 58 0081319.09 ND

After the WebSphere DataStage job is run, the file AdultOrRich.txt contains:
John Parker M 24 0087228.46 MA Susan Calvin F 24 0091312.42 IL William Mandella M 67 0040676.94 CA Ann Claybourne F 29 0061774.32 FL Jane Studdock F 24 0075990.80 TX Seymour Glass M 18 0051531.56 NJ Laura Engels F 57 0015280.31 KY Jennifer Sarandon F 58 0081319.09 ND William Tell M 73 0021008.45 SD Ann Dillard F 21 0004552.65 MI

124

Parallel Job Advanced Developer Guide

After the WebSphere DataStage job is run, the file Adult.txt contains:
John Parker M 24 0087228.46 MA Susan Calvin F 24 0091312.42 IL William Mandella M 67 0040676.94 CA Ann Claybourne F 29 0061774.32 FL Jane Studdock F 24 0075990.80 TX Laura Engels F 57 0015280.31 KY Jennifer Sarandon F 58 0081319.09 ND William Tell M 73 0021008.45 SD Ann Dillard F 21 0004552.65 MI

Funnel operators
The funnel operators copy multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set. WebSphere DataStage provides two funnel operators: v The funnel operator combines the records of the input data in no guaranteed order. v The sortfunnel operator combines the input records in the order defined by the value(s) of one or more key fields and the order of the output records is determined by these sorting keys. By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the sortfunnel operator and other operators.

Data flow diagram


input data sets

inRec:*; inRec:*; inRec:*;

funnel or sortfunnel

outRec:*;

output data set

sortfunnel: properties
Table 22. sortfunnel properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Input partitioning style Output partitioning style Value N (set by user) 1 inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential sortfunnel operator: keys in same partition sortfunnel operator: distributed keys
Chapter 7. Operators

125

Table 22. sortfunnel properties (continued) Property Partitioning method Value funnel operator: round robin (parallel mode) sortfunnel operator: hash Collection method Preserve-partitioning flag in output data set Composite operator any (sequential mode) propagated no

Funnel operator
input data sets

inRec:*; inRec:*; inRec:*;

funnel or sortfunnel

outRec:*;

output data set

Non-deterministic input sequencing


The funnel operator processes its inputs using non-deterministic selection based on record availability. The funnel operator examines its input data sets in round-robin order. If the current record in a data set is ready for processing, the operator processes it. However, if the current record in a data set is not ready for processing, the operator does not halt execution. Instead, it moves on to the next data set and examines its current record for availability. This process continues until all the records have been transferred to output. The funnel operator is not combinable.

Syntax
The funnel operator has no options. Its syntax is simply:
funnel

Note: We do not guarantee the output order of records transferred by means of the funnel operator. Use the sortfunnel operator to guarantee transfer order.

Sort funnel operators


Input requirements
The sortfunnel operator guarantees the order of the output records, because it combines the input records in the order defined by the value(s) of one or more key fields. The default partitioner and sort operator with the same keys are automatically inserted before the sortfunnel operator.

126

Parallel Job Advanced Developer Guide

The sortfunnel operator requires that the record schema of all input data sets be identical. A parallel sortfunnel operator uses the default partitioning method local keys. See The Partitioning Library for more information on partitioning styles.

Primary and secondary keys


The sortfunnel operator allows you to set one primary key and multiple secondary keys. The sortfunnel operator first examines the primary key in each input record. For multiple records with the same primary key value, the sortfunnel operator then examines secondary keys to determine the order of records it outputs. For example, the following figure shows the current record in each of three input data sets:

data set 0

data set 1

data set 2

Jane

Smith 42

Paul

Smith 34

Mary

Davis 42

current record
If the data set shown above is sortfunneled on the primary key, LastName, and then on the secondary

primary key

Mary

Davis

42

Paul

Smith

34

Jane

Smith

42

key, Age, here is the result:

secondary key

Funnel: syntax and options


The -key option is required. Multiple key options are allowed.
sortfunnel -key field [-cs | -ci] [-asc | -desc] [-nulls first | last] [-ebcdic] [-param params] [-key field [-cs | -ci] [-asc | -desc] [-nulls first | last] [-ebcdic] [-param params] ...] [-collation_sequence locale | collation_file_pathname | OFF]

Chapter 7. Operators

127

Table 23. Funnel: syntax and options Option -collation_ sequence Use -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.htm

128

Parallel Job Advanced Developer Guide

Table 23. Funnel: syntax and options (continued) Option -key Use -key field [-cs | -ci] [-asc | -desc] [-nulls first | last] [-ebcdic] [-param params] Specifies a key field of the sorting operation. The first -key defines the primary key field of the sort; lower-priority key fields are supplied on subsequent -key specifications. You must define a single primary key to the sortfunnel operator. You can define as many secondary keys as are required by your job. For each key, select the option and supply the field name. Each record field can be used only once as a key. Therefore, the total number of primary and secondary keys must be less than or equal to the total number of fields in the record. -cs | -ci are optional arguments for specifying case-sensitive or case-insensitive sorting. By default, the operator uses a case-sensitive algorithm for sorting, that is, uppercase strings appear before lowercase strings in the sorted data set. Specify -ci to override this default and perform case-insensitive sorting of string fields. -asc | -desc are optional arguments for specifying ascending or descending sorting By default, the operator uses ascending sorting order, that is, smaller values appear before larger values in the sorted data set. Specify -desc to sort in descending sorting order instead, so that larger values appear before smaller values in the sorted data set. -nulls first | last By default fields containing null values appear first in the sorted data set. To override this default so that fields containing null values appear last in the sorted data set, specify nulls last. -ebcdic By default data is represented in the ASCII character set. To represent data in the EBCDIC character set, specify this option. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property=value pairs separated by commas.

In this osh example, the sortfunnel operator combines two input data sets into one sorted output data set:
$ osh "sortfunnel -key Lastname -key Age < out0.v < out1.v > combined.ds

Generator operator
Often during the development of a WebSphere DataStage job, you will want to test the job using valid data. However, you might not have any data available to run the test, your data might be too large to execute the test in a timely manner, or you might not have data with the characteristics required to test the job.

Chapter 7. Operators

129

The WebSphere DataStage generator operator lets you create a data set with the record layout that you pass to the operator. In addition, you can control the number of records in the data set, as well as the value of all record fields. You can then use this generated data set while developing, debugging, and testing your WebSphere DataStage job. To generate a data set, you pass to the operator a schema defining the field layout of the data set and any information used to control generated field values. This topic describes how to use the generator operator, including information on the schema options you use to control the generated field values.

Data flow diagram


input data set (optional)

inRec:*;

generator

outRec:*;

output data set

generator: properties
Table 24. generator properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Value 0 or 1 1 inRec:* supplied_schema; outRec:* inRec -> outRec without record modification sequential (default) or parallel any (parallel mode) any (sequential mode) propagated

Generator: syntax and options


generator -schema schema | -schemafile filename [-records num_recs] [-resetForEachEOW]

You must use either the -schema or the -schemafile argument to specify a schema to the operator. Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.

130

Parallel Job Advanced Developer Guide

Table 25. generator Operator Options Option -schema Use -schema schema Specifies the schema for the generated data set. You must specify either -schema or -schemafile to the operator. If you supply an input data set to the operator, new fields with the specified schema are prepended to the beginning of each record. -schemafile -schemafile filename Specifies the name of a file containing the schema for the generated data set. You must specify either -schema or -schemafile to the operator. If you supply an input data set to the operator, new fields with the supplied schema are prepended to the beginning of each record. -records -records num_recs Specifies the number of records to generate. By default the operator generates an output data set with 10 records (in sequential mode) or 10 records per partition (in parallel mode). If you supply an input data set to the operator, any specification for -records is ignored. In this case, the operator generates one record for each record in the input data set. -resetForEachEOW -resetForEachEOW Specifies that the cycle should be repeated for each EOW.

Using the generator operator


During the development of an WebSphere DataStage job, you might find it convenient to execute your job against a data set with a well-defined content. This might be necessary because you want to: v v v v Run the program against a small number of records to test its functionality Control the field values of the data set to examine job output Test the program against a variety of data sets Run the program but have no available data

You pass to the generator operator a schema that defines the field layout of the data set. By default, the generator operator initializes record fields using a well-defined generation algorithm. For example, an 8-bit unsigned integer field in the first record of a generated data set is set to 0. The field value in each subsequently generated record is incremented by 1 until the generated field value is 255. The field value then wraps back to 0. However, you can also include information in the schema passed to the operator to control the field values of the generated data set. See Numeric Fields for more information on these options.

Chapter 7. Operators

131

By default, the operator executes sequentially to generate a data set with a single partition containing 10 records. However, you can configure the operator to generate any number of records. If you configure the operator to execute in parallel, you control the number of records generated in each partition of the output data set. You can also pass an input data set to the operator. In this case, the operator prepends the generated record fields to the beginning of each record of the input data set to create the output.

Supported data types


The generator operator supports the creation of data sets containing most WebSphere DataStage data types, including fixed-length vectors and nullable fields. However, the generator operator does not support the following data types: v Variable-length string and ustring types (unless you include a maximum-length specification) v Variable-length raws (unless you include a maximum-length specification) v Subrecords v Tagged aggregates v Variable-length vectors

Example 1: using the generator operator


In this example, you use the generator operator to create a data set with 1000 records where each record contains five fields. You also allow the operator to generate default field values. Here is the schema for the generated data set for this example:
record ( a:int32; b:int16; c:sfloat; d:string[10]; e:dfloat; )

This figure shows the data flow diagram for this example:

generator

newDS.ds
To use the generator operator, first configure the schema:

132

Parallel Job Advanced Developer Guide

$ rec_schema="record ( a:int32; b:int16; c:sfloat; d:string[10]; e:dfloat; )"

Then issue the generator command:


$ osh "generator -schema $rec_schema -records 1000 > newDS.ds"

This example defines an environment variable ($rec_schema) to hold the schema passed to the operator. Alternatively you can specify the name of a file containing the schema, as shown below:
$ osh "generator -schemafile s_file.txt -records 1000 > newDS.ds"

where the text file s_file.txt contains the schema.

Example 2: executing the operator in parallel


In the previous example, the operator executed sequentially to create an output data set with 1000 records in a single partition. You can also execute the operator in parallel. When executed in parallel, each partition of the generated data set contains the same number of records as determined by the setting for the -records option. For example, the following osh command executes the operator in parallel to create an output data set with 500 records per partition:
$ osh "generator -schemafile s_file -records 500 [par] > newDS.ds"

Note that the keyword [par] has been added to the example to configure the generator operator to execute in parallel.

Example 3: using generator with an input data set


You can pass an input data set to the generator operator. In this case, the generated fields are prepended to the beginning of each input record. The operator generates an output data set with the same number of records as the input data set; you cannot specify a record count. The following command creates an output data set from an input data set and a schema file:
$ osh "generator -schemafile s_file [par] < oldDS.ds > newDS.ds"

The figure below shows the output record of the generator operator:

Generated fields

Fields from input record

Generated fields output record

Fields from input record not also included in generated fields

Chapter 7. Operators

133

For example, you can enumerate the records of a data set by appending an int32 field that cycles from 0 upward. The generated fields are prepended to the beginning of each record. This means conflicts caused by duplicate field names in the generator schema and the input data set result in the field from the input data set being dropped. Note that WebSphere DataStage issues a warning message to inform you of the naming conflict. You can use the modify operator to rename the fields in the input data set to avoid the name collision. See Transform Operator for more information.

Defining the schema for the operator


The schema passed to the generator operator defines the record layout of the output data set. For example, the previous section showed examples using the following schema:
record ( a:int32; b:int16; c:sfloat; d:string[10]; e:dfloat; )

In the absence of any other specifications in the schema, the operator assigns default values to the fields of the output data set. However, you can also include information in the schema to control the values of the generated fields. This section describes the default values generated for all WebSphere DataStage data types and the use of options in the schema to control field values.

Schema syntax for generator options


You specify generator options within the schema in the same way you specify import/export properties. The following example shows the basic syntax of the generator properties:
record ( a:int32 {generator_options}; b:int16 {generator_options}; c:sfloat {generator_options}; d:string[10] {generator_options}; e:dfloat {generator_options}; )

Note that you include the generator options as part of the schema definition for a field. The options must be included within braces and before the trailing semicolon. Use commas to separate options for fields that accept multiple options. This table lists all options for the different WebSphere DataStage data types. Detailed information on these options follows the table.
Data Type numeric (also decimal, date, time, timestamp) date Generator Options for the Schema cycle = {init = init_val, incr = incr_val, limit = limit_val} random = {limit = limit_val, seed = seed_val, signed} epoch = date invalids = percentage function = rundate decimal zeros = percentage invalids = percentage

134

Parallel Job Advanced Developer Guide

Data Type raw string

Generator Options for the Schema no options available cycle = {value = string_1, value = string_2, ... } alphabet = alpha_numeric_string

ustring

cycle = {value = ustring_1, value = ustring_2, ... } alphabet = alpha_numeric_ustring

time

scale = factor invalids = percentage

timestamp

epoch = date scale = factor invalids = percentage

nullable fields

nulls = percentage nullseed = number

Numeric fields
By default, the value of an integer or floating point field in the first record created by the operator is 0 (integer) or 0.0 (float). The field in each successive record generated by the operator is incremented by 1 (integer) or 1.0 (float). The generator operator supports the use of the cycle and random options that you can use with integer and floating point fields (as well as with all other fields except raw and string). The cycle option generates a repeating pattern of values for a field. The random option generates random values for a field. These options are mutually exclusive; that is, you can only use one option with a field. v cycle generates a repeating pattern of values for a field. Shown below is the syntax for this option:
cycle = {init = init_val, incr = limit = limit_val} incr_val ,

where: init_val is the initial field value (value of the first output record). The default value is 0. incr_val is the increment value added to produce the field value in the next output record. The default value is 1 (integer) or 1.0 (float). limit_val is the maximum field value. When the generated field value is greater than limit_val, it wraps back to init_val. The default value of limit_val is the maximum allowable value for the fields data type. You can specify the keyword part or partcount for any of these three option values. Specifying part uses the partition number of the operator on each processing node for the option value. The partition number is 0 on the first processing node, 1 on the next, and so on Specifying partcount uses the number of partitions executing the operator for the option value. For example, if the operator executes on four processing nodes, partcount corresponds to a value of 4. v random generates random values for a field. Shown below is the syntax for this option (all arguments to random are optional):
random = {limit = limit_val, seed = seed_val, signed}

where: limit_val is the maximum generated field value. The default value of limit_val is the maximum allowable value for the fields data type.

Chapter 7. Operators

135

seed_val is the seed value for the random number generator used by the operator for the field. You do not have to specify seed_val. By default, the operator uses the same seed value for all fields containing the random option. signed specifies that signed values are generated for the field (values between -limit_val and +limit_val.) Otherwise, the operator creates values between 0 and +limit_val. You can also specify the keyword part for seed_val and partcount for limit_val. For example, the following schema generates a repeating cycle of values for the AccountType field and a random number for balance:
record ( AccountType:int8 {cycle={init=0, incr=1, limit=24}}; Balance:dfloat {random={limit=100000, seed=34455}}; )

Date fields
By default, a date field in the first record created by the operator is set to January 1, 1960. The field in each successive record generated by the operator is incremented by one day. You can use the cycle and random options for date fields as shown above. When using these options, you specify the option values as a number of days. For example, to set the increment value for a date field to seven days, you use the following syntax:
record ( transDate:date {cycle={incr=7}}; transAmount:dfloat {random={limit=100000,seed=34455}}; )

In addition, you can use the following options: epoch, invalids, and functions. The epoch option sets the earliest generated date value for a field. You can use this option with any other date options. The syntax of epoch is:
epoch = date

where date sets the earliest generated date for the field. The date must be in yyyy-mm-dd format and leading zeros must be supplied for all portions of the date. If an epoch is not specified, the operator uses 1960-01-01. For example, the following schema sets the initial field value of transDate to January 1, 1998:
record ( transDate:date {epoch=1998-01-01}; transAmount:dfloat {random={limit=100000,seed=34455}}; )

You can also specify the invalids option for a date field. This option specifies the percentage of generated fields containing invalid dates:
invalids = percentage

where percentage is a value between 0.0 and 100.0. WebSphere DataStage operators that process date fields can detect an invalid date during processing. The following example causes approximately 10% of transDate fields to be invalid:
record ( transDate:date {epoch=1998-01-01, invalids=10.0}; transAmount:dfloat {random={limit=100000, seed=34455}}; )

You can use the function option to set date fields to the current date:

136

Parallel Job Advanced Developer Guide

function = rundate

There must be no other options specified to a field using function. The following schema causes transDate to have the current date in all generated records:
record ( transDate:date {function=rundate}; transAmount:dfloat {random={limit=100000, seed=34455}}; )

Decimal fields
By default, a decimal field in the first record created by the operator is set to 0. The field in each successive record generated by the operator is incremented by 1. The maximum value of the decimal is determined by the decimals scale and precision. When the maximum value is reached, the decimal field wraps back to 0. You can use the cycle and random options with decimal fields. See Numeric Fields for information on these options. In addition, you can use the zeros and invalids options with decimal fields. These options are described below. The zeros option specifies the percentage of generated decimal fields where all bytes of the decimal are set to binary zero (0x00). Many operations performed on a decimal can detect this condition and either fail or return a flag signifying an invalid decimal value. The syntax for the zeros options is:
zeros = percentage

where percentage is a value between 0.0 and 100.0. The invalids options specifies the percentage of generated decimal fields containing and invalid representation of 0xFF in all bytes of the field. Any operation performed on an invalid decimal detects this condition and either fails or returns a flag signifying an invalid decimal value. The syntax for invalids is:
invalids = percentage

where percentage is a value between 0.0 and 100.0. If you specify both zeros and invalids, the percentage for invalids is applied to the fields that are not first made zero. For example, if you specify zeros=50 and invalids=50, the operator generates approximately 50% of all values to be all zeros and only 25% (50% of the remainder) to be invalid.

Raw fields
You can use the generator operator to create fixed-length raw fields or raw fields with a specified maximum length; you cannot use the operator to generate variable-length raw fields. If the field has a maximum specified length, the length of the string is a random number between 1 and the maximum length. Maximum-length raw fields are variable-length fields with a maximum length defined by the max parameter in the form:
max_r:raw [max=10];

By default, all bytes of a raw field in the first record created by the operator are set to 0x00. The bytes of each successive record generated by the operator are incremented by 1 until a maximum value of 0xFF is reached. The operator then wraps byte values to 0x00 and repeats the cycle. You cannot specify any options to raw fields.

Chapter 7. Operators

137

String fields
You can use the generator operator to create fixed-length string and ustring fields or string and ustring fields with a specified maximum length; you cannot use the operator to generate variable-length string fields. If the field has a maximum specified length, the length of the string is a random number between 0 and the maximum length. Note that maximum-length string fields are variable-length fields with a maximum length defined by the max parameter in the form:
max_s: string [max=10];

In this example, the field max_s is variable length up to 10 bytes long. By default, the generator operator initializes all bytes of a string field to the same alphanumeric character. When generating a string field, the operators uses the following characters, in the following order:
abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ

For example, the following field specification:


s: string[5];

produces successive string fields with the values:


aaaaa bbbbb ccccc ddddd ...

After the last character, capital Z, values wrap back to lowercase a and the cycle repeats. Note: The alphabet property for ustring values accepts Unicode characters. You can use the alphabet property to define your own list of alphanumeric characters used for generated string fields:
alphabet = alpha_numeric_string

This option sets all characters in the field to successive characters in the alpha_numeric_string. For example, this field specification:
s: string[3] {alphabet=abc};

produces strings with the following values:


aaa bbb ccc aaa ...

Note: The cycle option for usting values accepts Unicode characters. The cycle option specifies the list of string values assigned to generated string field:
cycle = { value = string_1, value = string_2, ... }

The operator assigns string_1 to the string field in the first generated record, string_2 to the field in the second generated record, and so on In addition: v If you specify only a single value, all string fields are set to that value.

138

Parallel Job Advanced Developer Guide

v If the generated string field is fixed length, the value string is truncated or padded with the default pad character 0x00 to the fixed length of the string. v If the string field contains a maximum length setting, the length of the string field is set to the length of the value string. If the length of the value string is longer than the maximum string length, the value string is truncated to the maximum length.

Time fields
By default, a time field in the first record created by the operator is set to 00:00:00 (midnight). The field in each successive record generated by the operator is incremented by one second. After reaching a time of 23:59:59, time fields wrap back to 00:00:00. You can use the cycle and random options with time fields. See Numeric Fields for information on these options. When using these options, you specify the options values in numbers of seconds. For example, to set the value for a time field to a random value between midnight and noon, you use the following syntax:
record ( transTime:time {random={limit=43200, seed=83344}}; )

For a time field, midnight corresponds to an initial value of 0 and noon corresponds to 43,200 seconds (12 hours * 60 minutes * 60 seconds). In addition, you can use the scale and invalids options with time fields. The scale option allows you to specify a multiplier to the increment value for time. The syntax of this options is:
scale = factor

The increment value is multiplied by factor before being added to the field. For example, the following schema generates two time fields:
record ( timeMinutes:time {scale=60}; timeSeconds:time; )

In this example, the first field increments by 60 seconds per record (one minute), and the second field increments by seconds. You use the invalids option to specify the percentage of invalid time fields generated:
invalids = percentage

where percentage is a value between 0.0 and 100.0. The following schema generates two time fields with different percentages of invalid values:
record ( timeMinutes:time {scale=60, invalids=10}; timeSeconds:time {invalids=15}; )

Timestamp fields
A timestamp field consists of both a time and date portion. Timestamp fields support all valid options for both date and time fields. See Date Fields or Time Fields for more information.

Chapter 7. Operators

139

By default, a timestamp field in the first record created by the operator is set to 00:00:00 (midnight) on January 1, 1960. The time portion of the timestamp is incremented by one second for each successive record. After reaching a time of 23:59:59, the time portion wraps back to 00:00:00 and the date portion increments by one day.

Null fields
By default, schema fields are not nullable. Specifying a field as nullable allows you to use the nulls and nullseed options within the schema passed to the generator operator. Note: If you specify these options fo a non-nullable field, the operator issues a warning and the field is set to its default value. The nulls option specifies the percentage of generated fields that are set to null:
nulls = percentage

where percentage is a value between 0.0 and 100.0. The following example specifies that approximately 15% of all generated records contain a null for field a:
record ( a:nullable int32 {random={limit=100000, seed=34455}, nulls=15.0}; b:int16; )

The nullseed options sets the seed for the random number generator used to decide whether a given field will be null.
nullseed = seed

where seed specifies the seed value and must be an integer larger than 0. In some cases, you might have multiple fields in a schema that support nulls. You can set all nullable fields in a record to null by giving them the same nulls and nullseed values. For example, the following schema defines two fields as nullable:
record ( a:nullable int32 {nulls=10.0, nullseed=5663}; b:int16; c:nullable sfloat {nulls=10.0, nullseed=5663}; d:string[10]; e:dfloat; )

Since both fields a and c have the same settings for nulls and nullseed, whenever one field in a record is null the other is null as well.

Head operator
The head operator selects the first n records from each partition of an input data set and copies the selected records to an output data set. By default, n is 10 records. However, you can determine the following by means of options: v The number of records to copy v The partition from which the records are copied v The location of the records to copy v The number of records to skip before the copying operation begins.

140

Parallel Job Advanced Developer Guide

This control is helpful in testing and debugging jobs with large data sets. For example, the -part option lets you see data from a single partition to ascertain if the data is being partitioned as you want. The -skip option lets you access a portion of a data set. The tail operator performs a similar operation, copying the last n records from each partition. See Tail Operator .

Data flow diagram


input data set

inRec:*;

head

outRec:*;

output data set

head: properties
Table 26. head properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1 inRec:* outRec:* inRec -> outRec without record modification

Head: syntax and options


head [-all | -nrecs count] [-part partition_number] [-period P] [-skip recs] Table 27. Head options Option -all Use -all Copy all input records to the output data set. You can skip records before head performs its copy operation by means of the -skip option. You cannot select the -all option and the -nrecs option at the same time.

Chapter 7. Operators

141

Table 27. Head options (continued) -nrecs -nrecs count Specify the number of records (count) to copy from each partition of the input data set to the output data set. The default value of count is 10. You cannot specify this option and the -all option at the same time. -part -part partition_number Copy records only from the indicated partition, partition_number. By default, the operator copies records from all partitions. You can specify -part multiple times to specify multiple partition numbers. Each time you do, specify the option followed by the number of the partition. -period -period P Copy every Pth record in a partition, where P is the period. You can start the copy operation after records have been skipped (as defined by -skip). P must equal or be greater than 1. The default value of P is 1. -skip -skip recs Ignore the first recs records of each partition of the input data set, where recs is the number of records to skip. The default skip count is 0.

Head example 1: head operator default behavior


In this example, no options have been specified to the head operator. The input data set consists of 100 sorted positive integers hashed into four partitions. The output data set consists of the first ten integers of each partition. The next table lists the input and output data sets by partition. The osh command is:
$osh "head < in.ds > out.ds" Partition 0 Input 0 9 18 19 23 25 36 37 40 47 51 Output 0 9 18 19 23 25 36 37 40 47 Partition 1 Input 3 5 11 12 13 14 15 16 17 35 42 46 49 50 53 57 59 Output 3 5 11 12 13 14 15 16 17 35 Partition 2 Input 6 7 8 22 29 30 33 41 43 44 45 48 55 56 58 Output 6 7 8 22 29 30 33 41 43 44 Partition 3 Input 1 2 4 10 20 21 24 26 27 28 31 32 34 38 39 52 54 Output 1 2 4 10 20 21 24 26 27 28

142

Parallel Job Advanced Developer Guide

Example 2: extracting records from a large data set


In this example you use the head operator to extract the first 1000 records of each partition of a large data set, in.ds. To perform this operation, use the osh command:
$ osh "head -nrecs 1000 < in.ds > out0.ds"

For example, if in.ds is a data set of one megabyte, with 2500 K records, out0.ds is a data set of15.6 kilobytes with 4K records.

Example 3: locating a single record


In this example you use head to extract a single record from a particular partition to diagnose the record. The osh command is:
$ osh "head -nrecs 1 -skip 1234 -part 2 < in.ds > out0.ds

Lookup operator
With the lookup operator, you can create and use lookup tables to modify your input data set. For example, you could map a field that should contain a two letter U. S. state postal code to the name of the state, adding a FullStateName field to the output schema. The operator performs in two modes: lookup mode and create-only mode: v In lookup mode, the operator takes as input a single source data set, one or more lookup tables represented as WebSphere DataStage data sets, and one or more file sets. A file set is a lookup table that contains key-field information. There most be at least one lookup table or file set. For each input record of the source data set, the operator performs a table lookup on each of the input lookup tables. The table lookup is based on the values of a set of lookup key fields, one set for each table. A source record and lookup record correspond when all of the specified lookup key fields have matching values. Each record of the output data set contains all of the fields from a source record. Concatenated to the end of the output records are the fields from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key fields. The reject data set is an optional second output of the operator. This data set contains source records that do not have a corresponding entry in every input lookup table. v In create-only mode, you use the -createOnly option to create lookup tables without doing the lookup processing step. This allows you to make and save lookup tables that you expect to need at a later time, making for faster start-up of subsequent lookup operations. The lookup operator is similar in function to the merge operator and the join operators. To understand the similarities and differences see Comparison with Other Operators .

Chapter 7. Operators

143

Data flow diagrams


Create-only mode
table0.ds table1.ds tablen.ds

lookup

fileset0.ds fileset1.ds filesetn.ds

Lookup mode
look ups

source.ds

fileset0.ds

filesetn.ds

table0.ds

tableN.ds

key0; ...keyN; inRec:*;

key0; ...keyN; filesetRec0:*;

key0; ...keyN; filesetRecN:*;

key0; key0; ...keyN; ...keyN; tableRec0:*; tableRec0:*;

outRec:*; tableRec1:*; tableRecN:*;

lookup
rejectRec:*;

output.ds

reject.ds

fileset0.ds

filesetn.ds

(when the save suboption is used)

lookup: properties
Table 28. lookup properties Property Number of input data sets Number of output data sets Input interface schema input data set: lookup data sets: Normal mode T+1 1 or 2 (output and optional reject) key0:data_type; ... keyN:data_type; inRec:*; key0:data_type; ... keyM:data_type; tableRec:*; Create-only mode T 0 n/a key0:data_type; ... keyM:data_type; tableRec:*;

144

Parallel Job Advanced Developer Guide

Table 28. lookup properties (continued) Property Output interface schema output data set: reject data sets: Transfer behavior source to output: lookup to output: source to reject: table to file set Normal mode outRec:*; with lookup fields missing from the input data set concatenated rejectRec;* inRec -> outRec without record modification tableRecN -> tableRecN, minus lookup keys and other duplicate fields inRec -> rejectRec without record modification (optional) key-field information is added to the table Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator any (parallel mode); the default for table inputs is entire any (sequential mode) propagated yes any (default is entire) any n/a yes n/a key-field information is added to the table Create-only mode n/a

Lookup: syntax and options


Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes. The syntax for the lookup operator in an osh command has two forms, depending on whether you are only creating one or more lookup tables or doing the lookup (matching) itself and creating file sets or using existing file sets.
lookup -createOnly -table -key field [-cs | -ci] [-param parameters] [-key field [-cs | -ci] [-param parameters]...] [-allow_dups] save lookup_fileset [-diskpool pool] [-table -key field [-cs | -ci] [-param parameters] [-key field [-cs | -ci] [-param parameters]...] [-allow_dups] -save fileset_descriptor [-diskpool pool] ...]

or
lookup [-fileset fileset_descriptor] [-collation_sequence locale |collation_file_pathname | OFF] [-table key_specifications [-allow_dups] -save fileset_descriptor [-diskpool pool]...] [-ifNotFound continue | drop | fail | reject]

where a fileset, or a table, or both, must be specified, and key_specifications is a list of one or more strings of this form:
-key field [-cs | -ci]
Chapter 7. Operators

145

Table 29. Lookup options Option -collation_sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -createOnly -createOnly Specifies the creation of one or more lookup tables; no lookup processing is to be done. -fileset [-fileset fileset_descriptor ...] Specify the name of a fileset containing one or more lookup tables to be matched. These are tables that have been created and saved by an earlier execution of the lookup operator using the -createOnly option. In lookup mode, you must specify either the -fileset option, or a table specification, or both, in order to designate the lookup table(s) to be matched against. There can be zero or more occurrences of the -fileset option. It cannot be specified in create-only mode. Warning: The fileset already contains key specifications. When you follow -fileset fileset_descriptor by key_specifications, the keys specified do not apply to the fileset; rather, they apply to the first lookup table. For example, lookup -fileset file -key field, is the same as: lookup -fileset file1 -table -key field

146

Parallel Job Advanced Developer Guide

Table 29. Lookup options (continued) Option -ifNotFound Use -ifNotFound continue | drop | fail | reject Specifies the operator action when a record of an input data set does not have a corresponding record in every input lookup table. The default action of the operator is to fail and terminate the step. continue tells the operator to continue execution when a record of an input data set does not have a corresponding record in every input lookup table. The input record is transferred to the output data set along with the corresponding records from the lookup tables that matched. The fields in the output record corresponding to the lookup table(s) with no corresponding record are set to their default value or null if the field supports nulls. drop tells the operator to drop the input record (refrain from creating an output record). fail sets the operator to abort. This is the default. reject tells the operator to copy the input record to the reject data set. In this case, a reject output data set must be specified.

Chapter 7. Operators

147

Table 29. Lookup options (continued) Option -table Use -table -key field [-ci | -cs] [-param parameters] [-key field [-ci | cs] [-param parameters] ...] [-allow_dups] -save fileset_descriptor [-diskpool pool]] ...] Specifies the beginning of a list of key fields and other specifications for a lookup table. The first occurrence of -table marks the beginning of the key field list for lookup table1; the next occurrence of -table marks the beginning of the key fields for lookup table2, and so on For example: lookup -table -key field -table -key field The -key option specifies the name of a lookup key field. The -key option must be repeated if there are multiple key fields. You must specify at least one key for each table. You cannot use a vector, subrecord, or tagged aggregate field as a lookup key. The -ci suboption specifies that the string comparison of lookup key values is to be case insensitive; the -cs option specifies case-sensitive comparison, which is the default. The -params suboption provides extra parameters for the lookup key. Specify property=value pairs, without curly braces. In create-only mode, the -allow_dups option causes the operator to save multiple copies of duplicate records in the lookup table without issuing a warning. Two lookup records are duplicates when all lookup key fields have the same value in the two records. If you do not specify this option, WebSphere DataStage issues a warning message when it encounters duplicate records and discards all but the first of the matching records. In normal lookup mode, only one lookup table (specified by either -table or -fileset) can have been created with -allow_dups set. The -save option lets you specify the name of a fileset to write this lookup table to; if -save is omitted, tables are written as scratch files and deleted at the end of the lookup. In create-only mode, -save is, of course, required. The -diskpool option lets you specify a disk pool in which to create lookup tables. By default, the operator looks first for a lookup disk pool, then uses the default pool (). Use this option to specify a different disk pool to use.

Lookup table characteristics


The lookup tables input to the operator are created from WebSphere DataStage data sets. The lookup tables do not have to be sorted and should be small enough that all tables fit into physical memory on the processing nodes in your system. Lookup tables larger than physical memory do not cause an error, but they adversely affect the execution speed of the operator.

148

Parallel Job Advanced Developer Guide

The memory used to hold a lookup table is shared among the lookup processes running on each machine. Thus, on an SMP, all instances of a lookup operator share a single copy of the lookup table, rather than having a private copy of the table for each process. This reduces memory consumption to that of a single sequential lookup process. This is why partitioning the data, which in a non-shared-memory environment saves memory by creating smaller tables, also has the effect of disabling this memory sharing, so that there is no benefit to partitioning lookup tables on an SMP or cluster.

Partitioning
Normally (and by default), lookup tables are partitioned using the entire partitioning method so that each processing node receives a complete copy of the lookup table. You can partition lookup tables using another partitioning method, such as hash, as long as you ensure that all records with the same lookup keys are partitioned identically. Otherwise, source records might be directed to a partition that doesnt have the proper table entry. For example, if you are doing a lookup on keys a, b, and c, having both the source data set and the lookup table hash partitioned on the same keys would permit the lookup tables to be broken up rather than copied in their entirety to each partition. This explicit partitioning disables memory sharing, but the lookup operation consumes less memory, since the entire table is not duplicated. Note, though, that on a single SMP, hash partitioning does not actually save memory. On MPPs, or where shared memory can be used only in a limited way, or not at all, it can be beneficial.

Create-only mode
In its normal mode of operation, the lookup operator takes in a source data set and one or more data sets from which the lookup tables are built. The lookup tables are actually represented as file sets, which can be saved if you wish but which are normally deleted as soon as the lookup operation is finished. There is also a mode, selected by the -createOnly option, in which there is no source data set; only the data sets from which the lookup tables are to be built are used as input. The resulting file sets, containing lookup tables, are saved to persistent storage. This create-only mode of operation allows you to build lookup tables when it is convenient, and use them for doing lookups at a later time. In addition, initialization time for the lookup processing phase is considerably shorter when lookup tables already exist. For example, suppose you have data sets data1.ds and data2.ds and you want to create persistent lookup tables from them using the name and ID fields as lookup keys in one table and the name and accountType fields in the other. For this use of the lookup operator, you specify the -createOnly option and two -table options. In this case, two suboptions for the -table options are specified: -key and -save. In osh, use the following command:
$ osh " lookup -createOnly -table -key name -key ID -save fs1.fs -table -key name -key accountType -save fs2.fs < data1.ds < data2.ds"

Chapter 7. Operators

149

Lookup example 1: single lookup table record


This figure shows the lookup of a source record and a single lookup table record:

source record
key field 1 key field 2 key field 1

lookup record
key field 2

name ID
John 27 other_source_fields

name
John 27

ID
payload_fields

name
John

ID
27 other_source_fields other payload fields not including fields already in the source record

output record
This figure shows the source and lookup record and the resultant output record. A source record and lookup record are matched if they have the same values for the key field(s). In this example, both records have John as the name and 27 as the ID number. In this example, the lookup keys are the first fields in the record. You can use any field in the record as a lookup key. Note that fields in a lookup table that match fields in the source record are dropped. That is, the output record contains all of the fields from the source record plus any fields from the lookup record that were not in the source record. Whenever any field in the lookup record has the same name as a field in the source record, the data comes from the source record and the lookup record field is ignored. Here is the command for this example:
$ osh "lookup -table -key Name -key ID < inSrc.ds < inLU1.ds > outDS.ds"

Example 2: multiple lookup table record


When there are multiple lookup tables as input, the lookup tables can all use the same key fields, or they can use different sets. The diagram shows the lookup of a source record and two lookup records where both lookup tables have the same key fields.

150

Parallel Job Advanced Developer Guide

source record
key field 1 key field 2

lookup record 1
key field 1 key field 2

lookup record 2
key field 1 key field 2

name ID
John 27 other_source_fields

name ID
John 27

name ID
payload_fields_1 John 27 payload_fields_2

name ID
John payload_fields_1 (not 27 other_source_fields including fields already in the source record) payload_fields_2 (not including fields already in the source record or payload1/payload2 collision)

output record
The osh command for this example is:
$ osh " lookup -table -key name -key ID -table -key name -key ID < inSrc.ds < inLU1.ds < inLU2.ds > outDS.ds"

Note that in this example you specify the same key fields for both lookup tables. Alternatively, you can specify a different set of lookup keys for each lookup table. For example, you could use name and ID for the first lookup table and the fields accountType and minBalance (not shown in the figure) for the second lookup table. Each of the resulting output records would contain those four fields, where the values matched appropriately, and the remaining fields from each of the three input records. Here is the osh command for this example:
$ osh " lookup -table -key name -key ID -table -key accountType -key minBalance < inSrc.ds < inLU1.ds < inLU2.ds > outDS.ds"

Example 3: interest rate lookup example


The following figure shows the schemas for a source data set customer.ds and a lookup data set interest.ds. This operator looks up the interest rate for each customer based on the customers account type. In this example, WebSphere DataStage inserts the entire partitioner (this happens automatically; you do not need to explicitly include it in your program) so that each processing node receives a copy of the entire lookup table.

Chapter 7. Operators

151

customer.ds schema: customer:int16; month:string[3]; name:string[21]; accounttype:int8; balance:sfloat;

interest.ds schema: accounttype:int8; interestRate:sfloat;;

step

lookup key

(entire)

lookup

outDS.ds schema: customer:int16; month:string[3]; name:string[21]; accounttype:int8; balance:sfloat; interestRate:sfloat;

OutDS.ds

Since the interest rate is not represented in the source data set record schema, the interestRate field from the lookup record has been concatenated to the source record. Here is the osh code for this example:
$ osh " lookup -table -key accountType < customers.ds < interest.ds > outDS.ds"

Example 4: handling duplicate fields example


If, in the previous example, the record schema for customer.ds also contained a field named interestRate, both the source and the lookup data sets would have a non-lookup-key field with the same name. By default, the interestRate field from the source record is output to the lookup record and the field from the lookup data set is ignored. If you want the interestRate field from the lookup data set to be output, rather than the value from the source record, you can use a modify operator before the lookup operator to drop the interestRate field from the source record. The following diagram shows record schemas for the customer.ds and interest.ds in which both schemas have a field named interestRate.

customer.ds schema: customer:int16; month:string[3]; name:string[21]; balance:sfloat; accountType:int8; interestRate:sfloat;


152

interest.ds schema: accountType:int8; interestRate:sfloat;

Parallel Job Advanced Developer Guide

To make the lookup tables interestRate field the one that is retained in the output, use a modify operator to drop interestRate from the source record. The interestRate field from the lookup table record is propagated to the output data set, because it is now the only field of that name. The following figure shows how to use a modify operator to drop the interestRate field:

source

lookup

drop InterestRate field

modify

(entire)

lookup
step

output
The osh command for this example is:
$ osh " modify -spec drop interestRate; < customer.ds | lookup -table -key accountType < interest.ds > outDS.ds"

Note that this is unrelated to using the -allow_dups option on a table, which deals with the case where two records in a lookup table are identical in all the key fields.

Merge operator
The merge operator combines a sorted master data set with one or more sorted update data sets. The fields from the records in the master and update data sets are merged so that the output record contains all the fields from the master record plus any additional fields from matching update record. A master record and an update record are merged only if both of them have the same values for the merge key field(s) that you specify. Merge key fields are one or more fields that exist in both the master and update records. By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the merge operator and other operators. As part of preprocessing your data for the merge operator, you must remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well. This section describes how to use the merge operator. Included in this topic are examples using the remdup operator to preprocess your data. The merge operator is similar in function to the lookup operator and the join operators. To understand the similarities and differences see Comparison with Other Operators .
Chapter 7. Operators

153

Data flow diagram


The merge operator merges a master data set with one or more update data sets
master update1

updaten

mkey0; ...mkeyN; masterRec:*;

merge

mkey0; ...mkeyN; updateRec0:*;

mkey0; ...mkeyN; updateRecn:*;

masterRec:*; updateRec1:*; updateRecn:*;

rejectRec1:*; rejectRec1:*;

merged output

reject1

rejectn

merge: properties
Table 30. merge properties Property Number of input data sets Number of output data sets Input interface schema master data set: update data sets: Output interface schema output data set: reject data sets: Transfer behavior master to output: update to output: update to reject: Input partitioning style Output partitioning style Execution mode Preserve-partitioning flag in output data set masterRec -> masterRec without record modification updateRecn-> outputRecrejectRecn -> updateRecn ->rejectRecn without record modification (optional) keys in same partition distributed keys parallel (default) or sequential propagated Value 1 master; 1-n update 1 output; 1-n reject (optional) mKey0:data_type; ... mKeyk:data_type; masterRec:*; mKey0:data_type; ... mKeyk:data_type; updateRecr:*; rejectRecr:*; masterRec:*; updateRec1:*; updateRec2:*; ... updateRecn:*; rejectRecn;*

Merge: syntax and options


merge -key field [-ci | -cs] [-asc | -desc] [-ebcdic] [-nulls first | last] [param params] [-key field [-ci | -cs] [-asc | -desc] [-ebcdic] [-nulls first | last] [param params] ...] [-collation_sequence locale | collation_file_pathname | OFF] [-dropBadMasters | -keepBadMasters] [-nowarnBadMasters | -warnBadMasters] [-nowarnBadUpdates | -warnBadUpdates]

154

Parallel Job Advanced Developer Guide

Table 31. Merge options Option -key Use -key field [-ci | -cs] [-asc | -desc] [-ebcdic] [-nulls first | last] [param params] [-key field [-ci | -cs] [-asc | -desc] [-ebcdic] [-nulls first | last] [param params] ...] Specifies the name of a merge key field. The -key option might be repeated if there are multiple merge key fields. The -ci option specifies that the comparison of merge key values is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. -asc | -desc are optional arguments for specifying ascending or descending sorting By default, the operator uses ascending sorting order, that is, smaller values appear before larger values in the sorted data set. Specify -desc to sort in descending sorting order instead, so that larger values appear before smaller values in the sorted data set. -nulls first | last. By default fields containing null values appear first in the sorted data set. To override this default so that fields containing null values appear last in the sorted data set, specify nulls last. -ebcdic. By default data is represented in the ASCII character set. To represent data in the EBCDIC character set, specify this option. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property=value pairs separated by commas. -collation_sequence -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v v v Specify a predefined IBM ICU locale Write your own collation sequence using ICU syntax, and supply its collation_file_pathname Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence.

By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.htm -dropBadMasters -dropBadMasters Rejected masters are not output to the merged data set. -keepBadMasters -keepBadMasters Rejected masters are output to the merged data set. This is the default.

Chapter 7. Operators

155

Table 31. Merge options (continued) Option -nowarnBadMasters Use -nowarnBadMasters Do not warn when rejecting bad masters. -nowarnBadUpdates -nowarnBadUpdates Do not warn when rejecting bad updates. -warnBadMasters -warnBadMasters Warn when rejecting bad masters. This is the default. -warnBadUpdates -warnBadUpdates Warn when rejecting bad updates. This is the default.

Merging records
The merge operator combines a master and one or more update data sets into a single, merged data set based upon the values of a set of merge key fields. Each record of the output data set contains all of the fields from a master record. Concatenated to the end of the output records are any fields from the corresponding update records that are not already in the master record. Corresponding master and update records have the same value for the specified merge key fields. The action of the merge operator depends on whether you specify multiple update data sets or a single update data set. When merging a master data set with multiple update data sets, each update data set might contain only one record for each master record. When merging with a single update data set, the update data set might contain multiple records for a single master record. The following sections describe merging a master data set with a single update data set and with multiple update data sets.

156

Parallel Job Advanced Developer Guide

Merging with a single update data set


The following diagram shows the merge of a master record and a single update record.

master record
key field 1 key field 2 key field 1

update record
key field 2

name ID
John 27 other_update_fields

name ID
John 27 other_update_fields

name
John

ID
27 other_master_fields other_update_fields not including fields already in the master record

output record
The figure shows the master and update records and the resultant merged record. A master record and an update record are merged only if they have the same values for the key field(s). In this example, both records have John as the Name and 27 as the ID value. Note that in this example the merge keys are the first fields in the record. You can use any field in the record as a merge key, regardless of its location. The schema of the master data set determines the data types of the merge key fields. The schemas of the update data sets might be dissimilar but they must contain all merge key fields (either directly or through adapters). The merged record contains all of the fields from the master record plus any fields from the update record which were not in the master record. Thus, if a field in the update record has the same name as a field in the master record, the data comes from the master record and the update field is ignored. The master data set of a merge must not contain duplicate records where duplicates are based on the merge keys. That is, no two master records can have the same values for all merge keys. For a merge using a single update data set, you can have multiple update records, as defined by the merge keys, for the same master record. In this case, you get one output record for each master/update record pair. In the figure above, if you had two update records with John as the Name and 27 as the value of ID, you would get two output records.

Merging with multiple update data sets


In order to merge a master and multiple update data sets, all data sets must be sorted and contain no duplicate records where duplicates are based on the merge keys. That is, there must be at most one update record in each update data set with the same combination of merge key field values for each master record. In this case, the merge operator outputs a single record for each unique combination of merge key fields.

Chapter 7. Operators

157

By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the merge operator and other operators. The following figure shows the merge of a master record and two update records (one update record from each of two update data sets):

master record
key field 1 key field 2

update record 1
key field 1 key field 2

update record 2
key field 1 key field 2

name ID
John 27 other_master_ fields

name ID
John 27 other_update_ fields1

name ID
John 27 other_update_ fields2

name ID
John other_update_fields1 (not 27 other_master_fields including fields already in the master record) other_update_fields2 (not including fields already in the master record or update_record1)

output record
Any fields in the first update record not in the master record are concatenated to the output record. Then, any fields in the second update record not in the master record or the first update record are concatenated to the output record. For each additional update data set, any fields not already in the output record are concatenated to the end of the output record.

158

Parallel Job Advanced Developer Guide

Understanding the merge operator


The following diagram shows the overall data flow for a typical use of the merge operator:

master

update

remdup

remdup

optional

merge
step

output
This diagram shows the overall process as one step. Note that the remdup operator is required only if you have multiple update data sets. If you have only a single update data set, the data set might contain more than one update record for each master record. Another method is to save the output of the remdup operator to a data set and then pass that data set to the merge operator, as shown in the following figure:

Chapter 7. Operators

159

master

update

remdup master
step

remdup update
step

optional

preprocessed master

preprocessed update

merge
step

output
This method has the disadvantage that you need the disk space to store the pre-processed master and update data sets and the merge must be in a separate step from the remove duplicates operator. However, the intermediate files can be checked for accuracy before merging them, or used by other processing steps that require records without duplicates.

The merging operation


When data sets are merged, one is the master and all other are update data sets. The master data set is always connected to input 0 of the operator. The merged output data set always contains all of the fields from the records in the master data set. In addition, it contains any additional fields from the update data sets. The following diagram shows the record schema of the output data set of the merged operator, based on the record schema of the master and update data sets:

160

Parallel Job Advanced Developer Guide

master data set


a:int8; b:int8; c:int8; d:int16; e:int16;

update data set


a:int8; b:int8; c:int8; f:int32; g:int32;

step merge

output data set


a:int8; b:int8; c:int8; d:int16; e:int16; f:int32; g:int32;

This data-flow diagram shows the record schema of the master and update data sets. The record schema of the master data set has five fields and all five of these appear in the record schema of the output data set. The update data set also has five fields, but only two of these (f and g) are copied to the output data set because the remaining fields (a, b, and c) already exist in the master data set. If the example above is extended to include a second update data set with a record schema containing the following fields: abdhi Then the fields in the merged output record are now: abcedfghi because the last two fields (h and i) occur only in the second update data set and not in the master or first update data set. The unique fields from each additional update data set are concatenated to the end of the merged output. If there is a third update data set with a schema that contains the fields: abdh it adds nothing to the merged output since none of the fields is unique. Thus if master and five update data sets are represented as: M U1 U2 U3 U4 U5 where M represents the master data set and Un represent update data set n, and if the records in all six data sets contain a field named b, the output record has a value for b taken from the master data set. If a field named e occurs in the U2, U3, and U5 update data sets, the value in the output comes from the U2 data set since it is the first one encountered.
Chapter 7. Operators

161

Therefore, the record schema of the merged output record is the sequential concatenation of the master and update schema(s) with overlapping fields having values taken from the master data set or from the first update record containing the field. The values for the merge key fields are taken from the master record, but are identical values to those in the update record(s).

Example 1: updating national data with state data


The following figure shows the schemas for a master data set named National.ds and an update data set named California.ds. The merge operation is performed to combine the two; the output is saved into a new data set named Combined.ds.

National.ds schema
customer:int16; month:string[3]; name;string[21]; balance;sfloat; salesman:string[8]; accountType:int8;

California.ds schema
customer:int16; month:string[3]; name;string[21]; calBalance:sfloat; status:string[8];

step merge

combined.ds schema
customer:int16; month:string[3]; name;string[21]; balance;sfloat; salesman:string[8]; accountType:int8; calBalance:sfloat; status:string[8];

The National.ds master data set contains the following record:

162

Parallel Job Advanced Developer Guide

National.ds schema
customer:int16; month:string[3]; name;string[21]; balance;sfloat; salesman:string[8]; accountType:int8;

86111

JUN

Jones, Bob

345.98

Steve

12

Record in National.ds data set


The Customer and Month fields are used as the merge key fields. You also have a record in the update data set named California.ds that contains the following record:

California.ds schema
customer:int16; month:string[3]; name;string[21]; CalBalance;sfloat; status:string[8];

86111

JUN

Jones, Bob

637.04

Normal

Record in California.ds data set


After you merge these records, the result is:
86111 JUN Jones, Bob Steve 12 637.04 Normal

Record in combined.ds data set


This example shows that the CalBalance and Status fields from the update record have been concatenated to the fields from the master record. The combined record has the same values for the key fields as do both the master and the update records since they must be the same for the records to be merged. The following figure shows the data flow for this example. The original data comes from the data sets NationalRaw.ds and CaliforniaRaw.ds. National.ds and California.ds are created by first sorting and then removing duplicates from NationalRaw.ds and CaliforniaRaw.ds.

Chapter 7. Operators

163

master
NationalRaw.ds

update
CaliforniaRaw.ds

remdup
National.ds

remdup
California.ds

merge
step

output
For the remdup operators and for the merge operator you specify the same key two fields: v Option: key Value: Month v Option: key Value: Customer The steps for this example have been written separately so that you can check the output after each step. Because each step is separate, it is easier to understand the entire process. Later all of the steps are combined together into one step. The separate steps, shown as osh commands, are:
# $ # $ # $ Produce National.ds osh "remdup -key Month -key Customer < NationalRaw.ds > National.ds" Produce California.ds osh "remdup -key Month -key Customer < CaliforniaRaw.ds > California.ds" Perform the merge osh "merge -key Month -key Customer < National.ds < California.ds > Combined.ds"

This example takes NationalRaw.ds and CaliforniaRaw.ds and produces Combined.ds without creating the intermediate files. When combining these three steps into one, you use a named virtual data sets to connect the operators.
$ osh "remdup -key Month -key Customer < CaliforniaRaw.ds > California.v; remdup -key Month -key Customer < NationalRaw.ds | merge -key Month -key Customer < California.v > Combined.ds"

In this example, California.v is a named virtual data set used as input to merge.

Example 2: handling duplicate fields


If the record schema for CaliforniaRaw.ds from the previous example is changed so that it now has a field named Balance, both the master and the update data sets will have a field with the same name. By default, the Balance field from the master record is output to the merged record and the field from the update data set is ignored.

164

Parallel Job Advanced Developer Guide

The following figure shows record schemas for the NationalRaw.ds and CaliforniaRaw.ds in which both schemas have a field named Balance:

Master data set NationalRaw.ds schema: customer:int16; month:string[3]; balance:sfloat; salesman:string[8]; accountType:int8;

Update data set CaliforniaRaw.ds schema: customer:int16; month:string[3]; balance:sfloat; CalBalance:sfloat; status:string[8];

If you want the Balance field from the update data set to be output by the merge operator, you have two alternatives, both using the modify operator. v Rename the Balance field in the master data set. v Drop the Balance field from the master record. In either case, the Balance field from the update data set propagates to the output record because it is the only Balance field. The following figure shows the data flow for both methods.

master
National.ds

update
CaliforniaRaw.ds

remdup

remdup
California.v

drop or rename the Balance field

modify

merge
step

output

Renaming a duplicate field


The osh command for this approach is:
$ osh "remdup -key Month -key Customer < CaliforniaRaw.ds > remdup -key Month -key Customer < NationalRaw.ds | California.v;

Chapter 7. Operators

165

modify OldBalance = Balance | merge -key Month -key Customer < California.v > Combined.ds"

The name of the Balance field has been changed to OldBalance. The Balance field from the update data set no longer conflicts with a field in the master data set and is added to records by the merge.

Dropping duplicate fields


Another method of handling duplicate field names is to drop Balance from the master record. The Balance field from the update record is written out to the merged record because it is now the only field with that name. The osh command for this approach is:
$ osh "remdup -key Month -key Customer < CaliforniaRaw.ds > California.v; remdup -key Month -key Customer < NationalRaw.ds | modify DROP Balance | merge -key Month -key Customer < California.v > Combined.ds"

Job scenario: galactic industries


This section contains an extended example that illustrates the use of the merge operator in a semi-realistic data flow. The example is followed by an explanation of why the operators were chosen. Files have been provided to allow you to run this example yourself. The files are in $APT_ORCHHOME/examples/doc/ mergeop subdirectory of the parallel engine directory. Galactic Industries stores certain customer data in one database table and orders received for a given month in another table. The customer table contains one entry per customer, indicating the location of the customer, how long she has been a customer, the customer contact, and other customer data. Each customer in the table is also assigned a unique identifier, cust_id. However, the customer table contains no information concerning what the customer has ordered. The order table contains details about orders placed by customers; for each product ordered, the table lists the product name, amount, price per unit, and other product information. The order table can contain many entries for a given customer. However, the only information about customers in the order table is the customer identification field, indicated by a cust_id field which matches an entry in the customer table. Each month Galactic Industries needs to merge the customer information with the order information to produce reports, such as how many of a given product were ordered by customers in a given region. Because the reports are reviewed by human eyes, they also need to perform a lookup operation which ties a description of each product to a product_id. Galactic Industries performs this merge and lookup operation using WebSphere DataStage. The WebSphere DataStage solution is based on the fact that Galactic Industries has billions of customers, trillions of orders, and needs the reports fast. The osh script for the solution follows.
# import the customer file; store as a virtual data set. import -schema $CUSTOMER_SCHEMA -file customers.txt -readers 4 | peek -name -nrecs 1 >customers.v;

166

Parallel Job Advanced Developer Guide

# import the order file; store as a virtual data set. import -schema $ORDER_SCHEMA -file orders.txt -readers 4 | peek -name -nrecs 1 >orders.v; # import the product lookup table; store as a virtual data set. import -schema $LOOKUP_SCHEMA -file lookup_product_id.txt | entire | # entire partitioning only necessary in MPP environments peek -name -nrecs 1 >lookup_product_id.v; # merge customer data with order data; lookup product descriptions; # store as a persistent data set. merge -key cust_id -dropBadMasters # customer did not place an order this period < customers.v < orders.v 1>| orders_without_customers.ds | # if not empty, we have a problem lookup -key product_id -ifNotFound continue # allow products that dont have a description < lookup_product_id.v | peek -name -nrecs 10 >| customer_orders.ds;

Why the merge operator is used


The merge operator is not the only component in the WebSphere DataStage library capable of merging the customer and order tables. An identical merged output data set could be produced with either the lookup or innerjoin operator. Furthermore, if the -dropBadMasters behavior was not chosen, merging could also be performed using the leftouterjoin operator. Galactic Industries needs make the merge operator the best choice. If the lookup operator were used the customer table would be used as the lookup table. Since Galactic Industries has billions of customers and only a few Gigabytes of RAM on its SMP the data would have to spill over onto paging space, resulting in a dramatic decrease in processing speed. Because Galactic Industries is interested in identifying entries in the order table that do not have a corresponding entry in the customer table, the merge operator is a better choice than the innerjoin or leftouterjoin operator, because the merge operator allows for the capture of bad update records in a reject data set (orders_without_customers.ds in the script above).

Why the lookup operator is used


Similar functionality can be obtained from merge, lookup, or one of the join operators. For the task of appending a descriptions of product field to each record the lookup operator is most suitable in this case for the following reasons. v Since there are around 500 different products and the length of each description is approximately 50 bytes, a lookup table consists of only about 25 Kilobytes of data. The size of the data makes the implementation of a lookup table in memory feasible, and means that the scan of the lookup table based on the key field is a relatively fast operation. v Use of either the merge or one of the join operators would necessitate a repartition and resort of the data based on the lookup key. In other words, having partitioned and sorted the data by cust_id to accomplish the merge of customer and order tables, Galactic industries would then have to perform a second partitioning and sorting operation based on product_id in order to accomplish the lookup of the product description using either merge or innerjoin. Given the small size of the lookup table and the huge size of the merged customer/order table, the lookup operator is clearly the more efficient choice.

Chapter 7. Operators

167

Why the entire operator is used


The lookup data set lookup_product_id.v is entire partitioned. The entire partitioner copies all records in the lookup table to all partitions ensuring that all values are available to all records for which lookup entries are sought. The entire partitioner is only required in MPP environments, due to the fact that memory is not shared between nodes of an MPP. In an SMP environment, a single copy of the lookup table is stored in memory that can be accessed by all nodes of the SMP. Using the entire partitioner in the flow makes this example portable to MPP environments, and due to the small size of the lookup table is not particularly wasteful of resources.

Missing records
The merge operator expects that for each master record there exists a corresponding update record, based on the merge key fields, and vice versa. If the merge operator takes a single update data set as input, the update data set might contain multiple update records for a single master record. By using command-line options to the operator, you can specify the action of the operator when a master record has no corresponding update record (a bad master record) or when an update record has no corresponding master record (a bad update record).

Handling bad master records


A master record with no corresponding update record is called a bad master. When a master record is encountered which has no corresponding update record, you can specify whether the master record is to be copied to the output or dropped. You can also request that you get a warning message whenever this happens. By default, the merge operator writes a bad master to the output data set and issues a warning message. Default values are used for fields in the output record which would normally have values taken from the update data set. You can specify -nowarnBadMasters to the merge operator to suppress the warning message issued for each bad master record. Suppose the data in the master record is:

86111

JUN

Lee, Mary

345.98

Steve

12

Record in National.ds data set


The first field, Customer, and the second field, Month, are the key fields. If the merge operator cannot find a record in the update data set for customer 86111 for the month of June, then the output record is: The last two fields in the output record, OldBalance and Status, come from the update data set. Since there is no update record from which to get values for the OldBalance and Status fields, default values are written to the output record. This default value of a field is the default for that particular data type. Thus, if the value is an sfloat, the field in the output data set has a value of 0.0. For a fixed-length string, the default value for every byte is 0x00. If you specify -dropBadMasters, master records with no corresponding update record are discarded (not copied to the output data set).

Handling bad update records


When an update record is encountered which has no associated master record, you can control whether the update record is dropped or is written out to a separate reject data set.

168

Parallel Job Advanced Developer Guide

In order to collect bad update records from an update data set, you attach one output data set, called a reject data set, for each update data set. The presence of a reject data set configures the merge operator to write bad update records to the reject data set. In the case of a merge operator taking as input multiple update data sets, you must attach a reject data set for each update data set if you want to save bad update records. You cannot selectively collect bad update records from a subset of the update data sets. By default, the merge operator issues a warning message when it encounters a bad update record. You can use the -nowarnBadMasters option to the operator to suppress this warning. For example, suppose you have a data set named National.ds that has one record per the key field Customer. You also have an update data set named California.ds, which also has one record per Customer. If you now merge these two data sets, and include a reject data set, bad update records are written to the reject data set for all customer records from California.ds that are not already in National.ds. If the reject data set is empty after the completion of the operator, it means that all of the California.ds customers already have National.ds records. The following diagram shows an example using a reject data set.

step

merge

combined.ds

CalReject.ds

In osh, the command is:


$ osh "merge -key customer < National.ds < California.ds > Combined.ds > CalReject.ds"

After this step executes, CalReject.ds contains all records from update data set that did not have a corresponding record in the master data set. The following diagram shows the merge operator with multiple update sets (U1 and U2) and reject data sets (R1 and R2). In the figure, M indicates the master data set and O indicates the merged output data set.

Chapter 7. Operators

169

input data sets


M U1 U2

merge

R1

R2

output data sets


As you can see, you must specify a reject data set for each update data set in order to save bad update records. You must also specify the output reject data sets in the same order as you specified the input update data sets. For example:
$ osh "merge -key customer < National.ds < California.ds < NewYork.ds > Combined.ds > CalRejects.ds > NewYorkRejects.ds"

Modify operator
The modify operator takes a single data set as input and alters (modifies) the record schema of the input data set to create the output data set. The modify operator changes the representation of data before or after it is processed by another operator, or both. Use it to modify: v Elements of the record schema of an input data set to the interface required by the by the operator to which it is input v Elements of an operators output to those required by the data set that receive the results The operator performs the following modifications: v Keeping and dropping fields v Renaming fields v Changing a fields data type v Changing the null attribute of a field The modify operator has no usage string.

Data flow diagram


input data set

modify

output data set


170
Parallel Job Advanced Developer Guide

modify: properties
Table 32. modify properties Property Number of input data sets Number of output data sets Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 1 1 any (parallel mode) any (sequential mode) propagated no

Modify: syntax and options


Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
modify modify_spec1 ;modify_spec2 ; ... modify_specn ;

where each modify_spec specifies a conversion you want to perform. Performing Conversions describes the conversions that the modify operator can perform. v Enclose the list of modifications in single quotation marks. v Separate the modifications with a semi-colon. v If modify_spec takes more than one argument, separate the arguments with a comma and terminate the argument list with a semi-colon, as in the following example:
modify keep field1,field2, ... fieldn;

Multi-byte Unicode character data is supported for fieldnames in the modify specifications below. The modify_spec can be one of the following: v DROP v KEEP v replacement_spec v NOWARN To drop a field:
DROP fieldname [, fieldname ...]

To keep a field:
KEEP fieldname [, fieldname ...]

To change the name or data type of a field, or both, specify a replacement-spec, which takes the form:
new-fieldname [:new-type] = [explicit-conversion-spec] old-fieldname

Replace the old field name with the new one. The default type of the new field is the same as that if the old field unless it is specified by the output type of the conversion-spec if provided. Multiple new fields can be instantiated based on the same old field.

Chapter 7. Operators

171

When there is an attempt to put a null in a field that has not been defined as nullable, WebSphere DataStage issues an error message and terminates the job. However, a warning is issued at step-check time. To disable the warning, specify the NOWARNF option.

Transfer behavior
Fields of the input data set that are not acted on by the modify operator are transferred to the output data set unchanged. In addition, changes made to fields are permanent. Thus: v If you drop a field from processing by means of the modify operator, it does not appear in the output of the operation for which you have dropped it. v If you use an upstream modify operator to change the name, type, or both of an input field, the change is permanent in the output unless you restore the field name, type, or both by invoking the modify operator downstream of the operation for whose sake the field name was changed. In the following example, the modify operator changes field names upstream of an operation and restores them downstream of the operation, as indicated in the following table.
Source Field Name Upstream of Operator aField bField cField field1 field2 field3 Destination Field Name field1 field2 field3 aField bField cField

Downstream of Operator

You set these fields with the command:


$ osh " ... | modify field1=aField; field2=bField; field3=cField; | op | modify aField=field1; bField=field2; cField=field3; ..."

Avoiding contiguous modify operators


Set the APT_INSERT_COPY_BEFORE_MODIFY environment variable to enable the automatic insertion of a copy operator before a modify operator. This process ensures that your data flow does not have contiguous modify operators, a practice which is not supported in WebSphere DataStage. When this variable is not set and the operator immediately preceding a modify operator in the data flow also includes a modify operator, WebSphere DataStage removes the downstream modify operator.

Performing conversions
The section Allowed Conversions provides a complete list of conversions you can effect using the modify operator. This section discusses these topics: v Performing Conversions v v v v v Keeping and Dropping Fields Renaming Fields Duplicating a Field and Giving It a New Name Changing a Fields Data Type Default Data Type Conversion
Parallel Job Advanced Developer Guide

172

v v v v v v v v v v v

Date Field Conversions Decimal Field Conversions Raw Field Length Extraction String and Ustring Field Conversions String Conversions and Lookup Tables Time Field Conversions Timestamp Field Conversions The The The The modify modify modify modify Operator Operator Operator Operator and Nulls and Partial Schemas and Vectors and Aggregate Schema Components

Keeping and dropping fields


Invoke the modify operator to keep fields in or drop fields from the output. Here are the effects of keeping and dropping fields: v If you choose to drop a field or fields, all fields are retained except those you explicitly drop. v If you chose to keep a field or fields, all fields are excluded except those you explicitly keep. In osh you specify either the keyword keep or the keyword drop to keep or drop a field, as follows:
modify keep field1, field2, ... fieldn; modify drop field1, field2, ... fieldn;

Renaming fields
To rename a field specify the attribution operator (=) , as follows:
modify newField1=oldField1; newField2=oldField2; ...newFieldn=oldFieldn;

Duplicating a field and giving it a new name


You can duplicate a field and give it a new name, that is, create multiple new names for the same old name. You can also convert the data type of a field and give it a new name. Note: This does not work with aggregates. To duplicate and rename a field or duplicate it and change its data type use the attribution operator (=) . The operation must be performed by one modify operator, that is, the renaming and duplication must be specified in the same command as follows:
$ osh "modify a_1 = a; a_2 = a; " $ osh "modify c_1 = conversionSpec(c); c_2 = conversionSpec(c); "

where: v a and c are the original field names; a_1, a_2 are the duplicated field names; c_1, and c_2 are the duplicated and converted field names v conversionSpec is the data type conversion specification, discussed in the next section

Changing a fields data type


Sometimes, although field names are the same, an input field is of a type that differs from that of the same field in the output, and conversion must be performed. WebSphere DataStage often automatically changes the type of the source field to match that of the destination field. Sometimes, however, you must invoke the modify operator to perform explicit conversion. The next sections discuss default data type conversion and data type conversion errors. The subsequent sections discuss non-default conversions of WebSphere DataStage data types.
Chapter 7. Operators

173

Default data type conversion


For a data set to be used as input to or output from an operator, its record schema must be compatible with that of the operators interface. That is: v The names of the data sets fields must be identical to the names of the corresponding fields in the operator interface. Use the modify operator to change them if they are not (see Renaming Fields ). v The data type of each field in the data set must be compatible with that of the corresponding field in the operator interface. Data types are compatible if WebSphere DataStage can perform a default data type conversion, translating a value in a source field to the data type of a destination field. The following figure shows an input data set schema in which the data types of some fields do not match those of the corresponding fields and WebSphere DataStages default conversion of these types:
schema: field1:int8; field2:int16; field3:int16;

input data set

field1:int32;field2:int16;field3:sfloat;

modify

field1:int32;field2:int16;field3:sfloat;

output data set


The following table shows the default data conversion types. In this example, the disparate fields are compatible and: v The data type of field1 is automatically converted from int8 to int32. v The data type of field3 is automatically converted from int16 to sfloat. WebSphere DataStage performs default type conversions on WebSphere DataStage built-in numeric types (integer and floating point) as defined in C: A Reference Manual (3rd edition) by Harbison and Steele. WebSphere DataStage also performs default data conversions involving decimal, date, time, and timestamp fields. The remaining allowable data type conversions are performed explicitly, using the modify operator, as described in this topic. The tables shows the default data type conversions performed by WebSphere DataStage and the conversions that you can perform with the modify operator.
Destination Field Source Field int8 int8 uint8 d uint8 d int16 d d uint16 d d int32 d d uint32 d d int64 d d uint64 d d sfloat d d dfloat dm d

174

Parallel Job Advanced Developer Guide

Destination Field Source Field int8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat decimal string ustring raw date time time stamp dm d dm d dm d dm dm dm dm dm m m m m m uint8 d d d d d d d d d d d d d d d d d d d dm dm d d d d d d d d d d d d d d dm d d m m m m m m m d d d d d dm dm d d d dm d d d d dm d d d d d d dm dm dm int16 uint16 d int32 d d uint32 d d d int64 d d d d uint64 d d d d d sfloat d d d d d d dfloat d d d d d d d

Source Field decimal int8 uint8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat decimal string ustring raw date time timestamp m m m m m m m m m dm dm dm d d d d d d d d d d dm string dm d dm dm dm m d d d dm dm ustring dm d dm dm dm m d d d dm dm d m m m m m m m m m m raw date m time m timestamp m

d = default conversion; m = modify operator conversion; blank = no conversion needed or provided

Chapter 7. Operators

175

Data type conversion errors


A data type conversion error occurs when a conversion cannot be performed. WebSphere DataStages action when it detects such an error differs according to whether the destination field has been defined as nullable, according to the following three rules: v If the destination field has been defined as nullable, WebSphere DataStage sets it to null. v If the destination field has not been defined as nullable but you have directed modify to convert a null to a value, WebSphere DataStage sets the destination field to the value. To convert a null to a value supply the handle_null conversion specification. v For complete information on converting a null to a value, see Out-of-Band to Normal Representation. v If the destination field has not been defined as nullable, WebSphere DataStage issues an error message and terminates the job. However, a warning is issued at step-check time. To disable the warning specify the nowarn option.

How to convert a data type


To convert the data type of a field, pass the following argument to the modify operator:
destField[ : dataType] = [conversionSpec](sourceField);

where: v destField is the field in the output data set v dataType optionally specifies the data type of the output field. This option is allowed only when the output data set does not already have a record schema, which is typically the case. v sourceField specifies the field in the input data set v conversionSpec specifies the data type conversion specification; you need not specify it if a default conversion exists (see Default Data Type Conversion ). A conversion specification can be double quoted, single quoted, or not quoted, but it cannot be a variable. Note that once you have used a conversion specification to perform a conversion, WebSphere DataStage performs the necessary modifications to translate a conversion result to the numeric data type of the destination. For example, you can use the conversion hours_from_time to convert a time to an int8, or to an int16, int32, dfloat, and so on.

Date field conversions


WebSphere DataStage performs no automatic type conversion of date fields. Either an input data set must match the operator interface or you must effect a type conversion by means of the modify operator. The following table lists the conversions involving the date field. For a description of the formats, refer to date Formats .
Conversion Specification dateField = date_from_days_since[date] (int32Field) Description date from days since Converts an integer field into a date by adding the integer to the specified base date. The date must be in the format yyyy-mm-dd. dateField = date_from_julian_day(uint32Field) date from Julian day

176

Parallel Job Advanced Developer Guide

Conversion Specification dateField = date_from_string [date_format | date_uformat] (stringField) dateField = date_from_ustring [date_format | date_uformat] (ustringField)

Description date from string or ustring Converts the string or ustring field to a date representation using the specified date_format. By default, the string format is yyyy-mm-dd. date_format and date_uformat are described in date Formats . date from timestamp Converts the timestamp to a date representation.

dateField = date_from_timestamp(tsField)

int8Field = month_day_from_date(dateField) int8Field = weekday_from_date [originDay](dateField)

day of month from date day of week from date originDay is a string specifying the day considered to be day zero of the week. You can specify the day using either the first three characters of the day name or the full day name. If omitted, Sunday is defined as day zero. The originDay can be either single- or double-quoted or the quotes can be omitted.

int16Field = year_day_from_date(dateField) int32Field = days_since_from_date[source_date] (dateField)

day of year from date (returned value 1-366) days since date Returns a value corresponding to the number of days from source_date to the contents of dateField. source_date must be in the form yyyy-mm-dd and can be quoted or unquoted.

uint32Field = julian_day_from_date(dateField) int8Field = month_from_date(dateField) dateField = next_weekday_from_date[day] (dateField)

Julian day from date month from date next weekday from date The destination contains the date of the specified day of the week soonest after the source date (including the source date). day is a string specifying a day of the week. You can specify day by either the first three characters of the day name or the full day name. The day can be quoted in either single or double quotes or quotes can be omitted.

dateField = previous_weekday_from_date[day] (dateField)

previous weekday from date The destination contains the closest date for the specified day of the week earlier than the source date (including the source date) The day is a string specifying a day of the week. You can specify day using either the first three characters of the day name or the full day name. The day can be either single- or double- quoted or the quotes can be omitted.

Chapter 7. Operators

177

Conversion Specification stringField = string_from_date[date_format | uformat] (dateField) ustringField = ustring_from_date [date_format | date_uformat] (dateField)

Description strings and ustrings from date Converts the date to a string or ustring representation using the specified date_format. By default, the string format is yyyy-mm-dd. date_format and date_uformat are described in date Formats . timestamp from date The time argument optionally specifies the time to be used in building the timestamp result and must be in the form hh:nn:ss. If omitted, the time defaults to midnight.

tsField = timestamp_from_date[time](dateField)

int16Field = year_from_date(dateField) int8Field=year_week_from_date (dateField)

year from date week of year from date

A date conversion to or from a numeric field can be specified with any WebSphere DataStage numeric data type. WebSphere DataStage performs the necessary modifications and either translates a numeric field to the source data type shown above or translates a conversion result to the numeric data type of the destination. For example, you can use the conversion month_day_from_date to convert a date to an int8, or to an int16, int32, dfloat, and so on

date formats
Four conversions, string_from_date, ustring_from_date, date_from_string, and ustring_from_date, take as a parameter of the conversion a date format or a date uformat. These formats are described below. The default format of the date contained in the string is yyyy-mm-dd. The format string requires that you provide enough information for WebSphere DataStage to determine a complete date (either day, month, and year, or year and day of year).

date uformat
The date uformat provides support for international components in date fields. Its syntax is:
String%macroString%macroString%macroString

where %macro is a date formatting macro such as %mmm for a 3-character English month. See the following table for a description of the date format macros. Only the String components of date uformat can include multi-byte Unicode characters.

date format
The format string requires that you provide enough information for WebSphere DataStage to determine a complete date (either day, month, and year, or year and day of year). The format_string can contain one or a combination of the following elements:
Table 33. Date format tags Tag %d Variable width availability import Description Day of month, variable width Value range 1...31 Options s

178

Parallel Job Advanced Developer Guide

Table 33. Date format tags (continued) Tag %dd %ddd %m %mm %mmm %mmmm %yy %yyyy %NNNNyy %e %E %eee %eeee %W %WW import/export import import/export with v option import Variable width availability Description Day of month, fixed width Day of year Month of year, variable width Month of year, fixed width Month of year, short name, locale specific Month of year, full name, locale specific Year of century Four digit year Cutoff year plus year of century Day of week, Sunday = day 1 Day of week, Monday = day 1 Value range 01...31 1...366 1...12 01...12 Jan, Feb ... January, February ... 00...99 0001 ...9999 yy = 00...99 1...7 1...7 t, u, w t, u, w, -N, +N s s s Options s s, v s s t, u, w t, u, w, -N, +N s

Weekday short name, Sun, Mon ... locale specific Weekday long name, locale specific Week of year (ISO 8601, Mon) Week of year (ISO 8601, Mon) Sunday, Monday ... 1...53 01...53

When you specify a date format string, prefix each component with the percent symbol (%) and separate the strings components with a suitable literal character. The default date_format is %yyyy-%mm-%dd. Where indicated the tags can represent variable-width data elements. Variable-width date elements can omit leading zeroes without causing errors. The following options can be used in the format string where indicated in the table: s Specify this option to allow leading spaces in date formats. The s option is specified in the form: %(tag,s) Where tag is the format string. For example: %(m,s) indicates a numeric month of year field in which values can contain leading spaces or zeroes and be one or two characters wide. If you specified the following date format property: %(d,s)/%(m,s)/%yyyy

Chapter 7. Operators

179

Then the following dates would all be valid: 8/ 8/1958 08/08/1958 8/8/1958 v Use this option in conjunction with the %ddd tag to represent day of year in variable-width format. So the following date property: %(ddd,v) represents values in the range 1 to 366. (If you omit the v option then the range of values would be 001 to 366.) u w t Use this option to render uppercase text on output. Use this option to render lowercase text on output. Use this option to render titlecase text (initial capitals) on output.

The u, w, and t options are mutually exclusive. They affect how text is formatted for output. Input dates will still be correctly interpreted regardless of case. -N +N Specify this option to left justify long day or month names so that the other elements in the date will be aligned. Specify this option to right justify long day or month names so that the other elements in the date will be aligned.

Names are left justified or right justified within a fixed width field of N characters (where N is between 1 and 99). Names will be truncated if necessary. The following are examples of justification in use: %dd-%(mmmm,-5)-%yyyyy
21-Augus-2006

%dd-%(mmmm,-10)-%yyyyy
21-August -2005

%dd-%(mmmm,+10)-%yyyyy
21August-2005

The locale for determining the setting of the day and month names can be controlled through the locale tag. This has the format:
%(L,locale)

Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention supported by ICU. See NLS Guide for a list of locales. The default locale for month names and weekday names markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment variable (the tag takes precedence over the environment variable if both are set). Use the locale tag in conjunction with your time format, for example the format string: %(L,es)%eeee, %dd %mmmm %yyyy Specifies the Spanish locale and would result in a date with the following format: mircoles, 21 septembre 2005

180

Parallel Job Advanced Developer Guide

The format string is subject to the restrictions laid out in the following table. A format string can contain at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the incompatible with column. When some tags are used the format string requires that other tags are present too, as indicated in the requires column.
Table 34. Format tag restrictions Element year month Numeric format tags %yyyy, %yy, %[nnnn]yy %mm, %m Text format tags %mmm, %mmmm Requires year month year %eee, %eeee month, week of year year Incompatible with week of year day of week, week of year day of month, day of week, week of year day of year month, day of month, day of year

day of month %dd, %d day of year day of week week of year %ddd %e, %E %WW

When a numeric variable-width input tag such as %d or %m is used, the field to the immediate right of the tag (if any) in the format string cannot be either a numeric tag, or a literal substring that starts with a digit. For example, all of the following format strings are invalid because of this restriction: %d%m-%yyyy %d%mm-%yyyy %(d)%(mm)-%yyyy %h00 hours The year_cutoff is the year defining the beginning of the century in which all two-digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. On import and export, the year_cutoff is the base year. This property is mutually exclusive with days_since, text, and julian. You can include literal text in your date format. Any Unicode character other than null, backslash, or the percent sign can be used (although it is better to avoid control codes and other non-graphic characters). The following table lists special tags and escape sequences:
Tag %% \% \n \t \\ Escape sequence literal percent sign literal percent sign newline horizontal tab single backslash
Chapter 7. Operators

181

For example, the format string %mm/%dd/%yyyy specifies that slashes separate the strings date components; the format %ddd-%yy specifies that the string stores the date as a value from 1 to 366, derives the year from the current year cutoff of 1900, and separates the two components with a dash (-). The diagram shows the modification of a date field to three integers. The modify operator takes: v The day of the month portion of a date field and writes it to an 8-bit integer v The month portion of a date field and writes it to an 8-bit integer v The year portion of a date field and writes it to a 16-bit integer

input data set schema:


dfield:date;

dayField:int8 = month_day_from_date(dField); monthField:int8 = month_from_date(dField); yearField:int16 = year_from_date(dfield);

modify

output data set schema:


dayField:int8; monthField:int8; yearField:int16;

Use the following osh command:


$ osh "...| modify dayField = month_day_from_date(dField); monthField = month_from_date(dField); yearField = year_from_date(dField); | ..."

Decimal field conversions


By default WebSphere DataStage converts decimal fields to and from all numeric data types and to and from string fields. The default rounding method of these conversion is truncate toward zero. However, the modify operator can specify a different rounding method. See rounding type on page 183. The operator can specify fix_zero so that a source decimal containing all zeros (by default illegal) is treated as a valid decimal with a value of zero. WebSphere DataStage does not perform range or representation checks of the fields when a source and destination decimal have the same precision and scale. However, you can specify the decimal_from_decimal conversion to force WebSphere DataStage to perform an explicit range and representation check. This conversion is useful when one decimal supports a representation of zeros in all its digits (normally illegal) and the other does not. The following table list the conversions involving decimal fields:
Conversion decimal from decimal decimal from dfloat Conversion Specification decimalField = decimal_from_decimal[r_type](decimalField) decimalField = decimal_from_dfloat[r_type](dfloatField)

182

Parallel Job Advanced Developer Guide

Conversion decimal from string decimal from ustring scaled decimal from int64 dfloat from decimal dfloat from decimal dfloat from dfloat int32 from decimal int64 from decimal string from decimal ustring from decimal uint64 from decimal

Conversion Specification decimalField = decimal_from_string[r_type](stringField) decimalField = decimal_from_ustring[r_type](ustringField) target_field:decimal[p,s] = scaled_decimal_from_int64 [no_warn] (int64field) dfloatField = dfloat_from_decimal[fix_zero](decimalField) dfloatField = mantissa_from_decimal(decimalField) dfloatField = mantissa_from_dfloat(dfloatField) int32Field = int32_from_decimal[r_type, fix_zero](decimalField) int64Field = int64_from_decimal[r_type, fix_zero](decimalField) stringField = string_from_decimal [fix_zero] [suppress_zero](decimalField) ustringField = ustring_from_decimal [fix_zero] [suppress_zero](decimalField) uint64Field = uint64_from_decimal[r_type, fix_zero](decimalField)

A decimal conversion to or from a numeric field can be specified with any WebSphere DataStage numeric data type. WebSphere DataStage performs the necessary modification. For example, int32_from_decimal converts a decimal either to an int32 or to any numeric data type, such as int16, or uint32. The scaled decimal from int64 conversion takes an integer field and converts the field to a decimal of the specified precision (p) and scale (s) by dividing the field by 102. For example, the conversion:
Decfield:decimal[8,2]=scaled_decimal_from_int64(intfield)

where intfield = 12345678 would set the value of Decfield to 123456.78. The fix_zero specification causes a decimal field containing all zeros (normally illegal) to be treated as a valid zero. Omitting fix_zero causes WebSphere DataStage to issue a conversion error when it encounters a decimal field containing all zeros. Data Type Conversion Errors discusses conversion errors. The suppress_zero argument specifies that the returned string value will have no leading or trailing zeros. Examples: 000.100 -> 0.1; 001.000 -> 1; -001.100 -> -1.1

rounding type
You can optionally specify a value for the rounding type (r_type) of many conversions. The values of r_typeare: v ceil: Round the source field toward positive infinity. This mode corresponds to the IEEE 754 Round Up mode. v Examples: 1.4 -> 2, -1.6 -> -1 v floor: Round the source field toward negative infinity. This mode corresponds to the IEEE 754 Round Down mode. v Examples: 1.6 -> 1, -1.4 -> -2

Chapter 7. Operators

183

v round_inf: Round or truncate the source field toward the nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity. This mode corresponds to the COBOL ROUNDED mode. v Examples: 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2 v trunc_zero (default): Discard any fractional digits to the right of the right-most fractional digit supported in the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, round or truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. v Examples: 1.6 -> 1, -1.6 -> -1 The diagram shows the conversion of a decimal field to a 32-bit integer with a rounding mode of ceil rather than the default mode of truncate to zero:

input data set schema:


dfield:decimal[6,2];

field1 = int32_from_decimal[ceil](dField);

modify

output data set schema:


field1:int32;

The osh syntax for this conversion is:


field1 = int32_from_decimal[ceil,fix_zero] (dField);

where fix_zero ensures that a source decimal containing all zeros is treated as a valid representation.

Raw field length extraction


Invoke the modify operator and the raw_length option to extract the length of a raw field. This specification returns an int32 containing the length of the raw field and optionally passes through the source field. The diagram shows how to find the length of aField using the modify operator and the raw_length option:

184

Parallel Job Advanced Developer Guide

input data set schema:


aField:raw[16];

field1 = raw_length(aField); field2 = aField

modify

output data set schema:


field1:int32; field2:raw;

Use the following osh commands to specify the raw_length conversion of a field:
$ modifySpec="field1 = raw_length(aField); field2 = aField;" $ osh " ... | modify $modifySpec |... "

Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator.
Conversion Specification rawField = raw_from_string(string) rawField = u_raw_from_string(ustring) int32Field = raw_length(raw) Description Returns string in raw representation. Returns ustring in raw representation. Returns the length of the raw field.

String and ustring field conversions


Use the modify operator to perform the following modifications involving string and ustring fields: v Extract the length of a string. v Convert long strings to shorter strings by string extraction. v Convert strings to and from numeric values using lookup tables (see String Conversions and Lookup Tables ).

Chapter 7. Operators

185

Conversion Specification stringField=string_trim [character, direction, justify] (string)

Description You can use this function to remove the characters used to pad variable-length strings when they are converted to fixed-length strings of greater length. By default, these characters are retained when the fixed-length string is then converted back to a variable-length string. The character argument is the character to remove. It defaults to NULL. The value of the direction and justify arguments can be either begin or end; direction defaults to end, and justify defaults to begin. justify has no affect when the target string has variable length. Examples: name:string = string_trim[NULL, begin](name) removes all leading ASCII NULL characters from the beginning of name and places the remaining characters in an output variable-length string with the same name. hue:string[10] = string_trim[Z, end, begin](color) removes all trailing Z characters from color, and left justifies the resulting hue fixed-length string.

stringField=substring(string, starting_position, length) ustringField=u_substring(ustring, starting_position, length)

Copies parts of strings and ustrings to shorter strings by string extraction. The starting_position specifies the starting location of the substring; length specifies the substring length. The arguments starting_position and length are uint16 types and must be positive (>= 0).

stringField=lookup_string_from_int16 [tableDefinition](int16Field) ustringField=lookup_ustring_from_int16 [tableDefinition](int16Field) int16Field=lookup_int16_from_string [tableDefinition](stringField) int16Field=lookup_int16_from_ustring [tableDefinition](ustringField) uint32 = lookup_uint32_from_string [tableDefinition](stringField) uint32 =lookup_uint32_from_ustring [tableDefinition](ustringField) stringField= lookup_string_from_uint32 [tableDefinition](uint32Field) ustringField=lookup_ustring_from_uint32 [tableDefinition](uint32Field) stringField = string_from_ustring(ustring) ustringField = ustring_from_string(string) decimalField = decimal_from_string(stringField) decimalField = decimal_from_ustring(ustringField)

Converts numeric values to strings and ustrings by means of a lookup table.

Converts strings and ustrings to numeric values by means of a lookup table.

Converts numeric values to strings and ustrings by means of a lookup table.

Converts ustrings to strings. Converts strings to ustrings. Converts strings to decimals. Converts ustrings to decimals.

186

Parallel Job Advanced Developer Guide

Conversion Specification stringField = string_from_decimal[fix_zero] [suppress_zero] (decimalField)

Description Converts decimals to strings. fix_zero causes a decimal field containing all zeros to be treated as a valid zero. suppress_zero specifies that the returned ustring value will have no leading or trailing zeros. Examples: 000.100 -> 0.1; 001.000 -> 1; -001.100 -> -1.1

ustringField = ustring_from_decimal[fix_zero] [suppress_zero] (decimalField)

Converts decimals to ustrings. See string_from_decimal above for a description of the fix_zero and suppress_zero arguments. date from string or ustring Converts the string or ustring field to a date representation using the specified date_format or date_uformat. By default, the string format is yyyy-mm-dd. date_format and date_uformat are described in date Formats.

dateField = date_from_string [date_format | date_uformat] (stringField) dateField = date_from_ustring [date_format | date_uformat] (ustringField)

stringField = string_from_date [date_format | date_uformat] strings and ustrings from date (dateField) Converts the date to a string or ustring representation ustringField = ustring_from_date [date_format | using the specified date_format or date_uformat. date_uformat] (dateField) By default, the string format is yyyy-mm-dd. date_format and date_uformat are described in date Formats . int32Field=string_length(stringField) int32Field=ustring_length(ustringField) stringField=substring [startPosition,len] (stringField) ustringField=substring [startPosition,len] (ustringField) Returns an int32 containing the length of a string or ustring. Converts long strings/ustrings to shorter strings/ ustrings by string extraction. The startPosition specifies the starting location of the substring; len specifies the substring length. If startPosition is positive, it specifies the byte offset into the string from the beginning of the string. If startPosition is negative, it specifies the byte offset from the end of the string. stringField=uppercase_string (stringField) ustringField=uppercase_ustring (ustringField) stringField=lowercase_string (stringField) ustringField=lowercase_ustring (ustringField) Convert strings and ustrings to all uppercase. Non-alphabetic characters are ignored in the conversion. Convert stringsand ustrings to all lowercase. Non-alphabetic characters are ignored in the conversion.

stringField = string_from_time [time_format | time_uformat string and ustring from time ] (timeField) Converts the time to a string or ustring representation ustringField = ustring_from_time [time_format | using the specified time_format or time_uformat. The time_uformat] (timeField) time_format options are described below.

Chapter 7. Operators

187

Conversion Specification stringField = string_from_timestamp [timestamp_format | timestamp_uformat] (tsField) ustringField = ustring_from_timestamp [timestamp_format | timestamp_uformat] (tsField)

Description strings and ustrings from timestamp Converts the timestamp to a string or ustring representation using the specified timestamp_format or timestamp_uformat. By default, the string format is %yyyy-%mm-%dd hh:mm:ss. The timestamp_format and timestamp_uformat options are described in timestamp Formats .

tsField = timestamp_from_string [timestamp_format | timestamp_uformat] (stringField) tsField = timestamp_from_ustring [timestamp_format | timestamp_uformat] (usringField)

timestamp from strings and ustrings Converts the string or ustring to a timestamp representation using the specified timestamp_format or timestamp_uformat. By default, the string format is yyyy-mm-dd hh:mm:ss. The timestamp_format and timestamp_uformat options are described timestamp Formats .

timeField = time_from_string [time_format | time_uformat](stringField)

string and ustring from time

Converts the time to a string or ustring representation timeField = time_from_ustring [time_format | time_uformat] using the specified time_format. The time_uformat options (ustringField) are described below.

The following osh command converts a string field to lowercase:


osh "... | modify "lname=lowercase_string(lname)" | peek"

The diagram shows a modification that converts the name of aField to field1 and produces field2 from bField by extracting the first eight bytes of bField:

input data set schema:


aField:int8; bField:string[16]; field1 = aField; field2 = substring[0,8](bField);

modify

output data set schema:


field1:int8; field2:string[8];

The following osh command performs the substring extraction:


modify field1 = aField; field2 = substring[0,8](bField);

The diagram shows the extraction of the string_length of aField. The length is included in the output as field1.

188

Parallel Job Advanced Developer Guide

input data set schema:


aField:string[16];

field1 = string_length(aField); field2 = aField;

modify

output data set schema:


field1:int32; field2:string;

The following osh commands extract the length of the string in aField and place it in field1 of the output:
$ modifyspec="field1 = string_length(aField); field2 = aField;" $ osh " ... | modify $modifySpec |... "

Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator.

String conversions and lookup tables


You can construct a string lookup table to use when default conversions do not yield satisfactory results. A string lookup table is a table of two columns and as many rows as are required to perform a conversion to or from a string as shown in the following table:
Numeric Value numVal1 numVal2 ... numVal3 String or Ustring string1 | ustring1 string2 | ustring1 ... stringn | ustringn

Each row of the lookup table specifies an association between a 16-bit integer or unsigned 32-bit integer value and a string or ustring. WebSphere DataStage scans the Numeric Value or the String or Ustring column until it encounters the value or string to be translated. The output is the corresponding entry in the row. The numeric value to be converted might be of the int16 or the uint32 data type. WebSphere DataStage converts strings to values of the int16 or uint32 data type using the same table. If the input contains a numeric value or string that is not listed in the table, WebSphere DataStage operates as follows: v If a numeric value is unknown, an empty string is returned by default. However, you can set a default string value to be returned by the string lookup table. v If a string has no corresponding value, 0 is returned by default. However, you can set a default numeric value to be returned by the string lookup table. Here are the options and arguments passed to the modify operator to create a lookup table:
Chapter 7. Operators

189

intField = lookup_int16_from_string[tableDefinition] ( source_stringField ); | intField = lookup_int16_from_ustring[tableDefinition] (source_ustringField);

OR:
intField = lookup_uint32_from_string[tableDefinition] (source_stringField); | intField = lookup_uint32_from_ustring[tableDefinition] (source_ustringField); stringField = lookup_string_from_int16[tableDefinition](source_intField); | ustringField = lookup_ustring_from_int16[tableDefinition](source_intField);

OR:
stringField = lookup_string_from_uint32[tableDefinition] ( source_intField ); ustringField = lookup_ustring_from_uint32[tableDefinition] (source_intField);

where: tableDefinition defines the rows of a string or ustring lookup table and has the following form:
{propertyList} (string | ustring = value; string | ustring= value; ... )

where: v propertyList is one or more of the following options; the entire list is enclosed in braces and properties are separated by commas if there are more than one: case_sensitive: perform a case-sensitive search for matching strings; the default is case-insensitive. default_value = defVal: the default numeric value returned for a string that does not match any of the strings in the table. default_string = defString: the default string returned for numeric values that do not match any numeric value in the table. v string or ustring specifies a comma-separated list of strings or ustrings associated with value; enclose each string or ustring in quotes. v value specifies a comma-separated list of 16-bit integer values associated with string or ustring. The diagram shows an operator and data set requiring type conversion:

190

Parallel Job Advanced Developer Guide

input data set fName:string;


IName:string; gender:string; gender = lookup_int16_from_string[tableDefinition](gender) schema: field1:int32; field2:string;

schema:

modify

output data set schema:

fName:string; IName:string; gender:int16;

sample

Whereas gender is defined as a string in the input data set, the SampleOperator defines the field as an 8:-bit integer. The default conversion operation cannot work in this case, because by default WebSphere DataStage converts a string to a numeric representation and gender does not contain the character representation of a number. Instead the gender field contains the string values male, female, m, or f. You must therefore specify a string lookup table to perform the modification. The gender lookup table required by the example shown above is shown in the following table:
Numeric Value 0 0 1 1 String f female m male

The value f or female is translated to a numeric value of 0; the value m or male is translated to a numeric value of 1. The following osh code performs the conversion:
modify gender = lookup_int16_from_string[{default_value = 2} (f = 0; female = 0; m = 1; male = 1;)] (gender);

In this example, gender is the name of both the source and the destination fields of the translation. In addition, the string lookup table defines a default value of 2; if gender contains a string that is not one of f, female, m, or male, the lookup table returns a value of 2.

Chapter 7. Operators

191

Time field conversions


WebSphere DataStage performs no automatic conversions to or from the time data type. You must invoke the modify operator if you want to convert a source or destination time field. Most time field conversions extract a portion of the time, such as hours or minutes, and write it into a destination field.
Conversion Specification int8Field = hours_from_time(timeField) int32Field = microseconds_from_time(timeField) int8Field = minutes_from_time(timeField) dfloatField = seconds_from_time(timeField) dfloatField = midnight_seconds_from_time (timeField) stringField = string_from_time [time_format | time_uformat] (timeField) ustringField = ustring_from_time [time_format |time_uformat] (timeField) timeField = time_from_midnight_seconds (dfloatField) timeField = time_from_string [time_format | time_uformat ](stringField) Description hours from time microseconds from time minutes from time seconds from time seconds-from-midnight from time string and ustring from time Converts the time to a string or ustring representation using the specified time_format or time_uformat. The time formats are described below. time from seconds-from-midnight time from string

Converts the string or ustring to a time representation timeField = time_from_ustring [time_format | time_uformat] using the specified time_format or time_uformat. (ustringField) The time format options are described below. timeField = time_from_timestamp(tsField) tsField = timestamp_from_time [date](timeField) time from timestamp timestamp from time The date argument is required. It specifies the date portion of the timestamp and must be in the form yyyy-mm-dd.

Time conversion to a numeric field can be used with any WebSphere DataStage numeric data type. WebSphere DataStage performs the necessary modifications to translate a conversion result to the numeric data type of the destination. For example, you can use the conversion hours_from_time to convert a time to an int8, or to an int16, int32, dfloat, and so on.

time Formats
Four conversions, string_from_time, ustring_from_time, time_from_string, and ustring_from_time, take as a parameter of the conversion a time format or a time uformat. These formats are described below. The default format of the time contained in the string is hh:mm:ss.

time Uformat
The time uformat date format provides support for international components in time fields. Its syntax is:
String % macroString % macroString % macroString

where %macro is a time formatting macro such as %hh for a two-digit hour. See time Format on page 193 below for a description of the time format macros. Only the String components of time uformat can include multi-byte Unicode characters.

192

Parallel Job Advanced Developer Guide

time Format
The string_from_time and time_from_string conversions take a format as a parameter of the conversion. The default format of the time in the string is hh:mm:ss. However, you can specify an optional format string defining the time format of the string field. The format string must contain a specification for hours, minutes, and seconds. The possible components of the time_format string are given in the following table:
Table 35. Time format tags Tag %h %hh %H %HH %n %nn %s %ss %s.N %ss.N %SSS %SSSSSS %aa with v option with v option German import import import import Variable width availability import Description Hour (24), variable width Hour (24), fixed width Hour (12), variable width Hour (12), fixed width Minutes, variable width Minutes, fixed width Seconds, variable width Seconds, fixed width Value range 0...23 0...23 1...12 01...12 0...59 0...59 0...59 0...59 Options s s s s s s s s s, c, C s, c, C s, v s, v u, w

Seconds + fraction (N = 0...6) Seconds + fraction (N = 0...6) Milliseconds Microseconds am/pm marker, locale specific 0...999 0...999999 am, pm

By default, the format of the time contained in the string is %hh:%nn:%ss. However, you can specify a format string defining the format of the string field. You must prefix each component of the format string with the percent symbol. Separate the strings components with any character except the percent sign (%). Where indicated the tags can represent variable-fields on import, export, or both. Variable-width date elements can omit leading zeroes without causing errors. The following options can be used in the format string where indicated: s Specify this option to allow leading spaces in time formats. The s option is specified in the form: %(tag,s) Where tag is the format string. For example: %(n,s)

Chapter 7. Operators

193

indicates a minute field in which values can contain leading spaces or zeroes and be one or two characters wide. If you specified the following date format property: %(h,s):$(n,s):$(s,s) Then the following times would all be valid: 20: 6:58 20:06:58 20:6:58 v Use this option in conjunction with the %SSS or %SSSSSS tags to represent milliseconds or microseconds in variable-width format. So the time property: %(SSS,v) represents values in the range 0 to 999. (If you omit the v option then the range of values would be 000 to 999.) u w c C Use this option to render the am/pm text in uppercase on output. Use this option to render the am/pm text in lowercase on output. Specify this option to use a comma as the decimal separator in the %ss.N tag. Specify this option to use a period as the decimal separator in the %ss.N tag.

The c and C options override the default setting of the locale. The locale for determining the setting of the am/pm string and the default decimal separator can be controlled through the locale tag. This has the format:
%(L,locale)

Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention supported by ICU. See NLS Guide for a list of locales. The default locale for am/pm string and separators markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment variable (the tag takes precedence over the environment variable if both are set). Use the locale tag in conjunction with your time format, for example: %L(es)%HH:%nn %aa Specifies the Spanish locale. The format string is subject to the restrictions laid out in the following table. A format string can contain at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the incompatible with column. When some tags are used the format string requires that other tags are present too, as indicated in the requires column.
Table 36. Format tag restrictions Element hour am/pm marker minute second fraction of a second Numeric format tags %hh, %h, %HH, %H %nn, %n %ss, %s %ss.N, %s.N, %SSS, %SSSSSS Text format tags %aa Requires hour (%HH) Incompatible with hour (%hh) -

194

Parallel Job Advanced Developer Guide

You can include literal text in your date format. Any Unicode character other than null, backslash, or the percent sign can be used (although it is better to avoid control codes and other non-graphic characters). The following table lists special tags and escape sequences:
Tag %% \% \n \t \\ Escape sequence literal percent sign literal percent sign newline horizontal tab single backslash

Converting Time Fields to Integers Example


The following figure shows the conversion of time field to two 8-bit integers, where: v The hours_from_time conversion specification extracts the hours portion of a time field and writes it to an 8-bit integer v The minutes_from_time conversion specification extracts the minutes portion of a time field and writes it to an 8-bit integer.

input data set schema:


tField:time;

hoursField = hours_from_time(tField); minField = minutes_from_time(tField);

modify

output data set schema:


hoursField:int8; minField:int8;

The following osh code converts the hours portion of tField to the int8 hoursField and the minutes portion to the int8 minField:
modify hoursField = hours_from_time(tField); minField = minutes_from_time(tField);

Timestamp field conversions


By default WebSphere DataStage converts a source timestamp field only to either a time or date destination field. However, you can invoke the modify operator to perform other conversions.
Conversion Specification Description

dfloatField = seconds_since_from_timestamp [ timestamp ]( seconds_since from timestamp tsField )

Chapter 7. Operators

195

Conversion Specification tsField = timestamp_from_seconds_since [ timestamp ]( dfloatField ) stringField = string_from_timestamp [ timestamp_format | timestamp_uformat] ( tsField ) ustringField = ustring_from_timestamp [ timestamp_format | timestamp_uformat] ( tsField )

Description timestamp from seconds_since strings and ustrings from timestamp Converts the timestamp to a string or ustring representation using the specified timestamp_format or timestamp_uformat . By default, the string format is %yyyy-%mm-%dd hh:mm:ss. The timestamp_format and timestamp_uformat options are described in timestamp Formats .

int32Field = timet_from_timestamp ( tsField )

timet_from_timestamp int32Field contains a timestamp as defined by the UNIX timet representation.

dateField = date_from_timestamp( tsField )

date from timestamp Converts the timestamp to a date representation.

tsField = timestamp_from_string [ timestamp_format | timestamp_uformat] ( stringField ) tsField = timestamp_from_ustring [ timestamp_format | timestamp_uformat] ( usringField )

timestamp from strings and ustrings Converts the string or ustring to a timestamp representation using the specified timestamp_format . By default, the string format is yyyy-mm-dd hh:mm:ss. The timestamp_format and timestamp_uformat options are described in timestamp Formats . timestamp from time_t int32Field must contain a timestamp as defined by the UNIX time_t representation.

tsField = timestamp_from_timet (int32Field)

tsField = timestamp_from_date [ time ]( dateField )

timestamp from date The time argument optionally specifies the time to be used in building the timestamp result and must be in the form hh:mm:ss. If omitted, the time defaults to midnight.

tsField = timestamp_from_time [ date ]( timeField )

timestamp from time The date argument is required. It specifies the date portion of the timestamp and must be in the form yyyy-mm-dd.

tsField = timestamp_from_date_time ( date , time )

Returns a timestamp from date and time. The date specifies the date portion (yyyy-nn-dd) of the timestamp. The time argument specifies the time to be used when building the timestamp. The time argument must be in the hh:nn:ss format. time from timestamp

timeField = time_from_timestamp( tsField )

Timestamp conversion of a numeric field can be used with any WebSphere DataStage numeric data type. WebSphere DataStage performs the necessary conversions to translate a conversion result to the numeric data type of the destination. For example, you can use the conversion timet_from_timestamp to convert a timestamp to an int32, dfloat, and so on.

196

Parallel Job Advanced Developer Guide

timestamp formats
The string_from_timestamp, ustring_from_timestamp, timestamp_from_string, and timestamp_from_ustring conversions take a timestamp format or timestamp uformat argument. The default format of the timestamp contained in the string is yyyy-mm-dd hh:mm:ss. However, you can specify an optional format string defining the data format of the string field.

timestamp format
The format options of timestamp combine the formats of the date and time data types. The default timestamp format is as follows:
%yyyy-%mm-%dd %hh:%mm:%ss

timestamp uformat
For timestamp uformat, concantenate the date uformat with the time uformat. The two formats can be in any order, and their components can be mixed. These formats are described in date Uformat under Date Field Conversions and time Uformat under Time Field Conversions.. The following diagram shows the conversion of a date field to a timestamp field. As part of the conversion, the operator sets the time portion of the timestamp to 10:00:00.

input data set schema:


dField:date;

tsField = timestamp_from_time[10:00:00](dField);

modify

output data set schema:


tsField:timestamp;

To specify the timestamp_from_date conversion and set the time to 10:00:00, use the following osh command:
modify tsField=timestamp_from_date[10:00:00](dField);

The modify operator and nulls


All WebSphere DataStage data types support nulls. As part of processing a record, an operator can detect a null and take the appropriate action, for example, it can omit the null field from a calculation or signal an error condition. WebSphere DataStage represents nulls in two ways. v It allocates a single bit to mark a field as null. This type of representation is called an out-of-band null. v It designates a specific field value to indicate a null, for example a numeric fields most negative possible value. This type of representation is called an in-band null. In-band null representation can be disadvantageous because you must reserve a field value for nulls and this value cannot be treated as valid data elsewhere.
Chapter 7. Operators

197

The modify operator can change a null representation from an out-of-band null to an in-band null and from an in-band null to an out-of-band null. The record schema of an operators input or output data set can contain fields defined to support out-of-band nulls. In addition, fields of an operators interface might also be defined to support out-of-band nulls. The next table lists the rules for handling nullable fields when an operator takes a data set as input or writes to a data set as output.
Source Field not_nullable not_nullable nullable Destination Field not_nullable nullable not_nullable Result Source value propagates to destination. Source value propagates; destination value is never null. If the source value is not null, the source value propagates. If the source value is null, a fatal error occurs, unless you apply the modify operator, as in Out-of-Band to Normal Representation . nullable nullable Source value or null propagates.

Out-of-band to normal representation


The modify operator can change a fields null representation from a single bit to a value you choose, that is, from an out-of-band to an in-band representation. Use this feature to prevent fatal data type conversion errors that occur when a destination field has not been defined as supporting nulls. See Data Type Conversion Errors . To change a fields null representation from a single bit to a value you choose, use the following osh syntax:
destField[: dataType ] = handle_null ( sourceField , value )

where: v destField is the destination fields name. v dataType is its optional data type; use it if you are also converting types. v sourceField is the source fields name v value is the value you wish to represent a null in the output. The destField is converted from an WebSphere DataStage out-of-band null to a value of the fields data type. For a numeric field value can be a numeric value, for decimal, string, time, date, and timestamp fields, value can be a string. Conversion specifications are described in: Date Field Conversions Decimal Field Conversions String and Ustring Field Conversions Time Field Conversions Timestamp Field Conversions For example, the diagram shows the modify operator converting the WebSphere DataStage out-of-band null representation in the input to an output value that is written when a null is encountered:

198

Parallel Job Advanced Developer Guide

input data set schema:


aField:nullable int8; bField:nullable string[4]; aField = handle_null(aField, -128); bField = handle_null(b_field, XXXX);

modify

output data set schema:


aField:int8; bField:string[4];

While in the input fields a null takes the WebSphere DataStage out-of-band representation, in the output a null in aField is represented by -128 and a null in bField is represented by ASCII XXXX (0x59 in all bytes). To make the output aField contain a value of -128 whenever the input contains an out-of-band null, and the output bField contain a value of XXXX whenever the input contains an out-of-band null, use the following osh code:
$ modifySpec = "aField = handle_null(aField, -128); bField = handle_null(bField, XXXX); " $ osh " ... | modify $modifySpec | ... "

Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator.

Normal to out-of-band representation


The modify operator can change a fields null representation from a normal field value to a single bit, that is, from an in-band to an out-of-band representation. To change a fields null representation to out-of band use the following osh syntax:
destField [: dataType ] = make_null(sourceField , value );

Where: v destField is the destination fields name. v dataType is its optional data type; use it if you are also converting types. v sourceField is the source fields name. v value is the value of the source field when it is null. A conversion result of value is converted from a WebSphere DataStage out-of-band null to a value of the fields data type. For a numeric field value can be a numeric value, for decimal, string, time, date, and timestamp fields, value can be a string. For example, the diagram shows a modify operator converting the value representing a null in an input field (-128 or XXXX) to the WebSphere DataStage single-bit null representation in the corresponding field of the output data set:

Chapter 7. Operators

199

input data set schema:


aField:nullable int8; bField:nullable string[4]; aField = make_null(aField, -128); bField = make_null(b_field, XXXX);

modify

output data set schema:


aField:nullable int8; bField:nullable string[4];

In the input a null value in aField is represented by -128 and a null value in bField is represented by ASCII XXXX, but in both output fields a null value if represented by WebSphere DataStages single bit. The following osh syntax causes the aField of the output data set to be set to the WebSphere DataStage single-bit null representation if the corresponding input field contains -128 (in-band-null), and the bField of the output to be set to WebSphere DataStages single-bit null representation if the corresponding input field contains XXXX (in-band-null).
$modifySpec = "aField = make_null(aField, -128); bField = make_null(bField, XXXX); " $ osh " ... | modify $modifySpec | ... "

Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator.

The null and notnull conversions


WebSphere DataStage supplies two other conversions to use with nullable fields, called null and notnull. v The null conversion sets the destination field to 1 if the source field is null and to 0 otherwise. v The notnull conversion sets the destination field to 1 if the source field is not null and to 0 if it is null. In osh, define a null or notnull conversion as follows:
destField [:dataType] = null( sourceField ); destField [:dataType] = notnull( sourceField );

By default, the data type of the destination field is int8. Specify a different destination data type to override this default. WebSphere DataStage issues a warning if the source field is not nullable or the destination field is nullable.

The modify operator and partial schemas


You can invoke a modify operator to change certain characteristics of a data set containing a partial record schema. (Complete and Partial Schemas discusses partial schemas and their definition.) When the modify operator drops a field from the intact portion of the record, it drops only the field definition. The contents of the intact record are not altered. Dropping the definition means you can no longer access that portion of the intact record.

200

Parallel Job Advanced Developer Guide

The modify operator and vectors


The modify operator cannot change the length of a vector or the vectors length type (fixed or variable). However, you can use the operator either to translate the name of a vector or to convert the data type of the vector elements.

The modify operator and aggregate schema components


Data set and operator interface schema components can contain aggregates (subrecords and tagged aggregates). You can apply modify adapters to aggregates, with these restrictions: v Subrecords might be translated only to subrecords. v Tagged fields might be translated only to tagged fields. v Within subrecords and tagged aggregates, only elements of the same level can be bound by the operator. The diagram shows an operation in which both the input data set and the output contain aggregates:
schema: fName:string; lName:string; purchase:subrec( itemNum:int32; price:sfloat;) date:tagged( intDate:int32; stringDate:string;)

input data set

subField.subF1 = purchase.ItemNum subField.subF2 = purchase.price tagField.tagF1 = date.intDate tagField.tagF2 = date.stringDate

modify
schema: fName:string; lName:string; subField:subrec( subF1:int32; subF2:sfloat;) tagField:tagged( tagF1:int32; tagF2:string;)

output data set

In this example, purchase contains an item number and a price for a purchased item; date contains the date of purchase represented as either an integer or a string. You must translate the aggregate purchase to the interface component subField and the tagged component date to tagField. To translate aggregates: 1. Translate the aggregate of an input data set to an aggregate of the output. To translate purchase, the corresponding output component must be a compatible aggregate type. The type is subrecord and the component is subField. The same principle applies to the elements of the subrecord. 2. Translate the individual fields of the data sets aggregate to the individual fields of the operators aggregate.

Chapter 7. Operators

201

If multiple elements of a tagged aggregate in the input are translated, they must all be bound to members of a single tagged component of the outputs record schema. That is, all elements of tagField must be bound to a single aggregate in the input. Here is the osh code to rename purchase.price to subField.subF2.
$ modifySpec = "subField = purchase; subField.subF1 = purchase.itemNum; subField.subF2 = purchase.price; tagField = date; tagField.tagF1 = date.intDate; tagField.tagF2 = date.stringDate; ); " $ osh " ... | modify $ modifySpec | ... " # translate aggregate # translate field # translate field

Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator. Aggregates might contain nested aggregates. When you translate nested aggregates, all components at one nesting level in an input aggregate must be bound to all components at one level in an output aggregate. The table shows sample input and output data sets containing nested aggregates. In the input data set, the record purchase contains a subrecord description containing a description of the item:
Level 0 1 1 1 2 2 Schema 1 (for input data set) purchase: subrec ( itemNum: int32; price: sfloat; description: subrec; ( color: int32; size:int8; ); ); n n+1 n+1 Level Schema 2 (for output data set) ... subField ( subF1; subF2: ); ...

Note that: v itemNum and price are at the same nesting level in purchase. v color and size are at the same nesting level in purchase. v subF1 and subF2 are at the same nesting level in subField. You can bind: v purchase.itemNum and purchase.price (both level 1) to subField.subF1 and subField.subF2, respectively v purchase.description.color and purchase.description.size (both level 2) to subField.subF1 and subField.subF2, respectively You cannot bind two elements of purchase at different nesting levels to subF1 and subF2. Therefore, you cannot bind itemNum (level 1) to subF1 and size (level2) to subF2. Note: WebSphere DataStage features several operators that modify the record schema of the input data set and the level of fields within records. Two of them act on tagged subrecords. See the topic on the restructure operators.

202

Parallel Job Advanced Developer Guide

Allowed conversions
The table lists all allowed data type conversions arranged alphabetically. The form of each listing is:
conversion_name (source_type ,destination_type) Conversion Specification date_from_days_since ( int32 , date ) date_from_julian_day ( int32 , date ) date_from_string ( string , date ) date_from_timestamp ( timestamp , date ) date_from_ustring ( ustring , date ) days_since_from_date ( date , int32 ) decimal_from_decimal ( decimal , decimal ) decimal_from_dfloat ( dfloat , decimal ) decimal_from_string ( string , decimal ) decimal_from_ustring ( ustring , decimal ) dfloat_from_decimal ( decimal , dfloat ) hours_from_time ( time , int8 ) int32_from_decimal ( decimal , int32 ) int64_from_decimal ( decimal , int64 ) julian_day_from_date ( date , uint32 ) lookup_int16_from_string ( string , int16 ) lookup_int16_from_ustring ( ustring , int16 ) lookup_string_from_int16 ( int16 , string ) lookup_string_from_uint32 ( uint32 , string ) lookup_uint32_from_string ( string , uint32 ) lookup_uint32_from_ustring ( ustring , uint32 ) lookup_ustring_from_int16 ( int16 , ustring ) lookup_ustring_from_int32 ( uint32 , ustring ) lowercase_string ( string , string ) lowercase_ustring ( ustring , ustring ) mantissa_from_dfloat ( dfloat , dfloat ) mantissa_from_decimal ( decimal , dfloat ) microseconds_from_time ( time , int32 ) midnight_seconds_from_time ( time , dfloat) minutes_from_time ( time , int8 ) month_day_from_date ( date , int8 ) month_from_date ( date , int8 ) next_weekday_from_date ( date , date ) notnull ( any , int8 ) null ( any , int8 ) previous_weekday_from_date ( date , date ) raw_from_string ( string , raw )
Chapter 7. Operators

203

Conversion Specification raw_length ( raw , int32 ) seconds_from_time ( time , dfloat ) seconds_since_from_timestamp ( timestamp , dfloat ) string_from_date ( date , string ) string_from_decimal ( decimal , string ) string_from_time ( time , string ) string_from_timestamp ( timestamp , string ) string_from_ustring ( ustring , string ) string_length ( string , int32 ) substring ( string , string ) time_from_midnight_seconds ( dfloat , time ) time_from_string ( string , time ) time_from_timestamp ( timestamp , time ) time_from_ustring ( ustring , time ) timestamp_from_date ( date , timestamp ) timestamp_from_seconds_since ( dfloat , timestamp ) timestamp_from_string ( string , timestamp ) timestamp_from_time ( time , timestamp ) timestamp_from_timet ( int32 , timestamp ) timestamp_from_ustring ( ustring , timestamp ) timet_from_timestamp ( timestamp , int32 ) uint64_from_decimal ( decimal , uint64 ) uppercase_string ( string , string ) uppercase_ustring ( ustring , ustring ) u_raw_from_string ( ustring , raw ) ustring_from_date ( date , ustring ) ustring_from_decimal ( decimal , ustring ) ustring_from_string ( string , ustring ) ustring_from_time ( time , ustring ) ustring_from_timestamp ( timestamp , ustring ) ustring_length ( ustring , int32 ) u_substring ( ustring , ustring ) weekday_from_date ( date , int8 ) year_day_from_date ( date , int16 ) year_from_date ( date , int16 ) year_week_from_date ( date , int8 )

204

Parallel Job Advanced Developer Guide

pcompress operator
The pcompress operator uses the UNIX compress utility to compress or expand a data set. The operator converts a WebSphere DataStage data set from a sequence of records into a stream of raw binary data; conversely, the operator reconverts the data stream into a WebSphere DataStage data set.
input data set compressed data set

in:*;

pcompress

pcompress
out:*;

compressed data set

ouput data set

Data flow diagram


The mode of the pcompress operator determines its action. Possible values for the mode are: v compress: compress the input data set v expand: expand the input data set

pcompress: properties
Table 37. pcompress operator Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 1 1 mode = compress: in:*; mode = expand: out:*; in -> out without record modification for a compress/decompress cycle parallel (default) or sequential mode = compress: any mode = expand: same any mode = compress: sets mode = expand: propagates yes:APT_EncodeOperator

Pcompress: syntax and options


pcompress [-compress | -expand] [-command compress | gzip] Table 38. Pcompress options Option -compress Use -compress This option is the default mode of the operator. The operator takes a data set as input and produces a compressed version as output.

Chapter 7. Operators

205

Table 38. Pcompress options (continued) Option -expand Use -expand This option puts the operator in expand mode. The operator takes a compressed data set as input and produces an uncompressed data set as output. -command -command compress | gzip Optionally specifies the UNIX command to be used to perform the compression or expansion. When you specify compress the operator uses the UNIX command, compress -f, for compression and the UNIX command, uncompress, for expansion. When you specify gzip, the operator uses the UNIX command, gzip -, for compression and the UNIX command, gzip -d -, for expansion.

The default mode of the operator is -compress, which takes a data set as input and produces a compressed version as output. Specifying -expand puts the command in expand mode, which takes a compressed data set as input and produces an uncompressed data set as output.

Compressed data sets


Each record of an WebSphere DataStage data set has defined boundaries that mark its beginning and end. The pcompress operator invokes the UNIX compress utility to change a WebSphere DataStage data set, which is in record format, into raw binary data and vice versa.

Processing compressed data sets


A compressed data set is similar to a WebSphere DataStage data set. A compressed, persistent data set is represented on disk in the same way as a normal data set, by two or more files: a single descriptor file and one or more data files. A compressed data set cannot be accessed like a standard WebSphere DataStage data set. A compressed data set cannot be processed by most WebSphere DataStage operators until it is decoded, that is, until its records are returned to their normal WebSphere DataStage format. Nonetheless, you can specify a compressed data set to any operator that does not perform field-based processing or reorder the records. For example, you can invoke the copy operator to create a copy of the compressed data set. You can further encode a compressed data set, using an encoding operator, to create a compressed-encoded data set. (See Encode Operator .) You would then restore the data set by first decoding and then decompressing it.

Compressed data sets and partitioning


When you compress a data set, you remove its normal record boundaries. The compressed data set must not be repartitioned before is it expanded, because partitioning in WebSphere DataStage is performed record-by-record. For that reason, the pcompress operator sets the preserve-partitioning flag in the output data set. This prevents an WebSphere DataStage operator that uses a partitioning method of any from repartitioning the data set to optimize performance and causes WebSphere DataStage to issue a warning if any operator attempts to repartition the data set.

206

Parallel Job Advanced Developer Guide

For an expand operation, the operator takes as input a previously compressed data set. If the preserve-partitioning flag in this data set is not set, WebSphere DataStage issues a warning message.

Using orchadmin with a compressed data set


The orchadmin utility manipulates persistent data sets. However, the records of a compressed data set are not in the normal form. For that reason, you can invoke only a subset of the orchadmin commands to manipulate a compressed data set. These commands are as follows: v delete to delete a compressed data set v copy to copy a compressed data set v describe to display information about the data set before compression

Example
This example consists of two steps. In the first step, the pcompress operator compresses the output of the upstream operator before it is stored on disk:

step1 op1

pcompress
(mode=compress)

compressDS.ds
In osh, the default mode of the operator is -compress, so you need not specify any option:
$ osh " ... op1 | pcompress > compressDS.ds "

In the next step, the pcompress operator expands the same data set so that it can be used by another operator.

Chapter 7. Operators

207

compressDS.ds

step2 pcompress
(mode=expand)

op2

Use the osh command:


$ osh "pcompress -expand < compressDS.ds | op2 ... "

Peek operator
The peek operator lets you print record field values to the screen as the operator copies records from its input data set to one or more output data sets. This might be helpful for monitoring the progress of your job, or to diagnose a bug in your job.

Data flow diagram


input data set

inRec:*;

peek

outRec:*;

outRec:*;

outRec:*;

output data sets

peek: properties
Table 39. peek properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Value 1 N (set by user) inRec:* outRec:*

208

Parallel Job Advanced Developer Guide

Table 39. peek properties (continued) Property Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value inRec -> outRec without record modification parallel (default) or sequential any (parallel mode) any (sequential mode) propagated no

Peek: syntax and options


Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
peek [-all]|[-nrecs numrec] [-dataset] [-delim string] [-field fieldname ... ] [-name] [-part part_num] [-period P] [-skip N] [-var input_schema_var_name]

There are no required options.


Table 40. Peek options Option -all Use -all Causes the operator to print all records. The default operation is to print 10 records per partition. -dataset -dataset Specifies to write the output to a data set. The record schema of the output data set is: record(rec:string;) -delim -delim string Uses the string string as a delimiter on top-level fields. Other possible values for this are: n1 (newline), tab, and space. The default is the space character. -field -field fieldname Specifies the field whose values you want to print. The default is to print all field values. There can be multiple occurences of this option. -name -name Causes the operator to print the field name, followed by a colon, followed by the field value. By default, the operator prints only the field value, followed by a space.
Chapter 7. Operators

209

Table 40. Peek options (continued) Option -nrecs Use -nrecs numrec Specifies the number of records to print per partition. The default is 10. -period -period p Cause the operator to print every pth record per partition, starting with first record. p must be >= 1. -part -part part_num Causes the operator to print the records for a single partition number. By default, the operator prints records from all partitions. -skip -skip n Specifies to skip the first n records of every partition. The default value is 0. -var -var input_schema_var_name Explicitly specifies the name of the operators input schema variable. This is necessary when your input data set contains a field named inRec.

Using the operator


The peek operator reads the records from a single input data set and copies the records to zero or more output data sets. For a specified number of records per partition, where the default is 10, the record contents are printed to the screen. By default, the value of all record fields is printed. You can optionally configure the operator to print a subset of the input record fields. For example, the diagram shows the peek operator, using the -name option to dump both the field names and field values for ten records from every partition of its input data set, between two other operators in a data flow:

210

Parallel Job Advanced Developer Guide

step op1

peek
(name)

op2

This data flow can be implemented with the osh command:


$ osh " ... op1 | peek -name | op2 ... "

The output of this example is similar to the following:


ORCHESTRATE VX.Y 16:30:49 00 APT configuration file: ./config.apt From[1,0]: 16:30:58 00 Name:Mary Smith Age:33 Income:17345 Zip:02141 Phone:555-1212 From[1,1]: 16:30:58 00 Name:John Doe Age:34 Income:67000 Zip:02139 Phone:555-2121 16:30:59 00 Step execution finished with status = OK.

PFTP operator
The PFTP (parallel file transfer protocol) Enterprise operator transfers files to and from multiple remote hosts. It works by forking an FTP client executable to transfer a file to or from multiple remote hosts using a URI (Uniform Resource Identifier). This section describes the operator and also addresses issues such as restartability, describing how you can restart an ftp transfer from the point where it was stopped if a fails. The restart occurs at the file boundary.

Chapter 7. Operators

211

Data flow diagram


input URI0 input URI1 input URIn

pftp (get)
outRec:*

output data set

input data set inRec:*

pftp (put)

output output URI0 URI1 output URIn

Operator properties
Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 0 <= N <= 1 (zero or one input data set in put mode) 0 <= N <= 1 (zero or one output data set in get mode) inputRec:* (in put mode) outputRec:* (in get mode) inputRec:* is exported according to a user supplied schema and written to a pipe for ftp transfer. outputRec:* is imported according to a user supplied schema by reading a pipe written by ftp. Parallel None None None None Not Propagated Not Propagated Yes No None

Default execution mode Input partitioning style Output partitioning style Partitioning method Collection method Preserve-partitioning flag in input data set Preserve-partitioning flag in output data set Restartable Combinable operator Consumption pattern

212

Parallel Job Advanced Developer Guide

Property Composite Operator

Value No

Pftp: syntax and options


Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes. The options within [ ] are optional.
pftp -mode put | get [-schema schema | -schemafile schemafile] -uri uri1 [-open_command cmd][-uri uri2 [-open_command cmd]... ] [-ftp_call ftp_command] [-user user1 [-user user2...]] [-password password1 [-password password2...]] [-overwrite] [-transfer_type[ascii,binary]] [-xfer_mode[ftp,sftp]] [[-restartable_transfer [-job_id job_id][-checkpointdir checkpoint_dir]] |[-abandon_transfer [-job_id job_id][-checkpointdir checkpoint_dir]] |[-restart_transfer [-job_id job_id][-checkpointdir checkpoint_dir]]] Table 41. Pftp2 options Option -mode Use -mode put | -mode get put or get

Chapter 7. Operators

213

Table 41. Pftp2 options (continued) Option -uri Use -uri uri1 [-uri uri2...] The URIs (Uniform Resource Identifiers) are used to transfer or access files from or to multiple hosts. There can be multiple URIs. You can specify one or more URIs, or a single URI with a wild card in the path to retrieve all the files using the wild character pointed by the URI. pftp collects all the retrieved file names before starting the file transfer. pftp supports limited wild carding. Get commands are issued in sequence for files when a wild card is specified. You can specify an absolute or a relative pathname. For put operations the syntax for a relative path is: ftp://remotehost.domain.com/path/remotefilename Where path is the relative path of the users home directory. For put operations the syntax for an absolute path is: ftp://remotehost.domain.com//path/remotefilename While connecting to the mainframe system, the syntax for an absolute path is: ftp://remotehost.domain.com/\path.remotefilename\ Where path is the absolute path of the users home directory. For get operations the syntax for a relative path is: ftp://host/path/filename Where path is the relative path of the users home directory. For get operations the syntax for an absolute path is: ftp://host//path/filename While connecting to the mainframe system, the syntax for an absolute path is: ftp://host//\path.remotefilename\ Where path is the absolute path of the users home directory. -open_command -open_command cmd Needed only if any operations need to be performed besides navigating to the directory where the file exists This is a sub-option of the URI option. At most one open_command can be specified for an URI. Example: -uri ftp://remotehost/fileremote1.dat -open_command verbose

214

Parallel Job Advanced Developer Guide

Table 41. Pftp2 options (continued) Option -user Use -user username1 [-user username2...] With each URI you can specify the User Name to connect to the URI. If not specified, the ftp will try to use the .netrc file in the users home directory. There can be multiple user names. User1 corresponds to URI1. When the number of usernames is less than the number of URIs, the last username is set for the remaining URIs. Example: -user User1 -user User2 -password -password password1 [-password password1] With each URI you can specify the Password to connect to the URI. If not specified, the ftp will try to use the .netrc file in the users home directory. There can be multiple passwords. Password1 corresponds to URI1. When the number of passwords is less than the number of URIs, the last password is set for the remaining URIs. Note The number of passwords should be equal to the number of usernames. Example: -password Secret1 -password Secret2 -schema -schema schema You can specify the schema for get or put operations. This option is mutually exclusive with -schemafile. Example: -schema record(name:string;) -schemafile -schemafile schemafile You can specify the schema for get or put operations. in a schema file. This option is mutually exclusive with -schema. Example: -schemafile file.schema -ftp_call -ftp_call cmd The ftp command to call for get or put operations. The default is ftp. You can include absolute path with the command. Example: -ftp_call /opt/gnu/bin/wuftp.

Chapter 7. Operators

215

Table 41. Pftp2 options (continued) Option -force_config_file_parallelism Use -force_config_file_parallelism Optionally limits the number of pftp players via the APT_CONFIG_FILE configuration file. The operator executes with a maximum degree of parallelism as determined by the configuration file. The operator will execute with a lesser degree of parallelism if the number of get arguments is less than the number of nodes in the Configuration file. In some cases this might result in more than one file being transferred per player. -overwrite -overwrite Overwrites remote files in ftp put mode. When this option is not specified, the remote file is not overwritten.

216

Parallel Job Advanced Developer Guide

Table 41. Pftp2 options (continued) Option -restartable_transfer | -restart_transfer | -abandon_transfer Use This option is used to initiate a restartable ftp transfer. The restartability option in get mode will reinitiate ftp transfer at the file boundary. The transfer of the files that failed half way is restarted from the beginning or zero file location. The file URIs that were transferred completely are not transferred again. Subsequently, the downloaded URIs are imported to the data set from the downloaded temporary folder path. v A restartable pftp session is initiated as follows: osh "pftp -uri ftp://remotehost/file.dat -user user -password secret -restartable_transfer -jobid 100 -checkpointdir chkdir -mode put < input.ds v -restart_transfer :If the transfer fails, to restart the transfer again, the restartable pftp session is resumed as follows: osh "pftp -uri ftp://remotehost/file.dat -user user -password secret -restart_transfer -jobid 100 -checkpointdir chkdir -mode put < input.ds v -abandon_transfer : Used to abort the operation,the restartable pftp session is abandoned as follows: osh "pftp -uri ftp://remotehost/file.dat -user user -password secret -abandon_transfer -jobid 100 -checkpointdir chkdir -mode put < input.ds

Chapter 7. Operators

217

Table 41. Pftp2 options (continued) Option -job_id Use This is an integer to specify job identifier of restartable transfer job. This is a dependent option of -restartable_transfer, -restart_transfer, or -abandon_transfer Example: -job_id 101 -checkpointdir This is the directory name/path of location where pftp restartable job id folder can be created. The checkpoint folder must exist. Example: -checkpointdir "/apt/linux207/orch_master/apt/folder" -transfer_type This option is used to specify the data transfer type. You can either choose ASCII or Binary as the data transfer type. Example: -transfer_type binary -xfer_mode This option is used to specify data transfer protocol. You can either choose FTP or SFTP mode of data transfer. Example: -xfer_mode sftp

Restartability
You can specify that the FTP operation runs in restartable mode. To do this you: 1. Specify the -restartable_transfer option 2. Specify a unique job_id for the transfer 3. Optionally specify a checkpoint directory for the transfer using the -checkpointdir directory (if you do not specify a checkpoint directory, the current working directory is used) When you run the job that performs the FTP operation, information about the transfer is written to a restart directory identified by the job id located in the checkpoint directory prefixed with the string pftp_jobid_. For example, if you specify a job_id of 100 and a checkpoint directory of /home/bgamsworth/checkpoint the files would be written to /home/bgamsworth/checkpoint/ pftp_jobid_100. If the FTP operation does not succeed, you can rerun the same job with the option set to restart or abandon. For a production environment you could build a job sequence that performed the transfer, then tested whether it was successful. If it was not, another job in the sequence could use another PFTP operator with the restart transfer option to attempt the transfer again using the information in the restart directory. For get operations, WebSphere DataStage reinitiates the FTP transfer at the file boundary. The transfer of the files that failed half way is restarted from the beginning or zero file location. The file URIs that were transferred completely are not transferred again. Subsequently, the downloaded URIs are imported to the data set from the temporary folder path. If the operation repeatedly fails, you can use the abandon_transfer option to abandon the transfer and clear the temporary restart directory.

218

Parallel Job Advanced Developer Guide

pivot operator
Use the Pivot Enterprise stage to pivot data horizontally. The pivot operator maps a set of fields in an input row to a single column in multiple output records. This type of mapping operation is known as horizontal pivoting. The data output by the pivot operator usually has fewer fields, but more records than the input data. You can map several sets of input fields to several output columns. You can also output any of the fields in the input data with the output data. You can generate a pivot index that will assign an index number to each record with a set of pivoted data.

Properties: pivot operator


Property Number of input data sets Number of output data sets Input interface schema Output interface schema Value 1 1 inRec:* outRec:*; pivotField:* ... pivotFieldn:*; pivotIndex:int; parallel (default) or sequential any (parallel mode) any (sequential mode) propagated no yes

Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator

The pivot operator: v Takes any single data set as input v Has an input interface schema consisting of a single schema variable inRec . v Copies the input data set to the output data set, pivotting data in multiple input fields to single output fields in multiple records.

Pivot: syntax and options


pivot -horizontal -derive field_name -from field_name [-from -type type [-derive field_name -from field_name [-from -type type]... [-index field_name] Table 42. Pivot options Option -horizontal Use -horizontal Specifies that the operator will perform a horizontal pivot operation. field_name]... field_name]...

Chapter 7. Operators

219

Table 42. Pivot options (continued) Option -derive Use -derive field_name Specifies a name for an output field. -from -from field_name Specifies the name of the input field from which the output field is derived. -type -type type Specifies the type of the output field. -index -index field_name Specifies that an index field will be generated for piviotted data.

Pivot: examples
In this example you use the pivot operator to pivot the data shown in the first table to produce the data shown in the second table. This example has a single pivotted output field and a generated index added to the data.
Table 43. Simple pivot operation - input data REPID 100 101 last_name Smith Yamada Jan_sales 1234.08 1245.20 Feb_sales 1456.80 1765.00 Mar_sales 1578.00 1934.22

Table 44. Simple pivot operation - output data REPID 100 100 100 101 101 101 last_name Smith Smith Smith Yamada Yamada Yamada Q1sales 1234.08 1456.80 1578.00 1245.20 1765.00 1934.22 Pivot_index 0 1 2 0 1 2

The osh command is:


$ osh "pivot -horizontal -derive REPID -from REPID -type string-index pivot_index -derive last_name -from last_name -type string -derive Q1sales -from Jan_sales -from Feb_sales -from Mar_sales-type decimal[10,2]"

In this example, you use the pivot operator to pivot the data shown in the first table to produce the data shown in the second table. This example has multiple pivotted output fields.
Table 45. Pivot operation with multiple pivot columns - input data REPID 100 last_name Smith Q1sales 4268.88 Q2sales 5023.90 Q3sales 4321.99 Q4sales 5077.63

220

Parallel Job Advanced Developer Guide

Table 45. Pivot operation with multiple pivot columns - input data (continued) REPID 101 Table 46. REPID 100 100 101 101 last_name Smith Smith Yamada Yamada halfyear1 4268.88 5023.90 4944.42 5111.88 halfyear2 4321.99 5077.63 4500.67 4833.22 last_name Yamada Q1sales 4944.42 Q2sales 5111.88 Q3sales 4500.67 Q4sales 4833.22

The osh command is:


$ osh "pivot -horizontal -derive REPID -from REPID -type string -derive last_name -from last_name -type string -derive halfyear1 -from Q1sales -from Q2sales -type decimal[10,2] -derive halfyear2 -from Q3sales -from Q4sales -type decimal[10,2]"

Remdup operator
The remove-duplicates operator, remdup, takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set. Removing duplicate records is a common way of cleansing a data set before you perform further processing. Two records are considered duplicates if they are adjacent in the input data set and have identical values for the key field(s). A key field is any field you designate to be used in determining whether two records are identical. For example, a direct mail marketer might use remdup to aid in householding, the task of cleansing a mailing list to prevent multiple mailings going to several people in the same household. The input data set to the remove duplicates operator must be sorted so that all records with identical key values are adjacent. By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the remdup operator and other operators.

Data flow diagram

Chapter 7. Operators

221

input data set

inRec:*;

remdup

outRec:*;

output data set

remdup: properties
Table 47. remdup properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Input partitioning style Partitioning method Collection method Preserve-partitioning flag in output data set Restartable Composite operator Value 1 1 inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential keys in same partition same (parallel mode) any (sequential mode) propagated yes no

Remdup: syntax and options


remdup [-key field [-cs | -ci] [-ebcdic] [-hash] [-param params] ...] [-collation_sequence locale | collation_file_pathname | OFF][-first | -last]

222

Parallel Job Advanced Developer Guide

Table 48. remdup options Option -collation_sequence Use -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu /userguide/Collate_Intro.htm -first -first Specifies that the first record of the duplicate set is retained. This is the default. -last -last Specifies that the last record of the duplicate set is retained. The options -first and -last are mutually exclusive. -key -key field [-cs | -ci] [-ebcdic] [-hash] [-param params] Specifies the name of a key field. The -key option might be repeated for as many fields as are defined in the input data sets record schema. The -cs option specifies case-sensitive comparison, which is the default. The -ci option specifies a case-insensitive comparison of the key fields. By default data is represented in the ASCII character set. To represent data in the EBCDIC character set, specify the -ebcdic option. The -hash option specifies hash partitioning using this key. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property =value pairs separated by commas.

Removing duplicate records


The remove duplicates operator determines if two adjacent records are duplicates by comparing one-or-more fields in the records. The fields used for comparison are called key fields. When using this operator, you specify which of the fields on the record are to be used as key fields. You can define only

Chapter 7. Operators

223

one key field or as many as you need. Any field on the input record might be used as a key field. The determination that two records are identical is based solely on the key field values and all other fields on the record are ignored. If the values of all of the key fields for two adjacent records are identical, then the records are considered duplicates. When two records are duplicates, one of them is discarded and one retained. By default, the first record of a duplicate pair is retained and any subsequent duplicate records in the data set are discarded. This action can be overridden with an option to keep the last record of a duplicate pair. In order for the operator to recognize duplicate records as defined by the key fields, the records must be adjacent in the input data set. This means that the data set must have been hash partitioned, then sorted, using the same key fields for the hash and sort as you want to use for identifying duplicates. By default, WebSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the remdup operator and other operators. For example, suppose you want to sort the data set first by the Month field and then by the Customer field and then use these two fields as the key fields for the remove duplicates operation. Use the following osh command:
$ osh "remdup -key Month -key Customer < inDS.ds > outDS.ds"

In this example, WebSphere DataStage-inserted partition and sort components guarantees that all records with the same key field values are in the same partition of the data set. For example, all of the January records for Customer 86111 are processed together as part of the same partition.

Using options to the operator


By default, the remdup operator retains the first record of a duplicate pair and discards any subsequent duplicate records in the data set. Suppose you have a data set which has been sorted on two fields: Month and Customer. Each record has a third field for the customers current Balance and the data set can contain multiple records for a customers balance for any month. When using the remdup operator to cleanse this data set, by default, only the first record is retained for each customer and all the others are discarded as duplicates. For example, if the records in the data set are:
Month Apr Apr Apr May May Customer 86111 86111 86111 86111 86111 Balance 787.38 459.32 333.21 134.66 594.26

The default result of removing duplicate records on this data set is:
Month Apr May Customer 86111 86111 Balance 787.38 134.66

Using the -last option, you can specify that the last duplicate record is to be retained rather than the first. This can be useful if you know, for example, that the last record in a set of duplicates is always the most recent record.

224

Parallel Job Advanced Developer Guide

For example, if the osh command is:


$ osh "remdup -key Month -key Customer -last < inDS.ds > outDS.ds"

the output would given by:


Month Apr May Customer 86111 86111 Balance 333.21 594.26

If a key field is a string, you have a choice about how the value from one record is compared with the value from the next record. The default is that the comparison is case sensitive. If you specify the -ci options the comparison is case insensitive. In osh, specify the -key option with the command:
$osh "remdup -key Month -ci < inDS.ds > outDS.ds"

With this option specified, month values of JANUARY and January match, whereas without the case-insensitive option they do not match. For example, if your input data set is:
Month Apr apr apr May might Customer 59560 43455 59560 86111 86111 Balance 787.38 459.32 333.21 134.66 594.26

The output from a case-sensitive sort is:


Month Apr May apr apr might Customer 59560 86111 43455 59560 86111 Balance 787.38 134.66 459.32 333.21 594.26

Thus the two April records for customer 59560 are not recognized as a duplicate pair because they are not adjacent to each other in the input. To remove all duplicate records regardless of the case of the Month field, use the following statement in osh:
$ osh "remdup -key Month -ci -key Customer < inDS.ds > outDS.ds"

This causes the result of sorting the input to be:


Month apr Apr apr Customer 43455 59560 59560 Balance 459.32 787.38 333.21

Chapter 7. Operators

225

Month May might

Customer 86111 86111

Balance 134.66 594.26

The output from the remdup operator will then be:


Month apr Apr May Customer 43455 59560 86111 Balance 459.32 787.38 134.66

Using the operator


The remdup operator takes a single data set as input, removes all duplicate records, and writes the results to an output data set. As part of this operation, the operator copies an entire record from the input data set to the output data without altering the record. Only one record is output for all duplicate records.

Example 1: using remdup


The following is an example of use of the remdup operator. Use the osh command:
$ osh "remdup -key Month < indDS.ds > outDS.ds"

This example removes all records in the same month except the first record. The output data set thus contains at most 12 records.

Example 2: using the -last option


In this example, the last record of each duplicate pair is output rather than the first, because of the -last option. Use the osh command:
$ osh "remdup -key Month -last < indDS.ds > outDS.ds"

Example 3: case-insensitive string matching


This example shows use of case-insensitive string matching. Use the osh command:
$ osh "remdup -key Month -ci -last < indDS.ds > outDS.ds"

The results differ from those of the previous example if the Month field has mixed-case data values such as May and MAY. When the case-insensitive comparison option is used these values match and when it is not used they do not.

Example 4: using remdup with two keys


This example retains the first record in each month for each customer. Therefore there are no more than 12 records in the output for each customer. Use the osh command:
$ osh "remdup -key Month -ci -key Customer < inDS.ds > outDS.ds"

226

Parallel Job Advanced Developer Guide

Sample operator
The sample operator is useful when you are building, testing, or training data sets for use with the WebSphere DataStage data-modeling operators. The sample operator allows you to: v Create disjoint subsets of an input data set by randomly sampling the input data set to assign a percentage of records to output data sets. WebSphere DataStage uses a pseudo-random number generator to randomly select, or sample, the records of the input data set to determine the destination output data set of a record. You supply the initial seed of the random number generator. By changing the seed value, you can create different record distributions each time you sample a data set, and you can recreate a given distribution by using the same seed value. A record distribution is repeatable if you use the same: Seed value Number of output data sets Percentage of records assigned to each data set No input record is assigned to more than one output data set. The sum of the percentages of all records assigned to the output data sets must be less than or equal to 100% v Alternatively, you can specify that every nth record be written to output data set 0.

Data flow diagram


input data set

inRec:*;

sample

outRec:*;

outRec:*;

outRec:*;

output data sets

sample: properties
Table 49. sample properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Value 1 N (set by user) inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential

Chapter 7. Operators

227

Table 49. sample properties (continued) Property Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value any (parallel mode) any (sequential mode) propagated no

Sample: syntax and options


sample -percent percent output_port_num [-percent percent output_port_num ... ] | -sample sample [-maxoutputrows maxout] [-seed

seed_val ]

Either the -percent option must be specified for each output data set or the -sample option must be specified.
Table 50. Sample options Option maxoutputrows Use -maxoutputrows maxout Optionally specifies the maximum number of rows to be output per process. Supply an integer >= 1 for maxout. -percent -percent percent output_port_num Specifies the sampling percentage for each output data set. You specify the percentage as an integer value in the range of 0, corresponding to 0.0%, to 100, corresponding to 100.0%. The sum of the percentages specified for all output data sets cannot exceed 100.0%. The output_port_num following percent is the output data set number. The -percent and -sample options are mutually exclusive. One must be specified. -sample -sample sample Specifies that each nth record is written to output 0. Supply an integer >= 1 for sample to indicate the value for n. The -sample and -percent options are mutually exclusive. One must be specified. -seed -seed seed_val Initializes the random number generator used by the operator to randomly sample the records of the input data set. seed_val must be a 32-bit integer. The operator uses a repeatable random number generator, meaning that the record distribution is repeatable if you use the same seed_val, number of output data sets, and percentage of records assigned to each data set.

228

Parallel Job Advanced Developer Guide

Example sampling of a data set


This example configures the sample operator to generate three output data sets from an input data set. The first data set receives 5.0% of the records of the input data set 0, data set 1 receives 10.0%, and data set 2 receives 15.0%.

InDS.ds

step

Sample

5.0%

10.0%

15.0%

outDS0.ds outDS1.ds outDS2.ds


Use this osh command to implement this example:
$ osh "sample -seed 304452 -percent 5 0 -percent 10 1 -percent 15 2 < inDS.ds > outDS0.ds > outDS1.ds > outDS2.ds"

In this example, you specify a seed value of 304452, a sampling percentage for each output data set, and three output data sets.

Sequence operator
Using the sequence operator, you can copy multiple input data sets to a single output data set. The sequence operator copies all records from the first input data set to the output data set, then all the records from the second input data set, and so forth. This operation is useful when you want to combine separate data sets into a single large data set. This topic describes how to use the sequence operator. The sequence operator takes one or more data sets as input and copies all input records to a single output data set. The operator copies all records from the first input data set to the output data set, then all the records from the second input data set, and so on The record schema of all input data sets must be identical. You can execute the sequence operator either in parallel (the default) or sequentially. Sequential mode allows you to specify a collection method for an input data set to control how the data set partitions are combined by the operator.

Chapter 7. Operators

229

This operator differs from the funnel operator, described in Funnel Operators , in that the funnel operator does not guarantee the record order in the output data set.

Data flow diagram


input data sets

inRec:*; inRec:*; inRec:*;

sequence

outRec:*;

output data set

sequence: properties
Table 51. sequence properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value N (set by user) 1 inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential round robin (parallel mode) any (sequential mode) propagated no

Sequence: syntax and options


The syntax for the sequence operator in an osh command is simply:
sequence

It has no operator-specific options.

Example of Using the sequence Operator


This example uses the sequence operator to combine multiple data sets created by multiple steps, before passing the combined data to another operator op1. The diagram shows data flow for this example:

230

Parallel Job Advanced Developer Guide

step A

step B

step C

outDS1.ds

outDS1.ds

outDS1.ds

step sequence

op1

The following osh commands create the data sets:


$ osh " ... > outDS0.ds" $ osh " ... > outDS1.ds" $ osh " ... > outDS2.ds"

The osh command for the step beginning with the sequence operator is:
$ osh "sequence < outDS0.ds < outDS1.ds < outDS2.ds | op1 ... rees

Switch operator
The switch operator takes a single data set as input. The input data set must have an integer field to be used as a selector, or a selector field whose values can be mapped, implicitly or explicitly, to int8. The switch operator assigns each input record to one of multiple output data sets based on the value of the selector field.

Data flow diagram

Chapter 7. Operators

231

input data set

inRec:*;

switch

outRec:*;

outRec:*;

outRec:*;

output data sets


data set 0 data set 1

data set N

The switch operator is analogous to a C switch statement, which causes the flow of control in a C program to branch to one of several cases based on the value of a selector variable, as shown in the following C program fragment.
switch (selector) { case 0: // if selector = 0, // write record to output data set 0 break; case 1: // if selector = 1, // write record to output data set 1 break; . . . case discard: // if selector = discard value // skip record break; case default:// if selector is invalid, // abort operator and end step };

You can attach up to 128 output data sets to the switch operator corresponding to 128 different values for the selector field. Note that the selector value for each record must be in or be mapped to the range 0 to N-1, where N is the number of data sets attached to the operator, or be equal to the discard value. Invalid selector values normally cause the switch operator to terminate and the step containing the operator to return an error. However, you might set an option that allows records whose selector field does not correspond to that range to be either dropped silently or treated as allowed rejects. You can set a discard value for the selector field. Records whose selector field contains or is mapped to the discard value is dropped, that is, not assigned to any output data set.

switch: properties
Table 52. switch properties Property Number of input data sets Number of output data sets Value 1 1 <= N <= 128

232

Parallel Job Advanced Developer Guide

Table 52. switch properties (continued) Property Input interface schema Output interface schema Preserve-partitioning flag in output data set Value selector field:any data type;inRec:* outRec:* propagated

Switch: syntax and options


Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
switch [-allowRejects] | [-ifNotFound ignore | allow | fail] | [-ignoreRejects] | [-hashSelector] [-case "selector_value = output_ds"] [-collation_sequence locale |collation_file_pathname | OFF] [-discard discard_value] [-key field_name [-cs | -ci] [-param params]]

If the selector field is of type integer and has no more than 128 values, there are no required options, otherwise you must specify a mapping using the -case option.
Table 53. switch options Option -allowRejects Use -allowRejects Rejected records (whose selector value is not in the range 0 to N-1, where N is the number of data sets attached to the operator, or equal to the discard value) are assigned to the last output data set. This option is mutually exclusive with the -ignoreRejects, -ifNotFound, and -hashSelector options. -case -case mapping Specifies the mapping between actual values of the selector field and the output data sets. mapping is a string of the form selector_value = output_ds , where output_ds is the number of the output data set to which records with that selector value should be written (output_ds can be implicit, as shown in the example below). You must specify an individual mapping for each value of the selector field you want to direct to one of the output data sets, thus -case is invoked as many times as necessary to specify the complete mapping. Multi-byte Unicode character data is supported for ustring selector values. Note: This option is incompatible with the -hashSelector option.

Chapter 7. Operators

233

Table 53. switch options (continued) Option -collation_sequence Use -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu /userguide/Collate_Intro.htm -discard -discard discard_value Specifies an integer value of the selector field, or the value to which it was mapped using case, that causes a record to be discarded by the operator. Note that discard_value must be outside the range 0 to N-1, where N is the number of data sets attached to the operator. Note: This option is mutually exclusive with -hashSelector. -hashSelector -hashSelector A boolean; when this is set, records are hashed on the selector field modulo the number of output data sets and assigned to an output data set accordingly. The selector field must be of a type that is convertible to uint32 and might not be nullable. Note: This option is incompatible with the -case, -discard, -allowRejects, -ignoreRejects, and -ifNotFound options.

234

Parallel Job Advanced Developer Guide

Table 53. switch options (continued) Option -ifNotFound Use -ifNotFound {allow | fail | ignore} Specifies what the operator should do if a data set corresponding to the selector value does not exist: allow Rejected records (whose selector value is not in the range 0 to N-1 or equal to the discard value) are assigned to the last output data set. If this optionvalue is used, you might not explicitly assign records to the last data set. fail When an invalid selector value is found, return an error and terminate. This is the default. ignore Drop the record containing the out-of-range value and continue. Note. This option is incompatible with -allowRejects, -ignoreRejects, and -hashSelector options. -ignoreRejects -ignoreRejects Drop the record containing the out-of-range value and continue. Note. This option is mutually exclusive with the -allowRejects, -ifNotFound, and -hashSelector options. -key -key field_name [-cs | -ci] Specifies the name of a field to be used as the selector field. The default field name is selector. This field can be of any data type that can be converted to int8, or any non-nullable type if case options are specified. Field names can contain multi-byte Unicode characters. Use the -ci flag to specify that field_name is case-insensitive. The -cs flag specifies that field_name is treated as case sensitive, which is the default.

In this example, you create a switch operator and attach three output data sets numbered 0 through 2. The switch operator assigns input records to each output data set based on the selector field, whose year values have been mapped to the numbers 0 or 1 by means of the -case option. A selector field value that maps to an integer other than 0 or 1 causes the operator to write the record to the last data set. You might not explicitly assign input records to the last data set if the -ifNotFound option is set to allow. With these settings, records whose year field has the value 1990, 1991, or 1992 go to outDS0.ds. Those whose year value is 1993 or 1994 go to outDS1.ds. Those whose year is 1995 are discarded. Those with any other year value are written to outDS2.ds, since rejects are allowed by the -ifNotFound setting. Note that because the -ifNotFound option is set to allow rejects, switch does not let you map any year value explicitly to the last data set (outDS2.ds), as that is where rejected records are written. Note also that it was unnecessary to specify an output data set for 1991 or 1992, since without an explicit mapping indicated, case maps values across the output data sets, starting from the first (outDS0.ds). You might map more than one selector field value to a given output data set.

Chapter 7. Operators

235

The operator also verifies that if a -case entry maps a selector field value to a number outside the range 0 to N-1, that number corresponds to the value of the -discard option.

InDS.ds
schema: income:int32; year:string; name:string; state:string;

step
selector = year; selector:type; inRec:*;

switch
outRec:*; outRec:*; outRec:*;

outDS0.ds outDS1.ds outDS2.ds


In this example, all records with the selector field mapped to: v 0 are written to outDS0.ds v 1 is written to outDS1.ds v 5 are discarded v any other values are treated as rejects and written to outDS2.ds. In most cases, your input data set does not contain an 8-bit integer field to be used as the selector; therefore, you use the -case option to map its actual values to the required range of integers. In this example, the record schema of the input data set contains a string field named year, which you must map to 0 or 1. Specify the mapping with the following osh code:
$ osh "switch -case 1990=0 -case 1991 -case 1992 -case 1993=1 -case 1994=1 -case 1995=5 -discard 5 -ifNotFound allow -key year < inDS.ds > outDS0.ds > outDS1.ds > outDS2.ds "

Note that by default output data sets are numbered starting from 0. You could also include explicit data set numbers, as shown below:
$ osh "switch -discard 3 < inDS.ds 0> outDS0.ds 1> outDS1.ds 2> outDS2.ds "

236

Parallel Job Advanced Developer Guide

Job monitoring information


The switch operator reports business logic information which can be used to make decisions about how to process data. It also reports summary statistics based on the business logic. The business logic is included in the metadata messages generated by WebSphere DataStage as custom information. It is identified with:
name="BusinessLogic"

The output summary per criterion is included in the summary messages generated by WebSphere DataStage as custom information. It is identified with:
name="CriterionSummary"

The XML tags criterion, case and where are used by the switch operator when generating business logic and criterion summary custom information. These tags are used in the example information below.

Example metadata and summary messages


<response type="metadata"> <component ident="switch"> <componentstats startTime="2002-08-08 14:41:56"/> <linkstats portNum="0" portType="in"/> <linkstats portNum="0" portType="out/"> <linkstats portNum="1" portType="out/"> <linkstats portNum="2" portType="out/"> <custom_info Name="BusinessLogic" Desc="User-supplied logic to switch operator"> <criterion name="key">tfield</criterion> <criterion name="case"> <case value=" 0" output_port="0"></case> <case value=" 1" output_port="1"></case> <case value=" 2" output_port="2"></case> </criterion> </custom_info> </component> </response> <response type="summary"> <component ident="switch" pid="2239"> <componentstats startTime="2002-08-08 14:41:59" stopTime= "2002-08-08 14:42:40" percentCPU="99.5"/> <linkstats portNum="0" portType="in" recProcessed="1000000"/> <linkstats portNum="0" portType="out" recProcessed="250000"/> <linkstats portNum="1" portType="out" recProcessed="250000"/> <linkstats portNum="2" portType="out" recProcessed="250000"/> <custom_info Name="CriterionSummary" Desc= "Output summary per criterion"> <case value=" 0" output_port="0" recProcessed="250000"/> <case value=" 1" output_port="1" recProcessed="250000"/> <case value=" 2" output_port="2" recProcessed="250000"/> </custom_info> </component> </response>

Customizing job monitor messages


WebSphere DataStage specifies the business logic and criterion summary information for the switch operator using the functions addCustomMetadata() and addCustomSummary(). You can also use these functions to generate this kind of information for the operators you write.

Chapter 7. Operators

237

Tail operator
The tail operator copies the last N records from each partition of its input data set to its output data set. By default, N is 10 records. However, you can determine the following by means of options: v The number of records to copy v The partition from which the records are copied This control is helpful in testing and debugging jobs with large data sets. The head operator performs a similar operation, copying the first N records from each partition. See Head Operator .

Data flow diagram


input data set

inRec:*;

tail

outRec:*;

output data set

tail: properties
Table 54. tail properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1 inRec:* outRec:* inRec -> outRec without record modification

Tail: syntax and options


tail [-nrecs count] [-part partition_number] Table 55. Tail options Option -nrecs Use -nrecs count Specify the number of records (count) to copy from each partition of the input data set to the output data set. The default value of count is 10.

238

Parallel Job Advanced Developer Guide

Table 55. Tail options (continued) Option -part Use -part partition_number Copy records only from the indicated partition. By default, the operator copies records from all partitions. You can specify -part multiple times to specify multiple partition numbers. Each time you do, specify the option followed by the number of the partition.

Tail example 1: tail operator default behavior


In this example, no options have been specified to the tail operator. The input data set consists of 60 sorted records (positive integers) hashed into four partitions. The output data set consists of the last ten records of each partition. The osh command for the example is:
$osh "tail < in.ds > out.ds" Table 56. tail Operator Input and Output for Example 1 Partition 0 Input Output Partition 1 Input Output Partition 2 Input 6 7 8 22 29 30 33 41 43 44 45 48 55 56 58 Output 30 33 41 43 44 45 48 55 56 58 Partition 3 Input 1 2 4 21 24 28 31 38 39 10 26 32 52 20 27 34 54 Output 26 27 28 31 32 34 38 39 52 54

0 9 18 19 23 9 18 19 23 25 36 37 40 25 36 37 40 47 51 47 51

3 5 11 12 13 16 17 35 42 14 15 16 17 46 49 50 53 35 42 46 49 57 59 50 53 57 59

Example 2: tail operator with both options


In this example, both the -nrecs and -part options are specified to the tail operator to request that the last 3 records of Partition 2 be output. The input data set consists of 60 sorted records (positive integers) hashed into four partitions. The output data set contains only the last three records of Partition 2. Table 57 shows the input and output data. The osh command for this example is:
$ osh "tail -nrecs 3 -part 2 < in.ds > out0.ds" Table 57. tail Operator Input and Output for Example 2 Partition 0 Input 0 9 18 19 23 25 36 37 40 47 51 Output Partition 1 Input 3 5 11 12 13 14 15 16 17 35 42 46 49 50 53 57 59 Output Partition 2 Input 6 7 8 22 29 30 33 41 43 44 45 48 55 56 58 Output 55 56 58 Partition 3 Input 1 2 4 21 24 28 31 38 39 10 26 32 52 20 27 34 54 Output

Chapter 7. Operators

239

Transform operator
The transform operator modifies your input records, or transfers them unchanged, guided by the logic of the transformation expression you supply. You build transformation expressions using the Transformation Language, which is the language that defines expression syntax and provides built-in functions. By using the Transformation Language with the transform operator, you can: v Transfer input records to one or more outputs v Define output fields and assign values to them based on your job logic v Use local variables and input and output fields to perform arithmetic operations, make function calls, and construct conditional statements v Control record processing behavior by explicitly calling writerecord dataset_number, droprecord, and rejectrecord as part of a conditional statement

Running your job on a non-NFS MPP


At run time, the transform operator distributes its shared library to remote nodes on non-NFS MPP systems. To prevent your job from aborting, these three conditions must be satisfied: 1. The APT_COPY_TRANSFORM_OPERATOR environment variable must be set. 2. Users must have create privileges on the project directory paths on all remote nodes at runtime. For example, the transform library trx.so is created on the conductor node at this location: /opt/IBM/InformationServer/Server/Projects/simple/RT_BP1.O 3. Rename $APT_ORCHHOME/etc/distribute-component.example to $APT_ORCHHOME/etc/ distribute-component and make the file executable:
chmod 755 $APT_ORCHHOME/etc/distribute-component

Data flow diagram


input.ds fileset1

filesetN

table0.ds

tableN.ds

transform

output data sets

reject data sets

output file sets (when the save suboption is used)

transform: properties
Table 58. transform properties Property Number of input data sets Number of output data sets Transfer behavior Value 1 plus the number of lookup tables specified on the command line. 1 or more and, optionally, 1 or more reject data sets See Transfer Behavior

240

Parallel Job Advanced Developer Guide

Table 58. transform properties (continued) Property Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator Value parallel by default, or sequential any (parallel mode) any (sequential mode) propagated yes yes

Transform: syntax and options


Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
transform -fileset fileset_description -table -key field [ci | cs] [-key field [ci | cs] ...] [-allow_dups] [-save fileset_descriptor] [-diskpool pool] [-schema schema | -schemafile schema_file] [-argvalue job_parameter_name= job_parameter_value ...][-collation_sequence locale collation_file_pathname | OFF] [-expression expression_string | -expressionfile expressionfile_path ] [-maxrejectlogs integer] [-sort [-input | -output [ port ] -key field_name sort_key_suboptions ...] [-part [-input | -output [port] -key field_name part_key_suboptions ...] [-flag {compile | run | compileAndRun} [ flag_compilation_options ]] [-inputschema schema | -inputschemafile schema_file ] [-outputschema schema | -outputschemafile schema_file ] [-reject [-rejectinfo reject_info_column_name_string]]

Where: sort_key_suboptions are:


[-ci | -cs] [-asc | -desc] [-nulls {first | last}] [-param params ]

part_key_options are:
[-ci | -cs] [-param params ]

flag_compilation_options are:
[-dir dir_name_for_compilation ] [-name library_path_name ] [-optimize | -debug] [-verbose] [-compiler cpath ] [-staticobj absolute_path_name ] [-sharedobj absolute_path_name ] [compileopt options] [-linker lpath] [-linkopt options ]

[-t

options ]

The -table and -fileset options allow you to use conditional lookups. Note: The following option values can contain multi-byte Unicode values: v the field names given to the -inputschema and -outputschema options and the ustring values v -inputschemafile and -outputschemafile files v -expression option string and the -expressionfile option filepath v -sort and -part key-field names
Chapter 7. Operators

241

v v v v

-compiler, -linker, and -dir pathnames -name file name -staticobj and -sharedobj pathnames -compileopt and -linkopt pathnames
Use -argvalue job_parameter_name = job_parameter_value This option is similar to the -params top-level osh option, but the initialized variables apply to a transform operator rather than to an entire job. The global variable given by job_parameter_name is initialized with the value given by job_parameter_value. In your osh script, you reference the job_parameter_value with [& job_parameter_name ] where the job_parameter_value component replaces the occurrence of [& job_parameter_name ].

Option -argvalue

-collation_sequence

-collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, WebSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu /userguide/Collate_Intro.htm

-expression

-expression expression_string This option lets you specify expressions written in the Transformation Language. The expression string might contain multi-byte Unicode characters. Unless you choose the -flag option with run, you must use either the -expression or -expressionfile option. The -expression and -expressionfile options are mutually exclusive.

242

Parallel Job Advanced Developer Guide

Option -expressionfile

Use -expressionfile expression_file This option lets you specify expressions written in the Transformation Language. The expression must reside in an expression_file, which includes the name and path to the file which might include multi-byte Unicode characters. Use an absolute path, or by default the current UNIX directory. Unless you choose the -flag option with run, you must choose either the -expression or -expressionfile option. The -expressionfile and -expression options are mutually exclusive.

Chapter 7. Operators

243

Option -flag

Use -flag {compile | run | compileAndRun} suboptions compile: This option indicates that you wish to check the Transformation Language expression for correctness, and compile it. An appropriate version of a C++ compiler must be installed on your computer. Field information used in the expression must be known at compile time; therefore, input and output schema must be specified. run: This option indicates that you wish to use a pre-compiled version of the Transformation Language code. You do not need to specify input and output schemas or an expression because these elements have been supplied at compile time. However, you must add the directory containing the pre-compiled library to your library search path. This is not done by the transform operator.You must also use the -name suboption to provide the name of the library where the pre-compiled code resides. compileAndRun: This option indicates that you wish to compile and run the Transformation Language expression. This is the default value. An appropriate version of a C++ compiler must be installed on your computer. You can supply schema information in the following ways: v You can omit all schema specifications. The transform operator then uses the up-stream operators output schema as its input schema, and the schema for each output data set contains all the fields from the input record plus any new fields you create for a data set. v You can omit the input data set schema, but specify schemas for all output data sets or for selected data sets. The transform operator then uses the up-stream operators output schema as its input schema. Any output schemas specified on the command line are used unchanged, and output data sets without schemas contain all the fields from the input record plus any new fields you create for a data set. v You can specify an input schema, but omit all output schemas or omit some output schemas. The transform operator then uses the input schema as specified. Any output schemas specified on the command line are used unchanged, and output data sets without schemas contain all the fields from the input record plus any new fields you create for a data set.

244

Parallel Job Advanced Developer Guide

Option -flag (continued)

Use v The flag option has the following suboptions: -dir dir_name lets you specify a compilation directory. By default, compilation occurs in the TMPDIR directory or, if this environment variable does not point to an existing directory, to the /tmp directory. Whether you specify it or not, you must make sure the directory for compilation is in the library search path. -name file_name lets you specify the name of the file containing the compiled code. If you use the -dir dir_name suboption, this file is in the dir_name directory. v The following examples show how to use the -dir and -name options in an osh command line: For development: osh transform -inputschema schema -outputschema schema -expression expression -flag compile - dir dir_name -name file_name For your production machine: osh ... | transform -flag run -name file_name | ... The library file must be copied to the production machine. -flag compile and -flag compileAndRun have these additional suboptions: -optimize specifies the optimize mode for compilation. -debug specifies the debug mode for compilation. v -verbose causes verbose messages to be output during compilation. -compiler cpath lets you specify the compiler path when the compiler is not in the default directory. The default compiler path for each operating system is: Solaris: /opt/SUNPRO6/SUNWspro/bin/CC AIX: /usr/vacpp/bin/xlC_r Tru64: /bin/cxx HP-UX: /opt/aCC/bin/aCC -staticobj absolute_path_name -sharedobj absolute_path_name These two suboptions specify the location of your static and dynamic-linking C-object libraries. The file suffix can be omitted. See External Global C-Function Support for details. -compileopt options lets you specify additional compiler options. These options are compiler-dependent. Pathnames might contain multi-byte Unicode characters. -linker lpath lets you specify the linker path when the linker is not in the default directory. The default linker path of each operating system is the same as the default compiler path listed above. -linkopt options lets you specify link options to the compiler. Pathnames might contain multi-byte Unicode characters.

Chapter 7. Operators

245

Option -inputschema

Use -inputschema schema Use this option to specify an input schema. The schema might contain multi-byte Unicode characters. An error occurs if an expression refers to an input field not in the input schema. The -inputschema and the -inputschemafile options are mutually exclusive. The -inputschema option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -inputschema or the -inputschemafile option. See the -flag option description in this table for information on the -compile suboption.

-inputschemafile

-inputschemafile schema_file Use this option to specify an input schema. An error occurs if an expression refers to an input field not in the input schema. To use this option, the input schema must reside in a schema_file, where schema_file is the name and path to the file which might contain multi-byte Unicode characters. You can use an absolute path, or by default the current UNIX directory. The -inputschemafile and the -inputschema options are mutually exclusive. The -inputschemafile option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -inputschema or the -inputschemafile option. See the -flag option description in this table for information on the -compile suboption.

-maxrejectlogs

-maxrejectlogs integer An information log is generated every time a record is written to the reject output data set. Use this option to specify the maximum number of output reject logs the transform option generates. The default is 50. When you specify -1 to this option, an unlimited number of information logs are generated.

246

Parallel Job Advanced Developer Guide

Option -outputschema

Use -outputschema schema Use this option to specify an output schema. An error occurs if an expression refers to an output field not in the output schema. The -outputschema and -outputschemafile options are mutually exclusive. The -outputschema option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -outputschema or the -outputschemafile option. See the -flag option description in this table for information on the -compile suboption. For multiple output data sets, repeat the -outputschema or -outputschemafile option to specify the schema for all output data sets.

-outputschemafile

-outputschemafile schema_file Use this option to specify an output schema. An error occurs if an expression refers to an output field not in the output schema. To use this option, the output schema must reside in a schema_file which includes the name and path to the file. You can use an absolute path, or by default the current UNIX directory. The -outputschemafile and the -outputschema options are mutually exclusive. The -outputschemafile option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -outputschema or the -outputschemafile option. See the -flag option description in this table for information on the -compile suboption. For multiple output data sets, repeat the -outputschema or -outputschemafile option to specify the schema for all output data sets.

-part

-part {-input | -output[ port ]} -key field_name [-ci | -cs] [-param params ] You can use this option 0 or more times. It indicates that the data is hash partitioned. The required field_name is the name of a partitioning key. Exactly one of the suboptions -input and -output[ port ] must be present. These suboptions determine whether partitioning occurs on the input data or the output data. The default for port is 0. If port is specified, it must be an integer which represents an output data set where the data is partitioned. The suboptions to the -key option are -ci for case-insensitive partitioning, or -cs for a case-sensitive partitioning. The default is case-sensitive. The -params suboption is to specify any property=value pairs. Separate the pairs by commas (,).

Chapter 7. Operators

247

Option -reject

Use -reject [-rejectinfo reject_info_column_name_string] This is optional. You can use it only once. When a null field is used in an expression, this option specifies that the input record containing the field is not dropped, but is sent to the output reject data set. The -rejectinfo suboption specifies the column name for the reject information.

-sort

-sort {-input | -output [ port ]} -key field_name [-ci | -cs] [-asc | -desc] [-nulls {first | last}] [-param params ] You can use this option 0 or more times. It indicates that the data is sorted for each partition. The required field_name is the name of a sorting key. Exactly one of the suboptions -input and -output[ port ] must be present. These suboptions determine whether sorting occurs on the input data or the output data. The default for port is 0. If port is specified, it must be an integer that represents the output data set where the data is sorted. You can specify -ci for a case-insensitive sort, or -cs for a case-sensitive sort. The default is case-sensitive. You can specify -asc for an ascending order sort or -desc for a descending order sort. The default is ascending. You can specify -nulls {first | last} to determine where null values should sort. The default is that nulls sort first. You can use -param params to specify any property = value pairs. Separate the pairs by commas (,).

248

Parallel Job Advanced Developer Guide

Option -table

Use -table -key field [ci | cs] [-key field [ci | cs] ...] [-allow_dups] [-save fileset_descriptor] [-diskpool pool] [-schema schema | -schemafile schema_file] Specifies the beginning of a list of key fields and other specifications for a lookup table. The first occurrence of -table marks the beginning of the key field list for lookup table1; the next occurrence of -table marks the beginning of the key fields for lookup table2, and so on For example: lookup -table -key field -table -key field The -key option specifies the name of a lookup key field. The -key option must be repeated if there are multiple key fields. You must specify at least one key for each table. You cannot use a vector, subrecord, or tagged aggregate field as a lookup key. The -ci suboption specifies that the string comparison of lookup key values is to be case insensitive; the -cs option specifies case-sensitive comparison, which is the default. In create-only mode, the -allow_dups option causes the operator to save multiple copies of duplicate records in the lookup table without issuing a warning. Two lookup records are duplicates when all lookup key fields have the same value in the two records. If you do not specify this option, WebSphere DataStage issues a warning message when it encounters duplicate records and discards all but the first of the matching records. In normal lookup mode, only one lookup table (specified by either -table or -fileset) can have been created with -allow_dups set. The -save option lets you specify the name of a fileset to write this lookup table to; if -save is omitted, tables are written as scratch files and deleted at the end of the lookup. In create-only mode, -save is, of course, required. The -diskpool option lets you specify a disk pool in which to create lookup tables. By default, the operator looks first for a lookup disk pool, then uses the default pool (). Use this option to specify a different disk pool to use. The -schema suboption specifies the schema that interprets the contents of the string or raw fields by converting them to another data type. The -schemafile suboption specifies the name of a file containing the schema that interprets the content of the string or raw fields by converting them to another data type. You must specify either -schema or -schemafile. One of them is required if the -compile option is set, but are not required for -compileAndRun or -run.

Chapter 7. Operators

249

Option -fileset

Use [-fileset fileset_descriptor ...] Specify the name of a fileset containing one or more lookup tables to be matched. In lookup mode, you must specify either the -fileset option, or a table specification, or both, in order to designate the lookup table(s) to be matched against. There can be zero or more occurrences of the -fileset option. It cannot be specified in create-only mode. Warning: The fileset already contains key specifications. When you follow -fileset fileset_descriptor by key_specifications , the keys specified do not apply to the fileset; rather, they apply to the first lookup table. For example, lookup -fileset file -key field, is the same as: lookup -fileset file1 -table -key field

Transfer behavior
You can transfer your input fields to your output fields using any one of the following methods: v Set the value of the -flag option to compileAndRun. For example:
osh "... | transform -expression expression -flacompileAndRun -dir dir_name -name file_name | ..."

v Use schema variables as part of the schema specification. A partial schema might be used for both the input and output schemas. This example shows a partial schema in the output:
osh "transform -expression expression -inputschema record(a:int32;b:string[5];c:time) -outputschema record(d:dfloat:outRec:*;) -flag compile ..."

where the schema for output 0 is:


record(d:dfloat;a:int32;b:string[5];c:time)

This example shows partial schemas in the input and the output:
osh "transform -expression expression -inputschema record(a:int32;b:string[5];c:time;Inrec:*) -outputschema record(d:dfloat:outRec:*;) -flag compile ..." osh "... | transform -flag run ... | ..."

Output 0 contains the fields d, a, b, and c, plus any fields propagated from the up-stream operator. v Use name matching between input and output fields in the schema specification. When input and output field names match and no assignment is made to the output field, the input field is transferred to the output data set unchanged. Any input field which doesnt have a corresponding output field is dropped. For example:
osh "transform -expression expression -inputschema record(a:int32;b:string[5];c:time) -outputschema record(a:int32;) -outputschema record(a:int32;b:string[5];c:time) -flag compile ..."

250

Parallel Job Advanced Developer Guide

Field a is transferred from input to output 0 and output 1. Fields b and c are dropped in output 0, but are transferred from input to output 1. v Specify a reject data set. In the Transformation Language, it is generally illegal to use a null field in expressions except in the following cases: In function calls to notnull(field_name) and null(fieldname) In an assignment statement of the form a=b where a and b are both nullable and b is null In these expressions:
if (null(a)) b=a else b=a+1 if (notnull(a)) b=a+1 else b=a b=null (a)?a:a +1; b=notnull(a)?a+1:a;

If a null field is used in an expression in other than these cases and a reject set is specified, the whole input record is transferred to the reject data set.

The transformation language


The Transformation Language is a subset of C, with extensions specific to dealing with records.

General structure
As in C, statements must be terminated by semi-colons and compound statements must be grouped by braces. Both C and C++ style comments are allowed.

Names and keywords


Names of fields in records, local variable names, and language keywords can consist of alphanumeric characters plus the underscore character. They cannot begin with a numeric character. Names in the Transformation Language are case-sensitive but keywords are case-insensitive. The available keywords fall into five groups: v The keyword extern is used to declare global C functions. See External Global C-Function Support below. v The keywords global, initialize, mainloop, and finish mark blocks of code that are executed at different stages of record processing. An explanation of these keywords are in Code Segmentation Keywords . v The keywords droprecord, writerecord, and rejectrecord control record processing. See Record Processing Control . v The keywords inputname and outputname are used to declare data set aliases. See Specifying Data Sets . v The tablename keyword is used to identify lookup tables by name. See Specifying Lookup Tables .

External global C-function support


Standard C functions are supported in the Transformation Language. Declare them in your expression file using the extern keyword and place them before your code segmentation keywords. The syntax for an external C function declaration is:
extern return_type function_name ([ argument_type , argment_name ...]);

Here is an expression file fragment that incorporates external C-function declarations:


Chapter 7. Operators

251

// externs this C function: int my_times(int x, int y) { ... } extern int32 my_times(int32 x, int32 y); // externs this C function: void my_print_message(char *msg) { ... } extern void my_print_message(string msg); inputname 0 in0; outputname 0 out0; mainloop { ... }

C function schema types and associated C types


The C function return and argument types can be any of the WebSphere DataStage schema types listed below with their associated C types.
Schema Type int8 uint8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat string void Associated Native C Type signed char unsigned char short unsigned short int unsigned int long long for Solaris and AIX unsigned long long for Solaris and AIX float double char * void

Specifying the location of your C libraries


To specify the locations of your static and dynamically-linked libraries, use the -staticobj and -sharedobj suboptions of the -flag option. These two suboptions take absolute path names as values. The file suffix is optional. The syntax is:
-staticobj absolute_path_name -sharedobj absolute_path_name

An example static library specification is:


-flag compile -name generate_statistics -staticobj /external_functions/static/part_statistics.o

An example dynamic library specification is:


-flag compile ... -sharedobj /external_functions/dynamic/generate

The shared object file name has lib prepended to it and and has a platform-dependent object-file suffix: .so for Sun Solaris and Linux; .sl for HP-UX, and .o for AIX. The file must reside in this directory:
/external-functions/dynamic

For this example, the object filepath on Solaris is:

252

Parallel Job Advanced Developer Guide

/external-functions/dynamic/libgenerate.so

Dynamically-linked libraries must be manually deployed to all running nodes. Add the library-file locations to your library search path. See Example 8: External C Function Calls for an example job that includes C header and source files, a Transformation Language expression file with calls to external C functions, and an osh script.

Code segmentation keywords


The Transformation Language provides keywords to specify when code is executed. Refer to Example 1: Student-Score Distribution for an example of how to use of these keywords. v global {job_parameters } Use this syntax to declare a set of global variables whose values are supplied by osh parameters. Values cannot be set with the Transformation Language. A warning message is issued if a value is missing. v initialize {statements } Use this syntax to mark a section of code you want to be executed once before the main record loop starts. Global variables whose values are not given through osh parameters should be defined in this segment. v mainloop {statements } Use this syntax to indicate the main record loop code. The mainloop segments is executed once for each input record. v finish {statements} Use this syntax to mark a section of code you want to be executed once after the main record loop terminates.

Record processing control


The transform operator processes one input record at a time, generating zero or any number of output records and zero or one reject record for each input record, terminating when there are no more input records to process. The transform operator automatically reads records by default. You do not need to specify this actions. The Transformation Language lets you control the input and output of records with the following keywords. v writerecord n; Use this syntax to force an output record to be written to the specific data set whose port number is n. v droprecord; Use this syntax to prevent the current input record from being written. v rejectcord; If you declare a reject data set, you can use this syntax to direct the current input record to it. You should only send a record to the reject data set if it is not going to another output data set. Note: Processing exceptions, such as null values for non-nullable fields, cause a record to be written to the reject data set if you have specified one. Otherwise the record is simply dropped.

Specifying lookup tables


You specify a lookup table using the tablename keyword. This name corresponds to a lookup table object of the same name. A lookup table can be from an input to the operator or from a fileset, therefore, the order of parameters in the command line is be used to determine the number associated with the table.
Chapter 7. Operators

253

The name of any field in the lookup schema, other than key fields, can be used to access the field value, such as table1.field1. If a field is accessed when is_match() returns false, the value of the field is null if it is nullable or it has its default value. Here is an example of lookup table usage:
transform -expressionfile trx1 -table -key a -fileset sKeyTable.fs < dataset.v < table.v > target.v trx1: inputname 0 in1; outputname 0 out0; tablename 0 tbl1; tablename 1 sKeyTable; mainloop { // This code demonstrates the interface without doing anything really // useful int nullCount; nullCount = 0; lookup(sKeyTable); if (is_match(sKeyTable)) // if theres no match { lookup(tbl1); if (!is_match(tbl1)) { out0.field2 = "missing"; } } else { // Loop through the results while (is_match(sKeyTable)) { if (is_null(sKeyTable.field1)) { nullCount++; } next_match(sKeyTable); } } writerecord 0; }

Specifying data sets


By default, the transform operator supports a single input data set, one or more output data sets, and a single optional reject data set. There is no default correspondence between input and output. You must use writerecord port to specify where you want your output sent. You can assign a name to each data set for unambiguous reference, using this syntax:
inputname 0 input-dataset-name; outputname n output-dataset-name;

Because the transform operator accepts only a single input data set, the data set number for inputname is 0. You can specify 0 through (the number of output data sets - 1) for the outputname data set number. For example:
inputname 0 input-grades; outputname 0 output-low-grades; outputname 1 output-high-grades;

Data set numbers cannot be used to qualify field names. You must use the inputname and outputname data set names to qualify field names in your Transformation Language expressions. For example:
output-low-grades.field-a = input-grades.field-a + 10; output-high-grades.field-a = output-low-grades.field-a - 10;

Field names that are not qualified by a data set name always default to output data set 0. It is good practice to use the inputname data set name to qualify input fields in expressions, and use the

254

Parallel Job Advanced Developer Guide

outputname data set name to qualify output fields even though these fields have unique names among all data sets. The Transformation Language does not attempt to determine if an unqualified, but unique, name exists in another data set. The inputname and outputname statements must appear first in your Transformation Language code.For an example, see the Transformation Language section of Example 2: Student-Score Distribution With a Letter Grade Added to Example 1 .

Data types and record fields


The Transformation Language supports all legal WebSphere DataStage schemas and all record types. The table lists the simple field types. The complex field types follow that table. Input and output fields can only be defined within the input/output schemas. You must define them using the operator options, not through transformation expressions. Refer to Syntax and Options for the details of the transform operator options. You can reference input and output data set fields by name. Use the normal WebSphere DataStage dot notation (for example, s.field1) for references to subrecord fields. Note that language keywords are not reserved, so field names can be the same as keywords if they are qualified by data set names, in0.fielda. Fields might appear in expressions. Fields that appear on the left side of an assignment statement must be output fields. New values might not be assigned to input fields. The fieldtype, or data type of a field, can be any legal WebSphere DataStage data type. Fieldtypes can be simple or complex. The table lists the simple field types. The complex field types follow.
Data Type integer Forms int8, int16, int32, int64 uint8, uint16, uint32, uint64 sfloat dfloat string string [max=n_codepoint_units] string[n_codepoint_units] ustring ustring ustring [max=n_codepoint_units] ustring[n_codepoint_units] decimal decimal[p] Decimal value with p (precision) digits. p must be between 1 and 255 inclusive. Decimal value with p digits and s (scale) digits to the right of the decimal point. p must be between 1 and 255 inclusive, and s must be between 0 and p inclusive. Meaning 1, 2, 4, and 8-byte signed integers 1, 2, 4, and 8-byte unsigned integers Single-precision floating point Double-precision floating point Variable-length string Variable-length string with upper bound on length Fixed-length string

floating Point

string

decimal[p, s]

Chapter 7. Operators

255

Data Type date and time

Forms date time time[microseconds] timestamp timestamp[microseconds] raw raw[max=n] raw[n] raw[align=k] raw[max=n, align=k]

Meaning Date with year, month, and day Time with one second resolution Time with one microsecond resolution Date/time with one second resolution Date/time with one microsecond resolution Variable length binary data. Variable length binary data with at most n bytes. Fixed length (n-byte) binary data. Variable length binary data, aligned on k-byte boundary (k = 1, 2, 4, or 8). Variable length binary data with at most n bytes, aligned on k-byte boundary (k = 1, 2, 4, or 8) Fixed length (n-byte) binary data, aligned on k-byte boundary (k = 1, 2, 4, or 8)

raw

raw[n, align=k]

WebSphere DataStage supports the following complex field types: v vector fields v subrecord fields v tagged fields. Note: Tagged fields cannot be used in expressions; they can only be transferred from input to output.

Local variables
Local variables are used for storage apart from input and output records. You must declare and initialize them before use within your transformation expressions. The scope of local variables differs depending on which code segment defines them: v Local variables defined within the global and initialize code segments can be accessed before, during, and after record processing. v Local variables defined in the mainloop code segment are only accessible for the current record being processed. v Local variables defined in the finish code segment are only accessible after all records have been processed. v Local variables can represent any of the simple value types: int8, uint8, int16, uint16, int32, uint32, int64, uint64 sfloat, dfloat decimal string date, time, timestamp raw

256

Parallel Job Advanced Developer Guide

Declarations are similar to C, as in the following examples: v int32 a[100]; declares a to be an array of 100 32-bit integers v dfloat b; declares b to be an double-precision float v string c; declares c to be a variable-length string v string[n] e; declares e to be a string of length n v string[n] f[m]; declares f to be an array of m strings, each of length n v decimal[p] g; declares g to be a decimal value with p (precision) digits v decimal[p, s] h; declares h to be a decimal value with p (precision) digits and s (scale) digits to the right of the decimal You cannot initialize variables as part of the declaration. They can only be initialized on a separate line. For example:
int32 a; a = 0;

The result is uncertain if a local variable is used without being initialized. There are no local variable pointers or structures, but you can use arrays.

Expressions
The Transformation Language supports standard C expressions, with the usual operator precedence and use of parentheses for grouping. It also supports field names as described in Data types and record fields , where the field name is specified in the schema for the data set.

Language elements
The Transformation Language supports the following elements: v Integer, character, floating point, and string constants v v v v v v v v Local variables Field names Arithmetic operators Function calls Flow control Record processing control Code segmentation Data set name specification

Note that there are no date, time, or timestamp constants.

operators
The Transformation Language supports several unary operators, which all apply only to simple value types.
Symbol ~ Name Ones complement Applies to Integer Comments ~a returns an integer with the value of each bit reversed !a returns 1 if a is 0; otherwise returns 0 +a returns a
Chapter 7. Operators

! +

Complement Unary plus

Integer Numeric

257

Symbol ++ --

Name Unary minus Incrementation operator Decrementation operator

Applies to Numeric Integer Integer

Comments -a returns the negative of a a++ or ++a returns a + 1 a-- or --a returns a - 1.

The Transformation Language supports a number of binary operators, and one ternary operator.
Symbol + * / % Name Addition Subtraction Multiplication Division Modulo Applies to Numeric Numeric Numeric Numeric Integers a % b returns the remainder when a is divided by b a << b returns a left-shifted b-bit positions a >> b returns a right-shifted b-bit positions a == b returns 1 (true) if a equals b and 0 (false) otherwise. a < b returns 1 if a is less than b and 0 otherwise. (See the note below the table.) a > b returns 1 if a is greater than b and 0 otherwise. (See the note below the table.) a <= b returns 1 if a < b or a == b, and 0 otherwise. (See the note below the table.) a >= b returns 1 if a > b or a == b, and 0 otherwise. (See the note below the table.) a != b returns 1 if a is not equal to b, and 0 otherwise. a ^ b returns an integer with bit value 1 in each bit position where the bit values of a and b differ, and a bit value of 0 otherwise. Comments

<< >> ==

Left shift Right shift Equals

Integer Integer Any; a and b must be numeric or of the same data type Same as ==.

<

Less than

>

Greater than

Same as ==

<=

Less than or equal to

Same as ==

>=

Greater than or equal to

Same as ==

!= ^

Not equals Bitwise exclusive OR

Same as == Integer

258

Parallel Job Advanced Developer Guide

Symbol &

Name Bitwise AND

Applies to Integer

Comments a & b returns an integer with bit value 0 in each bit position where the bit values of a and b are both 1, and a bit value of 0 otherwise. a | b returns an integer with a bit value 1 in each bit position where the bit value a or b (or both) is 1, and 0 otherwise. a && b returns 0 if either a == 0 or b == 0 (or both), and 1otherwise. a || b returns 1 if either a != 0 or b != 0 (or both), and 0 otherwise. a + b returns the string consisting of substring a followed by substring b. The ternary operator lets you write a conditional expression without using the if...else keyword. a ? b : c returns the value of b if a is true (non-zero) and the value of c if a is false.

Bitwise (inclusive) OR

Integer

&&

Logical AND

Any; a and b must be numeric or of the same data type Any; a and b must be numeric or of the same data type String

||

Logical OR

Concatenation

?:

Assignment

Any scalar; a and b must be numeric, numeric strings, or of the same data type

a = b places the value of b into a. Also, you can use = to do default conversions among integers, floats, decimals, and numeric strings.

Note: For the <, >, <=, and >= operators, if a and b are strings, lexicographic order is used. If a and b are date, time, or timestamp, temporal order is used. The expression a * b * c evaluates as (a * b) * c. We describe this by saying that multiplication has left to right associativity. The expression a + b * c evaluates as a + (b * c). We describe this by saying multiplication has higher precedence than addition. The following table describes the precedence and associativity of the Transformation Language operators. Operators listed in the same row of the table have the same precedence, and you use parentheses to force a particular order of evaluation. Operators in a higher row have a higher order of precedence than operators in a lower row.
Table 59. Precedence and Associativity of Operators Operators () [] ! ~ ++ -- + - (unary) * / % + - (binary) Associativity left to right right to left left to right left to right
Chapter 7. Operators

259

Table 59. Precedence and Associativity of Operators (continued) Operators << >> < <= > >= == != & ^ | && || : ? = Associativity left to right left to right left to right left to right left to right left to right left to right left to right for || right to left for : right to left right to left

Conditional Branching
The Transformation Language provides facilities for conditional branching. The following sections describe constructs available for conditional branching. if ... else
if (expression) statement1 else statement2;

If expression evaluates to a non-zero value (true) then statement1 is executed. If expression evaluates to 0 (false) then statement2 is executed. Both statement1 and statement2 can be compound statements. You can omit else statement2. In that case, if expression evaluates to 0 the if statement has no effect. Sample usage:
if (a < b) abs_difference = b - a; else abs_difference = a - b;

This code sets abs_difference to the absolute value of b - a. For Loop


for ( expression1 ; expression2; expression3) statement;

The order of execution is: 1. expression1. It is evaluated only once to initialize the loop variable. 2. expression2. If it evaluates to false, the loop terminates; otherwise, these expressions are executed in order:
statement expression3

Control then returns to 2. A sample usage is:


sum = 0; sum_squares = 0; for (i = 1; i < n; i++)

260

Parallel Job Advanced Developer Guide

{ sum = sum + 1; sum_squares = sum_squares + i*i; }

This code sets sum to the sum of the first n integers and sum_squares to the sum of the squares of the first n integers. While Loop
while ( expression ) statement ;

In a while loop, statement, which might be a compound statement, is executed repeatedly as long as expression evaluates to true. A sample usage is:
sum = 0; i = 0; while ((a[i] >= 0) && (i < n)) { sum = sum + a[i]; i++; }

This evaluates the sum of the array elements a[0] through a[n-1], or until a negative array element is encountered. Break The break command causes a for or while loop to exit immediately. For example, the following code does the same thing as the while loop shown immediately above:
sum = 0; for (i = 0; i < n; i++) { if (a[i] >= 0) sum = sum + a[i]; else break; }

Continue The continue command is related to the break command, but used less often. It causes control to jump to the top of the loop. In the while loop, the test part is executed immediately. In a for loop, control passes to the increment step. If you want to sum all positive array entries in the array a[n], you can use the continue statement as follows:
sum = 0; for (i = 0; i < n; i++) { if (a[i] <= 0) continue; sum = sum + a[i]; }

This example could easily be written using an else statement rather than a continue statement. The continue statement is most useful when the part of the loop that follows is complicated, to avoid nesting the program too deeply.

Built-in functions
This section defines functions that are provided by the Transformation Language. It is presented in a series of tables that deal with data transformation functions of the following types:
Chapter 7. Operators

261

v v v v v v v

Lookup table functions Data conversion functions Mathematical functions String field functions Ustring field functions Bit manipulation functions Job monitoring functions

v Miscellaneous functions When a function generates an output value, it returns the result. For functions with optional arguments, simply omit the optional argument to accept the default value. Default conversions among integer, float, decimal, and numeric string types are supported in the input arguments and the return value of the function. All integers can be signed or unsigned. The transform operator has default NULL handling at the record-level with individual field overrides. Options can be entered at the record level or the field level.

Lookup table functions


Function lookup( lookup_table ) Description Performs a lookup on the table using the current input record. It fills the current record of the lookup table with the first record found. If a match is not found, the current record is empty. If this is called multiple times on the same record, the record is filled with the current match if there is one and a new lookup will not be done. Gets the next record matched in the lookup and puts it into the current record of the table. Checks to see if the current lookup record has a match. If this method returns false directly after the lookup() call, no matches were found in the table. Returns a boolean value specifying whether the record is empty or not. Checks to see if the current lookup record has a match. If this method returns false directly after the lookup() call, no matches were found in the table. Returns a boolean value specifying whether the record is empty or not.

next_match( lookup_table ) clear_lookup( lookup_table )

int8 is_match( lookup_table )

Data conversion functions


date field functions WebSphere DataStage performs no automatic type conversion of date fields. Either an input data set must match the operator interface or you must effect a type conversion by means of the transform or modify operator. A date conversion to or from a numeric field can be specified with any WebSphere DataStage numeric data type. WebSphere DataStage performs the necessary modifications and either translates a numeric field to the source data type or translates a conversion result to the numeric data type of the destination. For example, you can use th