Monday, January 31, 2011

Complex Flat File Stages

Complex Flat File Stages

The Complex Flat File stage lets you convert data extracted from complex flat files that are generated on an IBM® mainframe. A complex flat file has hierarchical structure in its arrangement of columns. It is physically flat (that is, it has no pointers or other complicated infrastructure), but logically represents parent-child relationships. You can use multiple record types to achieve this hierarchical structure.

Recognizing a Hierarchical Structure

For example, use records with various structures for different types of information, such as an 'E' record for employee static information, and a 'S' record for employee monthly payroll information, or for repeating groups of information (twelve months of revenue). You can also combine these record groupings, and in the case of repeating data, you can flatten nested OCCURS groups.

Managing Repeating Groups and Internal Structures

You can easily load, manage, and use repeating groups and internal record structures such as GROUP fields and OCCURS. You can ignore GROUP data columns that are displayed as raw data and have no logical use for most applications. The metadata can be flattened into a normalized set of columns at load time, so that no arrays exist at run time.

Selecting subsets of columns

You can select a subset of columns from a large COBOL File Description (CFD). This filtering process results in performance gains since the stage no longer parses and processes hundreds of columns if you only need a few.

Complex flat files can also include legacy data types.

Output Links

The Complex Flat File stage supports multiple outputs. An output link specifies the data you are extracting, which is a stream of rows to be read.

When using the Complex Flat File stage to process a large number of columns, for example, more than 300, use only one output link in your job. This dramatically improves the performance of the GUI when loading, saving, or building these columns. Having more than one output link causes a save or load sequence each time you change tabs.

The Complex Flat File stage does not support reference lookup capability or input links.

Configuring the dsenv file

The dsenv file contains a series of shell arguments that are used when the engine starts. Interactive users, other programs, and scripts can use the dsenv file. For some ODBC connections, plug-ins, and connectors, and for interactions with external applications such as IBM® WebSphere® MQ, you must add environment variables to enable interactive use of ODBC drivers to make a connection to an ODBC data source. This lets IBM InfoSphere™ DataStage® inherit the correct environment for ODBC connections.

Before you begin

You must be logged in as an InfoSphere DataStage administrator with the operating system credentials on the server for the InfoSphere DataStage administrator.

Back up the dsenv file before you edit it.

For a connection that uses a wire protocol driver, you do not have to modify the dsenv file.


  1. Edit the dsenv file. The file is located in $DSHOME/DSEngine. $DSHOME identifies the InfoSphere DataStage installation directory. The default directory is /opt/IBM/InformationServer/Server/DSEngine.
  2. Specify the following information in the dsenv file:
    • Environment variables for the database client software
    • Database home location
    • Database library directory
    Table 1. Names of the library path environment variable, by operating system
    Operating system Library path environment variable
    HP-UX on Intel Itanium LD_LIBRARY_PATH

    The following examples show typical entries for commonly used databases. The entries vary slightly depending on your operating system. See the data source documentation for more information.

    Sybase 11
    LANG= export LANG SYBASE=/export/home/sybase/sybase export SYBASE LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SYBASE/lib:/usr/lib:/lib export LD_LIBRARY_PATH
    Informix® XPS 9.3
    DB2® Version 9.5
    DB2DIR=/opt/IBM/db2/V9.5 export DB2DIR DB2INSTANCE=db2inst1 export DB2INSTANCE INSTHOME=/export/home/db2inst1 export INSTHOME PATH=$PATH:$INSTHOME/sqllib/bin:$INSTHOME/sqllib/adm:   $INSTHOME/sqllib/misc export PATH LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$INSTHOME/sqllib/lib export LD_LIBRARY_PATH THREADS_FLAG=native export THREADS_FLAG
  3. Save your changes.
  4. Stop and restart the IBM InfoSphere Information Server engine. See Stopping and starting the server engine.

How to Run job from command line?

Below given are syntax for running datastge jobs through command line.

Command Syntax:
dsjob [-file | [-server ][-user
][-password ]]

Valid primary command options are:

Status code = -9999 DSJE_DSJOB_ERROR

dsjob -run
[-mode ]
[-param =]
[-warn ]
[-rows ]
[-opmetadata ]

Status code = -9999 DSJE_DSJOB_ERROR


Its now quite sometime we have moved to parallel jobs with IBM DataStage
dsjob command has helped me writing some scheduling scripts.

One more point I would like to put, that is to run dsenv and then run dsjob

. $DSHOME/dsenv
$DSHOME/bin/dsjob -run

And if you have any parameters passed to jobs that job you can pass it with
additional parameter list.


using pmcmd command we can run the mapping for command prompt


-run [ -mode ] [ -param ] [ -warn ] [ -rows ] [ -wait ] [ -stop ] [ -jobstatus ] [ -userstatus ] project job

Saturday, January 29, 2011

DataStage Job run from Unix Command Line

I am running DataStage Job from Unix Command Line with job level parameters the job is getting abort , can someone correct if there is any syntax problem in the below dsjob command.

dsjob -run \
-param "TargetFileDirectory=/datahub/dev/wc/target"\
-param "SCRIPTS_DIR=/datahub/dev/wc/scripts"\
-param "Filename=GYLD090504"\
DEV_WC jbWCPatternIdentifyReplace_bkupON_12May2009

I am able to execute successfully the same job by hardcoding the param values in job and running using below command.

dsjob -run DEV_WC jbWCPatternIdentifyReplace_bkupON_12May2009

By Keeping the space infront of \ also did not worked but with the below syntax it worked...

dsjob -run -param "TargetFileDirectory=/datahub/dev/wc/target" -param "SCRIPTS_DIR=/datahub/dev/wc/scripts" -param "Filename=GYLD090504" -jobstatus DEV_WC jbWCPatternIdentifyReplace_bkupON_12May2009

How can I identify the duplicate rows in a seq or comma delimited file?

How can I identify the duplicate rows in a seq or comma delimited file?
the case is...> the source has 4 values like, agent id, agent name, etc... our requirement is that the ID shouldn't be repeated. so how can i identify the duplicate rows , set a flag and send the rejects to the specified reject file? the source systems data is directly given to us. tha's why we are getting these duplicates.if it has a primary key set up already then it would have been very easy.thanks in advance.?
Ans:Sort the sequential file based on the key AGENT_ID adn set the option "Create Key Change Column" to TRUE in the sort stage. The records which has the duplicate records will be populated with the value 0(Zero) in the KeyChange field. Now reject the records which has the value 0.


Datastage Faqs

Types Of Lookups:

in lookup first link is 'primary link'. Other links are called "lookup' links.

when lookup links are from a stage that is other than a database stage, all data from the lookup link is read into memory. Then, for each row from the primary link, the lookup is performed.

if the source of lookups is a database. then can be two types of lookups;

those are Normal and spares lookup

Normal Lookup--- All the data form the database is read into memory, and then lookup is perfromed.

sparese Lookup--- For each incoming row from the primary link, the sql is fired on daabase at run time.

1.What is a view?

View is the vertual or logical or the duplicate copy of the original table with schema parts only.

2)Difference b/w materialized view and view?

View is nothing but a set a sql statements together which join single or multiple tables and shows the data .. however views do not have the data themselves but point to the data .

Whereas Materialized view is a concept mainly used in Datawarehousing .. these views contain the data itself .Reason being it is easier/faster to access the data.The main purpose of Materialized view is to do calculations and display data from multiple tables using joins .

3)How to remove duplicates in ds and how it keep?

Other than remove duplicate stage ,we can also use aggregator stage to count the number of records exist for the key columns.If more than one record exist for the key column,then they are considered as duplicate records and using transformer we can set a stage variable as 'COUNT' and check if 'COUNT>1'.If so,using a constraint, reject that duplicate records into reject file.

4)how to remove the locked jobs using datastage?

alternatively just restart the datastage services from the "administator tools" section of the control panel.

5)Did you Parameterize the job or hard-coded the values in the jobs?

Always parameterized the job. Either the values are coming from Job Properties or from a ‘Parameter Manager’ – a third part tool. There is no way you will hard–code some parameters in your jobs. The often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the data to be looked against at.

6)What is Hash file stage and what is it used for?
We can also use the Hash File stage to avoid / remove dupilcate rowsby specifying the hash key on a particular fileld

7)What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?
1.Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.

2.Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.

3.Tuned the 'Project Tunables' in Administrator for better performance.

4.Used sorted data for Aggregator.

5.Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs

6.Removed the data not used from the source as early as possible in the job.

7.Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries

8.Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.

9.If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.

10.Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.

11.Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.

12.Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.

13.Tuning should occur on a job-by-job basis.

14.Use the power of DBMS.

15.Try not to use a sort stage when you can use an ORDER BY clause in the database.

16.Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….

17.Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.


surrogate is the systemgenerated

8)What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job.

A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

9)What are the command line functions that import and export the DS jobs?
A. dsimport.exe- imports the DataStage components.
B. dsexport.exe- exports the DataStage components.

10)What is the utility you use to schedule the jobs on a UNIX server other than using Ascential Director?
Use crontab utility along with dsexecute() function along with proper parameters passed.

11)Difference between Hashfile and Sequential File?

Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannot

12)How can we implement Lookup in DataStage Server jobs?
We can use a Hash File as a lookup in server jobs. The hash file needs atleast one key column to create.


by using the hashed files u can implement the lookup in datasatge,hashed files stores data based on hashed algorithm and key values

13)what are the Job parameters?
These Parameters are used to provide Administrative access and change run time values of the job.


In that Parameters Tab we can define the name,prompt,type,value

14)Whats difference betweeen operational data stage (ODS) & data warehouse?
that which is volatile is ODS and the data which is nonvolatile and historical and time varient data is DWh simple terms ods is dynamic data


A dataware house is a decision support database for organisational needs.It is subject oriented,non volatile,integrated ,time varient collect of data.

ODS(Operational Data Source) is a integrated collection of related information . it contains maximum 90 days information.


ods is nothing but operational data store is the part of transactional database. this db keeps integrated data from different tdb and allow common operations across organisation. eg: banking transaction.


An operational data store (or "ODS") is a database designed to integrate data from multiple sources to facilitate operations, analysis and reporting. Because the data originates from multiple sources, the integration often involves cleaning, redundancy resolution and business rule enforcement. An ODS is usually designed to contain low level or atomic (indivisible) data such as transactions and prices as opposed to aggregated or summarized data such as net contributions. Aggregated data is usually stored in the Data warehouse.

How do you pass filename as the parameter for a job?
While job developement we can create a paramater 'FILE_NAME' and the value can be passed while running the job.


1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. Here you can see a grid, where you can enter your parameter name and the corresponding the path of the file.

2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the parameter name which you have given in the above. The selected parameter name appears in the text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use it in your job. Keep the project default in the text box

How do you eliminate duplicate rows?

Use Remove Duplicate Stage: It takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set


delete from from table name where rowid not in(select max/min(rowid)from emp group by column name)


The Duplicates can be eliminated by loading thecorresponding data in the Hash file. Specify the columns on which u want to eliminate as the keys of hash.

What about System variables?

DataStage provides a set of variables containing useful system information that you can access from a transform or routine. System variables are read-only.

@DATE The internal date when the program started. See the Date function.

@DAY The day of the month extracted from the value in @DATE.

@FALSE The compiler replaces the value with 0.

@FM A field mark, Char(254).

@IM An item mark, Char(255).

@INROWNUM Input row counter. For use in constrains and derivations in Transformer stages.

@OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages.

@LOGNAME The user login name.

@MONTH The current extracted from the value in @DATE.

@NULL The null value.

@NULL.STR The internal representation of the null value, Char(128).

@PATH The pathname of the current DataStage project.

@SCHEMA The schema name of the current DataStage project.

@SM A subvalue mark (a delimiter used in UniVerse files), Char(252).

Status codes returned by system processes or commands.

@TIME The internal time when the program started. See the Time function.

@TM A text mark (a delimiter used in UniVerse files), Char(251).

@TRUE The compiler replaces the value with 1.

@USERNO The user number.

@VM A value mark (a delimiter used in UniVerse files), Char(253).

@WHO The name of the current DataStage project directory.

@YEAR The current year extracted from @DATE.

REJECTED Can be used in the constraint expression of a Transformer stage of an output link. REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written.

How do you merge two files in DS?

Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.

How do we do the automation of dsjobs?

We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass all the parameters from command prompt.
Then call this shell script in any of the market available schedulers.
The 2nd option is schedule these jobs using Data Stage director.

what's the difference between Datastage Developers and Datastage Designers. What are the skill's required for this.

datastage developer is one how will code the jobs.datastage designer is how will desgn the job, i mean he will deal with blue prints and he will design the jobs the stages that are required in developing the code

Importance of Surrogate Key in Data warehousing?

Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is it is independent of underlying database. i.e Surrogate Key is not affected by the changes going on with a database.
key it is a numaric key it is primary key in the dimension table and it is forgien key in the fact table it is used to hadle the missing data and complex situation in the datastage


The concept of surrogate comes into play when there is slowely changing dimension in a table.
In such condition there is a need of a key by which we can identify the changes made in the dimensions.
These slowely changing dimensions can be of three type namely SCD1,SCD2,SCD3.
These are sustem genereated key.Mainly they are just the sequence of numbers or can be alfanumeric values also.

what is difference between data stage and informatica
Here is a very good articles on these differences... whic hhelps to get an idea.. basically it's depends on what you are tring to accomplish

Datstage solutions, knowledge base, FAQ and best practices

Datstage solutions, knowledge base, FAQ and best practices
1.What are the ways to execute datastage jobs?

A job can be run using a few different methods:

* from Datastage Director (menu Job -> Run now...)
* from command line using a dsjob command
* Datastage routine can run a job (DsRunJob command)
* by a job sequencer
2.How to invoke a Datastage shell command?

Datastage shell commands can be invoked from :

* Datastage administrator (projects tab -> Command)
* Telnet client connected to the datastage server

3.How to stop a job when its status is running?

To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go to Job -> Cleanup Resources, select a process with holds a lock and click Logout

If it still doesn't help go to the datastage shell and invoke the following command:
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

4. How to release a lock held by jobs?

Go to the datastage shell and invoke the following command:
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks commands (options 7-10).

5. How to run and schedule a job from command line?

To run a job from command line use a dsjob command

Command Syntax: dsjob [-file

6. User privileges for the default DataStage roles?

The role privileges are:

* DataStage Developer - user with full access to all areas of a DataStage project
* DataStage Operator - has privileges to run and manage deployed DataStage jobs
* -none- - no permission to log on to DataStage

7. What is a command to analyze hashed file?

There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are:

* FILE.STAT command
* ANALYZE.FILE command

8. Is it possible to run two versions of datastage on the same pc?

Yes, even though different versions of Datastage use different system dll libraries.
To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can unregister and register system libraries used by Datastage.

9. How to send notifications from Datastage as a text message (sms) to a cell phone

There is a few possible methods of sending sms messages from Datastage. However, there is no easy way to do this directly from Datastage and all methods described below will require some effort.

The easiest way of doing that from the Datastage standpoint is to configure an SMTP (email) server as a mobile phone gateway. In that case, a Notification Activity can be used to send message with a job log and any desired details. DSSendMail Before-job or After-job subroutine can be also used to send sms messages.
If configured properly, the recipients email address will have the following format:

If there is no possibility of configuring a mail server to send text messages, you can to work it around by using an external application run directly from the operational system. There is a whole bunch of unix scripts and applications to send sms messages.
In that solution, you will need to create a batch script which will take care of sending messages and invoke it from Datastage using ExecDOS or ExecSh subroutines passing the required parameters (like phone number and message body).

Please keep in mind that all these solutions may require a contact to the local cellphone provider first and, depending on the country, it may not be free of charge and in some cases the provider may not support the capability at all.

10.Error in Link collector - Stage does not support in-process active-to-active inputs or outputs

To get rid of the error just go to the Job Properties -> Performance and select Enable row buffer.
Then select Inter process which will let the link collector run correctly.
Buffer size set to 128Kb should be fine, however it's a good idea to increase the timeout.

11.What is the DataStage equivalent to like option in ORACLE

The following statement in Oracle:
select * from ARTICLES where article_name like '%WHT080%';

Can be written in DataStage (for example as the constraint expression):
incol.empname matches '...WHT080...'

12.what is the difference between logging text and final text message in terminator stage

Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is triggered or started.

* Informational - is a green line, DSLogInfo() type message.
* The Final Warning Text - the red fatal, the message which is included in the sequence abort message .

13.Error in STPstage - SOURCE Procedures must have an output link

The error appears in Stored Procedure (STP) stage when there are no stages going out of that stage.
To get rid of it go to 'stage properties' -> 'Procedure type' and select Transform

14. How to invoke an Oracle PLSQL stored procedure from a server job

To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used.
However it needs a flow of at least one record to run.

It can be designed in the following way:

* source odbc stage which fetches one record from the database and maps it to one column - for example: select sysdate from dual
* A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the right-hand side of tranformer's mapping
* Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run once).

15. Is it possible to run a server job in parallel?

Yes, even server jobs can be run in parallel.
To do that go to 'Job properties' -> General and check the Allow Multiple Instance button.
The job can now be run simultaneously from one or many sequence jobs. When it happens datastage will create new entries in Director and new job will be named with automatically generated suffix (for example second instance of a job named JOB_0100 will be named JOB_0100.JOB_0100_2). It can be deleted at any time and will be automatically recreated by datastage on the next run.

16.Datastage routine to open a text file with error catching

Note! work dir and file1 are parameters passed to the routine.
* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
CALL DSLogInfo("Unable to open file", "JobControl")

17. Datastage routine which reads the first line from a text file

Note! work dir and file1 are parameters passed to the routine.

* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
CALL DSLogInfo("Unable to open file", "JobControl")

Call DSLogWarn("******************** File is empty", "JobControl")

firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars
Call DSLogInfo("******************** Record read: " : firstline, "JobControl")

18.How to test a datastage routine or transform?

To test a datastage routine or transform go to the Datastage Manager.
Navigate to Routines, select a routine you want to test and open it. First compile it and then click 'Test...' which will open a new window. Enter test parameters in the left-hand side column and click run all to see the results.
Datastage will remember all the test arguments during future tests.

19.When hashed files should be used? What are the benefits or using them?

Hashed files are the best way to store data for lookups. They're very fast when looking up the key-value pairs.
Hashed files are especially useful if they store information with data dictionaries (customer details, countries, exchange rates). Stored this way it can be spread across the project and accessed from different jobs.

20.How to construct a container and deconstruct it or switch between local and shared?

To construct a container go to Datastage designer, select the stages that would be included in the container and from the main menu select Edit -> Construct Container and choose between local and shared.
Local will be only visible in the current job, and share can be re-used. Shared containers can be viewed and edited in Datastage Manager under 'Routines' menu.
Local Datastage containers can be converted at any time to shared containers in datastage designer by right clicking on the container and selecting 'Convert to Shared'. In the same way it can be converted back to local.

21.Corresponding datastage data types to ORACLE types?

Most of the datastage variable types map very well to oracle types. The biggest problem is to map correctly oracle NUMBER(x,y) format.

The best way to do that in Datastage is to convert oracle NUMBER format to Datastage Decimal type and to fill in Length and Scale column accordingly.

There are no problems with string mappings: oracle Varchar2 maps to datastage Varchar, and oracle char to datastage char.

22.How to adjust commit interval when loading data to the database?

In earlier versions of datastage the commit interval could be set up in:
General -> Transaction size (in version 7.x it's obsolete)

Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per transaction.
If set to 0 the commit will be issued at the end of a successfull transaction.

23.What is the use of INROWNUM and OUTROWNUM datastage variables?

@INROWNUM and @OUTROWNUM are internal datastage variables which do the following:

* @INROWNUM counts incoming rows to a transformer in a datastage job
* @OUTROWNUM counts oucoming rows from a transformer in a datastage job

These variables can be used to generate sequences, primary keys, id's, numbering rows and also for debugging and error tracing.
They play similiar role as sequences in Oracle.

24.Datastage trim function cuts out more characters than expected

By deafult datastage trim function will work this way:
Trim(" a b c d ") will return "a b c d" while in many other programming/scripting languages "a b c d" result would be expected.
That is beacuse by default an R parameter is assumed which is R - Removes leading and trailing occurrences of character, and reduces multiple occurrences to a single occurrence.

To get the "a b c d" as a result use the trim function in the following way: Trim(" a b c d "," ","B")

25. Database update actions in ORACLE stage

The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement. Available actions:

* Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows (INSERT).
* Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new rows (INSERT).
* Insert rows without clearing - only adds new rows (INSERT statement).
* Delete existing rows only - deletes matched rows (issues only the DELETE statement).
* Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows (INSERT).
* Update existing rows only - updates existing rows (UPDATE statement).
* Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT). An UPDATE is issued first and if succeeds the INSERT is ommited.
* Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An INSERT is issued first and if succeeds the UPDATE is ommited.
* User-defined SQL - the data is written using a user-defined SQL statement.
* User-defined SQL file - the data is written using a user-defined SQL statement from a file.

26. Use and examples of ICONV and OCONV functions?

ICONV and OCONV functions are quite often used to handle data in Datastage.
ICONV converts a string to an internal storage format and OCONV converts an expression to an output format.
Iconv (string, conversion code)
Oconv(expression, conversion )

Some useful iconv and oconv examples:
Iconv("10/14/06", "D2/") = 14167
Oconv(14167, "D-E") = "14-10-2006"
Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"
Oconv(12003005, "MD2$,") = "$120,030.05"

That expression formats a number and rounds it to 2 decimal places:

Iconv and oconv can be combined in one expression to reformat date format easily:
Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006"

27.ERROR 81021 Calling subroutine DSR_RECORD ACTION=2

Error message:

DataStage Repository Interface:
Error calling subroutine: DSR_RECORD (Action=2);
check DataStage is set up correctly in project
Development (Internal Error (81021))

Datastage system help gives the following error desription:
SYS.HELP. 081021
MESSAGE.. dsrpc: Error writing to Pipe.

The problem appears when a job sequence is used and it contains many stages (usually more than 10) and very often when a network connection is slow.

Basically the cause of a problem is a failure between DataStage client and the server communication.

The solution to the issue is:
# Do not log in to Datastage Designer using 'Omit' option on a login screen. Type in explicitly username and password and a job should compile successfully.
# execute the DS.REINDEX ALL command from the Datastage shell - if the above does not help

28. How to check Datastage internal error descriptions

# To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and invoke the following command: SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error number

The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a good starting point for further analysis.

29.Error timeout waiting for mutex

The error message usually looks like follows:
... ds_ipcgetnext() - timeout waiting for mutex

There may be several reasons for the error and thus solutions to get rid of it.
The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many transformers.

There are a few things to consider to work around the problem:
- increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance tab).
- ensure that the key columns in active stages or hashed files are composed of allowed characters – get rid of nulls and try to avoid language specific chars which may cause the problem.
- try to simplify the job as much as possible (especially if it’s very complex). Consider splitting it into two or three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL statements).

30. ERROR 30107 Subroutine failed to complete successfully

Error message:

Error calling subroutine:
DSR_RECORD (Action=2); or *DataStage*DSR_SELECT (Action=7);
check DataStage is set up correctly in project Development
(Subroutine failed to complete successfully(30107))

Datastage system help gives the following error desription:
SYS.HELP. 930107
MESSAGE.. DataStage/SQL: Illegal placement of parameter markers

The problem appears when a project is moved from one project to another (for example when deploying a project from a development environment to production).

The solution to the issue is:
# Rebuild the repository index by executing the DS.REINDEX ALL command from the Datastage shell

31.Datastage Designer hangs when editing job activity properties

The appears when running Datastage Designer under Windows XP after installing patches or the Service Pack 2 for Windows.
After opening a job sequence and navigating to the job activity properties window the application freezes and the only way to close it is from the Windows Task Manager.

The solution of the problem is very simple. Just Download and install the “XP SP2 patch” for the Datastage client.
It can be found on the IBM client support site (need to log in):

Go to the software updates section and select an appropriate patch from the Recommended DataStage patches section.
Sometimes users face problems when trying to log in (for example when the license doesn’t cover the IBM Active Support), then it may be necessary to contact the IBM support which can be reached at


Search 4 DataStage

Friday, January 21, 2011

what is the use of Partitioning Algorithm in Datastage?

When a job is running on multiple nodes the partitioning algorithm specifies the rule used to allocate each row to a specific node (or partition of the data). There are eight partitioning algorithms (six, really, because you can discount "Auto" and "Same").

Round robin allocates as the name suggests, for example if you have four nodes to 0,1,2,3,0,1,2,3 and so on. Random allocates randomly. Key based algorithms take a key value and apply a function to that key value to determine the node. For example if you have an integer key the Modulus algorithm divides that by the number of nodes and the remainder is the node to which that row is allocated.

Null fields

I have been struggling with null handling on the PX side. (Boy, do I miss Server for this)

One of the errors is :
Seq_Contract: When checking operator: When validating export schema: At field "PRID": "null_field" length (0) must match field's fixed width (12)

Another one is :
Seq_Contract: When checking operator: When validating export schema: At field "PRPARENTID": Exporting a fixed length field with a "null_field" of length 0 will prohibit a re-import of this field

I am also experiencing problems with View Data both in sequential stages and Oracle Enterprise.

I pretty much tried everything including:
  • NullToEmpty() function in input fields
  • Edit Colum Metadata -> Parallel -> Properties -> Nullable -> (different values such as "" or @)


The null handling that's in the stage properties of the Sequential File stage applies to ALL of your nullable columns globally. So unless all of your columns are of the same length for fixed-length fields (and the Null value you've specified is exactly that length), or they are all variable-length fields, then you have to specify the null-handling for each column. To do that, go into the Columns tab, highlight the column in question, right click and select Edit Row. That will bring you to the Edit Column Meta Data page where you can specify the Actual field length, Null field length, and Null field value for that column. Keep repeating this for each nullable column.

Wow, this is a major difference from the server version.


What if you want nulls to ultimately be stored in your database?

What would you set Actual field length, Null field length, and Null field value to in that case?

The solution provided above was to remove the warnings generated with sequential file stage for not handling null values.

Just in case you want null then use

make_null orchestrate function in modify stage or SetNull() function in Transformer.

A zero length string does not automatically become a null. You have to explicitly set it as NULL. You can use the following derivation in Transformer.

If Len(Trim(Input_Column))>0 then
Input_Column else SetNull().

Problem with FTP Stage

I am trying to get a file from an other unix server in my environment.TheURI is defined as
URI= ftp://hostname/directorypath/filename.I am getting an error when I run the job. The error is " FTP_Enterprise_0: uri : ftp://hostname/directorypath/filename is not valid remote
file.But the file exsits in the remote server.


Check file permissions for the user. Check to see if the file is locked. check file and path for case.

I changed the file permissions using chmod 777.The file is not locked and I have checked the path and filename everything looks same

Is the file on the remote server located in your user account's home directory? If not, then in your directory path, try putting enough '../' in there to path backwards from your home directory, to root, back up to the right directory. So if the file is /apps/Ascential/filename, and your home directory is /home/yourname, then your URI would look like ftp://hostname/../../apps/Ascential/filename.


Is it possible in Datastage ETL, to READ MULTI HEADERS FROM A SEQUENTIAL FILE (as Input Data) and able to distinguish the data based on it's Header, and be transformed?

Sample Input File:
Employee-ID, Employee-Name, Employee-Salary
Employee-ID, Leave-Date
101,John, 5000
102,Keane, 8000

Output File-1:
Employee-ID, Employee-Name, Employee-Salary
101,John, 5000
102,Keane, 8000

Output File-2:
Employee-ID, Leave-Date


For a server job, read the file as a single VarChar field.
In a Transformer stage, determine whether the first comma-delimited field (use the Field() function) is an integer (for example Matches "1N0N") - if it does you don't have a header line.
If the second field is a date (Matches "2N'-'2N'-'4N") then the row goes to the second output otherwise it doesn't.
Use the Transformer stage (more Field() functions) for all the other parsing.

fatal error getRealTableName

i am using an oracle enterprise stage to get data, when i press key 'view data' i get the error 'Fatal Error: getRealTable: missing name'. but this is only in case if i use UDF SQL (oracle enterprise-output-properties-source-read method=udf sql). when a read method ='table' then it's ok, no errors.
why? i need to use udf sql.

##I IIS-DSEE-TFCN-00001 06:42:45(000)
IBM WebSphere DataStage Enterprise Edition
Copyright (c) 2001, 2005-2007 IBM Corporation. All rights reserved

##I IIS-DSEE-TOSH-00002 06:42:45(001) orchgeneral: loaded
##I IIS-DSEE-TOSH-00002 06:42:45(002) orchsort: loaded
##I IIS-DSEE-TOSH-00002 06:42:45(003) orchstats: loaded
>##F IIS-DSEE-TDOR-00490 06:42:45(004) Fatal Error: getRealTable: missing name


don't have 8.x, but in 7.5.2 there is a bug where the table name is required even though it isn't used when processing the User Defined query. You might try it.


well certainly I provide table name
my udf query looks like 'select * from emp'.
'select 1 from dual' fails too. Actually any udf query or auto generated query fails with the same error (getRealTable: missing name).
Only if ReadMethod=Table then it's ok.
I've explored database trace (my DB is Oracle) but there's nothing criminal there, datastage genrates the same query for both read methods ('udf' or 'table').


you have to fill in the Table Name property (whatever it's called) in addition to mentioning the table name in the query. Also, you may need to qualify the table name with the schema/owner name depending on the id you are using to connect to the database.


when i choose 'udf query' as read method there's no way to fill in the table property (it becomes unavailable).
and i tried to qualify schema name but result was the same, fail.

Datasets files getting very huge, can i clean it up?

A better plan is to identify the Data Sets that are no longer required, and to use either orchadmin or the Data Set Management tool to delete them. The latter will show the pathnames of the segment files, so you can correlate these with what you are seeing in the directory. This way ALL files that belong to the Data Set, including its control file, are deleted. Your way you are liable to delete part of a Data Set. Not a good plan, as I said.


The Data Set Management utility is on the Tools menu of the Manager client (versions 7.x) or the Designer client (versions 8.x and later). This is probably the easier option if you're new.

The orchadmin executable is to be found in $APT_ORCHHOME/bin and is the main topic of discussion in the Orchestrate Administration manual, which you may need to request from your support provider - it's one of the "OEM manuals".

What is new for DataStage 8 on the Information Server

DataStage 8 on the Information Server looks the same as previous releases but has some major changes under the hood and a few extra bells and whistles.

This post looks at what is new or changed in DataStage jobs. There are a lot of new functions for managing, running and reporting on jobs but I will talk about that in another post or you can look back at my (much) earlier DataStage Hawk preview post.

Goodbye DataStage 7

It's time to bid goodbye to tired old DataStage 7.

You did a good job, you struggled on for as long as you could, but like all DataStage versions through the annuls of history you didn't have the right metadata repository and you didn't play well with your brothers and sisters in your suite.

DataStage 8 on the other hand is much shinier and comes with a better metadata story as you get the new Metadata Server and the common connectors:

Release Date

The Windows version of the Information Server and DataStage 8 are out now. No sign yet of the version for other platforms.

DataStage Versions

DataStage 8 can only upgrade a DataStage 7 server, it cannot upgrade previous versions of servers though it can co-exist with previous versions. DataStage 8 can however import and upgrade export files from earlier versions of DataStage. I don't know how far back this support goes.

All the DataStage 7.x versions are available in version 8:

  • DataStage Enterprise Edition: Parallel, Server and Sequence Jobs
  • DataStage Server Edition: Server and Sequence Jobs
  • DataStage MVS: Mainframe Jobs
  • DataStage Enterprise for z/OS: runs on Unix System Services

DataStage for PeopleSoft: 2 CPU limit with Server and Sequence jobs.

I don't know whether you will ever see this version of DataStage in the PeopleSoft EPM bundle, however you may be able to upgrade existing PeopleSoft implementations to this version. Drop me a message if you try.

DataStage Addons

The DataStage Enterprise Packs and Change Data Capture components are available in version 8 as shown in the version 8 architecture overview:

DataStage Architecture Overview

Enterprise PACKs

  • SAP BW Pack
    • BAPI: (Staging Business API) loads from any source to BW.
    • OpenHub: extract data from BW.
  • SAP R/3 Pack
    • ABAP: (Advanced Business Application Processing) auto generate ABAP, Extraction Object Builder, SQL Builder, Load and execute ABAP from DataStage, CPI-C Data Transfer, FTP Data Transfer, ABAP syntax check, background execution of ABAP.
    • IDoc: create source system, IDoc listener for extract, receive IDocs, send IDocs.
    • BAPI: BAPI explorer, import export Tables Parameters Activation, call and commit BAPI.
  • Siebel Pack
    • EIM: (data integration manager) interface tables
    • Business Component: access business views via Siebel Java Data Bean
    • Direct Access: use a metadata browser to select data to extract
    • Hierarchy: for extracts from Siebel to SAP BW.
  • Oracle Applications Pack
    • Oracle flex fields: extract using enhanced processing techniques.
    • Oracle reference data structures: simplified access using the Hierarchy Access component.
    • Metadata browser and importer
  • DataStage Pack for PeopleSoft Enterprise
    • Import business metadata via a metadata browser.
    • Extract data from PeopleSoft tables and trees.
  • JD Edwards Pack
    • Standard ODBC calls
    • Pre-joined database tables via business views

Change Data Capture

These are add on products (at an additional fee) that attach themselves to source databases and perform change data capture. Most source system database owners I've come across don't like you playing with their production transactional database and will not let you near it with a ten foot poll, but I guess there are exceptions:

  • Oracle
  • Microsoft SQL Server
  • DB2 for z/OS
  • IMS

There are three ways to get incremental feeds on the Information Server: the CDC products for DataStage, the Replication Server (renamed Information Integrator: Replication Edition, does DB2 replication very well) and the change data capture functions within DataStage jobs such as the parallel CDC stage.

Removed Functions

These are the functions that are not in DataStage 8, please imaging the last waltz playing in your head as you peruse this list:

  • dssearch command line function
  • dsjob "-import"
  • Version Control tool
  • Released jobs
  • Oracle 8i native database stages
  • ClickPack

The loss of the Version Control tool is not a big deal as the import/export functions have been improved. Building a release file as an export in version 8 is easier than building it in the Version Control tool in version 7.

Database Connectivity

The common connection objects functionality means the very wide range of DataStage database connections are now available across Information Server products.

Latest supported databases for version 8:

  • DB2 8.1, 8.2 and 9.1
  • Oracle 9i, 10i, 10gR2 not Oracle 8
  • SQL Server 2005 plus stored procedures.
  • Teradata v2r5.1, v2r6.0, v2r6.1 (DB server) / 8.1 (TTU) plus Teradata Parallel Transport (TPT) and stored procedures and macro support, reject links for bulk loads, restart capability for parallel bulk loads.
  • Sybase ASE 15, Sybase IQ 11.5, 12.5, 12.7
  • Informix 10 (IDS)
  • SAS 612, 8.1, 9.1 and 9.1.3
  • IBM WS MQ 6.1, WS MB 5.1
  • Netezza v3.1
  • ODBC 3.5 standard and level 3 compliant
  • UniData 6 and UniVerse ?
  • Red Brick ?

This is not the complete list. Some database versions are missing, more databases can be accessed through the ODBC stage and there may be some databases missing.

New Database Connector Functions

This is a big area of improvement.

  • LOB/BLOC/CLOB Data: pictures, documents etc of any size can now be moved between databases. After choosing the LOB data type you can choose to pass the data inline or as a link reference.
  • Reject Links: optionally append error codes and messages, conditionally filter types of rejection, fail a job based on a percentage threshold of failures.
  • Schema Reconciliation: where the hell has this function been all my life? Automatically compare your DataStage schema to the database schema and perform minor data type conversions.
  • Improved SQL Builder that supports more database types, although if you didn't like the version 7 one you wont like the 8 one either. (Kim Duke, I'm looking at you).
  • Test button on connectors. Test! You don't have to view data or run a job to find out if the stupid thing works.
  • Drag and drop your configured database connections onto jobs.
  • Before and after SQL defined per job or per node with a failure handling option. Neater than previous versions.

DataStage 8 gives you access to the latest versions of databases that DataStage 7 may never get. Extra functions on all connectors includes improved reject handling, LOB support and easier stage configuration.

Code Packs

These packs can be used by server and/or parallel jobs to interact with other coding languages. This lets you access programming modules or functions within a job:

  • Java Pack: produce or consume rows for DataStage Parallel or Server jobs. Use a java transformer.
  • Web Service Pack: access web services operations in a Server job transformer orServer routine.
  • XML Pack: read, write or transform XML files in parallel or server jobs.

The DataStage stages, custom stages, transformer functions and routines will usually be faster at transforming data than these packs however they are useful for re-using existing code.

New Stages

A new stage from the IBM software family, new stages from new partners and the convergence of QualityStage functions into Datastage. Apart from the SCD stage these all come at an additional cost.

  • WebSphere Federation and Classic Federation
  • Netezza Enterprise Stage
  • SFTP Enterprise Stage
  • iWay Enterprise Stage
  • Slowly Changing Dimension: for type 1 and type 2 SCDs.
  • Six QualityStage stages

There are four questions that have been asked since the dawn of time. What is the meaning of life? What's this rash that comes and goes? If you leave me can I come too? How do a populate a slowly changing dimension using DataStage? The answers being 42, visit a clinic, piss off and use the new SCD stage.

New Functions Existing Stages

  • Complex Flat File Stage: Multi Format File (MFF) in addition to existing cobol file support.
  • Surrogate Key Generator: now maintains the key source via integrated state file or DBMS sequence.
  • Lookup Stage: range lookups by defining checking high and low range fields on the input or reference data table. Updatable in memory lookups.
  • Transformer Stage: new surrogate key functions Initialize() and GetNextKey().
  • Enterprise FTP Stage: now choose between ftp and sftp transfer.

You can achieve most of these functions in the current version with extra coding except for in-memory lookups. This is a killer function in DataStage 8.


These are the platforms for the released Windows version and the yet to be released Linux/Unix version along with the C++ compiler that you only need for parallel jobs that will use transformers. You do not need this compiler for Server Edition.

-Windows 2003 SP1
•Visual Studio .NET 2003 C++, Visual Studio .NET 2005 C++ or Visual Studio .NET 2005 Express Edition C++
-AIX 5.2 & 5.3
•XL C/C++ Enterprise Edition 7.0, 8.0 compiler
-HP-UX 11i v1 & v2
•aC++ A.03.63 compiler
-Red Hat ASE 4.0
•gcc3.23 compiler
-SuSEES, 9.0
•gcc3.3.3 compiler
-Solaris 2.9 & 2.10
•Sun Studio 9, 10 , 11 compiler

Database Repository

Note the database compatibility for the Metadata Server repository is the latest versions of the three DBMS engines. DB2 is an optional extra in the bundle if you don't want to use an existing database.

    -IBM Information Server does not support the Database Partitioning Feature (DPF) for use in the repository layer
    -DB2 Restricted Enterprise Edition 9 is included with IBM Information Server and is an optional part of the installation however its use is restricted to hosting the IBM Information Server repository layer and cannot be used for other applications
  • Oracle 10g
  • SQL Server 2005

If you are a cheapskate and you really don't like DB2, in fact you would cross the street if you saw it coming in the other direction, you might be able to load the repository into a free (express) version of SQL Server or Oracle, however you might hit a problem with the DBMS license CPU restriction. If you get this working drop me a comment.


Foreign language support for the graphical tools and product messages:

Chinese (Simplified and Traditional), Czech, Danish, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish and Swedish.