Friday, March 22, 2013

Why Entire partition is used in LOOKUP stage ?

Entire partition has all data across the nodes So while matching(in lookup) the records all data should be present across all nodes.

For lookup sorting is not required.so when we are not using entire partition then reference data splits into all nodes. Then each primary record need check with all nodes for matched reference record.Then we face performance issue.If we use entire in lookup then one primary record needs to look into 1 node is enough.if match found then that record goes to target otherwise it move to reject,drop etc(based on requirement)no need check in another node.In this case if we are running job in 4 nodes then at a time 4 records should process.
Note:Please remember we go for lookup only we have small reference data.If we go for big data it is performance issue(I/O work will increase here) and also some times job will abort.

Wednesday, December 14, 2011

FastTrack Makes Your DataStage Development Faster

IBM introduced a tool called FastTrack that is a source to target mapping tool that is plugged straight into the Information Server and runs inside a browser. 
The tool was introduced with the Information Server and is available in the 8.1 version.
As the name suggests IBM are using it to help in the analysis and design stage of a data integration project to do the source to target mapping and the definition of the transform rules.  Since it is an Information Server product it runs against the Metadata Server and can share metadata with the other products and it can run inside a browser.
I have talked about it previously in New Product: IBM FastTrack for Source To Target Mapping and FastTrack Excel out of your DataStage project but now I have had the chance to see it in action on a Data Warehouse project.  We have been using the tool for a few weeks now and we are impressed.  It’s been easier to learn than other Information Server products and it manages to fit most of what you need inside frames on a single browse screen.  Very few bugs and it has been in the hands of someone who doesn’t know a lot about DataStage and they have been able to complete mappings and generate DataStage jobs.
I hope to get some screenshots up in the weeks to come but here are some observations in how we have saved time with FastTrack:
  1. FastTrack provides faster access to metadata.  In an Excel/Word mapping environment you need to copy and paste your metadata from a tool that can show it into your mapping document.  FastTrack can see metadata imported through any of the Information Server import methods such as DataStage plugin imports, the import bridges from Meta Integration Technologies Inc (MITI): ErWin, InfoSphere Data Architect, Cognos, Business Objects, Informatica etc.  The imports via the database connectors from any other Information Server product such as the table definitions imported and profiled by Information Analyser.  You can import an entire database in a few minutes and drag and drop it onto your various mappings.
  2. FastTrack lets you map columns from XML and Cobol Copybook hierarchies to flat file relational database targets without any metadata massaging.  In Excel you would spend days cutting and chopping XML and Cobol complex flat file definitions.  With FastTrack you can access a definition imported through the DataStage Cobol or XML importers and map away.
  3. FastTrack lets you do source to target mapping straight out of your modelling tool.  You can import your model straight into the Metadata Server via a bridge and start mapping it.  No mucking around with database DDLs and no need to get access to create database schemas.  This can be handy in the early days of a project.
  4. FastTrack has some great auto mapping functions.  There is a discovery functions where you drag and drop the source or target table onto one side of the mapping and then use the discover function to find candidate matches for the other side – then choose the “Best Match” to take the first of the candidates.  If you choose multiple columns you can Discover and Best Match all the columns in your table.  It searches through for matching column names against the candidate tables.
  5. FastTrack can map auto match on the business glossary terms attached to those columns.  It is one of the few products in the Information Server that makes productive use of the Business Glossary to speed things up.  Of course you need to create your Glossary and fill it with terms and map those terms to physical metadata first!  FastTrack lets you add Glossary terms to physical columns as you map.
  6. FastTrack lets you balance the mapping between business analysts and ETL developers.  Both can use the tool – it’s an Excel style interface – but business analysts may be faster at mapping early in the project as they gather requirements and rules and ETL   This can help avoid bottlenecks on your team if anyone can do mapping and can push the results straight into DataStage.
  7. FastTrack creates DataStage jobs.  These jobs have the source and target connectors already loaded and configured and stages such as lookup, join and transformer already built.  It even lets you add Transformer derivations such as macros, job parameters and functions from a function list.
  8. FastTrack handles DataStage naming conventions.  FastTrack holds a set of rules for naming DataStage stages and links that you can configure to match your naming convention.  Normal DataStage development means dragging and dropping stages and links onto a canvas and renaming every one.  FastTrack does the naming for you.
  9. FastTrack lets you add the links for joins and lookups.  I don’t know if you’ve tried to map joins and lookups in Excel but it’s not pretty – you have room to map the extra columns but there is no easy way to show the key fields that join the two sources together.  Generally you make a note of it under the mapping.  In FastTrack you choose the join/lookup table, choose the key fields that do the join and bring in the extra columns for mapping to the output table and it generates the required stages in DataStage.
  10. FastTrack shows progress of mapping tasks.  Once you have created a mapping for all interfaces FastTrack will produce a report showing how much of each mapping has been finished saving you the time of tracking progress manually.
What FastTrack can do better.
  1. Better bulk export and import functions – preferably XML and Excel.  Excel for when we produce documentation.  XML for when we want to back it up or move it between repositories.  (Or export, run a search and replace to globally rename a transform value and import it again).
  2. Global search and replace on transformation values, similar to the search and replace in the DataStage Transformer, for globally renaming things like function names and job parameter values.
  3. More DataStage stages – it currently lets you configure settings for Lookup, Join, Connectors and Transformers.  Would like to see Surrogate Key, Change Data Capture and Slowly Changing Dimension support – though it’s debatable whether those are business analyst functions for FastTrack or developer functions for DataStage.  It would be cool to define Type 1 Type 2 and key fields for dimension table mapping.
  4. Let you run Discover and Best Match on Business Glossary terms so you can find terms that suit the column name you are mapping.
  5. Discover transformation rules as well as mappings … oh hang on, that’s in the next release!
  6. Reverse engineer DataStage Server jobs so you can generate DataStage Enterprise jobs from a Server job mapping.
  7. More flexible licensing.  You buy licenses in packs of 10 – and that’s too many for a lot of customers!

Wednesday, November 30, 2011

IBM® InfoSphere™ Information Server 8.7

IBM® InfoSphere™ Information Server 8.7 helps you take delivery of trusted information to the next level with a more comprehensive integration platform. New InfoSphere Information Server capabilities help strengthen collaboration between business and IT professionals to make it easier for strategic initiatives, such as business analytics, master data management, data warehousing, and application consolidation and migration projects, to utilize and act on trusted information.
IBM InfoSphere Information Server 8.7 helps you turn insight into actionable results by:
Examples of the significant new integration capabilities delivered as part of InfoSphere Information Server 8.7 include:

Friday, November 18, 2011

Conversion error calling conversion routine date from string data may have been lost

I have a file which has five columms. A field "pdate" is char data type which has date values (format ddmmyy eg 011008). This date value is need to be converted before loading into a target table in oracle 10g.
I have used the converion function StringToDate(ln_TTS_DelvyDate_in.pu_date,"%dd-%mm-%yy") to convert the char value into date.
.

The job is: Seq file--->TRANSFORMER---->SeqFile

When I'm executing this job it gives the following error:

Tr_Split,1: Conversion error calling conversion routine date_from_string data may have been lost


In the output file
The "pudate" has '*******' being propagated.


The input date column will not have null values.

Input Data- Char data type

011008

Output Data - date data type

******
-------------------------------------------






Solution::::

The conversion function should not have hyphens as your data doesn't have it.

StringToDate(ln_TTS_DelvyDate_in.pu_date,"%dd%mm%yy")

should work.

or   This warning will also occurs if data itself contains junk data that does not support the required format even though conversion function StringToDate(ln_TTS_DelvyDate_in.pu_date,"%dd%mm%yy") is present. Best case is replace the source file which does not have junk data.




TNS entry error when trying to connect to a database using DS in order to import the table definition?

This usually indicates that either the DataStage userid running the job does not have read access to the Oracle run-time client files, or that the environment variable ORACLE_HOME is not defined.


ORA-12154: TNS:could not resolve service name
The above  error indicate that the connection identifier or service name specified in the DataStage job (or in the ORACLE_SID environment variable or ODBC definition) was not known to either the Oracle client or the Oracle server. The connection/service identifiers known to the Oracle run-time client are defined in the tnsnames.ora file, $ORACLE_HOME/network/admin/tnsnames.ora
Verify that the identifier specified for the failing Oracle connection has been defined in tnsnames.ora. If it is correctly defined, then next verify that the ORACLE_HOME environment variable is correctly defined, and that tnsnames.ora file has correct read permissions.
If the above items are configured correctly, also check the listener.log on Oracle server to confirm that the service id (or the database it maps to) are known to the Oracle server.
Setting up environment variables required to use Oracle run-time client
The Oracle client requires that the following environment variables be defined. These should be set in the .dsenv file in the DataStage DSEngine directory.

ORACLE_HOME=/home/oracle
LIBPATH=$LIBPATH:$ORACLE_HOME/lib:
PATH=$PATH:$ORACLE_HOME/bin

Change the path defined for ORACLE_HOME to the correct path for your system. ORACLE_HOME should be set to the absolute path to the home Oracle directory which is the directory level directly above the lib and bin directories.

Please also note that the name of the library path environment variable varies with different operating systems:
  • AIX - use LIBPATH
  • Solaris - use LD_LIBRARY_PATH
  • HP-UX - use SHLIB_PATH
  • Linux - use LD_LIBRARY_PATH

Tuesday, November 1, 2011

Parallel job: Retrieve sql codes on a failed upsert

Parallel job: Retrieve sql codes on a failed upsert 

Introduction

When an enterprise database stage such as DB2 or Oracle is set to upsert it is possible to create a reject link to trap rows that fail any update or insert statements. By default this reject link holds just the columns written to the stage, they do not show any columns indicating why the row was rejected and often no warnings or error messages appear in the job log.

Steps

There is an undocumented feature in the DB2 and Oracle enterprise stage where a reject link out of the stage will carry two new fields, sqlstate and sqlcode. These hold the return codes from the RDBMS engine for failed upsert transactions. The fields are called sqlstate and sqlcode.
To see these values add a peek to your reject link, the sqlstate and sqlcode should turn up for each rejected row in the job log. To trap these values add a copy stage to your reject link, add sqlstate and sqlcode to the list of output columns, on the output columns tab check the "Runtime column propagation" check box, this will turn your two new columns from invalid red columns to black and let your job compile. If you do not see this check box use the Administrator tool to turn on column propagation for your project.
When the job runs and a RDBMS reject occurs the record is sent down the reject link, two new columns are propagated down that link and are defined by the copy stage and can then be written out to an error handling table or file.
If you do not want to turn on column propagation for your project you can still define the two new columns with a Modify stage by creating them in two specifications. sqlcode=sqlcode and sqlstate=sqlstate. Despite column propagation being turned off the Modify stage will still find the two columns on the input link and use the specification to add them to the output schema.

Examples

Oracle By default, oraupsert produces no output data set. By using the -reject option, you can specify an optional output data set containing the records that fail to be inserted or updated. It’s syntax is: -reject filename For a failed insert record, these sqlcodes cause the record to be transferred to your reject dataset: -1400: cannot insert NULL -1401: inserted value too large for column -1438: value larger than specified precision allows for this column -1480: trailing null missing from string bind value
For a failed update record, these sqlcodes cause the record to be transferred to your reject dataset: -1: unique constraint violation -1401: inserted value too large for column -1403: update record not found -1407: cannot update to null -1438: value larger than specified precision allows for this column -1480: trailing null missing from string bind value An insert record that fails because of a unique constraint violation (sqlcode of -1) is used for updating.
DB2 When you specify the -reject option, any update record that receives a status of SQL_PARAM_ERROR is written to your reject data set. It’s syntax is: -reject filename

Conclusion

Always place a reject link on a Database stage that performs an upsert. There is no other way to trap rows that fail that upsert statement.
For other database actions such as load or import a different method of trapping rejects and messages is required.

Friday, October 21, 2011

AutoSys Cheatsheet


AutoSys: UNIX
Cd to the "autouser" ($AUTOUSER) directory and "." (or source) the "ksh" file. Ex: ". ./autosys.ksh.machine" After installing AutoSys, first make sure that the DB is up and running. Check the installation by running the command chk_auto_up to verify connection to the DB and event processor.
Enter the KEYS through "gatekeeper", add keys
Run the "autosys_secure" command to set the AutoSys Edit and Exec Super users (and also to enter NT users/passwords)
Start the Event Processor by running the command "eventor"
Shutdown AutoSys: "sendevent -E STOP_DEMON"
To start the AutoSys GUI set your DISPLAY and run the command "autosc &".
NT: Start AutoSys from start->programs->AutoSys-> administrator ->Graphical User Interface ->Command Prompt
Command Line Commands:
1. gatekeeper: Allows you to enter the License Keys which allow you to run AutoSys.
2. eventor [-M machine_name] : Starts the event processor.
3. autorep -J [ALL | Job_name] [-q] [> file_name], -d (detail), -r (run number), -o (override), jil < file_na -G (global var report), -M -q for machine definitions.
Ex: autorep -J job_name -d
autorep -J job_name -d
autorep -J job_name -q > file_name queries the DB & save job Dfn. Into a file
vi file_name
When you want a report of a box use the -L0 option
Autorep -J job_name -l1 report on the job for the day -1 (prev day)
4. sendevent -E STARTJOB -J job_name, sendevent -E FORCE_STARTJOB -J job_name, [JOB_ON_ICE, JOB_OFF_ICE, JOB_ON_HOLD, JOB_OFF_HOLD, SET_GLOBAL, STOP_DEMON. . . .]
sendevent -E STOP_DEMON - to stop AutoSys
(ex: sendevent -E SET_GLOBAL -G "var_name=/home/mydir" to set a var)
(ex: sendevent -E SET_GLOBAL -G "var_name=DELETE" to delete a var)]
5. chk_auto_up: checks to see if event processor and the DB are both up.
6. autoping -m machine: verify that both client & server are correctly configured.
7. cron2jil -f cronfile [-d outdir] [-I incl_file] [-m machine] [-p prefix]
8. jil
To insert a job directly into the DB
insert_job: job.id job_type: c
machine: machine_name
command: echo testing jil
[go | ;] (depending on the DB you are using)
Template example:
/* ----------------- template ----------------- */
insert_job: template job_type: c
box_name: box1
command: ls -l
machine: localhost
owner: lyota01@TANT-A01
permission: gx,ge,wx,we,mx,me
date_conditions: 1
days_of_week: all
start_times: "15:00, 14:00"
run_window: "14:00 - 6:00"
condition: s (job1)
description: "description field"
n_retrys: 12
term_run_time: 60
box_terminator: 1
job_terminator: 1
std_out_file: /tmp/std_out
std_err_file: /tmp/std_err
min_run_alarm: 5
max_run_alarm: 10
alarm_if_fail: 1
max_exit_success: 2
chk_files: /tmp 2000
profile: /tmp/.profile
job_load: 25
priority: 1
auto_delete: 12
9. autosyslog -e: same as tail -f autosys_log_file. This command must be run from the machine where the server resides if used with the -e option. Else it can be used with the -J option to see that job's run log.
10. job_depends: -[c|d|t] -J jobname [-F "mm/dd/yy time"] [-T "mm/dd/yy time"] (Note: It will only print out the first occurrence found)
11. monbro -n monitor_name: Allows you to run from command line monitor/browser programs previously created using the monitor/browser GUI.exec superuser: AUTOSYS superuser
12. autocal_asc full_cal_name: prints, adds & deletes custom calendar definitions.
13. autostatus: Reports the current status of a specific job, or the value of an AutoSys global variable. Ex: autostatus -J job_name, -S instance
14. autotimezone -l : Allows additions, deletions, and queries to the timezones table (-l provides list).
15. autotrack: Tracks & report changes to the AutoSys DB. Ex: autotrack -l 2 (level 2) [sets the tracking level] autotrack -U sys -v (user sys: verbose) To start using the autotrack utility type: autotrack -u to set tracking level 1 or 2. By default it is set to 0. Autotrack -l will list the current tracking level. Options -[J, U, m, F, T, and t] are to request reporting on a specific Job, User, machine, time window (-F -T), and event type (t). Type is used in conjunction w/other parameters. autotrack w/no arguments retrieves information an all events omitting detail. -v option is for verbose.
16. autosys_secure: to change edit, exec superusers, change DB passwd, change remote authentication method.
17. chase [-A|E]: Makes sure that jobs claiming to be running in the client machine are running. The "-E" option restarts the job.
18. archive_events: to archive events in the DB which are older than x days to prev DB from becoming full.
19. clean_files: Deletes old remote agent log files. It does it by searching the DB for all machines which have had jobs started on them.
20. autostatad: to get the status of a PeopleSoft job. You can define one of the user definable buttons to view PeopleSoft job: Autocons*userButton1Label: Adapter Status
User definable buttons: There are user definable buttons in the operator's console.
How to configure:
Autocons*userButton1Command: /autosys/bin/autostatad -J $JOB -g & (which allows you to have a command button on the operator's console.)
Dependencies:
success (job) and s(job_b)
failure(job_a) or f (job_b)
notrunning (job)
terminated(job)
exitcode(job) > 5 and exitcode(job_b) != 10
value(global_name)=100
done(job)
Hostscape: Schedule a job to run every x minutes & then go into forecasting. Make that job fail.
· Solid black line: Hostscape can communicate with the remote agent in the client machine.
· Solid red line: Hostscape can't communicate with the remote agent but it can communicate with the internet daemon (inetd) running on that machine..
· Dashed red line: Hostscape can't communicate with the client machine at all. Client is probably down.
· Accessing a variable name: $$GLOBAL_VAR_NAME (unless used in dependency condition with a job definition. If used in the "command" field, you must use the $$)
Tunable Parameters:
· $AUTOUSER/config.ACE
· $AUTOUSER/autosys.ksh.xxx
· /etc/auto.profile
· /etc/inetd.conf
· /etc/services
Notify.Ace: The alarms to notify on are:
(There is an example in $AUTOSYS/install/data/Notify.Ace).
· DB_ROLLOVER
· DB_PROBLEM
· EP_HIGH_AVAILABILITY
· EP_ROLLOVER
· EP_SHUTDOWN
Where to go to find the Errors:
· $AUTOUSER/out/event_demon.$AUTOSERV
($AUTOUSER/out/event_demon.ACE)
· Output from the job definition output & error files
· /tmp files created for job_run at client machine
· $AUTOSYS/out/DBMaint.out for DB problems
· $SYBASE/install/errorlog_$DSQUERY when event server will not start.
· NT: AutoNuTc\lib/X11\app-defaults\xpert
AutoSys Maintenance: DBMaint @$AUTOSYS/bin
Once a day the Database goes into a maintenance cycle. Every day at 3:00am it runs a program called DBMaint. This is user configurable. The program runs DBstatistics which is found in $AUTOSYS/bin.
app-defaults file: /usr/openwin/lib/app-defaults directory. Autocons, Xpert, etc.. ( or: /usr/lib/X11/app-defaults, /autosys/bin/X11/app-defaults)
Environment file: /etc./auto.profile
C programs: $AUTOSYS/code
Where to change AutoSys screen fonts: /usr/openwin/lib/app-defaults
Where to look for troubleshooting: Chapter 15
Summary of commands: Appendix C
$AUTO_JOB_NAME: when naming a file dynamically using as prefix AutoSys's job name.
$AUTORUN: unique identifier for the run of that job
$AUTOPID: unique identifier for that job's run number (PID)
$JOID: DB identifier for a job. To extract from the DB: select joid from job where job_name=" "
Creating a Virtual Machine:
insert_machine: virtual
type: v /* default, not required */
machine: real_1
machine: real_2
max_load: 100
factor: 0.5 /* used to describe the relative processing power of a machine. Usually between 0.0-1.0*/
machine: real_2
max_load: 60 /* this is designed to limit the loading of a machine */
Load Balancing, Queuing, priorities:
insert_job: test_load
machine: localhost
command: echo "Test load balancing"
job_load: 50
priority: 1 /* this only affects queues */
Note: For 5.0 we will be using information from ServerVision's towards our load balancer which is composed of 26 categories such as i/o usage, disk usage, CPU usage, etc.
Testing:
zql
zql -U autosys -P autosys
NOTES:
When a job is stuck in the starting condition this means that the event processor communicated with the remote agent and passed all the information the remote agent ran the job but was not able to communicate to the DB. Once testing is done with AutoSys one should change the default refresh interval for AutoSys. This is so there is less querying to the DB. When AutoSys goes from dual mode to single mode, always run the autobcp command before bringing AutoSys back to dual mode/High Availability. Default behavior for stdout is to always appends. If you want to overwrite the file enter the following, no spaces: ">file.out"
Box Logic
Use boxes to group jobs with like scheduling parameters, not as means of grouping jobs organizationally. For example, if you have a number of jobs that run daily at 1:00 a.m., you could put all these jobs in a box and assigning a daily start condition to the box. However, a variety of account processing jobs with diverse starting conditions should not be grouped in the same box.
Default Box Job Behavior
Some important rules to remember about boxes are:
  • Jobs run only once per box execution.
  • Jobs in a box will start only if the box itself is running.
  • As long as any job in a box is running, the box remains in RUNNING state; the box cannot complete until all jobs have run.
  • By default, a box will return a status of SUCCESS only when all the jobs in the box have run and the status of all the jobs is "success." Default SUCCESS is described in Default Box Success and Box Failure on page 5-13.
  • By default, a box will return a status of FAILURE only when all jobs in the box have run and the status of one or more of the jobs is "failure." Default FAILURE is described in Default Box Success and Box Failure on page 5-13.
  • Unless otherwise specified, a box will run indefinitely until it reaches a status of SUCCESS or FAILURE. For a description of how to override this behavior, see Box Job Attributes and Terminators on page 5-6.
  • Changing the state of a box to INACTIVE (via the sendevent command) changes the state of all the jobs in the box to INACTIVE.
When you Should Not Use a Box
The fact that all jobs in a box change status when a box starts running has lead some to use boxes to implement "job cycle" behavior. Be aware that placing jobs in a box to achieve this end may bring with it undesired behavior due to the nature of boxes.
Avoid the temptation to put jobs in a box as a short cut for performing events (such as ON_ICE or ON_HOLD) on a large number of jobs at once. You will most likely find that the default behavior of boxes inhibits the expected execution of the jobs you placed in the box.
Likewise, you should not place jobs in a box solely because you want to run reports on all of them. When you run autorep on a box, you will get a report on the box and all the jobs in the box (unless you use the -L0 option). In addition, if you use wildcarding when specifying a job name, you could get duplicate entries in your report. For example, suppose you have a box named "acnt_box" containing three jobs named "acnt_job1", "acnt_job2", and "daily_rep". If you specify acnt% as the job name for the autorep report, the report will have an entry for the box "acnt_box" and an entry for each job in the box. Then autorep will continue searching for all job names matching the wildcard characters and, thus, will list "acnt_job1" and "acnt_job2" a second time.
What Happens when a Box Runs
As soon as a box starts running, all the jobs in the box (including sub-boxes) change to status ACTIVATED, meaning they are eligible to run. (Because of this, jobs in boxes do not retain their statuses from previous box cycles.) Then each job is analyzed for additional starting conditions. All jobs with no additional starting conditions are started, without any implied ordering or prioritizing. Jobs with additional starting conditions remain in the ACTIVATED state until those additional dependencies have been met. The box remains in the RUNNING state as long as there are activated or running jobs in the box.
If a box is terminated before a job in it was able to start, the status of that job will change directly from ACTIVATED to INACTIVE.
Note o Jobs in a box cannot start unless the box is running. However, once the job starts running, it will continue to run even if the box is later stopped for some reason.
Time Conditions in a Box
Each job in a box will run only once per box execution. Therefore, you should not define more than one time attribute for any job in a box because the job will only run the first time. If you want to put a job in a box, but you also want it to run more than once, you must assign multiple start time conditions to the box itself, and define no time conditions for the job. Remember also that the box must be running before the job can start. Do not assign a start time for a job in a box if the box will not be running at that time. If you do, the next time the box starts the job will start immediately.
The following example illustrates a scenario that would not work properly if placed in a box.
"job_a" is defined to run repeatedly until it succeeds. "job_report" has one starting condition-the success of "job_a".
How Job Status Changes Affect Box Status
If a box that is not running contains a job that changes status, as a result of a FORCE_STARTJOB or CHANGE_STATUS event, the new job status could change the status of its container box. A change of status of the box could trigger the start of downstream jobs that are dependent on the box.
If a box contained only one job, and the job changed status, the box status would change.