Wednesday, November 13, 2013

RUNSTATS in DB2

Background

The RUNSTATS utility updates statistics in the system catalog tables to help with the query optimization process. Without these statistics, the database manager could make a decision that would adversely affect the performance of an SQL statement. The RUNSTATS utility allows you to collect statistics on the data contained in the tables, indexes, or both tables and indexes. Use the RUNSTATS utility to collect statistics based on both the table and the index data to provide accurate information to the access plan selection process in the following situations:
  • When a table has been loaded with data, and the appropriate indexes have been created.
  • When a table has been reorganized with the REORG utility.
  • When there have been extensive updates, deletions, and insertions that affect a table and its indexes. ("Extensive" in this case may mean that 10 to 20 percent of the table and index data has been affected.)
  • Before binding application programs whose performance is critical.
  • When you want to compare new statistics with previous statistics. Running statistics on a periodic basis enables you to discover performance problems at an early stage.
  • When the prefetch quantity is changed.
  • When you have used the REDISTRIBUTE NODEGROUP utility.
When optimizing SQL queries, the decisions made by the SQL compiler are heavily influenced by the optimizer's model of the database contents. This data model is used by the optimizer to estimate the costs of alternative access paths that can be used to resolve a particular query. A key element in the data model is the set of statistics gathered about the data contained in the database and stored in the system catalog tables. This includes statistics for tables, nicknames, indexes, columns, and user-defined functions (UDFs). A change in the data statistics can result in a change in the choice of access plan selected as the most efficient method of accessing the desired data.
Examples of the statistics available which help define the data model to the optimizer include:
  • The number of pages in a table and the number of pages that are not empty.
  • The degree to which rows have been moved from their original page to other (overflow) pages.
  • The number of rows in a table.
  • Statistics about individual columns such as he number of distinct values in a column.
  • The degree of clustering of an index; that is, the extent to which the physical sequence of rows in a table follows an index.
  • Statistics about the index such as the number of index levels and the number of leaf pages in each index.
  • The number of occurrences of frequently used column values.
  • The distribution of column values across the range of values present in the column.
  • Cost estimates for user-defined functions (UDFs).
RUNSTATS can help you determine how performance is related to changes in your database. The statistics show the data distribution within a table. When used routinely, RUNSTATS provides data about tables and indexes over a period of time, thereby allowing performance trends to be identified for your data model as it evolves over time. Rebind applications that use static SQL after using RUNSTATS so that the query optimizer can choose the best access plan given the new statistics. However, for applications using dynamic SQL (e.g. most vendor applications) rebinding is not necessary since the statement will be optimized based on the statistics at run time. When statistical information about tables is not accurate, it may cause performance problems. In a worst-case scenario, a particular SQL statement may cause DB2 to use a table scan instead of an index scan.

How to update the statistics

Statistics for objects are updated in the system catalog tables only when explicitly requested. There are several ways to update some or all of the statistics:
  • Using the RUNSTATS (run statistics) utility.
  • Using LOAD, with statistics collection options specified.
  • Coding SQL UPDATE statements that operate against a set of predefined catalog views.
  • Using the "reorgchk update statistics" command.
When you do not exactly know all the table names, or there are too many, the easiest way to do RUNSTATS is to use the "db2 reorgchk update statistics" command. The exact script looks like this:
 db2 -v connect to DB_NAME 
 db2 -v "select tbname, nleaf, nlevels, stats_time from sysibm.sysindexes" 
 db2 -v reorgchk update statistics on table all 
 db2 -v "select tbname, nleaf, nlevels, stats_time from sysibm.sysindexes" 
 db2 -v terminate
The example we chose above does not require table names. This one command performs RUNSTATS on all tables.
Remember: Don't run the RUNSTATS utility until after you have populated the database.
If you know the name of the table and to avoid having large numbers of tables that may take a long time to complete, it's preferable to do RUNSTATS on each table one at a time. The command looks like the following:
 db2 -v runstats on table TAB_NAME and indexes all
This will collect statistics by table and all indexes (basic level).

Checking to see if RUNSTATS has been run

One quick way to see whether RUNSTATS has been performed on your database is to query some system catalog tables. For example, as shown in the script above, you can run this command:
 db2 -v "select tbname, nleaf, nlevels, stats_time from sysibm.sysindexes"
If RUNSTATS has not yet been run, you will see "-1" for the nleaf and nlevels columns, and a "-" for the stats_time column. These columns contain real numbers if RUNSTATS has been run, and the stats_time column will contain the timestamp when RUNSTATS ran. If you think the time shown in stats_time is too old, it's time to do runstats again.

Reference: http://www.ibm.com/developerworks/data/library/techarticle/anshum/0107anshum.html

DB2 Performance fundamentals


  1. Have enough memory.
    • For a 32-bit system, use at least 512 MB of RAM per CPU, up to 4 GB per machine, to support the buffer pools, DB2 agents, and other shared memory objects required for a large number of concurrent users. (See Buffer pool size (BUFFPAGE) for more information on buffer pools.) More memory may be needed to support applications that run locally or as stored procedures. On AIX®, additional memory can be used by the JFS file cache to supplement the buffer pool.
    • For a 64-bit system, the buffer pool can be practically any size. However, for most e-commerce OLTP applications that use a large database, the buffer pool doesn't really need to be more than 8 GB. Bigger is still better, but at some point you'll experience diminishing returns as the buffer pool hit ratio moves to the 98+% range. The number of concurrent users (with its impact on the number of DB2 agents) determines how much more memory is required.
    • The amount of memory required by each user connection into the database (that is, a DB2 agent) depends on the nature of the SQL statements performed by the application -- such as the number of concurrent cursors opened and the amount of sorting and temp space required. For OLTP applications, there should be less sorting and temp space required and only a handful of concurrent cursors opened at a time.
    • Rule of thumb: Use a minimum of 1 MB for UNIX and 500 KB for Windows for each DB2 agent. If fenced stored procedures are used, then each user connection has two DB2 agents, in addition to the memory required to run the stored procedure application.
  2. Have sufficient I/O handling capability.
    • There must be enough disk arms to ensure sufficient I/O parallelism to support a high volume of concurrent transactions. There should be at least 5 to 10 disks per CPU for a moderate workload, and 20 disks for a high-I/O OLTP workload. The operating system (including paging space), DB2 logs, and DB2 table spaces should each have their own dedicated disks. There should be multiple disks for the DB2 logs, for tables, and for indexes.
    • The proper way to estimate the I/O handling capability that is needed for good performance is actually to prototype the transactions and find out how many I/Os are required per transaction, and how many transactions per second are required. Then find out the I/O rate for the disk controller and the disk subsystem to help determine how many controllers and disks are required.
  3. Have sufficient network bandwidth. There must be enough network bandwidth to support the workload. Make sure that the network or any intermediate hubs are not a bottleneck. This is especially significant when supporting access from remote location. For example, a T1 line supports 1.544 Mbit/sec, which is only 0.193 MB/sec, whereas a typical 10 Mbit/sec thernet LAN can support 6x the throughput at 1.25 MB/sec. Use commands such as netstat on UNIX to monitor the traffic volumes on your connections.
  4. Use the DB2 Performance Configuration Wizard from the DB2 Control Center to set up the initial DB2 Database Manager and Database Configuration parameters. This tool will ask you a series of questions about the nature of the workload to determine a starting set of configuration parameter values. You can modify these parameters to suit your production workload.
  5. Index your table columns appropriately.
    • Ensure that columns that are joined in queries have indexes.
    • It can improve performance if columns involved in ORDER BY and GROUP BY are indexed.
    • Frequently accessed data can also be included within an index as INCLUDED columns.
    • Use the Index Advisor (also known as Index Wizard from the DB2 Control Center) to help determine a good set of indexes to use, based on the tables and SQL statements that you use.
  6. Ensure that applications hold locks for as short a time as possible.
    • When a user operation involves multiple interactions, each interaction should have its own transaction to commit and should free up all locks before returning activity to the user. Keep the duration of a transaction as short as possible by starting its first SQL statement (which starts a transaction) as late as possible, and its updates (inserts, updates, and deletes, which use exclusive locks) as close to the commit stage as possible.
    • Use of the DB2 registry parameter DB2_RR_TO_RS can improve concurrency by not locking the next key of the row that was inserted or updated. This can be used if the isolation level RR (Repeatable Read) is not used by any programs that operate on the same set of tables. Use DB2 Snapshot to monitor the number of deadlocks and lock waits.
  7. Use stored procedures or compound SQL to minimize the network cost.
    • Minimizing the number of network trips for your SQL statements will save on network latency and context switches, which can result in the application holding onto locks for a shorter period of time Generally, a stored procedure should be used when an OLTP transaction has more than 4 or 5 statements.
    • On the other hand, if there is some complicated CPU-intensive processing involved in the application logic, leaving this in a stored procedure running on the database server can use up excessive CPU cycles on the database server at the expense of database operations. In this case, either do not use stored procedure, or execute part of the logic in the client side and execute the rest in a stored procedure.
  8. Use SQL efficiently.
    • In general, don't use multiple SQL statements where one will do. When you provide more detailed search conditions by having more predicates in a query, the optimizer has a chance to make better choices. You should also be selective in your query so that the database does not return more rows and columns than you need. For example, use SQL to filter the rows you want; don't return all the rows and then require the application to do the filtering.
  9. Analyze the access plan.
    • Use Visual Explain or db2exfmt to analyze each SQL statement. Make sure appropriate indexes are used to minimize the rows that have to be fetched internally when selecting and joining tables

    Reference: http://www.ibm.com/developerworks/data/library/techarticle/anshum/0107anshum.html

Tuesday, November 12, 2013

DataStage Performance Tuning


 

Performance Tuning - 

BasicsParallelism Parallelism in DataStage Jobs should be optimized rather than maximized. The degree of parallelism of a DataStage Job is determined by the number of nodes that is defined in the Configuration File, for example, four-node, eight –node etc. A configuration file with a larger number of nodes will generate a larger number of processes and will in turn add to the processing overheads as compared to a configuration file with a smaller number of nodes. Therefore, while choosing the configuration file one must weigh the benefits of increased parallelism against the losses in processing efficiency (increased processing overheads and slow start up time).Ideally , if the amount of data to be processed is small , configuration files with less number of nodes should be used while if data volume is more , configuration files with larger number of nodes should be used.

Partioning :
Proper partitioning of data is another aspect of DataStage Job design, which significantly improves overall job performance. Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized.

Memory :
In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimised. Jobs often abort in cases where a single lookup has multiple reference links. This happens due to low temp memory space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values.

Performance Analysis of Various stages in DataStagSequential File Stage -
The sequential file Stage is a file Stage. It is the most common I/O Stage used in a DataStage Job. It is used to read data from or write data to one or more flat Files. It can have only one input link or one Output link .It can also have one reject link. While handling huge volumes of data, this Stage can itself become one of the major bottlenecks as reading and writing from this Stage is slow.Sequential files should be used in following conditionsWhen we are reading a flat file (fixed width or delimited) from UNIX environment which is FTPed from some external systemsWhen some UNIX operations has to be done on the file Don’t use sequential file for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a UNIX file.In order to have faster reading from the Stage the number of readers per node can be increased (default value is one).

Data Set Stage :
The Data Set is a file Stage, which allows reading data from or writing data to a dataset. This Stage can have a single input link or single Output link. It does not support a reject link. It can be configured to operate in sequential mode or parallel mode. DataStage parallel extender jobs use Dataset to store data being operated on in a persistent form.Datasets are operating system files which by convention has the suffix .dsDatasets are much faster compared to sequential files.Data is spread across multiple nodes and is referred by a control file.Datasets are not UNIX files and no UNIX operation can be performed on them.Usage of Dataset results in a good performance in a set of linked jobs.They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order.

Lookup Stage –
A Look up Stage is an Active Stage. It is used to perform a lookup on any parallel job Stage that can output data. The lookup Stage can have a reference link, single input link, single output link and single reject link.Look up Stage is faster when the data volume is less.It can have multiple reference links (if it is a sparse lookup it can have only one reference link)The optional reject link carries source records that do not have a corresponding input lookup tables.Lookup Stage and type of lookup should be chosen depending on the functionality and volume of data.Sparse lookup type should be chosen only if primary input data volume is small.If the reference data volume is more, usage of Lookup Stage should be avoided as all reference data is pulled in to local memory

Join Stage :
Join Stage performs a join operation on two or more datasets input to the join Stage and produces one output dataset. It can have multiple input links and one Output link.There can be 3 types of join operations Inner Join, Left/Right outer Join, Full outer join. Join should be used when the data volume is high. It is a good alternative to the lookup stage and should be used when handling huge volumes of data.Join uses the paging method for the data matching.

Merge Stage :
The Merge Stage is an active Stage. It can have multiple input links, a single output link, and it supports as many reject links as input links. The Merge Stage takes sorted input. It combines a sorted master data set with one or more sorted update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record. A master record and an update record are merged only if both of them have the same values for the merge key column(s) that you specify. Merge key columns are one or more columns that exist in both the master and update records. Merge keys can be more than one column. For a Merge Stage to work properly master dataset and update dataset should contain unique records. Merge Stage is generally used to combine datasets or files.

Sort Stage :
The Sort Stage is an active Stage. The Sort Stage is used to sort input dataset either in Ascending or Descending order. The Sort Stage offers a variety of options of retaining first or last records when removing duplicate records, Stable sorting, can specify the algorithm used for sorting to improve performance, etc. Even though data can be sorted on a link, Sort Stage is used when the data to be sorted is huge.When we sort data on link ( sort / unique option) once the data size is beyond the fixed memory limit , I/O to disk takes place, which incurs an overhead. Therefore, if the volume of data is large explicit sort stage should be used instead of sort on link.Sort Stage gives an option on increasing the buffer memory used for sorting this would mean lower I/O and better performance.

Transformer Stage :
The Transformer Stage is an active Stage, which can have a single input link and multiple output links. It is a very robust Stage with lot of inbuilt functionality. Transformer Stage always generates C-code, which is then compiled to a parallel component. So the overheads for using a transformer Stage are high. Therefore, in any job, it is imperative that the use of a transformer is kept to a minimum and instead other Stages are used, such as:Copy Stage can be used for mapping input links with multiple output links without any transformations. Filter Stage can be used for filtering out data based on certain criteria. Switch Stage can be used to map single input link with multiple output links based on the value of a selector field. It is also advisable to reduce the number of transformers in a Job by combining the logic into a single transformer rather than having multiple transformers .

Funnel Stage –
Funnel Stage is used to combine multiple inputs into a single output stream. But presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job. Funnel Stage should be run in “continuous” mode, without hindrance.

Overall Job Design :
While designing DataStage Jobs care should be taken that a single job is not overloaded with Stages. Each extra Stage put in a Job corresponds to lesser number of resources available for every Stage, which directly affects the Jobs Performance. 
If possible, big jobs having large number of Stages should be logically split into smaller units. Also if a particular Stage has been identified to be taking lot of time in a job, like a transformer Stage having complex functionality with a lot of Stage variables and transformations, then the design of jobs could be done in such a way that this Stage is put in a separate job all together (more resources for the transformer Stage!!!). Also while designing jobs, care must be taken that unnecessary column propagation is not done. Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next. As far as possible, RCP (Runtime Column Propagation) should be disabled in the jobs. Sorting in a job should be taken care try to minimise number sorts in a job. 
Design a job in such a way as to combine operations around same sort keys, if possible maintain same hash keys. Most often neglected option is “don’t sort if previously sorted” in sort Stage, set this option to “true”. This improves the Sort Stage performance a great deal. In Transformer Stage “Preserve Sort Order” can be used to maintain sort order of the data and reduce sorting in the job.In a transformer minimum of Stage variables should be used. More the no of Stage variable lower is the performance. An overloaded transformer can choke the data flow and lead to bad performance or even failure of job at some point. In order to minimise the load on transformer we can Avoid some unnecessary function calls. For example to convert a varchar field with date value can be type cast into Date type by simple formatting the input value. We need not use StringToDate function, which is used to convert a String to Date type.Implicit conversion of data types.
Reduce the number of Stage variables used. It was observed in our previous project by removing 5 Stage variables and 6 function calls, runtime for the job was reduced from 2 hours to 1 hour 10 min (approximately) with 100 million records input.Try to balance load on transformers by sharing the transformations across existing transformers. This would ensure smooth flow of data.If you require type casting, renaming of columns or addition of new columns, use Copy or Modify Stages to achieve thisWhenever you have to use Lookups on large tables, look at the options such as unloading the lookup tables to datasets and using, user defined join SQL to reduce the look up volume with the help of temp tables, etc.The Copy stage should be used instead of a Transformer for simple operations including:o Job Design placeholder between stages o Renaming Columnso Dropping Columnso Implicit (default) Type Conversions The “upsert” works well if the data is sorted on the primary key column of the table which is being loaded. Or Determine , if the record already exists or not to have “Insert” and “Update” separately.It is sometimes possible to re-arrange the order of business logic within a job flow to leverage the same sort order, partitioning, and groupings. Don’t read from a Sequential File using SAME partitioning. Unless more than one source file is specified, this scenario will read the entire file into a single partition, making the entire downstream flow run sequentially (unless it is repartitioned)

Monday, September 9, 2013

Secondary Index in Teradata

Nice article from below link on SI http://www.teradatatech.com/?p=815

Friday, March 22, 2013

Why Entire partition is used in LOOKUP stage ?

Entire partition has all data across the nodes So while matching(in lookup) the records all data should be present across all nodes.

For lookup sorting is not required.so when we are not using entire partition then reference data splits into all nodes. Then each primary record need check with all nodes for matched reference record.Then we face performance issue.If we use entire in lookup then one primary record needs to look into 1 node is enough.if match found then that record goes to target otherwise it move to reject,drop etc(based on requirement)no need check in another node.In this case if we are running job in 4 nodes then at a time 4 records should process.
Note:Please remember we go for lookup only we have small reference data.If we go for big data it is performance issue(I/O work will increase here) and also some times job will abort.

Wednesday, December 14, 2011

FastTrack Makes Your DataStage Development Faster

IBM introduced a tool called FastTrack that is a source to target mapping tool that is plugged straight into the Information Server and runs inside a browser. 
The tool was introduced with the Information Server and is available in the 8.1 version.
As the name suggests IBM are using it to help in the analysis and design stage of a data integration project to do the source to target mapping and the definition of the transform rules.  Since it is an Information Server product it runs against the Metadata Server and can share metadata with the other products and it can run inside a browser.
I have talked about it previously in New Product: IBM FastTrack for Source To Target Mapping and FastTrack Excel out of your DataStage project but now I have had the chance to see it in action on a Data Warehouse project.  We have been using the tool for a few weeks now and we are impressed.  It’s been easier to learn than other Information Server products and it manages to fit most of what you need inside frames on a single browse screen.  Very few bugs and it has been in the hands of someone who doesn’t know a lot about DataStage and they have been able to complete mappings and generate DataStage jobs.
I hope to get some screenshots up in the weeks to come but here are some observations in how we have saved time with FastTrack:
  1. FastTrack provides faster access to metadata.  In an Excel/Word mapping environment you need to copy and paste your metadata from a tool that can show it into your mapping document.  FastTrack can see metadata imported through any of the Information Server import methods such as DataStage plugin imports, the import bridges from Meta Integration Technologies Inc (MITI): ErWin, InfoSphere Data Architect, Cognos, Business Objects, Informatica etc.  The imports via the database connectors from any other Information Server product such as the table definitions imported and profiled by Information Analyser.  You can import an entire database in a few minutes and drag and drop it onto your various mappings.
  2. FastTrack lets you map columns from XML and Cobol Copybook hierarchies to flat file relational database targets without any metadata massaging.  In Excel you would spend days cutting and chopping XML and Cobol complex flat file definitions.  With FastTrack you can access a definition imported through the DataStage Cobol or XML importers and map away.
  3. FastTrack lets you do source to target mapping straight out of your modelling tool.  You can import your model straight into the Metadata Server via a bridge and start mapping it.  No mucking around with database DDLs and no need to get access to create database schemas.  This can be handy in the early days of a project.
  4. FastTrack has some great auto mapping functions.  There is a discovery functions where you drag and drop the source or target table onto one side of the mapping and then use the discover function to find candidate matches for the other side – then choose the “Best Match” to take the first of the candidates.  If you choose multiple columns you can Discover and Best Match all the columns in your table.  It searches through for matching column names against the candidate tables.
  5. FastTrack can map auto match on the business glossary terms attached to those columns.  It is one of the few products in the Information Server that makes productive use of the Business Glossary to speed things up.  Of course you need to create your Glossary and fill it with terms and map those terms to physical metadata first!  FastTrack lets you add Glossary terms to physical columns as you map.
  6. FastTrack lets you balance the mapping between business analysts and ETL developers.  Both can use the tool – it’s an Excel style interface – but business analysts may be faster at mapping early in the project as they gather requirements and rules and ETL   This can help avoid bottlenecks on your team if anyone can do mapping and can push the results straight into DataStage.
  7. FastTrack creates DataStage jobs.  These jobs have the source and target connectors already loaded and configured and stages such as lookup, join and transformer already built.  It even lets you add Transformer derivations such as macros, job parameters and functions from a function list.
  8. FastTrack handles DataStage naming conventions.  FastTrack holds a set of rules for naming DataStage stages and links that you can configure to match your naming convention.  Normal DataStage development means dragging and dropping stages and links onto a canvas and renaming every one.  FastTrack does the naming for you.
  9. FastTrack lets you add the links for joins and lookups.  I don’t know if you’ve tried to map joins and lookups in Excel but it’s not pretty – you have room to map the extra columns but there is no easy way to show the key fields that join the two sources together.  Generally you make a note of it under the mapping.  In FastTrack you choose the join/lookup table, choose the key fields that do the join and bring in the extra columns for mapping to the output table and it generates the required stages in DataStage.
  10. FastTrack shows progress of mapping tasks.  Once you have created a mapping for all interfaces FastTrack will produce a report showing how much of each mapping has been finished saving you the time of tracking progress manually.
What FastTrack can do better.
  1. Better bulk export and import functions – preferably XML and Excel.  Excel for when we produce documentation.  XML for when we want to back it up or move it between repositories.  (Or export, run a search and replace to globally rename a transform value and import it again).
  2. Global search and replace on transformation values, similar to the search and replace in the DataStage Transformer, for globally renaming things like function names and job parameter values.
  3. More DataStage stages – it currently lets you configure settings for Lookup, Join, Connectors and Transformers.  Would like to see Surrogate Key, Change Data Capture and Slowly Changing Dimension support – though it’s debatable whether those are business analyst functions for FastTrack or developer functions for DataStage.  It would be cool to define Type 1 Type 2 and key fields for dimension table mapping.
  4. Let you run Discover and Best Match on Business Glossary terms so you can find terms that suit the column name you are mapping.
  5. Discover transformation rules as well as mappings … oh hang on, that’s in the next release!
  6. Reverse engineer DataStage Server jobs so you can generate DataStage Enterprise jobs from a Server job mapping.
  7. More flexible licensing.  You buy licenses in packs of 10 – and that’s too many for a lot of customers!

Wednesday, November 30, 2011

IBM® InfoSphere™ Information Server 8.7

IBM® InfoSphere™ Information Server 8.7 helps you take delivery of trusted information to the next level with a more comprehensive integration platform. New InfoSphere Information Server capabilities help strengthen collaboration between business and IT professionals to make it easier for strategic initiatives, such as business analytics, master data management, data warehousing, and application consolidation and migration projects, to utilize and act on trusted information.
IBM InfoSphere Information Server 8.7 helps you turn insight into actionable results by:
Examples of the significant new integration capabilities delivered as part of InfoSphere Information Server 8.7 include: