Datasets are the best when storing the results intermediately. Datasets will keep the partitions and sort order if set. This will save re-partitioning, sorting and would make the job more robust.
Performance of the job can be improved if:
1) Unnecessary column are removed from the up and down stream links.
2) Removing these unnecessary columns will help reducing the memory consumption.
3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption.
4) Use RCP very carefully.
5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job.
Always perform the following checks first before using the sort in the design:
1) Is sort really needed?
2) What is the volume of data going to be sorted?
3) Is data being read from database first and getting sorted in the job? Can we not sort the data in the database and bring sorted data?
4) What are the values set in the system related to Sort stage?If we give attention to above questions before applying sort then this will help up creating more performant job.
Parallelism is not always good:
Remember parallelism is not always beneficial. You have to think about the design of the job and the configuration. Degree of parallelism is determined by the configuration file where you can check how many node are defined. Increased parallelism can bring more overhead but will help distributing the work.
To get the maximum performance from job:
To get the maximum performance from job we should start the job design with the smaller set of the data and then increase the amount of data. We will only get the best performing job when we will experiment with the design of the job using different partitioning methods etc.
Point to remember while partitioning the data:
While partitioning the data make sure that the partitions are having equal amount of data in them. inequality will make the job less performant.