Tuesday, February 1, 2011

Sort stage to remove duplicate

1) what is the advantage of using sort stage over remove duplicate stage in removing duplicates.
2) Is there any way in which we can specify which record to retain(like retaining the last record or retaining the first) when we remove duplicate using transform stage, similarly in sort stage also.

**************
1)The advantage of using sort stage over remove duplicate stage is that sort stage allows us to capture the duplicate records whereas remove duplicate stage does not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in ascending order and try to get the last rec. The same can be done using sort stage by sorting in descending order to retain the first record.
********
Yes indeed...
Use a sort stage with the option "Create Key Change Column" set to True.
Thil will create a column Keychange at the output of the sort as below.

Input
-----
col_a
------
1
1
1
2
2
3
4
5
6
6

output
-------
col_a Keychange
------------------
1 1
1 0
1 0
2 1
2 0
3 1
4 1
5 1
6 1
6 0

For the first occurance of a particular value, it assigns 1 and for subsequent occurances of the same value it assigns 0.
Now you can use a transformer with constraint Keychange=0 to capture duplicate records.

*********
Can u tell me what is create cluster key change value in sort stage ... if my requirement is that i have to remove duplicate, i dont want to capture the duplicate records. Is that fine to remove duplicate using input tab sort ,unique option of any stage .
******

the options spcd above for retaining duplicate values using sort stage will obviously work........but as far as I remember there is an option in the properties of the sort stage namely ::ALLOW DUPLICATES.If this property is set to true I think it will satisfy the requirement(provided that u r using datastage 8.x.



If yhe property "cluster key change column" is set to true it will eventually create a cluster key change column in the output.concentrate on the following example::

i/p_col cluster _key _change_col
1 1
1 0
1 0
2 1
2 0
3 1
1 1
2 1
6 1
6 0

the logic is that every i/p value will check the value immediete above of it.If it finds a match the o/p is 0 else 1.doesn't matter if it is appeared previously or not.


I think u should use a remove duplicate stage to filter out the duplicates.

3 comments:

  1. Hello submitter,
    Here is your mistake.
    key change and cluster key change do the exactly same, only difference cluster key change require sorted or sorted group data.
    That means sort key mode should not be sort for
    cluster key change.
    mail me if require informdaya@gmail.com

    ReplyDelete
  2. Hello sir,
    How to remove duplicats from file1 contants 1,1,2,2,3,4,5

    and output result is file2-1,1,2,2 and file3- 3,4,5

    ReplyDelete
    Replies
    1. Hi Dhanraj,

      Here
      seq--->copy----->agg(1)====>join(2)---->filter====>ds(1) & (2)

      in agg we take the count based on column and
      in join we just join the main data and agg data and
      in filter write condition like count=1 then data will move to ds(1),
      count>1 then data will move to ds(2).

      That's it.

      Delete