Duplicate removal from a table using Informatica

0

votes

I have a scenario to be implemented in informatica where I need to remove duplicate records from a table based on PK. But I need to keep the 1st occurrence of the PK values and remove the others(in case of duplicate PK).

For example, If my source has 1,1,1,2,3,3,4,5,4. I want to see my target data as 1,2,3,4,5. I have to read data from the same table and need to load into the same table., no new table can be introduced. please help me with your inputs.

Thanks in Advance!

etlinformaticainformatica-powercenter

1

votes

I suppose you want the first occurrence because there are other (data) columns in addition to the key you entered. Therefore you want

1,b
1,c
1,a
2,d
3,c
3,d
4,e
5,f
4,b

Turned into

1,b
2,d
3,c
4,e
5,f

??

In that case try this mapping layout:

SRC -> SQ -> SRT -> AGG -> TGT
      SEQ /

Where the sorter is set to sort on the KEY,sequence_port (desc) And the aggregator is set to group by the KEY, and the sequence_port should not go out of the sorter

Hope you can follow me :)

1

votes

There are multiple ways to do this, the simplest would be too do it in the SQL override.

Assuming the example quoted above, the SQL would be like this. General idea is to set a row number for a primary key ( so if you have 3 rows with same pk they will have 1,2,3 as row numbers before being reset for the next pk)

SQL:

select * from (
Select primary_key,column2 row_number() over (partition by primary_key order     by primary_key) as distinct_key) where distinct_key=1

Before:

1,b
1,c
1,a
2,d
3,c
3,d

Intermediate query:

1,c,1
1,a,2
2,d,1
3,c,1
3,d,2

output:

1,c
2,d
3,d

0

votes

I am able to achieve this by following the below steps.

1. Passing Sorted data(keys are EMP_ID, MOBILE, DEPTID) to an expression.
2. Creating the following variable ports in the expression and getting the counts.
V_CURR_EMP_ID = EMP_ID      
V_CURR_MOBILE = MOBILE          
V_CURR_DEPTID = DEPTID 
V_COUNT =                   
IIF(V_CURR_EMP_ID=V_PREV_EMP_ID AND V_CURR_MOBILE=V_PREV_MOBILE AND V_CURR_DEPTID=V_PREV_DEPTID ,V_COUNT+1,1)
V_PREV_EMP_ID  = EMP_ID 
V_PREV_MOBILE  = MOBILE
V_PREV_DEPTID  = DEPTID
O_COUNT =V_COUNT

3. In the next transformation which is filter, I am taking only the records which have count more than 1 and deleting them using update strategy(DD_DELETE).

Here is the mapping flow. SQ->SRTR->EXP->FIL->UPD->TGT

Also, when I tried to delete them using aggregator , it is deleting only the first occurrence of duplicates but not all.

Thanks again for your inputs!

Duplicate removal from a table using Informatica

3 Answers