2
votes

Being a novel on SPSS I am struggling with finding duplicate cases based on a string-variable in a dataset containing approx 33,000 cases.

I have a variable named "nr" that is supposed to be unique id for every case. However, it turns out that some cases might have two different values in "nr" entered,the only difference being the last character. Resulting in a case being shown as two separate rows.

The structure of the var "nr" is a as follows: XX-XXXXXXX-X or X-XXXXXXX-X i.e 2-7-1 characters or 1-7-1 characters.

I would like to sort out all cases that have a "nr" equal to another case except for the last character.

To illustrate, with a succesfull syntax I would hopefully be able to sort cases like these out from the whole dataset:

20-4026988-2
20-4026988-3

5-4026992-5
5-4026992-8

20-4027281-2
20-4027281-3

Anyone have an idea on how to make a syntax for this? Would be so grateful for any input!

2

2 Answers

0
votes

I suggest to create a new variable without that last character, and then look for the doubles:

* first creating some sample data to play with.    
data list list/ID (a15).
begin data.
20-4026988-2
12-2345678-7
20-4026988-3
5-4026992-5
5-4026992-8
12-1234567-1
20-4027281-2
6-1234567-1
20-4027281-3
end data.

* now creating the new variable and counting the occurrences of each shortened ID.
string ShortID (a15).
compute ShortID=char.substr(ID,1,char.rindex(ID,"-")).
* also possible: compute ShortID=char.substr(ID,1,char.length(rtrim(ID))-1).
aggregate out=* mode=add /break=ShortID/occurrences=n.

* at this point you can filter based on the number or `occurrences` or sort them.
sort cases by occurrences (d) ShortID.
0
votes

After removing the last character, you can use Data > Identify Duplicate Cases to find the dups. It as a number of useful options for this.