How do I perform a left outer join using SPSS commands?

votes

Can SPSS commands (e.g., MERGE FILES) be used to perform a left outer join between 2 SPSS datasets? Assume that the join field is not unique in either dataset.

Example: Let the left Dataset1 contains 2 fields - ClassNbr and Fact1 - and these 4 records . . .

1 A
1 D
2 A
3 B

Let Dataset2 contains 2 fields - ClassNbr and Fact2 - and these 3 records . . .

1 XX
1 XY
3 ZZ

I want to join Dataset1 and Dataset2 on ClassNbr. The desired result is a 6 record dataset as follows:

1 A XX
1 A XY
1 D XX
1 D XY
2 A (NULL)
3 B ZZ

I would prefer a solution that uses SPSS commands (as opposed to SQL/Python/etc.).

spss

old question, but I find the answers too complicated. Can't you use merge as "outer partial join"? – chuse

2 Answers

votes

As far as I'm aware you can not do this directly. One potential way to do the workaround is to "reshape" the data from long format to wide format (using casestovars), do the merge, and then reshape back into long format (using varstocases). Below is a use example (if any clarification is needed on the code just ask).

data list free / ClassNbr (F1) Fact1 (A1).
begin data
1 A
1 D
2 A
3 B
end data.
dataset name data1.

casestovars 
/id = ClassNbr.

data list free / ClassNbr (F1) Fact2 (A2).
begin data
1 XX
1 XY
3 ZZ
end data.
dataset name data2.

casestovars 
/id = ClassNbr.

match files file = 'data1'
/file = 'data2'
/by ClassNbr.
execute.

varstocases
/make Fact1 FROM Fact1.1 to Fact1.2
/null = KEEP.
varstocases
/make Fact2 FROM Fact2.1 to Fact2.2
/null = KEEP.

This creates some cases that you do not want, here I have just defined a set of commands to identify those cases and take them out (I'm sure this could be improved to be more efficient).

*now cleaning up the extra records.
compute flag = 0.
if ClassNbr = lag(ClassNbr) and Fact1 = lag(Fact1) and Fact2 = lag(Fact2) flag = 1.
select if flag = 0.
execute.
if Fact1 = " " and Fact2 = " " flag = 1.
select if flag = 0.
execute.
if ClassNbr = lag(ClassNbr) and Fact1 = lag(Fact1) and Fact2 = " " flag = 1.
select if flag = 0.
execute.
if ClassNbr = lag(ClassNbr) and Fact2 = lag(Fact2) and Fact1 = " " flag = 1.
select if flag = 0.
execute.

I'm sure it would be possible to make this more robust (probably making some custom python functions). But hopefully this helps get you started.

votes

You can do this if you install the "STATS CARTPROD" extension bundle. With this extension you can create a cartesian product as an intermediate step to create an outer join.

Since SPSS 22 you can download it directly from the progam menu Extra->Extension Bundles->Install and Download extension bundles. You can also download and install it manualy from here: https://www.ibm.com/developerworks/community/files/app?lang=en#/file/d0afcd4e-6d5d-4779-84ef-2b68bc81b861 Note that you must have installed "Python Essentials for SPSS" in order to get it work.

*** create the example data.

DATA LIST FREE / classnbr1 (F1) fact1 (A1).
BEGIN DATA
1 A
1 D
2 A
3 B
END DATA.
DATASET NAME data1.

DATA LIST FREE / classnbr2 (F1) fact2 (A2).
BEGIN DATA
1 XX
1 XY
3 ZZ
END DATA.
DATASET NAME data2.

I ran into problems when using capital letters in variable names while using the "STATS CARTPROD" extension. It is also important that "classnbr" has different variable names in both datasets.

*** create cartesian product using the STATS CARTPROD extension.

DATASET ACTIVATE data1.

STATS CARTPROD INPUT2=data2
   VAR1=classnbr1 fact1 VAR2=classnbr2 fact2
/SAVE OUTFILE="C:\MY FOLDER\cardprod.sav" DSNAME = cart.
EXECUTE.

*** create an equi join.

SELECT IF classnbr1 = classnbr2.
EXECUTE.

DELETE VARIABLES classnbr2.

Now include the cases which have no match in data2.

*** create left outer join
* assuming both data sets are ordered by classnbr1 and fact1

ADD FILES 
   /FILE = cart
   /FILE = data1
   /BY classnbr1 fact1.
EXECUTE.
DATASET NAME outer_join.
DATASET ACTIVATE outer_join.

COMPUTE select=1.
IF (length(fact2)=0 AND classnbr1=LAG(classnbr1) AND fact1=LAG(fact1)) select=0.
EXECUTE.

SELECT IF select = 1.
EXECUTE.
DELETE VARIABLES select.

However you might get into some trouble when using very big data sets. In that case the cartesian product will be huge.

To alleviate this effect a bit, you can drop all the cases from the data sets wich don't have a corresponding match on the respective other data set, before producing the cartesian product.

This is, how it can be done:

*** create the example data.
*** (I added an additional case to the second data set, which will be deleted 
in the result, since it has no match in the first data set)

DATA LIST FREE / classnbr1 (F1) fact1 (A1).
BEGIN DATA
1 A
1 D
2 A
3 B
END DATA.
DATASET NAME data1.

DATA LIST FREE / classnbr2 (F1) fact2 (A2).
BEGIN DATA
1 XX
1 XY
3 ZZ
4 XY
END DATA.
DATASET NAME data2.


*** select cases who (don't) have a matching correspondent in the other dataset

** Create a list of unique key values of data set data2 
** (In this Example the key Value is classnbr2).

DATASET ACTIVATE data2.
DATASET COPY data2_keylist.
DATASET ACTIVATE data2_keylist.

* Assuming the data set is already sorted by the key value.
* Mark the first occurance of every key kalue in the data set.
COMPUTE list = 1.
IF classnbr2 = LAG(classnbr2) list = 0.
SELECT IF list=1.
EXECUTE.

* Delete all variables except the (now unique) key value

MATCH FILES
   /FILE *
   /KEEP classnbr2.
EXECUTE.

** Match the list of data2 key values to data1 in order to mark
** which cases of data1 have at least one correspondent case in data 2.
DATASET ACTIVATE data1.
MATCH FILES
   /FILE *
   /TABLE data2_keylist
      /RENAME classnbr2=classnbr1
      /IN data2
   /BY classnbr1.
EXECUTE.

** Remove cases from data1 who don't have a correspondent in data2
** and store them in another dataset, because we need to add them later.
DATASET COPY date1_nomatch.
SELECT IF data2=1.
EXECUTE.

DATASET ACTIVATE date1_nomatch.
SELECT IF data2=0.
EXECUTE.

** Now doing the same for the other data set.

** Create a list of unique key values of data set data1 
** (In this Example the key Value is classnbr1).

DATASET ACTIVATE data1.
DATASET COPY data1_keylist.
DATASET ACTIVATE data1_keylist.

* Assuming the data set is already sorted by the key value.
* Mark the first occurance of every key kalue in the data set.
COMPUTE list = 1.
IF classnbr1 = LAG(classnbr1) list = 0.
SELECT IF list=1.
EXECUTE.

* Delete all variables except the (now unique) key value

MATCH FILES
   /FILE *
   /KEEP classnbr1.
EXECUTE.

** Match the list of data2 key values to data1 in order to mark
** which cases of data1 have at least one correspondent case in data 2.
DATASET ACTIVATE data2.
MATCH FILES
   /FILE *
   /TABLE data1_keylist
      /RENAME classnbr1=classnbr2
      /IN data1
   /BY classnbr2.
EXECUTE.


** Remove cases from data1 who don't have a correspondent in data2.
SELECT IF data1=1.
EXECUTE.

*** create a cartesian product of the two reduced datasets.
DATASET ACTIVATE data1.

STATS CARTPROD INPUT2=data2
   VAR1=classnbr1 fact1 VAR2=classnbr2 fact2
/SAVE OUTFILE="C:\MY FOLDER\cardprod.sav" DSNAME = outer_join.
EXECUTE.

*** create an equi join.
SELECT IF classnbr1 = classnbr2.
EXECUTE.

DELETE VARIABLES classnbr2.


*** create left outer join by adding the cases from date1_nomatch.
DATASET ACTIVATE outer_join.
ADD FILES 
   /FILE = *
   /FILE = date1_nomatch
   /BY classnbr1 fact1
   /DROP data2.
EXECUTE.

* Some cleaning up.
DATASET CLOSE data1_keylist.
DATASET CLOSE date1_nomatch.
DATASET CLOSE data2_keylist.