I am comparing two datasets to look for duplicate entries on certain columns.
I have done this first in SAS using the PROC SQL command as below(what I consider the true outcome) using the following query:
proc sql;
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2;
quit;
I output this result to csv giving output_sas.csv
I have also done this in Python using SQLite3 using the same query:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
I output this to csv giving output_python.csv.
The problem:
The outputs should be the same but they are not:
output_sas.csv contains 123 more records than output_python.csv.
Within the SAS output file, there are 123 records that contain blank space "" within the yob1 and yob2 columns i.e. as an example, The 123 records in the sas_data.csv look like this sample:
yob1 yob2 cob1 cob2 surname1 surname2
"" "" 1 1 xx xx
"" "" 2 2 yy yy
.
.
.
# Continues for 123 records
I find that this difference is due to the yob1 and yob2 columns, which, in the above 123 records contains blank space. These 123 record pairs are missing from the output_python.csv file.
[Note: In this work, a string of length zero corresponds to a missing value]
In short:
The PROC SQL routine in SAS is evaluating blank space as equal i.e. "" == "" -> TRUE.
The Python SQLite code appears to be doing opposite i.e. "" == "" ->
FALSE
This is happening even though "" == "" -> True in Python.
The question:
Why is this the case and what do I need to change to match up the SQLite output to the PROC SQL output?
Note: Both routines are using the same input datasets. They are entirely equal, and I even manually amend the Python code to ensure that the columns yob1 and yob2 contain "" for missing values.
Update 1:
At the moment my SAS PROC SQL code works on uses data1.sas7bdat, named local and data2.sas7bdat, named neighbor.
To use the same dataset in Python, in SAS, I export these to csv and read in to Python.
If I do:
import pandas as pd
# read in
dflocal = pd.read_csv(csv_path_local, index_col=False)
dfneighbor = pd.read_csv(csv_path_neighbor, index_col=False)
Pandas converts missing values to nan. We can use isnull() to find the number of nan values in each of the columns:
# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
78
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
184
To solve the null value problem, I then explicitly convert nan to a string of length zero "" by running:
dflocal['yob1'].fillna(value="", axis=0, inplace=True)
dfneighbor['yob2'].fillna(value="", axis=0, inplace=True)
We can test if the values got updated by testing a known nan:
dflocal.iloc[393].yob1
`""`
type(dflocal.iloc[393].yob1)
str
So they are a string of length 0.
Then read these into SQL via:
dflocal.to_sql('local', con=conn, flavor='sqlite', if_exists='replace', index=False)
dfneighbor.to_sql('neighbor', con=conn, flavor='sqlite', if_exists='replace', index=False)
Then execute the same SQLite3 code:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
Even though I have made this explicit change I STILL get the same missing 123 values, even though, the null values have been changed to a string of length zero "".
Potential Solution:
However, if I instead import the dataset with the na_filter=False argument , this does the conversion from null to "" for me.
dflocal = pd.read_csv(csv_path_local, index_col=False, na_filter=False)
dfneighbor = pd.read_csv(csv_path_neighbor, index_col=False, na_filter=False")
# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
0
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
0
When I import these datasets to my database and run this through the same SQL code:
conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
local AS a INNER JOIN neighbor AS b
ON a.surname1 = b.surname2
AND a.yob1 = b.yob2
AND a.cob1 = b.cob2
""")
HOORAY I GET THE SAME OUTPUT AS THE SAS CODE!
But Why does the first solution not work? I'm doing the same thing in both cases (the first doing it manually with fill_na, and the second using na_filter=False).
"". - Chuck""vs what was being used in the SAS PROC SQL code. They were indeed a string of length zero (and the same as the SAS data file, but for some reason when they get passed into the SQLite code it's a case of "Uh Wrong Answer Geoff". - Chuck