Difference between SQLite and PROC SQL when evaluating blank space

Question

I am comparing two datasets to look for duplicate entries on certain columns.

I have done this first in SAS using the PROC SQL command as below(what I consider the true outcome) using the following query:

proc sql;
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND a.yob1 = b.yob2
    AND a.cob1 = b.cob2;
quit;

I output this result to csv giving output_sas.csv

I have also done this in Python using SQLite3 using the same query:

conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND a.yob1 = b.yob2
    AND a.cob1 = b.cob2
""")

I output this to csv giving output_python.csv.

The problem:

The outputs should be the same but they are not:

output_sas.csv contains 123 more records than output_python.csv.

Within the SAS output file, there are 123 records that contain blank space "" within the yob1 and yob2 columns i.e. as an example, The 123 records in the sas_data.csv look like this sample:

yob1 yob2 cob1 cob2 surname1 surname2
""     ""     1    1    xx       xx
""     ""     2    2    yy       yy
.
.
.
# Continues for 123 records

I find that this difference is due to the yob1 and yob2 columns, which, in the above 123 records contains blank space. These 123 record pairs are missing from the output_python.csv file.

[Note: In this work, a string of length zero corresponds to a missing value]

In short:

The PROC SQL routine in SAS is evaluating blank space as equal i.e. "" == "" -> TRUE.

The Python SQLite code appears to be doing opposite i.e. "" == "" -> FALSE

This is happening even though "" == "" -> True in Python.

The question:

Why is this the case and what do I need to change to match up the SQLite output to the PROC SQL output?

Note: Both routines are using the same input datasets. They are entirely equal, and I even manually amend the Python code to ensure that the columns yob1 and yob2 contain "" for missing values.

Update 1:

At the moment my SAS PROC SQL code works on uses data1.sas7bdat, named local and data2.sas7bdat, named neighbor.

To use the same dataset in Python, in SAS, I export these to csv and read in to Python.

If I do:

import pandas as pd
# read in
dflocal     = pd.read_csv(csv_path_local, index_col=False)
dfneighbor  = pd.read_csv(csv_path_neighbor, index_col=False)

Pandas converts missing values to nan. We can use isnull() to find the number of nan values in each of the columns:

# find null / nan values in yob1 and yob2 in each dataset
len(dflocal.loc[dflocal.yob1.isnull()])
78
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
184

To solve the null value problem, I then explicitly convert nan to a string of length zero "" by running:

dflocal['yob1'].fillna(value="", axis=0, inplace=True)
dfneighbor['yob2'].fillna(value="", axis=0, inplace=True)

We can test if the values got updated by testing a known nan:

dflocal.iloc[393].yob1
`""`
type(dflocal.iloc[393].yob1)
str

So they are a string of length 0.

Then read these into SQL via:

dflocal.to_sql('local', con=conn, flavor='sqlite', if_exists='replace', index=False)
dfneighbor.to_sql('neighbor', con=conn, flavor='sqlite', if_exists='replace', index=False)

Then execute the same SQLite3 code:

conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND a.yob1 = b.yob2
    AND a.cob1 = b.cob2
""")

Even though I have made this explicit change I STILL get the same missing 123 values, even though, the null values have been changed to a string of length zero "".

Potential Solution:

However, if I instead import the dataset with the na_filter=False argument , this does the conversion from null to "" for me.

dflocal     = pd.read_csv(csv_path_local, index_col=False, na_filter=False)
dfneighbor  = pd.read_csv(csv_path_neighbor, index_col=False, na_filter=False")

# find null / nan values in yob1 and yob2 in each dataset

len(dflocal.loc[dflocal.yob1.isnull()])
0
len(dfneighbor.loc[dfneighbor.yob2.isnull()])
0

When I import these datasets to my database and run this through the same SQL code:

conn = sqlite3.connect(file_path + db_name)
cur = conn.cursor()
cur.execute("""
CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND a.yob1 = b.yob2
    AND a.cob1 = b.cob2
""")

HOORAY I GET THE SAME OUTPUT AS THE SAS CODE!

But Why does the first solution not work? I'm doing the same thing in both cases (the first doing it manually with fill_na, and the second using na_filter=False).

Regarding your last statement "I'm doing the same thing in both cases", it seems pretty clear that this is not the case... — user2877959
@user2877959 So I absolutely agree with you. Which is why I went in and manually checked the value and type of all cells which I had changed to "" vs what was being used in the SAS PROC SQL code. They were indeed a string of length zero (and the same as the SAS data file, but for some reason when they get passed into the SQLite code it's a case of "Uh Wrong Answer Geoff". — Chuck

user2877959 user2877959 · Accepted Answer · 2017-08-09T11:29:23

In SAS, there isn't really a concept of null values for characters. It is more of an empty string. However, in most SQL implementations (including SQlite, I assume), a null value and an empty string will be different.

A blank value in SAS is indeed evaluated as "" = "" which is true

In your average DBMS however, what you would call 'blank values' are often null values, not empty strings (""). And null=null is not true. You cannot compare null values with anything, including null values.

What you could do is change your SQlite to

CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND coalesce(a.yob1,'') = coalesce(b.yob2,'')
    AND a.cob1 = b.cob2

The coalesce function will replace yob with an empty string when yob is null.

Be aware however that, if yob1 is null and yob2 actually is an empty string, adding those coalesce functions will change what would have been a null='' condition, which is not true, to a ''='' which is true. If that is not what you'd want, you could also just write it like this:

CREATE TABLE t1 AS
SELECT a.*, b.*
FROM
    local AS a INNER JOIN neighbor AS b
    ON a.surname1 = b.surname2
    AND (a.yob1 = b.yob2
        OR (a.yob1 is null AND a.yob2 is null)
        )
    AND a.cob1 = b.cob2