What's the best way to unit test functions that handle csv files?

Question

I'm trying to unit-test a function that deals with csv files with Pytest. While my function works, I feel like there's a lot of code repetition when creating "sample" csv files in my project directory to test the function. The actual csv file that holds the real data has millions of records.

These are not the only csv files I have to test in my module, so it would be immensely helpful to know what's the best way to test functions that work with different file structures.

Right now I'm creating a very short csv file that mimics the actual file schema with a single line of data plus expected dataframe output after the file is processed through the function.

Perhaps mocking is the way to go? But I feel like you shouldn't need to mock for this kind of testing

Test Function

@pytest.mark.parametrize('test_file, expected', [
    (r'Path\To\Project\Output\Folder\mock_sales1.csv',
     pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales'])),
    (r'Path\To\Project\Output\Folder\mock_sales2.csv',
     pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales']))
])
def test_sales_dataframe(test_file, expected):
    # This part is repetitive, different tests each need a seperate file written within the test function.
    # Writing sample file to test that files with 7 columns are read correctly.
    mock_mks_sales1 = [['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 4000]]
    with open(r'Path\To\Project\Output\Folder\mock_sales1.csv', 'w') as file:
        writer = csv.writer(file)
        writer.writerows(mock_sales1)
    # Writing sample file to test that files with 8 columns are read correctly.
    mock_mks_sales2 = [['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 'Data6', 4000]]
    with open(r'Path\To\Project\Output\Folder\mock_sales2.csv', 'w') as file:
        writer = csv.writer(file)
        writer.writerows(mock_sales2)

    sales_df = mks_sales_dataframe(test_file)
    testing.assert_frame_equal(expected, sales_df)

    os.remove(r'Path\To\Project\Output\Folder\mock_sales1.csv')
    os.remove(r'Path\To\Project\Output\Folder\mock_sales2.csv')

Main Function

def sales_dataframe(file):
    try:
        with open(file, 'r') as f:
            reader = csv.reader(f)
            num_cols = len(next(reader))
            columns = [1, 2, (num_cols - 1)]  # Number of columns is variable, this is used later to accurately specify which columns should be read. This is part I'm testing!

        sales_df = pd.read_csv(file, usecols=columns, names=['Postal_Code', 'Store_Num', 'Sales'])
        return sales_df
    except FileNotFoundError:
        raise FileNotFoundError(file)

The test passes as intended. However, for every different test I have to create a sample csv file within the test function and delete each file once the test is finished. As you can imagine that's a lot of repetitive code within a single test function which feels quite clunky and wordy, especially when the test is parameterized.

Do I understand this correctly? In test_sales_dataframe, you first create the mock_sales1.csv and mock_sales2.csv files with a fixed content, then you call mks_sales_dataframe to read one of these .csv files, and then you check that the result equals expected? — pschill
You could just store mock_sales1.csv and mock_sales2.csv (and all your other test data files) in a testdata folder next to your test code. Then your test becomes the two lines sales_df = mks_sales_dataframe(test_file) and testing.assert_frame_equal(expected, sales_df). — pschill
@pschill That's what I was doing originally but I thought that deleting the files after each usage reduced the clutter. Ultimately, there's going to be more and more files (There's many different sources of data I have to read and test), so I felt that having a directory that permanently stored test files was unnecessary UNTIL the tests are actually run. Basically, is there a way I could create and delete the test files separately from the tests themselves? I tried to use a fixture but it didn't really work (and didn't make much sense). Just a regular function maybe? — ShockDoctor

Arnaud Claudel Arnaud Claudel · Accepted Answer · 2019-07-18T15:52:04

I think the problem is that your test input and expected output are strongly tied but located at two different places, one in the parameters and the other in the test code.
If you change one parameter, you'll need to change the method body of your test which is not right imo, in addition of the duplicated code.

I think that you should have the parameters test(test_data, expected output) and inject the input in a temporary file.
Then you call your function and compare the expected and actual output.

@pytest.mark.parametrize('test_data, expected', [
    ([['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 4000]],
      pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales'])),
    ([['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 'Data6', 4000]],
      pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales']))
])
def test_sales_dataframe(test_data, expected):

    # Write your test data in a temporary file
    tmp_file = r'Path\To\Project\Output\Folder\tmp.csv';
    with open(tmp_file, 'w') as file:
        writer = csv.writer(file)
        writer.writerows(test_data)

    # Process the data
    sales_df = mks_sales_dataframe(tmp_file)

    # Compare expected and actual output
    testing.assert_frame_equal(expected, sales_df)

    # Clean the temporary file
    os.remove(tmp_file)

You can also create your .csv and add them as test resources, but you'll have different locations for your input and expected output, which is not that great.

What's the best way to unit test functions that handle csv files?

Test Function

Main Function

2 Answers