I'm trying to unit-test a function that deals with csv files with Pytest. While my function works, I feel like there's a lot of code repetition when creating "sample" csv files in my project directory to test the function. The actual csv file that holds the real data has millions of records.
These are not the only csv files I have to test in my module, so it would be immensely helpful to know what's the best way to test functions that work with different file structures.
Right now I'm creating a very short csv file that mimics the actual file schema with a single line of data plus expected dataframe output after the file is processed through the function.
Perhaps mocking is the way to go? But I feel like you shouldn't need to mock for this kind of testing
Test Function
@pytest.mark.parametrize('test_file, expected', [
(r'Path\To\Project\Output\Folder\mock_sales1.csv',
pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales'])),
(r'Path\To\Project\Output\Folder\mock_sales2.csv',
pd.DataFrame([['A0A0A0', 1, 4000]], columns=['Postal_Code', 'Store_Num', 'Sales']))
])
def test_sales_dataframe(test_file, expected):
# This part is repetitive, different tests each need a seperate file written within the test function.
# Writing sample file to test that files with 7 columns are read correctly.
mock_mks_sales1 = [['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 4000]]
with open(r'Path\To\Project\Output\Folder\mock_sales1.csv', 'w') as file:
writer = csv.writer(file)
writer.writerows(mock_sales1)
# Writing sample file to test that files with 8 columns are read correctly.
mock_mks_sales2 = [['Data0', 'A0A0A0', 1, 'Data3', 'Data4', 'Data5', 'Data6', 4000]]
with open(r'Path\To\Project\Output\Folder\mock_sales2.csv', 'w') as file:
writer = csv.writer(file)
writer.writerows(mock_sales2)
sales_df = mks_sales_dataframe(test_file)
testing.assert_frame_equal(expected, sales_df)
os.remove(r'Path\To\Project\Output\Folder\mock_sales1.csv')
os.remove(r'Path\To\Project\Output\Folder\mock_sales2.csv')
Main Function
def sales_dataframe(file):
try:
with open(file, 'r') as f:
reader = csv.reader(f)
num_cols = len(next(reader))
columns = [1, 2, (num_cols - 1)] # Number of columns is variable, this is used later to accurately specify which columns should be read. This is part I'm testing!
sales_df = pd.read_csv(file, usecols=columns, names=['Postal_Code', 'Store_Num', 'Sales'])
return sales_df
except FileNotFoundError:
raise FileNotFoundError(file)
The test passes as intended. However, for every different test I have to create a sample csv file within the test function and delete each file once the test is finished. As you can imagine that's a lot of repetitive code within a single test function which feels quite clunky and wordy, especially when the test is parameterized.
test_sales_dataframe
, you first create themock_sales1.csv
andmock_sales2.csv
files with a fixed content, then you callmks_sales_dataframe
to read one of these .csv files, and then you check that the result equalsexpected
? – pschillmock_sales1.csv
andmock_sales2.csv
(and all your other test data files) in atestdata
folder next to your test code. Then your test becomes the two linessales_df = mks_sales_dataframe(test_file)
andtesting.assert_frame_equal(expected, sales_df)
. – pschill