Convert dask to pandas dataframe

Question

I have a quite similiar question to this one: Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`

I am running the following script:

import pandas as pd
import dask.dataframe as dd
df2 = dd.read_csv("Path/*.csv", sep='\t', encoding='unicode_escape', sample=2500000)
df2 = df2.loc[~df2['Type'].isin(['STVKT','STKKT', 'STVK', 'STKK', 'STKET', 'STVET', 'STK', 'STKVT', 'STVVT', 'STV', 'STVZT', 'STVV', 'STKV', 'STVAT', 'STKAT', 'STKZT', 'STKAO', 'STKZE', 'STVAO', 'STVZE', 'STVT', 'STVNT'])]
df2 = df.compute()

And i get the following errror: ValueError: Mismatched dtypes found in pd.read_csv/pd.read_table.

How can I avoid that? I have over 32 columns, so i can't setup the dtypes upfront. As a hint it is also written: Specify dtype option on import or set low_memory=False

mdurant mdurant · Accepted Answer · 2020-04-15T16:17:41

When Dask loads your CSV, it tries to derive the dtypes from the header of the file, and then assumes that the rest of the parts of the files have the same dtypes for each column. Sine pandas types from csv depend on the set of values seen, this is where the error comes from.

To fix, you either have to explicitly tell dask what types to expect, or increase the size of the portion dask tries to guess types from (sample=).

The error message should have told you which columns were not matching and the types found, so you only need to specify those to get things working.

Convert dask to pandas dataframe

2 Answers