1
votes

I'm currently loading a 4000 column csv into dask, doing some transformations and then converting it to Pandas.

By default all fields are loaded as float64, so a LOT of ram wasted and some dead kernels while experimenting.

Still using Pandas <1.0 for compatibility.

The question is, without downcasting (5 != 5.00), so maybe I can convert from float64 to float8, is there a way to make a massive detection of each column? Setting from float to int8 it goes from 50gb df to 800MB aprox (yes, I'm destroying data, it was a test).

Huge dataframes are common in Machine Learning, I think there must be a standardized way to optimize them, no luck finding it...yet.

And after this conversion, to keep data types and persist the dataframe, is Parquet ok?

Thanks!!

if there are nulls, it will set to float. wont it? - Joe Ferndz
There can be nulls in some (not all) columns, but for columns with a max value of 90.00 float64 sounds like a waste of RAM. - Alejandro
What about float64 -> float16? I'm not sure exactly what your question is because you said you don't want to downcast, but your last sentence still mentions conversion. - tdy
You are right in this scenario, all data will be float, I mean don't change from float to int...yet, just want to automatically determine the lowest presicion I could set for each column. Question edited, Thanks! - Alejandro
Maybe can apply / map some of the ideas here: determine-precision-and-scale-of-particular-number-in-python - tdy