I'm currently loading a 4000 column csv into dask, doing some transformations and then converting it to Pandas.
By default all fields are loaded as float64, so a LOT of ram wasted and some dead kernels while experimenting.
Still using Pandas <1.0 for compatibility.
The question is, without downcasting (5 != 5.00), so maybe I can convert from float64 to float8, is there a way to make a massive detection of each column? Setting from float to int8 it goes from 50gb df to 800MB aprox (yes, I'm destroying data, it was a test).
Huge dataframes are common in Machine Learning, I think there must be a standardized way to optimize them, no luck finding it...yet.
And after this conversion, to keep data types and persist the dataframe, is Parquet ok?
Thanks!!
float64
->float16
? I'm not sure exactly what your question is because you said you don't want to downcast, but your last sentence still mentions conversion. - tdyapply
/map
some of the ideas here: determine-precision-and-scale-of-particular-number-in-python - tdy