Poor performance Arrow Parquet multiple files

Question

After watching the mind-blowing webinar at Rstudio conference here I was pumped enough to dump an entire SQL server table to parquet files. The result was 2886 files, (78 entities over 37 months) with around 700 millons rows in total.

Doing a basic select returned all rows in less than 15 seconds! (Just out of this world result!!) At the webinar Neal Richardson from Ursa Labs was showcasing the Ny-Taxi dataset with 2 billions rows under 4 seconds.

I felt it was time to do something more daring like basic mean, sd, mode over a year worth of data, but that took a minute per month, so I was sitting 12.4 minutes waiting for a reply from R.

What is the issue? My badly written R-query? or simply too many files or granularity (decimal values?)??

Any ideas??

PS: I did not want to put a Jira-case in apache-arrow board as I see google search does not retrieve answers from there.

Neal Richardson Neal Richardson · Accepted Answer · 2021-02-08T16:33:34

My guess (without actually looking at the data or profiling the query) is two things:

You're right, the decimal type is going to require some work in converting to an R type because R doesn't have a decimal type, so that will be slower than just reading in an int32 or float64 type.
You're still reading in ~350 million rows of data to your R session, and that's going to take some time. In the example query on the arrow package vignette, more data is filtered out (and the filtering is very fast).

Poor performance Arrow Parquet multiple files

1 Answers