After watching the mind-blowing webinar at Rstudio conference here I was pumped enough to dump an entire SQL server table to parquet files. The result was 2886 files, (78 entities over 37 months) with around 700 millons rows in total.
Doing a basic select returned all rows in less than 15 seconds! (Just out of this world result!!) At the webinar Neal Richardson from Ursa Labs was showcasing the Ny-Taxi dataset with 2 billions rows under 4 seconds.
I felt it was time to do something more daring like basic mean, sd, mode over a year worth of data, but that took a minute per month, so I was sitting 12.4 minutes waiting for a reply from R.
What is the issue? My badly written R-query? or simply too many files or granularity (decimal values?)??
Any ideas??
PS: I did not want to put a Jira-case in apache-arrow board as I see google search does not retrieve answers from there.