How do I read a partitioned parquet file into R with arrow (without any spark)
The situation
- created parquet files with a Spark pipe and save on S3
- read with RStudio/RShiny with one column as index to do further analysis
The parquet file structure
The parquet files created from my Spark consists of several parts
tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc
How do I read this component_mapping.parquet into R?
What I tried
install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")
but this fails with the error
IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory
It works if I just read one file of the directory
install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")
but I need to load all in order to query on it
What I found in the documentation
In the apache arrow documentation https://arrow.apache.org/docs/r/reference/read_parquet.html and https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html I found that there area some properties for the read_parquet() command but I can't get it working and do not find any examples.
read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)
How do I set the properties correctly to read the full directory?
# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)
Help would be very appreciated