Read partitioned parquet directory (all files) in one R dataframe with apache arrow

Question

How do I read a partitioned parquet file into R with arrow (without any spark)

The situation

created parquet files with a Spark pipe and save on S3
read with RStudio/RShiny with one column as index to do further analysis

The parquet file structure

The parquet files created from my Spark consists of several parts

tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc

How do I read this component_mapping.parquet into R?

What I tried

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")

but this fails with the error

IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory

It works if I just read one file of the directory

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")

but I need to load all in order to query on it

What I found in the documentation

In the apache arrow documentation https://arrow.apache.org/docs/r/reference/read_parquet.html and https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html I found that there area some properties for the read_parquet() command but I can't get it working and do not find any examples.

read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)

How do I set the properties correctly to read the full directory?

# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)

Help would be very appreciated

Alex Ortner Alex Ortner · Accepted Answer · 2019-10-18T11:34:52

Solution for: Read partitioned parquet files from local file system into R dataframe with arrow

As I would like to avoid using any Spark or Python on the RShiny server I can't use the other libraries like sparklyr, SparkR or reticulate and dplyr as described e.g. in How do I read a Parquet in R and convert it to an R DataFrame?

I solved my task now with your proposal using arrow together with lapply and rbindlist

my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))

looking forward until the apache arrow functionality is available Thanks

Read partitioned parquet directory (all files) in one R dataframe with apache arrow

4 Answers