10
votes

How do I read a partitioned parquet file into R with arrow (without any spark)

The situation

  1. created parquet files with a Spark pipe and save on S3
  2. read with RStudio/RShiny with one column as index to do further analysis

The parquet file structure

The parquet files created from my Spark consists of several parts

tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc

How do I read this component_mapping.parquet into R?

What I tried

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")

but this fails with the error

IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory

It works if I just read one file of the directory

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")

but I need to load all in order to query on it

What I found in the documentation

In the apache arrow documentation https://arrow.apache.org/docs/r/reference/read_parquet.html and https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html I found that there area some properties for the read_parquet() command but I can't get it working and do not find any examples.

read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)

How do I set the properties correctly to read the full directory?

# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)

Help would be very appreciated

4

4 Answers

9
votes

Solution for: Read partitioned parquet files from local file system into R dataframe with arrow

As I would like to avoid using any Spark or Python on the RShiny server I can't use the other libraries like sparklyr, SparkR or reticulate and dplyr as described e.g. in How do I read a Parquet in R and convert it to an R DataFrame?

I solved my task now with your proposal using arrow together with lapply and rbindlist

my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))

looking forward until the apache arrow functionality is available Thanks

6
votes

Reading a directory of files is not something you can achieve by setting an option to the (single) file reader. If memory isn't a problem, today you can lapply/map over the directory listing and rbind/bind_rows into a single data.frame. There's probably a purrr function that does this cleanly. In that iteration over the files, you also can select/filter on each if you only need a known subset of the data.

In the Arrow project, we're actively developing a multi-file dataset API that will let you do what you're trying to do, as well as push down row and column selection to the individual files and much more. Stay tuned.

2
votes

Solution for: Read partitioned parquet files from S3 into R dataframe using arrow

As it tooked me now very long to figure out a solution and I was not able to find anything in the web I would like to share this solution on how to read partitioned parquet files from S3

library(arrow)
library(aws.s3)

bucket="mybucket"
prefix="my_prefix"

# using aws.s3 library to get all "part-" files (Key) for one parquet folder from a bucket for a given prefix pattern for a given component
files<-rbindlist(get_bucket(bucket = bucket,prefix=prefix))$Key

# apply the aws.s3::s3read_using function to each file using the arrow::read_parquet function to decode the parquet format
data <- lapply(files, function(x) {s3read_using(FUN = arrow::read_parquet, object = x, bucket = bucket)})

# concatenate all data together into one data.frame
data <- do.call(rbind, data)

What a mess but it works.
@neal-richardson is there a using arrow directly to read from S3? I couldn't find something in the documentation for R
2
votes

As @neal-richardson alluded to in his answer, more work has been done on this, and with the current arrow package (I'm running 4.0.0 currently) this is possible.

I noticed your files used snappy compression, which requires a special build flag before installation. (Installation documentation here: https://arrow.apache.org/docs/r/articles/install.html)

Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
install.packages("arrow",force = TRUE)

The Dataset API implements the functionality you are looking for, with multi-file datasets. While the documentation does not yet include a wide variety of examples, it does provide a clear starting point. https://arrow.apache.org/docs/r/reference/Dataset.html

The example below shows a minimal example of reading a multi-file dataset from a given directory and converting it to an in-memory R data frame. The API also supports filtering criteria and selecting a subset of columns, though I'm still trying to figure out the syntax myself.

library(arrow)

## Define the dataset
DS <- arrow::open_dataset(sources = "/path/to/directory")
## Create a scanner
SO <- Scanner$create(DS)
## Load it as n Arrow Table in memory
AT <- SO$ToTable()
## Convert it to an R data frame
DF <- as.data.frame(AT)