22
votes

Is it possible to read parquet files from Scala without using Apache Spark?

I found a project which allows us to read and write avro files using plain scala.

https://github.com/sksamuel/avro4s

However I can't find a way to read and write parquet files using plain scala program without using Spark?

3

3 Answers

20
votes

It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer.

Some sample code

val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
// iter is of type Iterator[GenericRecord]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
// if you want a list then...
val list = iter.toList

This will return you a standard Avro GenericRecords, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. Assuming you are using version 1.30 or higher then:

case class Bibble(name: String, location: String)
val format = RecordFormat[Bibble]
// then for a given record
val bibble = format.from(record)

We can obviously combine that with the original iterator in one step:

val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
val format = RecordFormat[Bibble]
// iter is now an Iterator[Bibble]
val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from)
// and list is now a List[Bibble]
val list = iter.toList
7
votes

There is also a relatively new project called eel this is a lightweight (non distributed processing) toolkit for using some of the 'big data' technology in the small.

3
votes

Yes, you don't have to use Spark to read/write Parquet. Just use parquet lib directly from your Scala code (and that's what Spark is doing anyway): http://search.maven.org/#search%7Cga%7C1%7Cparquet