1
votes

I have two streams of data that have limited duration (typically 1-60 seconds) and I want to store them in a compressed data file for later retrieval. Right now I am using HDF5, but I've heard about Parquet and want to give it a try.

Stream 1:

The data is arriving as a series of records, approximately 2500 records per second. Each record is a tuple (timestamp, tag, data) with the following sizes:

  • timestamp: 64-bit value
  • tag: 8-bit value
  • data: variable-length octets (typically about 100 bytes per record, sometimes more, sometimes less)

Stream 2:

The data is arriving as a series of records, approximately 100000 records per second. Each record is a tuple (timestamp, index, value) with the following sizes:

  • timestamp: 64 bits
  • index: 16-bit value
  • data: 32-bit value

Can I do this with Apache Parquet? I am totally new to this + can't seem to find the right documentation; I found documentation about reading/writing entire tables, but in my case I need to incrementally write to the tables in batches of some number of rows (depending on how large of a buffer I want to use).

I am interested in both Java and Python and can explore in either, but I'm more fluent in Python.

I found this page for pyarrow: https://arrow.apache.org/docs/python/parquet.html --- it talks about row groups and ParquetWriter and read_row_group() but I can't tell if it supports my use case.

Any suggestions?

1

1 Answers

1
votes

Parquet is the go to file format for large datasets, you may find a number of articles (like this) that use parquet with big data.

Since the specified datasets have high frequency and throughput they can be classified in the category of big data, and so, parquet is highly recommended.

Yet,the direction you are looking into doesn't seem scalable. Since the solution is not restricted to simple Python, I would recommend you to look into Spark Streaming with python, instead of pyArrow. You may end up reading input and producing parquet output in batches with as simple script as this.

Please let me know if you have any concerns.