I have two streams of data that have limited duration (typically 1-60 seconds) and I want to store them in a compressed data file for later retrieval. Right now I am using HDF5, but I've heard about Parquet and want to give it a try.
Stream 1:
The data is arriving as a series of records, approximately 2500 records per second. Each record is a tuple (timestamp, tag, data) with the following sizes:
- timestamp: 64-bit value
- tag: 8-bit value
- data: variable-length octets (typically about 100 bytes per record, sometimes more, sometimes less)
Stream 2:
The data is arriving as a series of records, approximately 100000 records per second. Each record is a tuple (timestamp, index, value) with the following sizes:
- timestamp: 64 bits
- index: 16-bit value
- data: 32-bit value
Can I do this with Apache Parquet? I am totally new to this + can't seem to find the right documentation; I found documentation about reading/writing entire tables, but in my case I need to incrementally write to the tables in batches of some number of rows (depending on how large of a buffer I want to use).
I am interested in both Java and Python and can explore in either, but I'm more fluent in Python.
I found this page for pyarrow: https://arrow.apache.org/docs/python/parquet.html --- it talks about row groups and ParquetWriter
and read_row_group()
but I can't tell if it supports my use case.
Any suggestions?