2
votes

I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column names and column dtypes within pyarrow for the csv file?

I already thought about to append the header to the csv file. This enforces a complete rewrite of the file which looks like a unnecssary overhead. As far as I know, pyarrow provides schemas to define the dtypes for specific columns, but the docs are missing a concrete example for doing so while transforming a csv file to an arrow table.

Imagine that this csv file just has for an easy example the two columns "A" and "B". My current code looks like this:

import numpy as np
import pandas as pd
import pyarrow as pa
df_with_header = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

print(df_with_header)
df_with_header.to_csv("data.csv", header=False, index=False)

df_without_header = pd.read_csv('data.csv', header=None)
print(df_without_header)
opts = pa.csv.ConvertOptions(column_types={'A': 'int8',
                                            'B': 'int8'})

table = pa.csv.read_csv(input_file = "data.csv", convert_options = opts)
print(table)

If I print out the final table, its not going to change the names of the columns.

pyarrow.Table
1: int64
3: int64

How can I now change the loaded column names and dtypes? Is there maybe also a possibility to for example pass in a dict containing the names and their dtypes?

1
You can provide the types and names of the columns using ConvertOptions: arrow.apache.org/docs/python/generated/… and arrow.apache.org/docs/python/generated/… - 0x26res
You May have a short example in how to provide or how to Set the correct Dictionary for the column,_types param in ConvertOptions? - azo91

1 Answers

4
votes

You can specify type overrides for columns:

    fp = io.BytesIO(b'one,two,three\n1,2,3\n4,5,6')
    fp.seek(0)
    table = csv.read_csv(
        fp,
        convert_options=csv.ConvertOptions(
            column_types={
                'one': pa.int8(),
                'two': pa.int8(),
                'three': pa.int8(),
            }
        ))

But in your case you don't have a header, and as far as I can tell this use case is not supported in arrow:

    fp = io.BytesIO(b'1,2,3\n4,5,6')
    fp.seek(0)
    table = csv.read_csv(
        fp,
        parse_options=csv.ParseOptions(header_rows=0)
    )

This raises:

pyarrow.lib.ArrowInvalid: header_rows == 0 needs explicit column names

The code is here: https://github.com/apache/arrow/blob/3cf8f355e1268dd8761b99719ab09cc20d372185/cpp/src/arrow/csv/reader.cc#L138

This is similar to this question apache arrow - reading csv file

There should be fix for it in the next version: https://github.com/apache/arrow/pull/4898