1
votes

I'm using CTAS statement to create a parquet file from a csv in Apache Drill. I've tried multiple experiments changing various configuration parameters, even trying to write to tmpfs.

My tests always take the same amount of time. I'm not IO bound. I may be CPU bound, consistently one java thread is at 100% most of the time.

Experiments tried:

store.parquet.compression=none
store.parquet.page-size=8192
planner.slice_target=10000
store.parquet.block-size=104857600
store.text.estimated_row_size_bytes=4k

I've reached the conclusion that perhaps Drill is single threaded for writing, can anybody confirm this?

With a 12 core server I have plenty of headroom available that is not being utilised.

Is it possible to run multiple drillbits on a single server?

Update: It appears that the performance is the same whether the CTAS output format is csv or parquet, so the limitation appears to the ability Drill to write data in general.

Update 2: Switching from using a csv file as input to the CTAS statement without a header, using a statement of the form:

CREATE TABLE (col1, col2, col3, ...) AS SELECT columns[0], columns[1], column[2] from filename;

to using a CSV file with header, ie changing the statement to something like:

CREATE TABLE (name1, name2, name3, ...) AS SELECT name1, name2, name3 from filename;

Where name1, name2 etc are defined in the header line made a significant difference in performance, from a consistent 13 minutes to execute overall process to 9 minutes.

1

1 Answers

0
votes

You cannot run multiple drillibits on a single server.

Yes, In my observation also drill uses lots of Process power many times the CPU usage goes to 300-400% when we're computing on large set of data & i think it uses single thread for parquet file.