Drill - Parquet IO performance issues with Azure Blob or Azure Files

Question

The problem:

Parquet read performance of Drill appears to be 5x - 10x worse when reading from azure storage and it renders it unusable for bigger data workloads.
It appears to be only a problem when reading parquets. Reading CSV, at the other hand, runs normally.

Let's have:

Azure Blob storage account with ~1GB source.csv and parquets with same data.
Azure Premium File Storage with the same files
Local disk folder containing the same files
Drill running on Azure VM in single mode

Drill configuration:

Azure blob storage plugin working as namespace blob
Azure files mounted with SMB to /data/dfs used as namespace dfs
Local disk folder used as namespace local

The VM

Standard E4s v3 (4 vcpus, 32 GiB memory)
256GB SSD
NIC 2Gbps
6400 IOPS / 96MBps

Azure Premium Files Share

1000GB
1000 IOPS base / 3000 IOPS Burst
120MB/s throughput

Storage benchmarks

Measured with dd, 1GB data, various block sizes, conv=fdatasync
FS cache dropped before each read test (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches")

Local disk

+-------+------------+--------+
| Mode  | Block size | Speed  |
+-------+------------+--------+
| Write |       1024 | 37MB/s |
| Write |         64 | 16MBs  |
| Read  |       1024 | 70MB/s |
| Read  |         64 | 44MB/s |
+-------+------------+--------+

Azure Premium File Storage SMB mount

+-------+------------+---------+
| Mode  | Block size |  Speed  |
+-------+------------+---------+
| Write |       1024 | 100MB/s |
| Write |         64 | 23MBs   |
| Read  |       1024 | 88MB/s  |
| Read  |         64 | 40MB/s  |
+-------+------------+---------+

Azure Blob

Max known throughput of azure blobs is 60MB/s. Upload/download speeds are clamped to target storage read/write speeds.

Drill benchmarks

The filesystem cache was purged before every read test.
IO performance observed with iotop
Queries were chosen simple only for demonstration. Execution time growth for more complex queries is linear.

Sample queries:

-- Query A: Reading parquet
select sum(`Price`) as test from namespace.`Parquet/**/*.parquet`;

-- Query B: Reading CSV
select sum(CAST(`Price` as DOUBLE)) as test from namespace.`sales.csv`;

Results

+-------------+--------------------+----------+-----------------+
|    Query    | Source (namespace) | Duration | Disk read usage |
+-------------+--------------------+----------+-----------------+
| A (Parquet) | dfs(smb)           | 14.8s    | 2.8 - 3.5 MB/s  |
| A (Parquet) | blob               | 24.5s    | N/A             |
| A (Parquet) | local              | 1.7s     | 40 - 80 MB/s    |
| ---         | ---                | ---      | ---             |
| B (CSV)     | dfs(smb)           | 22s      | 30 - 60 MB/s    |
| B (CSV)     | blob               | 29s      | N/A             |
| B (CSV)     | local              | 18s      | 68 MB/s         |
+-------------+--------------------+----------+-----------------+

Observations

When reading parquet, more threads will spawn but only cisfd process takes the IO performance.
Trying to tune parquet reader performance as described here but without any significant results.
There is a big peak of egress data at the time of querying parquets from azure storage, that exceeds parquet data size several times. The parquets have ~300MB but the egress peak for one read query is about 2.5GB.

Conclusion

Reading parquets from Azure Files is for some reason slowed down to ridiculous speeds.
Reading parquets from Azure Blob is even a bit slower.
Reading parquets from local filesystem is nicely fast, but not suitable for real use.
Reading CSV from any source utilizes storage throughput normally, therefore I assume some problem / misconfiguration of parquet reader.

The questions

What are the reasons that parquet read performance from Azure Storage is so drastically reduced?
Is there way to optimize it?

have you attempted anything in databricks? I am noticing ridiculously poor performance when writing to blob-parquet from databricks data frame — JoshuaJames
facing same issue but with different use case. My application is reading 48 files with 1 MB each approx. First response from azure blob takes ~5 minutes, another request takes ~2 minutes to respond and gradually on later requests it takes around 400 milliseconds to respond. Is there anything related to caching ? — Milesh

SumanthMarigowda-MSFT SumanthMarigowda-MSFT · Accepted Answer · 2019-07-01T15:05:15

I assume that you would have cross checked IO performance issue using Azure Monitor and if the issue still persist, I would like to work closely on this issue. This may require a deeper investigation, so If you have a support plan, I request you file a support ticket, else please do let us know, we will try and help you get a one-time free technical support. In this case, could you send an email to AzCommunity[at]Microsoft[dot]com referencing this thread. Please mention "ATTN subm" in the subject field. Thank you for your cooperation on this matter and look forward to your reply.