2
votes

The problem:

  • Parquet read performance of Drill appears to be 5x - 10x worse when reading from azure storage and it renders it unusable for bigger data workloads.
  • It appears to be only a problem when reading parquets. Reading CSV, at the other hand, runs normally.

Let's have:

  • Azure Blob storage account with ~1GB source.csv and parquets with same data.
  • Azure Premium File Storage with the same files
  • Local disk folder containing the same files
  • Drill running on Azure VM in single mode

Drill configuration:

  • Azure blob storage plugin working as namespace blob
  • Azure files mounted with SMB to /data/dfs used as namespace dfs
  • Local disk folder used as namespace local

The VM

  • Standard E4s v3 (4 vcpus, 32 GiB memory)
  • 256GB SSD
  • NIC 2Gbps
  • 6400 IOPS / 96MBps

Azure Premium Files Share

  • 1000GB
  • 1000 IOPS base / 3000 IOPS Burst
  • 120MB/s throughput

Storage benchmarks

  • Measured with dd, 1GB data, various block sizes, conv=fdatasync
  • FS cache dropped before each read test (sudo sh -c "echo 3 > /proc/sys/vm/drop_caches")

Local disk

+-------+------------+--------+
| Mode  | Block size | Speed  |
+-------+------------+--------+
| Write |       1024 | 37MB/s |
| Write |         64 | 16MBs  |
| Read  |       1024 | 70MB/s |
| Read  |         64 | 44MB/s |
+-------+------------+--------+

 Azure Premium File Storage SMB mount

+-------+------------+---------+
| Mode  | Block size |  Speed  |
+-------+------------+---------+
| Write |       1024 | 100MB/s |
| Write |         64 | 23MBs   |
| Read  |       1024 | 88MB/s  |
| Read  |         64 | 40MB/s  |
+-------+------------+---------+

Azure Blob

Max known throughput of azure blobs is 60MB/s. Upload/download speeds are clamped to target storage read/write speeds.


Drill benchmarks

  • The filesystem cache was purged before every read test.
  • IO performance observed with iotop
  • Queries were chosen simple only for demonstration. Execution time growth for more complex queries is linear.

 Sample queries:

-- Query A: Reading parquet
select sum(`Price`) as test from namespace.`Parquet/**/*.parquet`;

-- Query B: Reading CSV
select sum(CAST(`Price` as DOUBLE)) as test from namespace.`sales.csv`;

 Results

+-------------+--------------------+----------+-----------------+
|    Query    | Source (namespace) | Duration | Disk read usage |
+-------------+--------------------+----------+-----------------+
| A (Parquet) | dfs(smb)           | 14.8s    | 2.8 - 3.5 MB/s  |
| A (Parquet) | blob               | 24.5s    | N/A             |
| A (Parquet) | local              | 1.7s     | 40 - 80 MB/s    |
| ---         | ---                | ---      | ---             |
| B (CSV)     | dfs(smb)           | 22s      | 30 - 60 MB/s    |
| B (CSV)     | blob               | 29s      | N/A             |
| B (CSV)     | local              | 18s      | 68 MB/s         |
+-------------+--------------------+----------+-----------------+

Observations

  • When reading parquet, more threads will spawn but only cisfd process takes the IO performance.
  • Trying to tune parquet reader performance as described here but without any significant results.
  • There is a big peak of egress data at the time of querying parquets from azure storage, that exceeds parquet data size several times. The parquets have ~300MB but the egress peak for one read query is about 2.5GB.

Conclusion

  • Reading parquets from Azure Files is for some reason slowed down to ridiculous speeds.
  • Reading parquets from Azure Blob is even a bit slower.
  • Reading parquets from local filesystem is nicely fast, but not suitable for real use.
  • Reading CSV from any source utilizes storage throughput normally, therefore I assume some problem / misconfiguration of parquet reader.

The questions

  • What are the reasons that parquet read performance from Azure Storage is so drastically reduced?
  • Is there way to optimize it?
1
have you attempted anything in databricks? I am noticing ridiculously poor performance when writing to blob-parquet from databricks data frameJoshuaJames
facing same issue but with different use case. My application is reading 48 files with 1 MB each approx. First response from azure blob takes ~5 minutes, another request takes ~2 minutes to respond and gradually on later requests it takes around 400 milliseconds to respond. Is there anything related to caching ?Milesh

1 Answers

0
votes

I assume that you would have cross checked IO performance issue using Azure Monitor and if the issue still persist, I would like to work closely on this issue. This may require a deeper investigation, so If you have a support plan, I request you file a support ticket, else please do let us know, we will try and help you get a one-time free technical support. In this case, could you send an email to AzCommunity[at]Microsoft[dot]com referencing this thread. Please mention "ATTN subm" in the subject field. Thank you for your cooperation on this matter and look forward to your reply.