The problem:
- Parquet read performance of Drill appears to be 5x - 10x worse when reading from azure storage and it renders it unusable for bigger data workloads.
- It appears to be only a problem when reading parquets. Reading CSV, at the other hand, runs normally.
Let's have:
- Azure Blob storage account with ~1GB source.csv and parquets with same data.
- Azure Premium File Storage with the same files
- Local disk folder containing the same files
- Drill running on Azure VM in single mode
Drill configuration:
- Azure blob storage plugin working as namespace
blob
- Azure files mounted with SMB to /data/dfs used as namespace
dfs
- Local disk folder used as namespace
local
The VM
- Standard E4s v3 (4 vcpus, 32 GiB memory)
- 256GB SSD
- NIC 2Gbps
- 6400 IOPS / 96MBps
Azure Premium Files Share
- 1000GB
- 1000 IOPS base / 3000 IOPS Burst
- 120MB/s throughput
Storage benchmarks
- Measured with
dd
, 1GB data, various block sizes, conv=fdatasync - FS cache dropped before each read test (
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
)
Local disk
+-------+------------+--------+
| Mode | Block size | Speed |
+-------+------------+--------+
| Write | 1024 | 37MB/s |
| Write | 64 | 16MBs |
| Read | 1024 | 70MB/s |
| Read | 64 | 44MB/s |
+-------+------------+--------+
Azure Premium File Storage SMB mount
+-------+------------+---------+
| Mode | Block size | Speed |
+-------+------------+---------+
| Write | 1024 | 100MB/s |
| Write | 64 | 23MBs |
| Read | 1024 | 88MB/s |
| Read | 64 | 40MB/s |
+-------+------------+---------+
Azure Blob
Max known throughput of azure blobs is 60MB/s. Upload/download speeds are clamped to target storage read/write speeds.
Drill benchmarks
- The filesystem cache was purged before every read test.
- IO performance observed with
iotop
- Queries were chosen simple only for demonstration. Execution time growth for more complex queries is linear.
Sample queries:
-- Query A: Reading parquet
select sum(`Price`) as test from namespace.`Parquet/**/*.parquet`;
-- Query B: Reading CSV
select sum(CAST(`Price` as DOUBLE)) as test from namespace.`sales.csv`;
Results
+-------------+--------------------+----------+-----------------+
| Query | Source (namespace) | Duration | Disk read usage |
+-------------+--------------------+----------+-----------------+
| A (Parquet) | dfs(smb) | 14.8s | 2.8 - 3.5 MB/s |
| A (Parquet) | blob | 24.5s | N/A |
| A (Parquet) | local | 1.7s | 40 - 80 MB/s |
| --- | --- | --- | --- |
| B (CSV) | dfs(smb) | 22s | 30 - 60 MB/s |
| B (CSV) | blob | 29s | N/A |
| B (CSV) | local | 18s | 68 MB/s |
+-------------+--------------------+----------+-----------------+
Observations
- When reading parquet, more threads will spawn but only
cisfd
process takes the IO performance. - Trying to tune parquet reader performance as described here but without any significant results.
- There is a big peak of egress data at the time of querying parquets from azure storage, that exceeds parquet data size several times. The parquets have ~300MB but the egress peak for one read query is about 2.5GB.
Conclusion
- Reading parquets from Azure Files is for some reason slowed down to ridiculous speeds.
- Reading parquets from Azure Blob is even a bit slower.
- Reading parquets from local filesystem is nicely fast, but not suitable for real use.
- Reading CSV from any source utilizes storage throughput normally, therefore I assume some problem / misconfiguration of parquet reader.
The questions
- What are the reasons that parquet read performance from Azure Storage is so drastically reduced?
- Is there way to optimize it?