
When opening an HDF5 file with h5py you can pass in a python file-like object. I have done so, where the file-like object is a custom implementation of my own network-based transport layer.

This works great, I can slice large HDF5 files over a high latency transport layer. However HDF5 appears to provide its own file locking functionality, so that if you open multiple files for read-only within the same process (threading model) it will still only run the operations, effectively, in series.

There are drivers in HDF5 that support parallel operations, such as h5py.File(f, driver='mpio'), but this doesn't appear to apply to python file-like objects which use h5py.File(f, driver='fileobj').

The only solution I see is to use multiprocessing. However the scalability is very limited, you can only realistically open 10's of processes because of overhead. My transport layer uses asyncio and is capable of parallel operations on the scale of 1,000's or 10,000's, allowing me to build a longer queue of slow file-read operations which boost my total throughput.

I can achieve 1.5 GB/sec of large-file, random-seek, binary reads with my transport layer against a local S3 interface when I queue 10k IO ops in parallel (requiring 50GB of RAM to service the requests, an acceptable trade-off for the throughput).

Is there any way I can disable the h5py file locking when using driver='fileobj'?


1 Answers


You just need to set the value to FALSE for the environment variable HDF5_USE_FILE_LOCKING.

Examples are as follows:

In Linux or MacOS via Terminal: export HDF5_USE_FILE_LOCKING=FALSE

In Windows via Command Prompts (CMD): set HDF5_USE_FILE_LOCKING=FALSE