dask read_csv timeout on Amazon s3 with big files

Question

dask read_csv timeout on s3 for big files

s3fs.S3FileSystem.read_timeout = 5184000  # one day
s3fs.S3FileSystem.connect_timeout = 5184000  # one day

client = Client('a_remote_scheduler_ip_here:8786')

df = dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv')
len(df)

len(df) has timeout exception, if the file is small, then it just works well.

I think we need a way to set s3fs.S3FileSystem.read_timeout on the remote workers, not the local code, but I have no idea how to do it.

Here is a part of the stack trace:

File "/opt/conda/lib/python3.6/site-packages/dask/bytes/utils.py", line 238, in read_block File "/opt/conda/lib/python3.6/site-packages/s3fs/core.py", line 1333, in read File "/opt/conda/lib/python3.6/site-packages/s3fs/core.py", line 1303, in _fetch File "/opt/conda/lib/python3.6/site-packages/s3fs/core.py", line 1520, in _fetch_range File "/opt/conda/lib/python3.6/site-packages/botocore/response.py", line 81, in read botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "None"

Side-note: s3fs is a virtual file system on top of Amazon S3. It is recommend that systems interact directly with S3 rather than using s3fs. — John Rotenstein
(s3fs does interact directly with S3, which is a REST API over HTTP) — mdurant

mdurant mdurant · Accepted Answer · 2018-12-30T03:04:29

Setting the timeouts using the class attribute seems like a reasonable thing to do, but you are using a client talking with workers in other processes/machines. Therefore, you would need to set the attribute on the copies of the class on each worker for your method to take affect.

Better, perhaps, would be to set the blocksize being used by read_csv (64MB by default) to a smaller number. I assume that you are on a slower network, and this is why you are getting timeouts. If you need numbers below 5MB, the default readahead size in s3fs, then you should also pass default_block_size amongst the storage_options passed to read_csv

Note, finally, that both s3fs and dask allow for retries, on connection errors or general task errors. That may be enough to help you in the case that you only get this for the occasional laggy ready.

dask read_csv timeout on Amazon s3 with big files

1 Answers