dask read_csv timeout on s3 for big files
s3fs.S3FileSystem.read_timeout = 5184000 # one day
s3fs.S3FileSystem.connect_timeout = 5184000 # one day
client = Client('a_remote_scheduler_ip_here:8786')
df = dd.read_csv('s3://dask-data/nyc-taxi/2015/*.csv')
len(df)
len(df) has timeout exception, if the file is small, then it just works well.
I think we need a way to set s3fs.S3FileSystem.read_timeout on the remote workers, not the local code, but I have no idea how to do it.
Here is a part of the stack trace:
File "/opt/conda/lib/python3.6/site-packages/dask/bytes/utils.py", line 238, in read_block File "/opt/conda/lib/python3.6/site-packages/s3fs/core.py", line 1333, in read File "/opt/conda/lib/python3.6/site-packages/s3fs/core.py", line 1303, in _fetch File "/opt/conda/lib/python3.6/site-packages/s3fs/core.py", line 1520, in _fetch_range File "/opt/conda/lib/python3.6/site-packages/botocore/response.py", line 81, in read botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "None"