2
votes

I have input data stored as a single large file on S3. I want Dask to chop the file automatically, distribute to workers and manage the data flow. Hence the idea of using distributed collection, e.g. bag.

On each worker I have a command line tools (Java) that read the data from file(s). Therefore I'd like to write a whole chunk of data into file, call external CLI/code to process the data and then read the results from output file. This looks like processing batches of data instead of record-at-a-time.

What would be the best approach to this problem? Is it possible to write partition to disk on a worker and process it as a whole?

PS. It nor necessary, but desirable, to stay in a distributed collection model because other operations on data might be simpler Python functions that process data record by record.

1
Hi and welcome to Stack Overflow, please take a time to go through the welcome tour to know your way around here (and also to earn your first badge), read how to create a Minimal, Complete, and Verifiable example and also check How to Ask Good Questions so you increase your chances to get feedback and useful answers. - DarkCygnus

1 Answers

3
votes

You probably want the read_bytes function. This breaks the file into many chunks cleanly split by a delimiter (like an endline). It gives you back a list of dask.delayed objects that point to those blocks of bytes.

There is more information on this documentation page: http://dask.pydata.org/en/latest/bytes.html

Here is an example from the docstring:

>>> sample, blocks = read_bytes('s3://bucket/2015-*-*.csv', delimiter=b'\n')