I want to process ~500 GB of data, spread across 64 JSON files each containing 5M records. Basically, Map(Pyspark) function on each of 300M records.
To test my PySpark map function, I have set up a google Dataproc cluster (1 master 5 workers to test just one JSON file).
What is the best practice here?
Should I copy all the files in master node (to make use of Hadoop distributed file system in Dataproc) or will it be equally efficient if I keep the files in my GCS bucket and point the file location in my Pyspark?
Also my code imports quite a few external modules which I have copied to my master and import works fine in master. What is the best practice to copy it over all other worker nodes so that when Pyspark runs in those workers I don't get the import error.
I read a few articles on Google cloud website but did not get a clear answer where to store the files.
I can manually copy the external modules to each of my worker nodes but can't do it in production when I will be dealing with at least 100 nodes.