I am currently working on an SGE cluster and have code which submits many jobs, written in python, in parallel.
The output at the end of my code is a set of files containing numerical data. Each python job performs some calculation and then outputs to each file in turn. To output to a file, my code reads in the data in the file, adds what it has computed to the data, and then outputs back to the file.
My problem is this; as all of the jobs are run in parallel, and all of the jobs contribute to each of the output files; my jobs are conflicting with one another. I am often getting errors concerning incompatible file sizes and such. The cause I believe is that sometimes two jobs will try and read in the file around the same time and conflict.
My question is this: When running (potentially many) multiple jobs in parallel that are each contributing to the same file multiple times, is there a good practice way of ensuring that they don't try to write to the file concurrently? Are there any pythonic or SGE solutions to this problem?
My niave idea was to have a txt file, which contains a 1
or 0
indicating whether a file is currently being accessed, and that a job will only write to a file when the value is set to 0
, and will change the value to 1
whilst it is outputting. Is this a bad practice?/dumb idea?