1
votes

I have a batch of python jobs, that only differ in the input file they are reading, say:

python main.py --input=file1.json > log_file1.txt
python main.py --input=file2.json > log_file2.txt
python main.py --input=file3.json > log_file3.txt
...

All these jobs are independent, and use a prebuilt anaconda environment.

I'm able to run my code on an on-demand EC2 instance using the following workflow:

  • Mount an EBS volume with the input files and prebuilt conda environment.
  • Activate the conda environment.
  • Run python programs, such that each program reads a different input file, and writes to a separate log file. The input files are stored in the EBS volume, and the log files will be written to the EBS volume.

Now, I want to scale this to use AWS spot instances -- basically, if I have N jobs, request N spot instances that run one of the above jobs each to read different files from an existing volume, and write the outputs to different files on the same volume. But I couldn't find a comprehensive guide on how to go about it. Any help would be appreciated.

1
It is possible that AWS Batch would meet your needs. See: What Is AWS Batch?John Rotenstein
Thanks! I did consider using AWS Batch, and followed this tutorial -- aws.amazon.com/blogs/compute/… -- to understand how to read data in my jobs, but I was running into some issues. I'll give it another try, and get back.user2966082

1 Answers

0
votes

Maybe this will give you something to ponder as my solution isn't exactly like yours, but here goes. (oh, and i'm going to look at batch as well, just haven't gotten there). I have decent sized stock option files that I analyze and transform for 500 different symbols. I've used some tools to figure out my memory demands on the largest files are around 4MB max. I spin up 1 spot instance with at least 30 MB that is from an image I make of the ec2 and ebs store, so it's always the like the one I test on, just more memory.

I run a shell script that breaks up the 500 or so symbols into 6-10 different chunks and run them concurrently on one machine. I'm not time sensitive so I don't really need multiple machines in parallel. But I could, I would just run a different script.

here's the script:

for y in {0..500..50}
do
    start_slice=$(($y))
    end_slice=$(($y + 50))
 #   echo $low_limit
#    echo $high_limit

/usr/local/bin/pipenv run ~/.local/share/virtualenvs/ec2-user-zzkNbF-x/bin/python /home/ec2-user/code/etrade_getoptionsdata/get_bidask_of_baseline_combos_intraday_parallel.py -s $start_slice -e $end_slice &
# echo 'next file'
done

my environment is pipenv and put the environment path in so it has access to all my modules again, the script just breaks up same analysis into 50 symbols each

in my file I use a for loop that uses the passed in arguments -s and -e for key_cons in keys_list[s:e]

to launch the shell script I've been playing around with nohup ./shell.sh $ so it runs in background and won't stop when my ssh session ends.

if you need one instance per job, then that's what it takes. each individual transformation I run takes 30-45 seconds, so it still takes a couple hours.

let me know if you have any questions.