1
votes

Here's my CPU information:

enter image description here

I use ray to train reinforcement learning algorithms, where I define a Learner class decorated by @ray.remote(num_cpus=2) and a Worker class decorated by ray.remote(num_cpus=1). To gain maximum performance, how many workers can I have?

I used to set the number of workers to 8-10, but today I come across this post, which says

For many workloads (especially numerical workloads), you often cannot expect a greater speedup than the number of physical CPUs.

This seems to say that the number of physical CPUs bounds the number of process running in parallel. Does this mean I should not use more than 4 workers in order to gain maximum performance, assuming that workers are CPU-intensive? I hope someone could provide me a detailed explanation(or reference). Thanks in advance.

Update

Thanks for the comments by @AMC and @KlausD.. I update my question here, hoping it makes my question more clear.

I have done some tests. For example, I ran experiments with 1, 3, 8 workers, separately. Here's the result:

  • For 1-worker case, it takes 4m17s to run 400 steps
  • For 3-worker case, it takes 4m29s on average to run 400 steps
  • For 6-worker case, it takes 5m30s on average to run 400 steps

I concluded CPU contention had happened to 6-worker case. However, I opened top(where I could see 12 CPUs) to check the CPU usage, all workers used around 100% CPU. Therefore, I had no clue whether my conclusion was right.

I also wrote a small program to for further test. Code is provided below

from time import time
import numpy as np
import ray


@ray.remote(num_cpus=1)
def f(x, y):
    start = time()
    while True:
        x += y
        if np.mean(x) > 100:
            break
    return time() - start

if __name__ == '__main__':
    # I intend to make x and y large to increase the cpu usage.
    x = np.random.rand(1000, 10000)
    y = np.random.uniform(0, 3, (1000, 10000))
    print('x mean:', np.mean(x))
    print('y mean:', np.mean(y))
    for n in range(1, 30, 3):
        ray.init()

        start = time()
        result = ray.get([f.remote(x, y) for _ in range(n)])

        print('Num of workers:', n)
        # print('Run time:', result)
        print('Average run time:', np.mean(result))
        print('Ray run time:', time() - start)
        ray.shutdown()

Here's the result

x mean: 0.4998949941471149
y mean: 1.4997634832632463

Num of workers: 1
Average run time: 1.3638701438903809
Ray run time: 2.1305620670318604

Num of workers: 4
Average run time: 3.1797224283218384
Ray run time: 4.065998554229736

Num of workers: 7
Average run time: 5.139907530375889
Ray run time: 6.446819543838501

Num of workers: 10
Average run time: 7.569052147865295
Ray run time: 8.996447086334229

Num of workers: 13
Average run time: 8.455958109635572
Ray run time: 11.761570692062378

Num of workers: 16
Average run time: 7.848772034049034
Ray run time: 13.739320278167725

Num of workers: 19
Average run time: 8.033894174977354
Ray run time: 16.16210103034973

Num of workers: 22
Average run time: 8.699185609817505
Ray run time: 18.566803693771362

Num of workers: 25
Average run time: 8.966830835342407
Ray run time: 21.45942711830139

Num of workers: 28
Average run time: 8.584995950971331
Ray run time: 23.2943696975708

I was expecting that at least the 4-worker case should take almost the same time as the 1-worker case as I have 6 physical cores. But the result seems to suggest a different story. Furthermore, I don't understand why Average run time stops increasing when the number of workers is greater than 10?

1
How about you test some numbers and see what is the fastest choice? - Klaus D.
I hope someone could provide me a detailed explanation(or reference). A detailed explanation of what, exactly? The contents of that link seem quite clear, no? Also, I think the title of the post might be too ambiguous/vague. - AMC
Hi, I've update my question, adding some experimental results. I hope this time it is more clearer. - Maybe

1 Answers

2
votes

The number of processes you can run in parallel is dependent on the number of workers available which you computer can spin up processes for, this is in direct relationship with your computer CPU cores and processors available(dual core systems and such). The more available workers you have via your available CPU cores, the more processes you can run simultaneously. I work on a linux machine and one way to check information about your CPU is

cat /proc/cpuinfo

If you are running multiprocesses in python, I would recommend using the concurrent.futures library as this does a great job of automatically spinning up the right number of workers for optimal parallel tasks performance based on the specs of your computer, although you can choose to overwrite it if would like more or less workers.

So to answer your question on the cores, the cores are whats running the show beneath the metal heatsink on the processor. They are physical chips inside your computer, they are neither metaphorical nor logical. Each is an entire CPU in its own right, in that they can all be running completely different bits of code at the same time.

The reason why your experiment was counter productive is very simple. It takes your computer time to create and assign tasks(processes) to workers(cpu cores) and an additional time to close those processes when they have been completed. This is the extra time you kept getting back in your experiment, which is the reason why 1 worker took less time to compute than any other number of workers and if you noticed the more workers you used the more time it took, because more processes had to be assigned to more workers, and closed when completed. As such since no time was being saved on the computation of the program it just made your code slower which means they weren't any heavy computation hence one worker did the tasks at optimal efficiency, but when you introduced more workers more processes had to be created from the original task and assigned to the new workers created(which would increase the time needed for you code to run) and when those processes are completed they will have to be closed(which would also increase the time needed for your program to complete it's run).

Usually multiprocessing is only advisable for very heavy CPU bound operations(heavy computational calculations, gaming etc) if not they become counter productive and just end up slowing down your program. In other to handle things that are not CPU bound you should look into using threads(the module i recommended also supports that) or asynchronous coding with asycio