Here's my CPU information:
I use ray to train reinforcement learning algorithms, where I define a Learner
class decorated by @ray.remote(num_cpus=2)
and a Worker
class decorated by ray.remote(num_cpus=1)
. To gain maximum performance, how many workers can I have?
I used to set the number of workers to 8-10, but today I come across this post, which says
For many workloads (especially numerical workloads), you often cannot expect a greater speedup than the number of physical CPUs.
This seems to say that the number of physical CPUs bounds the number of process running in parallel. Does this mean I should not use more than 4 workers in order to gain maximum performance, assuming that workers are CPU-intensive? I hope someone could provide me a detailed explanation(or reference). Thanks in advance.
Update
Thanks for the comments by @AMC and @KlausD.. I update my question here, hoping it makes my question more clear.
I have done some tests. For example, I ran experiments with 1, 3, 8 workers, separately. Here's the result:
- For 1-worker case, it takes 4m17s to run 400 steps
- For 3-worker case, it takes 4m29s on average to run 400 steps
- For 6-worker case, it takes 5m30s on average to run 400 steps
I concluded CPU contention had happened to 6-worker case. However, I opened top
(where I could see 12 CPUs) to check the CPU usage, all workers used around 100% CPU. Therefore, I had no clue whether my conclusion was right.
I also wrote a small program to for further test. Code is provided below
from time import time
import numpy as np
import ray
@ray.remote(num_cpus=1)
def f(x, y):
start = time()
while True:
x += y
if np.mean(x) > 100:
break
return time() - start
if __name__ == '__main__':
# I intend to make x and y large to increase the cpu usage.
x = np.random.rand(1000, 10000)
y = np.random.uniform(0, 3, (1000, 10000))
print('x mean:', np.mean(x))
print('y mean:', np.mean(y))
for n in range(1, 30, 3):
ray.init()
start = time()
result = ray.get([f.remote(x, y) for _ in range(n)])
print('Num of workers:', n)
# print('Run time:', result)
print('Average run time:', np.mean(result))
print('Ray run time:', time() - start)
ray.shutdown()
Here's the result
x mean: 0.4998949941471149
y mean: 1.4997634832632463
Num of workers: 1
Average run time: 1.3638701438903809
Ray run time: 2.1305620670318604
Num of workers: 4
Average run time: 3.1797224283218384
Ray run time: 4.065998554229736
Num of workers: 7
Average run time: 5.139907530375889
Ray run time: 6.446819543838501
Num of workers: 10
Average run time: 7.569052147865295
Ray run time: 8.996447086334229
Num of workers: 13
Average run time: 8.455958109635572
Ray run time: 11.761570692062378
Num of workers: 16
Average run time: 7.848772034049034
Ray run time: 13.739320278167725
Num of workers: 19
Average run time: 8.033894174977354
Ray run time: 16.16210103034973
Num of workers: 22
Average run time: 8.699185609817505
Ray run time: 18.566803693771362
Num of workers: 25
Average run time: 8.966830835342407
Ray run time: 21.45942711830139
Num of workers: 28
Average run time: 8.584995950971331
Ray run time: 23.2943696975708
I was expecting that at least the 4-worker case should take almost the same time as the 1-worker case as I have 6 physical cores. But the result seems to suggest a different story. Furthermore, I don't understand why Average run time
stops increasing when the number of workers is greater than 10?