2
votes

I have encountered a weird behavior with my algorithm/cpu, I was wondering what could be causing this.

CPU that I am using: AMD 2990WX 32c/64t, OS: Ubuntu 18.04LTS with 4.15.0-64-generic kernel.

The algorithm (Julia 1.0.3):

@sync @distributed for var in range(0.1,step=0.1,stop=10.0)
                       res=do_heavy_stuff(var) #solves differential equation,
                                               #basically, multiplying 200x200 matrices many times
                       save(filename,"RES",res)
end

Function do_heavy_stuff(var) takes ~3 hours to solve on a single CPU core. When I launch it in parallel with 10 processes (julia -p 10 my_code.jl)it takes ~4 hours for each parallel loop, meaning every 4 hours I get 10 files saved. The slowdown is expected, as cpu frequency goes down from 4.1Ghz to 3.4Ghz.

If I launch 3 separate instances with 10 processes each, so a total cpu utilization is 30 cores, it still takes ~4 hours for one loop cycle, meaning I get 30 runs completed and saved every 4 hours.

However, if I run 2 instances (one has nice value of 0, another nice value of +10) with 30 processes each at once julia -p 30 my_code.jl, I see (using htop) that CPU utilization is 60(+) threads, but the algorithm becomes extremely slow (after 20 hours still zero files saved). Furthermore, I see that CPU temperature is abnormally low (~45C instead of expected 65C).

From this information I can guess, that using (almost) all threads of my cpu makes it do something useless that is eating up CPU cycles, but no floating point operations are being done. I see no I/O to SSD, I utilize only half of RAM.

I launched mpstat mpstat -A: https://pastebin.com/c19nycsT and I can see that all of my cores are just chilling in idle state, that explains low temperature, however, I still don`t understand what exactly is the bottleneck? How do I troubleshoot from here? Is there any way too see (without touching hardware) whether the problem is RAM bandwidth or something else?

EDIT: It came to my attention, that I was using mpstat wrong. Apparently mpstat -A gives cpu stats since launch of the computer, while what I was needed was short time integrated results that can be obtained with mpstat -P ALL 2. Unfortunately, I only learned this after I killed my code in question, so no real data from mpstat. However, I am still interested, how would one troubleshoot such situation, where cores seems to be doing something, but result is not showing? How do I find the bottleneck?

1
Do note that you're actually running multiple Julia processes, not threads. - pfitzseb
Yes, thank you, I am aware of this and the separability of resources, are you implying, that because of this, I am getting too many cache miss? - MrModern
No, I just wanted to point out you weren't using the right terminology. I don't know what might cause the slowdown you observe. - pfitzseb
When you do sync on processes or threads, there is a potential for the sync to cause all but one thread to wait until the last thread finishes. You may need to look for a resource contention or such an all-but-one-thread-waiting state in your code's execution. It might help to post a brief, but working example of the problem. - Bill
I edited the question to fix terminology. I am aware that sync might make some processes wait for others, but in my past experience with sync, when processes wait, they don`t show up in htop as using CPU cycles. Providing working example might is easy, however, providing brief and working example is very difficult if not impossible. - MrModern

1 Answers

0
votes

Since you are using multiprocessing there are 2 most likely reasons for the observer behavior:

  • long delays on I/O. When you are processing lots of disk data or reading data from the network your processes are naturally staled. In this case CPU utilization can be low combined with long execution times.
  • high variance of execution time for do_heavy_stuff. This variance could arise from unstable I/O or different model parameters resulting in different execution times. Why it is a problem requires understanding how @distributed is sharing the workload among worker processes. Namely, each worker gets an equal of the for loop. For an example if you have 4 workers the first one gets var in range 0.1:0.1:2.5 the second one 2.6:0.1:5.0 and so on. Now if some of the var values result in heavy tasks the first worker might get 5h of work and other workers 1h of work. This means that @sync completes after 5 hours with only one CPU actually working all time.

Looking at your post I would strongly bet on the second reason.