Julia - map function status

2

votes

Is there a comfy way to somehow get the 'status' of map/pmap in Julia?

If I had an array a = [1:10] i'd like to either:

1: enumerate the array and use if-conditional to add a print command

((index,value) -> 5*value ......, enumerate(a)

and where the "......." are, there would be a way to 'chain' the anonymous function to something like

"5*value and then print index/length(a) if index%200 == 0"

2: know if there is already an existing option for this, as pmap is intended for parallel tasks, which are usually used for large processes so it would make sence for this to already exist?

Additionally, is there a way to make anonymous functions do two 'separate' things one after the other?

Example

if I have

a = [1:1000]
function f(n) #something that takes a huge ammount of time
end

and I execute

map(x -> f(x), a)

the REPL would print out the status

"0.1 completed"
.
.
.
"0.9 completed"

Solution

Chris Rackauckas answer

A bitt odd that the ProgressMeter package doesnt include this by default

Pkg.add("ProgressMeter")
Pkg.clone("https://github.com/slundberg/PmapProgressMeter.jl")
@everywhere using ProgressMeter
@everywhere using PmapProgressmeter
pmap(x->begin sleep(1); x end, Progress(10), 1:10)

PmapProgressMeter on github

parallel-processingjuliaanonymous-functionpmap

It is not clear what you mean by status. Please give an example of the output you want for a particular input vector. – David P. Sanders

'status' as in how far through the evaluation, I posted an example to clarify – isebarn

3

votes

ProgressMeter.jl has a branch for pmap.

You can also make the Juno progress bar work inside of pmap. This is kind of using undocumented things, so you should ask in the Gitter if you want more information because posting this public will just confuse people if/when it changes.

1

votes

You can create a function with 'state' as you ask, by implementing a 'closure'. E.g.

julia> F = function ()
  ClosedVar = 5
  return (x) -> x + ClosedVar
end;
julia> f = F();
julia> f(5)
10
julia> ClosedVar = 1000;
julia> f(5)
10

As you can see, the function f maintains 'state' (i.e. the internal variable ClosedVar is local to F, and f maintains access to it even though F itself has technically long gone out of scope.

Note the difference with normal, non-closed function definition:

julia> MyVar = 5;
julia> g(x) = 5 + MyVar;
julia> g(5)
10
julia> MyVar = 1000;
julia> g(5)
1005

You can create your own closure which interrogates / updates its closed variables when run, and does something different according to its state each time.

Having said that, from your example you seem to expect that pmap will run sequentially. This is not guaranteed. So don't rely on a 'which index is this thread processing' approach to print every 200 operations. You would probably have to maintain a closed 'counter' variable inside your closure, and rely on that. Which presumably also implies your closure needs to be accessible @everywhere

1

votes

Why not just include it in your function's definition to print this information? E.g.

function f(n) #something that takes a huge amount of time
    ...
    do stuff.
    ...
    println("completed $n")
end

And, you can add an extra argument to your function, if desired, that would contain that 0.1, ... , 0.9 in your example (which I'm not quite sure what those are, but whatever they are, they can just be an argument in your function).

If you take a look at the example below on pmap and @parallel you will find an example of a function fed to pmap that prints output.

See also this and this SO post on info for feeding multiple arguments to functions used with map and pmap.

The Julia documentation advises that

pmap() is designed for the case where each function call does a large amount of work. In contrast, @parallel for can handle situations where each iteration is tiny, perhaps merely summing two numbers.

There are several reasons for this. First, pmap incurs greater start up costs initiating jobs on workers. Thus, if the jobs are very small, these startup costs may become inefficient. Conversely, however, pmap does a "smarter" job of allocating jobs amongst workers. In particular, it builds a queue of jobs and sends a new job to each worker whenever that worker becomes available. @parallel by contrast, divvies up all work to be done amongst the workers when it is called. As such, if some workers take longer on their jobs than others, you can end up with a situation where most of your workers have finished and are idle while a few remain active for an inordinate amount of time, finishing their jobs. Such a situation, however, is less likely to occur with very small and simple jobs.

The following illustrates this: suppose we have two workers, one of which is slow and the other of which is twice as fast. Ideally, we would want to give the fast worker twice as much work as the slow worker. (or, we could have fast and slow jobs, but the principal is the exact same). pmap will accomplish this, but @parallel won't.

For each test, we initialize the following:

addprocs(2)

@everywhere begin
    function parallel_func(idx)
        workernum = myid() - 1 
        sleep(workernum)
        println("job $idx")
    end
end

Now, for the @parallel test, we run the following:

@parallel for idx = 1:12
    parallel_func(idx)
end

And get back print output:

julia>     From worker 2:    job 1
    From worker 3:    job 7
    From worker 2:    job 2
    From worker 2:    job 3
    From worker 3:    job 8
    From worker 2:    job 4
    From worker 2:    job 5
    From worker 3:    job 9
    From worker 2:    job 6
    From worker 3:    job 10
    From worker 3:    job 11
    From worker 3:    job 12

It's almost sweet. The workers have "shared" the work evenly. Note that each worker has completed 6 jobs, even though worker 2 is twice as fast as worker 3. It may be touching, but it is inefficient.

For for the pmap test, I run the following:

pmap(parallel_func, 1:12)

and get the output:

From worker 2:    job 1
From worker 3:    job 2
From worker 2:    job 3
From worker 2:    job 5
From worker 3:    job 4
From worker 2:    job 6
From worker 2:    job 8
From worker 3:    job 7
From worker 2:    job 9
From worker 2:    job 11
From worker 3:    job 10
From worker 2:    job 12

Now, note that worker 2 has performed 8 jobs and worker 3 has performed 4. This is exactly in proportion to their speed, and what we want for optimal efficiency. pmap is a hard task master - from each according to their ability.

0

votes

One other possibility would be to use a SharedArray as a counter shared amongst the workers. E.g.

addprocs(2)

Counter = convert(SharedArray, zeros(Int64, nworkers()))

## Make sure each worker has the SharedArray declared on it, so that it need not be fed as an explicit argument
function sendto(p::Int; args...)
  for (nm, val) in args
    @spawnat(p, eval(Main, Expr(:(=), nm, val)))
  end
end

for (idx, pid) in enumerate(workers())
  sendto(pid, Counter = Counter)
end
@everywhere global Counter


@everywhere begin
    function do_stuff(n)
        sleep(rand())
        Counter[(myid()-1)] += 1
        TotalJobs = sum(Counter)
        println("Jobs Completed = $TotalJobs")
    end
end

pmap(do_stuff, 1:10)

Julia - map function status

4 Answers