How can I efficiently use tensorflow if I have a CPU with 64 cores?

Question

So I have a facility of 1 CPU with 64 cores. I have installed tensorflow from anaconda. I know that if I had multiple CPUs, I could distribute computation by specifying the CPUids. Like below (adapted from here) :

with tf.device("/cpu:0"):
    a = tf.Variable(tf.ones(()))
    a = tf.square(a)
with tf.device("/cpu:1"):
    b = tf.Variable(tf.ones(()))
    b = tf.square(b)
with tf.device("/cpu:2"):
    loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10):
    loss0, _ = sess.run([loss, train_op])
    print("loss", loss0)

The above example code assumes three CPUs. But I was wondering if I can efficiently do some kind of efficient deep learning exercises with the present facility (1 CPU, 64 cores)? Can someone help or guide me?

UPDATE :

The cores are Intel Xeon Phi processor model.
Also please note that I don't have administrator privilege, so cannot compile any libraries. I installed every python libraries via Anaconda.

My attempt to understand something. I used the Timeline concept (from here) in the above given code like below :

import tensorflow as tf
from tensorflow.python.client import timeline


with tf.device("/cpu:0"):
    a = tf.Variable(tf.ones(()))
    a = tf.square(a)
with tf.device("/cpu:0"):
    b = tf.Variable(tf.ones(()))
    b = tf.square(b)
with tf.device("/cpu:0"):
    loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)

sess = tf.Session()
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(tf.global_variables_initializer())
for i in range(10):
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    loss0, _ = sess.run([loss, train_op], options=run_options,run_metadata=run_metadata)
    print("loss", loss0)

# Create the Timeline object, and write it to a json
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline_execution1.json', 'w') as f:
    f.write(ctf)

And then I generated different json files to see the timeline in chrome with config=tf.ConfigProto(intra_op_parallelism_threads=#,inter_op_parallelism_threads=#) line in tf.Session(). And then I got different outputs. But I understood nothing other than one point. This program is using 4 cores, whatever options I give inside tf.Session(). Like below :

Isn't a single CPU with multiple cores the default anyway? As in, what did you try and what was the result? That said, 64 core CPU's are either obscure ARM's or Xeon Phi's AFAIK. The exact architecture probably matters for this question. — MSalters
@MSalters : The cores are Intel Xeon Phi processor model. I will update the question. — dexterdev
/cpu:0 refers to all 64 cores and parallelization happens automatically. You can also create logical devices like cpu:1 and cpu:2, but they refer to the same physical CPU, so I don't see a reason to use those — Yaroslav Bulatov
@YaroslavBulatov : Can I virtually make 1CPU,64 core system into something like 8 node each with 8 cores? I know this is a dumb question. regrarding logical cpu:1 and cpu:2, I think it will only be useful if I have multiple nodes. — dexterdev
Not easily. TensorFlow doesn't support pinning to physical CPU cores. You could use distributed tensorflow with 8 separate processes running tensorflow and use OS level tools (taskset) to pin each process to particular set of cores — Yaroslav Bulatov

bn2302 bn2302 · Accepted Answer · 2017-08-08T14:34:48

In case you have an Intel CPU (maybe XeonPhi), compiling Tensorflow with MKL might speed up things.

You can see how it's done here

How can I efficiently use tensorflow if I have a CPU with 64 cores?

1 Answers