So I have a facility of 1 CPU with 64 cores. I have installed tensorflow from anaconda. I know that if I had multiple CPUs, I could distribute computation by specifying the CPUids. Like below (adapted from here) :
with tf.device("/cpu:0"):
a = tf.Variable(tf.ones(()))
a = tf.square(a)
with tf.device("/cpu:1"):
b = tf.Variable(tf.ones(()))
b = tf.square(b)
with tf.device("/cpu:2"):
loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10):
loss0, _ = sess.run([loss, train_op])
print("loss", loss0)
The above example code assumes three CPUs. But I was wondering if I can efficiently do some kind of efficient deep learning exercises with the present facility (1 CPU, 64 cores)? Can someone help or guide me?
UPDATE :
The cores are Intel Xeon Phi processor model.
Also please note that I don't have administrator privilege, so cannot compile any libraries. I installed every python libraries via Anaconda.
My attempt to understand something. I used the Timeline concept (from here) in the above given code like below :
import tensorflow as tf from tensorflow.python.client import timeline with tf.device("/cpu:0"): a = tf.Variable(tf.ones(())) a = tf.square(a) with tf.device("/cpu:0"): b = tf.Variable(tf.ones(())) b = tf.square(b) with tf.device("/cpu:0"): loss = a+b opt = tf.train.GradientDescentOptimizer(learning_rate=0.1) train_op = opt.minimize(loss) sess = tf.Session() run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() sess.run(tf.global_variables_initializer()) for i in range(10): run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) run_metadata = tf.RunMetadata() loss0, _ = sess.run([loss, train_op], options=run_options,run_metadata=run_metadata) print("loss", loss0) # Create the Timeline object, and write it to a json tl = timeline.Timeline(run_metadata.step_stats) ctf = tl.generate_chrome_trace_format() with open('timeline_execution1.json', 'w') as f: f.write(ctf)
And then I generated different json files to see the timeline in chrome with config=tf.ConfigProto(intra_op_parallelism_threads=#,inter_op_parallelism_threads=#)
line in tf.Session()
. And then I got different outputs. But I understood nothing other than one point. This program is using 4 cores, whatever options I give inside tf.Session()
. Like below :
/cpu:0
refers to all 64 cores and parallelization happens automatically. You can also create logical devices likecpu:1
andcpu:2
, but they refer to the same physical CPU, so I don't see a reason to use those – Yaroslav Bulatov