deep neural network's precision for image recognition, float or double?

Question

Neural networks for image recognition can be really big. There can be thousands of inputs/hidden neurons, millions of connections what can take up a lot of computer resources.

While float being commonly 32bit and double 64bit in c++, they don't have much performance difference in speed yet using floats can save up some memory.

Having a neural network what is using sigmoid as an activation function, if we could choose of which variables in neural network can be float or double which could be float to save up memory without making neural network unable to perform?

While inputs and outputs for training/test data can definitely be floats because they do not require double precision since colors in image can only be in range of 0-255 and when normalized 0.0-1.0 scale, unit value would be 1 / 255 = 0.0039~

1. what about hidden neurons output precision, would it be safe to make them float too?

hidden neuron's output gets it value from the sum of previous layer neuron's output * its connection weight to currently being calculating neuron and then sum being passed into activation function(currently sigmoid) to get the new output. Sum variable itself could be double since it could become a really large number when network is big.

2. what about connection weights, could they be floats?

while inputs and neuron's outputs are at the range of 0-1.0 because of sigmoid, weights are allowed to be bigger than that.

Stochastic gradient descent backpropagation suffers on vanishing gradient problem because of the activation function's derivative, I decided not to put this out as an a question of what precision should gradient variable be, feeling that float will simply not be precise enough, specially when network is deep.

sastanin sastanin · Accepted Answer · 2016-11-10T22:52:47

what about hidden neurons output precision, would it be safe to make them float too?

Using float32 everywhere is usually the safe first choice for most of the neural network applications. GPUs currently support only float32, so many practitioners stick to float32 everywhere. For many applications, even 16-bit floating point values could be sufficient. Some extreme examples show that high accuracy networks can be trained with only as little as 2-bits per weight (https://arxiv.org/abs/1610.00324).

The complexity of the deep networks is usually limited not by the computation time, but by the amount of RAM on a single GPU and throughput of the memory bus. Even if you're working on CPU, using a smaller data type still helps to use the cache more efficiently. You're rarely limited by the machine datatype precision.

since colors in image can only be in range of 0-255,

You're doing it wrong. You force the network to learn the scale of your input data, when it is already known (unless you're using a custom weight initialization procedure). The better results are usually achieved when the input data is normalized to the range (-1, 1) or (0, 1) and the weights are initialized to have the average output of the layer at the same scale. This is a popular initialization technique: http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization

If inputs are in the range [0, 255], then with an average input being ~ 100, and weights being ~ 1, the activation potential (the argument of the activation function) is going to be ~ 100×N, where N is the number of layer inputs, likely far away in the "flat" part of the sigmoid. So either you initialize your weights to be ~ 1/(100×N), or you scale your data and use any popular initialization method. Otherwise the network will have to spend a lot of training time just to bring the weights to this scale.

Stochastic gradient descent backpropagation suffers on vanishing gradient problem because of the activation function's derivative, I decided not to put this out as an a question of what precision should gradient variable be, feeling that float will simply not be precise enough, specially when network is deep.

It's much less a matter of machine arithmetic precision, but the scale of the outputs for each of the layers. In practice:

preprocess input data (normalize to (-1, 1) range)
if you have more than 2 layers, then don't use sigmoids, use rectified linear units instead
initialize weights carefully
use batch normalization

This video should be helpful to learn these concepts if you're not familiar with them.

deep neural network's precision for image recognition, float or double?

2 Answers