EDIT:
Thanks for all the suggestions. I should have put this in my original question better, so sorry about not being clear about it. The lines prod[0] <= prod[0] + input[0] x weight1[i]; up to prod[25] are executed over 200 clock cycles. But, the 26 multiplications do happen in parallel. Is that too much at one time, or do I need to do 1 multiplication (32 bit x 32 bit) over one clock cycle which means all the 200 x 26 multiplications over 200 x 26 clock cycles?
Also, I have 6 stages to pipeline the whole thing through:
26 matrix multiplications in parallel repeated 200 times (200 clock cycles).
Sets up a control flag so that the activation module consisting of a LUT with about 200 elements can do its thing and send result. Repeated 25 times per neuron (over 25 clock cycles).
Reset flag to stop activation function. Reset counters and everything to prep for next part.
3 matrix multiplications in parallel repeated 25 times (25 clock cycles).
Call activation function again.
Find max value of the three values obtained from 6 to figure out what the output should be (this is a classification problem).
Are the above stages and the number of multiplications per clock cycle (25 in stage 1, and 3 in stage 2) ok?
Or should I redo it to execute stage 2 over 200x25 clock cycles and stage 4 to execute over 10x3 clock cycles?
Thanks a lot for all the help and advice guys,
Faisal
ORIGINAL:
I have coded and simulated a module for Feedforward Algorithm in Artificial Neural Network. The synthesis tool takes a long time (Synopsys DCS and Xilinx ISE 14.4)...it has been running for over 9 hours now, so I have a really bad feeling about it! The simulation results are correct though.
I have an idea for a NEW design (at the end of message) but wanted to run it by experienced people to see if that is more efficient that my current implementation (see below) or worse, and if worse, how can I make so many arithmetic operations more efficient?
Some background on the network:
Input Layer has 200 inputs,
Hidden Layer has 25 neurons,
Output Layer had 3 outputs.
Verilog Code Idea: I have only have one module which implements the entire algorithm.
- First step is to multiply the inputs (200 of them) with the weights (200 of them) for each neuron (and there are 25 neurons) It calculates
prod[0] <= prod[0] + input[0] x weight1[i]; i = 0 to 200-1
.......
prod[25] <= prod[25] + input[25] x weight1[i]; i = 0 to 200-1
And repeats above thing 200 times for each of the 200 inputs. It is simultaneously doing the calculation of all 25 neurons.
Next, the ANN activation function is called on the above results. This is done by using a LUT which has close to 200 elements (I used case statement). To do this, I wrote another activation.v file and had to instantiate it 25 times for each neuron!
Next, it multiplies the above results with the weights for the last 3 neurons:
prod_2[0] <= prod_2[0] + prod[0]*weight2[i]; i = 0 to 25-1;
prod_2[1] <= prod_2[1] + prod[0]*weight2[i]; i = 0 to 25-1;
prod_2[2] <= prod_2[2] + prod[0]*weight2[i]; i = 0 to 25-1;
And repeats the above 25 times for each of the 25 prod inputs.
It is simultaneously doing calculation for all 3 output neurons.
- Last step is to call sigmoid on prod_2[0 to 2]. I had to instantiate 3 more activation modules for this.
The simulated results are great. But probably horribly inefficient!!!!
So, I wanna know if this is a better idea?
Top_Module -> Neuron -> Multiplication and Activation
The top module calls the neuron function (need to instantiate 28 neurons for this!) and passes the relevant inputs and weights to it. The neuron function calls a Multiplication module (need 200 of them for first part and 25 of them for second part). The neuron function next adds the above 200 results (and 25 in part 2) and calls the activation function. The output finally goes back to the Top_Module.
My Questions:
Is this more efficient that the earlier implementation which was performing everything in one module?
Is instantiating 28 neurons and each neuron instantiating 200 multiplication modules bad or good?
Any other ideas to make my code efficient so that Synopsys Design Compiler does not take 12 hours?!!
If I do this, each neuron has 200 inputs and 200 weights that will be its input port. I didn't think Verilog modules could pass arrays between each other. If not, will I have to manually have to write out all the 400 ports instead of passing the arrays?
Sorry about a weird question, but I am new to the whole synthesis concept and wanna know how these tools go about instantiating modules?
Thanks,
Faisal.