0
votes

I have an interview test where I have to implement a fast matrix multiplication with a given matrix multiplication algorithm.

I have to implement it on any platform with any compiler I want. The task says:

•PC implementation should be ready for SIMD optimization. • Design a rational interface to the data processing module. • Write portable ANSIC code where it doesn't degrade the efficiency. Don’t use assembler. • Think about the number of operations, complexity of the operations. Care about things like function call overhead, loop overhead, memory access time and cache performance

Should I implement this on a platform like raspberry pi? Or on a CPU+DSP or ARM+NEON or CPU+GPU simulator? Or just give the code?

Thank you

1
Isn't this better asked of the interviewer? - Scott Hunter

1 Answers

0
votes

There a whole theory about Instruction level parallelism, thread level parallelism, Cache utilization and what not used in speeding up matrix multiplication.

I can point you, first to learn how the CPU cache works. When a block is loaded into cache, how it is mapped to to cache index, when a block is evicted etc. Consult a book on Computer architecture, or Wikipedia.

Then I can point you to the blocking matrix multiplication algorithm.

And last there is the BLAS specification and OpenBLAS as the fastest implementation for CPUs.