I have an interview test where I have to implement a fast matrix multiplication with a given matrix multiplication algorithm.
I have to implement it on any platform with any compiler I want. The task says:
•PC implementation should be ready for SIMD optimization. • Design a rational interface to the data processing module. • Write portable ANSIC code where it doesn't degrade the efficiency. Don’t use assembler. • Think about the number of operations, complexity of the operations. Care about things like function call overhead, loop overhead, memory access time and cache performance
Should I implement this on a platform like raspberry pi? Or on a CPU+DSP or ARM+NEON or CPU+GPU simulator? Or just give the code?
Thank you