17
votes

I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should perform equally for both. For example, both float and double vectors have instructions to load higher 64bits of a 128-bit vector from an address (movhps, movhpd), but there's no such instruction for integer vectors.

My question:

Is there any reasons to expect a performance hit when using floating point instructions on integer vectors, e.g. using movhps to load data to an integer vector?

I wrote several tests to check that, but I suppose their results are not credible. It's really hard to write a correct test that explores all corner cases for such things, especially when the instruction scheduling is most probably involved here.

Related question:

Other trivially similar things also have several instructions that do basically the same. For example I can do bitwise OR with por, orps or orpd. Can anyone explain what's the purpose of these additional instructions? I guess this might be related to different scheduling algorithms applied to each instruction.

1
I don't think there has been an issue with this since the very early days of MMX/SSE. Any reasonably modern x86 CPU (e.g. from the last 5 years or so) should not have any such limitations. It's just a legacy from the days when MMX/SSE were just kluges bolted onto the FPU.Paul R
@Paul R: I agree with that. However, the SSE ISA have interesting bits not only from the old times, e.g. SSE3 delivered 'movddup' instruction which is only available for doubles. This is actually what causes me confusion: the limitations shouldn't be there, but Intel seems to imply otherwise.user283145
well the whole optimisation process, particularly where SIMD is concerned, involves a lot of experimentation - try out ideas, collect timing/profiling data, repeat ad nauseam... So probably the best idea is to just take an empirical approach - try everything and see what makes a difference.Paul R
@Paul.R: Unless I get an answer from an expert who knows the inner workings of x86 SIMD, this way most probably will be the one I take.user283145
even if you get a definitive answer for one particular generation of x86, it's liable to be a different story in the next generation - nothing really remains static, so you have to keep re-evaluating, experimenting, benchmarking, etc, if you need absolute maximum SIMD performance.Paul R

1 Answers

28
votes

From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:

There is a penalty for using the wrong type of instructions on some processors. This is because the processor may have different data buses or different execution units for integer and floating point data. Moving data between the integer and floating point units can take one or more clock cycles depending on the processor, as listed in table 13.2.

Processor                       Bypass delay, clock cycles 
  Intel Core 2 and earlier        1 
  Intel Nehalem                   2 
  Intel Sandy Bridge and later    0-1 
  Intel Atom                      0 
  AMD                             2 
  VIA Nano                        2-3 
Table 13.2. Data bypass delays between integer and floating point execution units