The data-shift with each transmission can happen when the SPI slave recognizes an (unexpected) additional clock pulse. Looking at the SSI protocol description on Wikipedia this actually makes sense:
In order to transmit N bits of data the master emits N clock cycles, followed by another clock pulse to signal the end of the transfer (so-called "Monoflop Time" - referring to the original hardware implementation of the SSI interface). Since the SPI protocol / SPI slave does not know about this additional clock pulse, it begins to output the first bit of the next data byte, which is in turn not recognized by the SSI master. As a result this leads to a shift in the data bits recognized by the SSI master on the next SSI frame.
Unfortunately, it is not easy to handle the Monoflop time correctly with the SPI slave. In order to deal with the additional clock pulse, we could try to set the SPI frame size to 25 bits on the slave side. Since the STM32 hardware only supports SPI frame sizes between 4 bit and 16 bit, the only choice is to set it to 5 bit. This is not very convenient, since we need to convert the 3 byte (24 bit) output data into 5 blocks of 5 bit (24 bit output data + 1 bit dummy data), but it should work for a "normal" transfer.
Things get more complicated though, if we also want to handle the cases "Multiple transmissions" and "Interrupting transmission" correctly. We need to monitor the clock signal to be able to detect the monoflop timeout. This can be done using a STM32 hardware timer with an external trigger. When the timer expires, we need to reset the SPI unit (in order to handle an interrupted transmission) and update the output value. This "simple" task can be quite challenging since it requires a couple of instructions - requiring a fast MCU depending on the SSI clock frequency.
Alternatively the SSI protocol can be implemented using a software-only "bit banging" solution. But this requires a fast MCU as well in order to handle a fast SSI clock correctly.
IMHO the best solution is to use a small (inexpensive) FPGA to implement the SSI slave and let the MCU feed it with data over a traditional SPI interface.