7
votes

Given a number in a register (a binary integer), how to convert it to a string of hexadecimal ASCII digits? (i.e. serialize it into a text format.)

Digits can be stored in memory or printed on the fly, but storing in memory and printing all at once is usually more efficient. (You can modify a loop that stores to instead print one at a time.)

Can we efficiently handle all the nibbles in parallel with SIMD? (SSE2 or later?)

3
This is intended to be a decent canonical duplicate target for int->hex questions. All the functions in my answer were tested before posting. Part of the reason for deciding to write obsolete 32-bit code instead of x86-64 is to justify presenting a scalar loop version. SSE2 is baseline for x86-64, so you should always use it from int->hex unless you want a variable-width result without leading zeros. (Even then, you can probably use pcmpeqb / pmovmskb / bsf to find the position of the first non-0 digit easily.)Peter Cordes
See also github.com/zbjornson/fast-hex for binary->hex and hex->binary, for large buffers.Peter Cordes

3 Answers

17
votes

related: 16-bit version that converts 1 byte to 2 hex digits which you could print or store to a buffer. And Converting bin to hex in assembly has another 16-bit version with plenty of text explanation in the half of the answer that covers the int -> hex-string part of the problem.

If optimizing for code-size instead of speed, there's a hack using DAS that saves a few bytes.


16 is a power of 2. Unlike decimal or other bases that aren't a power of 2, we don't need division, and we can extract the most-significant digit first (i.e. in printing order). Otherwise we can only get the least-significant digit first (and its value depends on all bits of the number) and we have to go backwards: see How do I print an integer in Assembly Level Programming without printf from the c library? for non-power-of-2 bases.

Each 4-bit group of bits maps to one hex digit. We can use shifts or rotates, and AND masks, to extract each 4-bit chunk of the input as a 4-bit integer.

Unfortunately the 0..9 a..f hex digits are not contiguous in the ASCII character set (http://www.asciitable.com/). We either need conditional behaviour (a branch or cmov) or we can use a lookup table.

A lookup table is typically the most efficient for instruction count and performance since we're doing this repeatedly; modern CPUs have very fast L1d caches that make repeated loads of nearby bytes very cheap. Pipelined / out-of-order execution hides the ~5 cycle latency of an L1d cache load.

;; NASM syntax, i386 System V calling convention
global itohex      ; inputs: char* output,  unsigned number
itohex:
    push   edi           ; save a call-preserved register for scratch space
    mov    edi, [esp+8]  ; out pointer
    mov    eax, [esp+12] ; number

    mov    ecx, 8        ; 8 hex digits, fixed width zero-padded
.digit_loop:             ; do {
    rol    eax, 4          ; rotate the high 4 bits to the bottom

    mov    edx, eax
    and    edx, 0x0f       ; and isolate 4-bit integer in EDX

    movzx  edx, byte [hex_lut + edx]
    mov    [edi], dl       ; copy a character from the lookup table
    inc    edi             ; loop forward in the output buffer

    dec    ecx
    jnz    .digit_loop   ; }while(--ecx)

    pop    edi
    ret

section .rodata
    hex_lut:  db  "0123456789abcdef"

To adapt for x86-64, the calling convention will pass args in registers instead of the stack, e.g. RDI and ESI for x86-64 System V (non-Windows). Simply remove the part that loads from the stack, and change the loop to use ESI instead of EAX. (And make the addressing modes 64-bit. You may need to LEA the hex_lut address into a register outside the loop; see this and this).

This version converts to hex with leading zeros. If you want to drop them, bit_scan(input)/4 like lzcnt or __builtin_clz on the input, or SIMD compare -> pmovmksb -> tzcnt on the output ASCII string will tell you how many 0 digits you have (and thus you can print or copy starting at the first non-zero). Or convert starting with the low nibble and work backwards, stopping when a right shift makes the value zero, as shown in the second version that uses cmov instead of a lookup table.

Until BMI2 (shrx / rorx), x86 lacks a copy-and-shift instruction, so rotating in-place and then copy/AND is hard to beat1. Modern x86 (Intel and AMD) has 1-cycle latency for rotates (https://agner.org/optimize/ and https://uops.info/), so this loop-carried dependency chain doesn't become a bottleneck. (There are too many instructions in the loop for it to run at even 1 cycle per iteration even on 5-wide Ryzen.)

I used mov ecx,8 and dec ecx/jnz for for human readability; lea ecx, [edi+8] at the top and cmp edi, ecx / jb .digit_loop as the loop branch is smaller overall machine code size, and more efficient on more CPUs. dec/jcc macro-fusion into a single uop only happens on Intel Sandybridge-family; AMD only fuses jcc with cmp or test. This optimization would get it down to 7 uops for the front-end on Ryzen, same as Intel, which is still more than it can issue in 1 cycle.

Footnote 1: We might use SWAR (SIMD within a register) to do the AND before shifting: x & 0x0f0f0f0f low nibbles, and shr(x,4) & 0x0f0f0f0f high nibbles, then effectively unroll by alternating processing a byte from each register. (Without any efficient way to do an equivalent of punpcklbw or mapping integers to the non-contiguous ASCII codes, we do still just have to do each byte separately. But we might unroll the byte-extraction and read AH then AL (with movzx) to save shift instructions. Reading high-8 registers can add latency, but I think it doesn't cost extra uops on current CPUs. Writing high-8 registers is usually not good on Intel CPUs: it costs an extra merging uop to read the full register, with a front-end delay to insert it. So getting wider stores by shuffling registers is probably not good. In kernel code where you can't use XMM regs, but could use BMI2 if available, pdep could expand nibbles to bytes but this is probably worse than just masking 2 ways.)

Test program:

// hex.c   converts argv[1] to integer and passes it to itohex
#include <stdio.h>
#include <stdlib.h>

void itohex(char buf[8], unsigned num);

int main(int argc, char**argv) {
    unsigned num = strtoul(argv[1], NULL, 0);  // allow any base
    char buf[9] = {0};
    itohex(buf, num);   // writes the first 8 bytes of the buffer, leaving a 0-terminated C string
    puts(buf);
}

compile with:

nasm -felf32 -g -Fdwarf itohex.asm
gcc -g -fno-pie -no-pie -O3 -m32 hex.c itohex.o

test runs:

$ ./a.out 12315
0000301b
$ ./a.out 12315123
00bbe9f3
$ ./a.out 999999999
3b9ac9ff
$ ./a.out 9999999999   # apparently glibc strtoul saturates on overflow
ffffffff
$ ./a.out 0x12345678   # strtoul with base=0 can parse hex input, too
12345678

Alternate implementations:

Conditional instead of lookup-table: takes several more instructions, and will probably be slower. But it doesn't need any static data.

It could be done with branching instead of cmov, but that would be even slower most of the time. (It won't predict well, assuming a random mix of 0..9 and a..f digits.) https://codegolf.stackexchange.com/questions/193793/little-endian-number-to-string-conversion/193842#193842 shows a version optimized for code-size. (Other than a bswap at the start, it's a normal uint32_t -> hex with zero padding.)

Just for fun, this version starts at the end of the buffer and decrements a pointer. (And the loop condition uses a pointer-compare.) You could have it stop once EDX becomes zero, and use EDI+1 as the start of the number, if you don't want leading zeros.

Using a cmp eax,9 / ja instead of cmov is left as an exercise for the reader. A 16-bit version of this could use different registers (like maybe BX as a temporary) to still allow lea cx, [bx + 'a'-10] copy-and-add. Or just add/cmp and jcc, if you want to avoid cmov for compat with ancient CPUs that don't support P6 extensions.

;; NASM syntax, i386 System V calling convention
itohex:   ; inputs: char* output,  unsigned number
itohex_conditional:
    push   edi             ; save a call-preserved register for scratch space
    push   ebx
    mov    edx, [esp+16]   ; number
    mov    ebx, [esp+12]   ; out pointer

    lea    edi, [ebx + 7]   ; First output digit will be written at buf+7, then we count backwards
.digit_loop:                ; do {
    mov    eax, edx
    and    eax, 0x0f            ; isolate the low 4 bits in EAX
    lea    ecx, [eax + 'a'-10]  ; possible a..f value
    add    eax, '0'             ; possible 0..9 value
    cmp    ecx, 'a'
    cmovae eax, ecx             ; use the a..f value if it's in range.
                                ; for better ILP, another scratch register would let us compare before 2x LEA,
                                ;  instead of having the compare depend on an LEA or ADD result.

    mov    [edi], al        ; *ptr-- = c;
    dec    edi

    shr    edx, 4

    cmp    edi, ebx         ; alternative:  jnz on flags from EDX to not write leading zeros.
    jae    .digit_loop      ; }while(ptr >= buf)

    pop    ebx
    pop    edi
    ret

We could expose even more ILP within each iteration using 2x lea + cmp/cmov. cmp and both LEAs only depend on the nibble value, with cmov consuming all 3 of those results. But there's lots of ILP across iterations with only the shr edx,4 and the pointer decrement as loop-carried dependencies. I could have saved 1 byte of code-size by arranging so I could use cmp al, 'a' or something. And/or add al,'0' if I didn't care about CPUs that rename AL separately from EAX.

Testcase that checks for off-by-1 errors by using a number that has both 9 and a in its hex digits:

$ nasm -felf32 -g -Fdwarf itohex.asm && gcc -g -fno-pie -no-pie -O3 -m32 hex.c itohex.o && ./a.out 0x19a2d0fb
19a2d0fb

SIMD with SSE2, SSSE3, AVX2 or AVX512F, and ~2 instructions with AVX512VBMI

With SSSE3 and later, it's best to use a byte shuffle as a nibble lookup table.

Most of these SIMD versions could be used with two packed 32-bit integers as input, with the low and high 8 bytes of the result vector containing separate results that you can store separately with movq and movhps. Depending on your shuffle control, this is exactly like using it for one 64-bit integer.

SSSE3 pshufb parallel lookup table. No need to mess around with loops, we can do this with a few SIMD operations, on CPUs that have pshufb. (SSSE3 is not baseline even for x86-64; it was new with Intel Core2 and AMD Bulldozer).

pshufb is a byte shuffle that's controlled by a vector, not an immediate (unlike all earlier SSE1/SSE2/SSE3 shuffles). With a fixed destination and a variable shuffle-control, we can use it as a parallel lookup table to do 16x lookups in parallel (from a 16 entry table of bytes in a vector).

So we load the whole integer into a vector register, and unpack its nibbles to bytes with a bit-shift and punpcklbw. Then use a pshufb to map those nibbles to hex digits.

That leaves us with the ASCII digits an XMM register with the least significant digit as the lowest byte of the register. Since x86 is little-endian, there's no free way to store them to memory in the opposite order, with the MSB first.

We can use an extra pshufb to reorder the ASCII bytes into printing order, or use bswap on the input in an integer register (and reverse the nibble -> byte unpacking). If the integer is coming from memory, going through an integer register for bswap kinda sucks (especially for AMD Bulldozer-family), but if you have the integer in a GP register in the first place it's pretty good.

;; NASM syntax, i386 System V calling convention

section .rodata
 align 16
    hex_lut:  db  "0123456789abcdef"
    low_nibble_mask: times 16 db 0x0f
    reverse_8B: db 7,6,5,4,3,2,1,0,   15,14,13,12,11,10,9,8
    ;reverse_16B: db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0

section .text

global itohex_ssse3    ; tested, works
itohex_ssse3:
    mov    eax,  [esp+4]    ; out pointer
    movd   xmm1, [esp+8]    ; number

    movdqa xmm0, xmm1
    psrld  xmm1, 4          ; right shift: high nibble -> low  (with garbage shifted in)
    punpcklbw xmm0, xmm1    ; interleave low/high nibbles of each byte into a pair of bytes
    pand   xmm0, [low_nibble_mask]   ; zero the high 4 bits of each byte (for pshufb)
    ; unpacked to 8 bytes, each holding a 4-bit integer

    movdqa xmm1, [hex_lut]
    pshufb xmm1, xmm0       ; select bytes from the LUT based on the low nibble of each byte in xmm0

    pshufb xmm1, [reverse_8B]  ; printing order is MSB-first

    movq   [eax], xmm1      ; store 8 bytes of ASCII characters
    ret
;; The same function for 64-bit integers would be identical with a movq load and a movdqu store.
;; but you'd need reverse_16B instead of reverse_8B to reverse the whole reg instead of each 8B half

It's possible to pack the AND mask and the pshufb control into one 16-byte vector, similar to itohex_AVX512F below.

AND_shuffle_mask: times 8 db 0x0f       ; low half: 8-byte AND mask
                   db 7,6,5,4,3,2,1,0   ; high half: shuffle constant that will grab the low 8 bytes in reverse order

Load it into a vector register and use it as an AND mask, then use it as a pshufb control to grab the low 8 bytes in reverse order, leaving them in the high 8. Your final result (8 ASCII hex digits) will be in the top half of an XMM register, so use movhps [eax], xmm1. On Intel CPUs, this is still only 1 fused-domain uop, so it's just as cheap as movq. But on Ryzen, it costs a shuffle on top of a store. Plus, this trick is useless if you want to convert two integers in parallel, or a 64-bit integer.

SSE2, guaranteed available in x86-64:

Without SSSE3 pshufb, we need to rely on scalar bswap to put the bytes in printing right order, and punpcklbw the other way to interleave with the high nibble of each pair first.

Instead of a table lookup, we simply add '0', and add another 'a' - ('0'+10) for digits greater than 9 (to put them into the 'a'..'f' range). SSE2 has a packed byte compare for greater-than, pcmpgtb. Along with a bitwise AND, that's all we need to conditionally add something.

itohex:             ; tested, works.
global itohex_sse2
itohex_sse2:
    mov    edx,  [esp+8]    ; number
    mov    ecx,  [esp+4]    ; out pointer
    ;; or enter here for fastcall arg passing.  Or rdi, esi for x86-64 System V.  SSE2 is baseline for x86-64
    bswap  edx
    movd   xmm0, edx

    movdqa xmm1, xmm0
    psrld  xmm1, 4          ; right shift: high nibble -> low  (with garbage shifted in)
    punpcklbw xmm1, xmm0    ; interleave high/low nibble of each byte into a pair of bytes
    pand   xmm1, [low_nibble_mask]   ; zero the high 4 bits of each byte
    ; unpacked to 8 bytes, each holding a 4-bit integer, in printing order

    movdqa  xmm0, xmm1
    pcmpgtb xmm1, [vec_9]
    pand    xmm1, [vec_af_add] ; digit>9 ?  'a'-('0'+10)  :  0
    
    paddb   xmm0, [vec_ASCII_zero]
    paddb   xmm0, xmm1      ; conditional add for digits that were outside the 0..9 range, bringing them to 'a'..'f'

    movq   [ecx], xmm0      ; store 8 bytes of ASCII characters
    ret
    ;; would work for 64-bit integers with 64-bit bswap, just using movq + movdqu instead of movd + movq


section .rodata
align 16
    vec_ASCII_zero: times 16 db '0'
    vec_9:          times 16 db 9
    vec_af_add:     times 16 db 'a'-('0'+10)
    ; 'a' - ('0'+10) = 39 = '0'-9, so we could generate this from the other two constants, if we were loading ahead of a loop
    ; 'A'-('0'+10) = 7 = 0xf >> 1.  So we could generate this on the fly from an AND.  But there's no byte-element right shift.

    low_nibble_mask: times 16 db 0x0f

This version needs more vector constants than most others. 4x 16 bytes is 64 bytes, which fits in one cache line. You might want to align 64 before the first vector instead of just align 16, so they all come from the same cache line.

This could even be implemented with only MMX, using only 8-byte constants, but then you'd need an emms so it would probably only be a good idea on very old CPUs which don't have SSE2, or which split 128-bit operations into 64-bit halves (e.g. Pentium-M or K8). On modern CPUs with mov-elimination for vector registers (like Bulldozer and IvyBrige), it only works on XMM registers, not MMX. I did arrange the register usage so the 2nd movdqa is off the critical path, but I didn't do that for the first.


AVX can save a movdqa, but more interesting is with AVX2 we can potentially produce 32 bytes of hex digits at a time from large inputs. 2x 64-bit integers or 4x 32-bit integers; use a 128->256-bit broadcast load to replicate the input data into each lane. From there, in-lane vpshufb ymm with a control vector that read from the low or high half of each 128-bit lane should set you up with the nibbles for the low 64 bits of input unpacked in the low lane, and the nibbles for the high 64 bits of input unpacked in the high lane.

Or if the input numbers come from different sources, maybe vinserti128 the high one might be worth it on some CPUs, vs. just doing separate 128-bit operations.


AVX512VBMI (Cannonlake/IceLake, not present in Skylake-X) has a 2-register byte shuffle vpermt2b that could combine the puncklbw interleaving with byte-reversing. Or even better, we have VPMULTISHIFTQB which can extract 8 unaligned 8-bit bitfields from each qword of the source.

We can use this to extract the nibbles we want into the order we want directly, avoiding a separate right-shift instruction. (It still comes with garbage bits, but vpermb ignores high garbage.)

To use this for 64-bit integers, use a broadcast source and a multishift control that unpacks the high 32 bits of the input qword in the bottom of the vector, and the low 32 bits in the top of the vector. (Assuming little-endian input)

To use this for more than 64 bits of input, use vpmovzxdq to zero-extend each input dword into a qword, setting up for vpmultishiftqb with the same 28,24,...,4,0 control pattern in each qword. (e.g. producing a zmm vector of output from a 256-bit vector of input, or four dwords -> a ymm reg to avoid clock-speed limits and other effects of actually running a 512-bit AVX512 instruction.)

Beware that wider vpermb uses 5 or 6 bits of each control byte, meaning you'll need to broadcast the hexLUT to a ymm or zmm register, or repeat it in memory.

itohex_AVX512VBMI:                         ;  Tested with SDE
    vmovq          xmm1, [multishift_control]
    vpmultishiftqb xmm0, xmm1, qword [esp+8]{1to2}    ; number, plus 4 bytes of garbage.  Or a 64-bit number
    mov    ecx,  [esp+4]            ; out pointer
   
     ;; VPERMB ignores high bits of the selector byte, unlike pshufb which zeroes if the high bit is set
     ;; and it takes the bytes to be shuffled as the optionally-memory operand, not the control
    vpermb  xmm1, xmm0, [hex_lut]   ; use the low 4 bits of each byte as a selector

    vmovq   [ecx], xmm1     ; store 8 bytes of ASCII characters
    ret
    ;; For 64-bit integers: vmovdqa load [multishift_control], and use a vmovdqu store.

section .rodata
align 16
    hex_lut:  db  "0123456789abcdef"
    multishift_control: db 28, 24, 20, 16, 12, 8, 4, 0
    ; 2nd qword only needed for 64-bit integers
                        db 60, 56, 52, 48, 44, 40, 36, 32
# I don't have an AVX512 CPU, so I used Intel's Software Development Emulator
$ /opt/sde-external-8.4.0-2017-05-23-lin/sde -- ./a.out 0x1235fbac
1235fbac

vpermb xmm is not lane-crossing because there's only one lane involved (unlike vpermb ymm or zmm). But unfortunately on CannonLake (according to instlatx64 results), it still has 3-cycle latency so pshufb would be better for latency. But pshufb conditionally zeros based on the high bit so it requires masking the control vector. That makes it worse for throughput, assuming vpermb xmm is only 1 uop. In a loop where we can keep the vector constants in registers (instead of memory operands), it only saves 1 instruction instead of 2.

(Update: yes, https://uops.info/ confirms vpermb is 1 uop with 3c latency, 1c throughput on Cannon Lake and Ice Lake. ICL has 0.5c throughput for vpshufb xmm/ymm)


AVX2 variable-shift or AVX512F merge-masking to save an interleave

With AVX512F, we can use merge-masking to right-shift one dword while leaving the other unmodified, after broadcasting the number into an XMM register.

Or we could use an AVX2 variable-shift vpsrlvd to do exactly the same thing, with a shift-count vector of [4, 0, 0, 0]. Intel Skylake and later has single-uop vpsrlvd; Haswell/Broadwell take multiple uops (2p0 + p5). Ryzen's vpsrlvd xmm is 1 uop, 3c latency, 1 per 2 clock throughput. (Worse than immediate shifts).

Then we only need a single-register byte shuffle, vpshufb, to interleave nibbles and byte-reverse. But then you need a constant in a mask register which takes a couple instructions to create. It would be a bigger win in a loop converting multiple integers to hex.

For a non-looping stand-alone version of the function, I used two halves of one 16-byte constant for different things: set1_epi8(0x0f) in the top half, and 8 bytes of pshufb control vector in the low half. This doesn't save a lot because EVEX broadcast memory operands allow vpandd xmm0, xmm0, dword [AND_mask]{1to4}, only requiring 4 bytes of space for a constant.

itohex_AVX512F:       ;; Saves a punpcklbw.  tested with SDE
    vpbroadcastd  xmm0, [esp+8]    ; number.  can't use a broadcast memory operand for vpsrld because we need merge-masking into the old value
    mov     edx, 1<<3             ; element #3
    kmovd   k1, edx
    vpsrld  xmm0{k1}, xmm0, 4      ; top half:  low dword: low nibbles unmodified (merge masking).  2nd dword: high nibbles >> 4
      ; alternatively, AVX2 vpsrlvd with a [4,0,0,0] count vector.  Still doesn't let the data come from a memory source operand.

    vmovdqa xmm2, [nibble_interleave_AND_mask]
    vpand   xmm0, xmm0, xmm2     ; zero the high 4 bits of each byte (for pshufb), in the top half
    vpshufb xmm0, xmm0, xmm2     ; interleave nibbles from the high two dwords into the low qword of the vector

    vmovdqa xmm1, [hex_lut]
    vpshufb xmm1, xmm1, xmm0       ; select bytes from the LUT based on the low nibble of each byte in xmm0

    mov      ecx,  [esp+4]    ; out pointer
    vmovq   [ecx], xmm1       ; store 8 bytes of ASCII characters
    ret

section .rodata
align 16
    hex_lut:  db  "0123456789abcdef"
    nibble_interleave_AND_mask: db 15,11, 14,10, 13,9, 12,8  ; shuffle constant that will interleave nibbles from the high half
                      times 8 db 0x0f              ; high half: 8-byte AND mask
2
votes

With AVX2 or AVX-512 Intrinsics

As requested, porting some versions of my asm answer to C (which I wrote to also be valid C++). Godbolt compiler-explorer link. They compile back to asm almost as good as my hand-written asm. (And I checked that the vector constants in the compiler-generated asm match my db directives. Definitely something to check on when translating asm to intrinsics, especially if you use _mm_set_ instead of setr for constants that may seem more "natural" in highest-first order. setr uses memory order, same as asm.)

Unlike my 32-bit asm, these are optimizing for their input number being in a register, not assuming it has to get loaded from memory anyway. (So we don't assume the broadcast is free.) But TODO: explore using bswap instead of a SIMD shuffle to get bytes into printing order. Especially for 32-bit integers where bswap is only 1 uop (vs. 2 on Intel for 64-bit registers, unlike AMD).

These print the whole number in MSD-first printing order. Tweak the multishift constant or shuffle controls for little-endian memory-order output, like people apparently want for hex output of a large hash. Or for the SSSE3 version, simply remove the pshufb byte-reverse.)

AVX2 / 512 also allow wider versions that operate on 16 or 32 bytes of input at a time, producing 32 or 64 bytes of hex output. Probably by shuffling to repeat each 64 bits within a 128-bit lane, in a vector of twice the width, e.g. with vpermq like _mm256_permutex_epi64(_mm256_castsi128_si256(v), _MM_SHUFFLE(?,?,?,?)).

AVX512VBMI (Ice Lake and newer)

#include <immintrin.h>
#include <stdint.h>

#if defined(__AVX512VBMI__) || defined(_MSC_VER)
// AVX512VBMI was new in Icelake
//template<typename T>   // also works for uint64_t, storing 16 or 8 bytes.
void itohex_AVX512VBMI(char *str, uint32_t input_num)
{
    __m128i  v;
    if (sizeof(input_num) <= 4) {
        v = _mm_cvtsi32_si128(input_num); // only low qword needed
    } else {
        v = _mm_set1_epi64x(input_num);   // bcast to both halves actually needed
    }
    __m128i multishift_control = _mm_set_epi8(32, 36, 40, 44, 48, 52, 56, 60,   // high qword takes high 32 bits.  (Unused for 32-bit input)
                                               0,  4,  8, 12, 16, 20, 24, 28);  // low qword takes low 32 bits
    v = _mm_multishift_epi64_epi8(multishift_control, v);
    // bottom nibble of each byte is valid, top holds garbage. (So we can't use _mm_shuffle_epi8)
    __m128i hex_lut = _mm_setr_epi8('0', '1', '2', '3', '4', '5', '6', '7',
                                    '8', '9', 'a', 'b', 'c', 'd', 'e', 'f');
    v = _mm_permutexvar_epi8(v, hex_lut);

    if (sizeof(input_num) <= 4)
        _mm_storel_epi64((__m128i*)str, v);  // 8 ASCII hex digits (u32)
    else
        _mm_storeu_si128((__m128i*)str, v);  // 16 ASCII hex digits (u64)
}
#endif

My asm version used a 64-bit broadcast load of its stack arg from memory even for a u32 arg. But that was only so I could fold the load into a memory source operand for vpmultishiftqb. There's no way to tell the compiler that it can use a 64-bit broadcast memory source operand with the upper 32 bits being "don't care", if the value was coming from memory anyway (and known not be at the end of a page before an unmapped page, e.g. a 32-bit mode stack arg). So that minor optimization isn't available in C. And usually after inlining your vars will be in registers, and if you have a pointer you won't know if it's at the end of a page or not. The uint64_t version does need to broadcast, but since the object in memory is a uint64_t the compiler can use a {1to2} broadcast memory source operand. (At least clang and ICC are smart enough to with -m32 -march=icelake-client, or in 64-bit mode with a reference instead of value arg.)

clang -O3 -m32 actually compiles identically to what my hand-written asm, except for vmovdqa load of the constant, not vmovq, because it's actually all needed in that case. Compilers aren't smart enough to only use vmovq loads and omit the 0 bytes from .rodata when the top 8 bytes of the constant are 0. Also note that the multishift constant in asm output matches, so the _mm_set_epi8 is right; .


AVX2

This takes advantage of the input being a 32-bit integer; the strategy doesn't work for 64-bit (because it needs a bit-shift twice as wide).

// Untested, and different strategy from any tested asm version.

// requires AVX2, can take advantage of AVX-512
// Avoids a broadcast, which costs extra without AVX-512, unless the value is coming from mem.
// With AVX-512, this just saves a mask or variable-shift constant.  (vpbroadcastd xmm, reg is as cheap as vmovd, except for code size)
void itohex_AVX2(char *str, uint32_t input_num)
{
    __m128i  v = _mm_cvtsi32_si128(input_num);
    __m128i hi = _mm_slli_epi64(v, 32-4);  // input_num >> 4 in the 2nd dword
    // This trick to avoid a shuffle only works for 32-bit integers
#ifdef __AVX512VL__
                                          // UNTESTED, TODO: check this constant
    v = _mm_ternarylogic_epi32(v, hi, _mm_set1_epi8(0x0f), 0b10'10'10'00);  // IDK why compilers don't do this for us
#else
    v = _mm_or_si128(v, hi);              // the overlaping 4 bits will be masked away anyway, don't need _mm_blend_epi32
    v = _mm_and_si128(v, _mm_set1_epi8(0x0f));     // isolate the nibbles because vpermb isn't available
#endif
    __m128i nibble_interleave = _mm_setr_epi8(7,3, 6,2, 5,1, 4,0,
                                              0,0,0,0,  0,0,0,0);
    v = _mm_shuffle_epi8(v, nibble_interleave);  // and put them in order into the low qword
    __m128i hex_lut = _mm_setr_epi8('0', '1', '2', '3', '4', '5', '6', '7',
                                    '8', '9', 'a', 'b', 'c', 'd', 'e', 'f');
    v = _mm_shuffle_epi8(hex_lut, v);

    _mm_storel_epi64((__m128i*)str, v);  // movq 8 ASCII hex digits (u32)
}

The above is I think better, especially on Haswell, but also on Zen where variable-shift vpsrlvd has lower throughput and higher latency even though it's only a single uop. It's better for back-end port bottlenecks even on Skylake: 3 instructions that run only on port 5, vs. 4 (including vmovd xmm, reg, vpbroadcastd xmm,xmm, and 2x vpshufb) for the version below, but same number of front-end uops (assuming micro-fusion of the vector constants as memory source operands). It also needs 1 fewer vector constant, which is always nice, especially if this isn't in a loop.

AVX-512 can use a merge-masked shift instead of a variable-count shift, saving one vector constant at the cost of needing to set up a mask register. This saves space in .rodata but doesn't eliminate all constants, so a cache miss will still stall this. And mov r,imm / kmov k,r is 2 uops instead of 1 outside whatever loop you use this with.

also AVX2: port of the itohex_AVX512F asm version with the vpsrlvd idea I added later.

// combining shuffle and AND masks into a single constant only works for uint32_t
// uint64_t would need separate 16-byte constants.
// clang and GCC wastefully replicate into 2 constants anyway!?!

// Requires AVX2, can take advantage of AVX512 (for cheaper broadcast, and alternate shift strategy)
void itohex_AVX2_slrv(char *str, uint32_t input_num)
{
    __m128i  v = _mm_set1_epi32(input_num);
#ifdef __AVX512VL__
    // save a vector constant, at the cost of a mask constant which takes a couple instructions to create
    v = _mm_mask_srli_epi32(v, 1<<3, v, 4);  // high nibbles in the top 4 bytes, low nibbles unchanged.
#else
    v = _mm_srlv_epi32(v, _mm_setr_epi32(0,0,0,4));  // high nibbles in the top 4 bytes, low nibbles unchanged.
#endif

    __m128i nibble_interleave_AND_mask = _mm_setr_epi8(15,11, 14,10, 13,9, 12,8,     // for PSHUFB
                                    0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f); // for PAND
    v = _mm_and_si128(v, nibble_interleave_AND_mask);     // isolate the nibbles because vpermb isn't available
    v = _mm_shuffle_epi8(v, nibble_interleave_AND_mask);  // and put them in order into the low qword
    __m128i hex_lut = _mm_setr_epi8('0', '1', '2', '3', '4', '5', '6', '7',
                                    '8', '9', 'a', 'b', 'c', 'd', 'e', 'f');
    v = _mm_shuffle_epi8(hex_lut, v);

    _mm_storel_epi64((__m128i*)str, v);  // movq 8 ASCII hex digits (u32)
}

Compared to the the SSSE3 version, this saves a vpunpcklbw by using vpsrlvd (or masked shift) to get the bytes of num>>4 and num into the same XMM register to set up for a 1-register byte shuffle. vpsrlvd is single-uop on Skylake and later, and on Zen 1 / Zen 2. On Zen it's higher latency, though, and not fully pipelined according to https://uops.info/ (2c throughput instead of the 1c you'd expect from it being a single uop for one port.) But at least it doesn't compete for the same port as vpshufb and vpbroadcastd xmm,xmm on those CPUs. (On Haswell, it's 2 uops including one for p5, so there it does compete and this is strictly worse than the SSSE3 version because it requires an extra constant.)

A good option for Haswell might be _mm_slli_epi64(v, 32-4) / _mm_blend_epi32 - vpblendd runs on any port, not needing the shuffle port. Or maybe even in general, since that only needs a vmovd setup, not vmovd + vpbroadcastd

This function needs 2 other vector constants (hex lut, and a combined AND and shuffle mask). GCC and clang foolishly "optimize" the 2 uses of one mask into 2 separate mask constants, which is really dumb. (But in a loop, only costs setup overhead and a register, no extra per-conversion cost.) You'd need 2 separate 16-byte constants anyway for a uint64_t version of this, but my hand-written asm version was being clever by using 2 halves of one 16-byte constant.

MSVC avoids that problem: it compiles intrinsics more literally and doesn't try to optimize them (which is often a bad thing, but here it avoids that problem.) But MSVC misses out on using AVX-512 GP-register-source vpbroadcastd xmm0, esi for _mm_set1_epi32 with -arch:AVX512. With -arch:AVX2 (so the broadcast has to be done with 2 separate instructions) it uses that vector constant as a memory source operand twice (for vpand and vpshufb) instead of loading into a register, which is pretty questionable but probably ok and actually saves front-end uops. IDK what it would do in a loop where hoisting the load is more obviously good.


Writing hex_lut more compactly:

hex_lut = _mm_loadu_si128((const __m128i*)"0123456789abcdef"); compiles fully efficiently with GCC and Clang (they effectively optimize away the string literal with its terminating 0, and just emit an aligned vector constant). But MSVC unfortunately keeps the actual string in .rdata, without aligning it. So I used the longer, less nice to read, _mm_setr_epi8('0', '1', ..., 'f');

-1
votes

shotly it is

section .data
msg resb 8
db 10
hex_nums db '0123456789ABCDEF'
xx dd 0FF0FEFCEh
length dw 4

section .text
global main

main:
    mov rcx, 0
    mov rbx, 0
sw:
    mov ah, [rcx + xx]
    mov bl, ah
    shr bl, 0x04
    mov al, [rbx + hex_nums]
    mov [rcx*2 + msg], al
    and ah, 0x0F
    mov bl, ah
    mov ah, [rbx + hex_nums]
    mov [rcx*2 + msg + 1], ah
    inc cx
    cmp cx, [length]
    jl  sw

    mov rax, 1
    mov rdi, 1
    mov rsi, msg
    mov rdx, 9   ;8 + 1
    syscall

    mov rax, 60
    mov rdi, 0
    syscall

nasm -f elf64 x.asm -o t.o
gcc -no-pie t.o -o t