propagate carry bit between two ASM GCC inline block

Question

Dear Assembly/C++ dev,

The question is: Does propagate the carry (or any flag) between two ASM block is realistic or totally insane, even if it works ?

A few years ago I developed an integer library for large arithmetic lower than 512 bits (at compile time). I did not use GMP at this time because for this scale, GMP become slower due to the memory allocation and the model choose for the binary representation bench.

I must confess I created my ASM ( the string block) using BOOST_PP, it is not very glorious (for curious have a look on it vli). The library was working well.

However I remark at this time it was impossible to propagate the carry flag of the state register between two ASM inline block. This is logical because for any mnemonic generated by the compiler between two blocks, the register is reset ( except for the mov instruction (from my assembly knowledge)).

Yesterday I get an idea to propagate the carry between two ASM block a bit tricky (using a recursive algo). It is working but I think I am lucky.

#include <iostream>
#include <array>
#include <cassert>
#include <algorithm>

//forward declaration
template<std::size_t NumBits>
struct integer;


//helper using object function, partial specialization  is forbiden on functions
template <std::size_t NumBits, std::size_t W, bool K = W == integer<NumBits>::numwords>
struct helper {
    static inline void add(integer<NumBits> &a, const integer<NumBits> &b){
        helper<NumBits, integer<NumBits>::numwords>::add(a,b);
    }
};

// first addition (call first)
template<std::size_t NumBits, std::size_t W>
struct helper<NumBits, W, 1> {
    static inline void add(integer<NumBits> &a, const integer<NumBits> &b){
        __asm__ (
                              "movq %1, %%rax \n"
                              "addq %%rax, %0 \n"
                              : "+m"(a[0]) // output
                              : "m" (b[0]) // input only
                              : "rax", "cc", "memory");
        helper<NumBits,W-1>::add(a,b);
    }
};

//second and more propagate the carry (call next)
template<std::size_t NumBits, std::size_t W>
struct helper<NumBits, W, 0> {
    static inline void add(integer<NumBits> &a, const integer<NumBits> &b){
        __asm__ (
                              "movq %1, %%rax \n"
                              "adcq %%rax, %0 \n"
                              : "+m"(a[integer<NumBits>::numwords-W])
                              : "m" (b[integer<NumBits>::numwords-W])
                              : "rax", "cc", "memory");
        helper<NumBits,W-1>::add(a,b);
    }
};

//nothing end reccursive process (call last)
template<std::size_t NumBits>
struct helper<NumBits, 0, 0> {
    static inline void add(integer<NumBits> &a, const integer<NumBits> &b){};
};

// tiny integer class
template<std::size_t NumBits>
struct integer{
    typedef uint64_t      value_type;
    static const std::size_t numbits = NumBits;
    static const std::size_t numwords = (NumBits+std::numeric_limits<value_type>::digits-1)/std::numeric_limits<value_type>::digits;
    using container = std::array<uint64_t, numwords>;

    typedef typename container::iterator             iterator;

    iterator begin() { return data_.begin();}
    iterator end() { return data_.end();}

    explicit integer(value_type num = value_type()){
        assert( -1l >> 1 == -1l );
        std::fill(begin(),end(),value_type());
        data_[0] = num;
    }

    inline value_type& operator[](std::size_t n){ return data_[n];}
    inline const value_type& operator[](std::size_t n) const { return data_[n];}

    integer& operator+=(const integer& a){
        helper<numbits,numwords>::add(*this,a);
        return *this;
    }

    integer& operator~(){
        std::transform(begin(),end(),begin(),std::bit_not<value_type>());
        return *this;
    }

    void print_raw(std::ostream& os) const{
        os << "(" ;
        for(std::size_t i = numwords-1; i > 0; --i)
            os << data_[i]<<" ";
        os << data_[0];
        os << ")";
    }

    void print(std::ostream& os) const{
        assert(false && " TO DO ! \n");
    }

private:
    container data_;
};

template <std::size_t NumBits>
std::ostream& operator<< (std::ostream& os, integer<NumBits> const& i){
    if(os.flags() & std::ios_base::hex)
        i.print_raw(os);
    else
        i.print(os);
    return os;
}

int main(int argc, const char * argv[]) {
    integer<256> a; // 0
    integer<256> b(1);

    ~a; //all the 0 become 1

    std::cout << " a: " << std::hex << a << std::endl;
    std::cout << " ref: (ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff) " <<  std::endl;

    a += b; // should propagate the carry

    std::cout << " a+=b: " << a << std::endl;
    std::cout << " ref: (0 0 0 0) " <<  std::endl; // it works but ...

    return 0;
}

I get a correct result (it must be compiled in release -O2 or -O3!) and the ASM is right (on my Mac with clang++: Apple LLVM version 9.0.0 (clang-900.0.39.2))

    movq    -96(%rbp), %rax
    addq    %rax, -64(%rbp)

    ## InlineAsm End
    ## InlineAsm Start
    movq    -88(%rbp), %rax
    adcq    %rax, -56(%rbp)

    ## InlineAsm End
    ## InlineAsm Start
    movq    -80(%rbp), %rax
    adcq    %rax, -48(%rbp)

    ## InlineAsm End
    ## InlineAsm Start
    movq    -72(%rbp), %rax
    adcq    %rax, -40(%rbp)

I am conscient it is working because during the optimization the compiler remove all useless instruction between the ASM block (in debug mode it failed).

What do you think ? Definitively unsafe ? Does a compiler guy know how much it will be stable ?

To conclude: I am just doing that for fun :) Yes, GMP is the solution for large arithmetic !

Dietrich Epp Dietrich Epp · Accepted Answer · 2018-02-15T16:44:50

The use of __volatile__ is an abuse.

The purpose of __volatile__ is to force the compiler to emit the assembly code at the written location, rather than relying on data flow analysis to figure this out. If you are doing ordinary manipulations of data in user space, usually you should not be using __volatile__, and if you need __volatile__ to get your code to work it almost always means that your operands are specified incorrectly.

And yes, the operands are specified incorrectly. Let's look at the first block.

__asm__ __volatile__ (
                      "movq %1, %%rax \n"
                      "addq %%rax, %0 \n"
                      : "=m"(a[0]) // output
                      : "m" (b[0]) // input only
                      : "rax", "memory");

There are two errors here.

The constraint on the output "=m"(a[0]) is incorrect. Recall that the destination for addq is both an input and an output, so the correct constraint is +, so use "+m"(a[0]). If you tell the compiler that a[0] is output only, the compiler may arrange a[0] to contain a garbage value (through dead store elimination), which is not what you want.
The flags are missing from the assembly specification. Without telling the compiler that the flags are modified, the compiler may assume that the flags are preserved across the assembly block, which will cause the compiler to generate incorrect code elsewhere.

Unfortunately, the flags are only available as output or clobber operands to assembly blocks, and are not available as inputs. So after all that fuss over specifying operands correctly so you don't use __volatile__... it turns out that there isn't a good way to specify your operands anyway!

So the recommendation here is that you should at least fix the operands you can fix, and specify "cc" as a clobber. But there are a couple better solutions that do not require __volatile__ at all...

Solution #1: Use GMP.

The mpn_ functions for addition do not allocate memory. The mpz_ functions are wrappers around the mpn_ functions with some extra logic and memory allocations.

Solution #2: Write everything in one assembly block.

If you write the entire loop in one assembly block, you don't have to worry about preserving flags between blocks. You can use assembly macros to do this. Excuse the mess, I'm not much of an assembly programmer:

template <int N>
void add(unsigned long long *dest, unsigned long long *src) {
  __asm__(
      "movq (%1), %%rax"
      "\n\taddq %%rax, (%0)"
      "\n.local add_offset"
      "\n.set add_offset,0"
      "\n.rept %P2" // %P0 means %0 but without the $ in front
      "\n.set add_offset,add_offset+8"
      "\n\tmovq add_offset(%1), %%rax"
      "\n\tadcq %%rax, add_offset(%0)"
      "\n.endr"
      :
      : "r"(dest), "r"(src), "n"(N-1)
      : "cc", "memory", "rax");
}

What this does is evaluate the loop using the .rept assembly directive. You will ultimately get 1 copy of addq and N-1 copies of adcq, although if you look at the assembly output of GCC with -S you will only see one of each. The assembler itself will create the copies, unwinding the loop.

See Gist: https://gist.github.com/depp/966fc1f4d535e31d9725cc71d97daf91

propagate carry bit between two ASM GCC inline block

1 Answers

Solution #1: Use GMP.

Solution #2: Write everything in one assembly block.