14
votes

I realize that LLVM has a long way to go, but theoretically, can the optimizations that are in GCC/ICC/etc. for individual languages be applied to LLVM byte code? If so, does this mean that any language that compiles to LLVM byte code has the potential to be equally as fast? Or are language specific optimizations (before the LLVM bytecode stage) going to always play a large part in optimizing any specific program.

I don't know much about compilers or optimizations (only enough to be dangerous), so I apologize if this question isn't well defined.

7
llvm is doing quite well actually, starting with version 28 it was producing better/faster code than gcc 4.x (for the applications I tested it with). That was using it as a cross compiler not a runtime virtual machine. The number of optimization combinations with llvm as a cross compiler grow exponentially relative to what gcc can offer. You can optimize any individual file or combine the bytecode in any combination and optimize that combination. Optimize from C/C++ to bytecode or wait until after bytecode has been linked, etc.old_timer
My question really revolved around leveling the playing field between the languages that get lots of attention in optimizing compilers (C, C++), and the languages that don't yet have the interest or manpower (haskell, d, ocaml, etc) to really do the level of optimization needed to bring them to the same level as C and C++. The hope was that LLVM would make all these languages have similar performance, which made choosing a language a more open choice (for my particular field, which is numerical simulations and modeling, and needs the speed).Andrew Spott
I would think the concept of turning the language into an internal language, bytecode or icode or what have you, then applying a list of optimizations, which are not really specific to the language, should be doable, then when going from the internal code to the target you have more optimizations there, which again are not tied to the original language. If these other languages can build front ends to llvm for example, or gcc for that matter, they should be able to take advantage of the existing optimizers.old_timer
@dwelch: That was my original thought, however, it looks like, in order to get the best performance, you would need to do language specific optimizations (look at the answer by Dietrich), then LLVM byte-code specific optimizations. So, while the general performance of languages will improve (because there will be a set of "llvm optimizations" that every language can take advantage of), the languages with the manpower behind them will ultimately be faster because there are more people and more interest in writing language specific optimizations.Andrew Spott
definitely agree with that. You need the talent and manpower. And you need the demand/interest, if nobody wants to use it then you lose the early adopters and beta testers and eventually the talent/manpower loses interest and moves on. Good, bad, or otherwise this is what keeps C/C++ going.old_timer

7 Answers

21
votes

In general, no.

For example, in Haskell a common optimization is strictness analysis, which allows the compiler to determine which variables are always in head-normal form and therefore can be forced + inlined without changing program semantics. This is not possible with LLVM.

Explanation: In Haskell, a function (Int, Int) -> Int is more or less equivalent to the type in C:

typedef int (*returns_int)();
struct pair { returns_int first, second};
typedef struct pair *(*returns_pair)();

int function(returns_pair arg);

The compiler can analyze function and determine that it always evaluates its argument and always extracts the contents, transforming the function into this:

int function(int x, int y); // note that it takes *two* arguments now

This is well beyond the capability of LLVM. Maybe, in the future, with some really heavy interprocedural optimization... but realistically speaking, this will not happen in the foreseeable future.

Example 2: There are Java VMs which can transform virtual function calls into direct function calls. However, this isn't something that LLVM can do — because this transformation must be undone dynamically if another class is loaded which implements the same interface.

In general, when you compile a program to LLVM you lose much of the semantic information about the original program. LLVM bytecode is capable of representing any code, but its type system is fairly limited — and your choice of type system affects what optimizations you can do.

10
votes

I addition to Dietrich's excellent answer, I think it is important to appreciate that it is not just the compiler that determines how fast programming languages are. In addition to the various optimizations that a given language may allow/disallow, there is also the matter of how you do certain tasks in various programming languages and what the language lets you do.

For example, it is relatively easy to optimize C code to maximize cache efficiency (reducing slow reads from memory) while this is much harder in Haskell. Pointer hacks are impossible in Java. As is the strategy of allocating a huge chunk of memory and parceling it out by hand.

Thus, some languages will always be slower simply because they don't allow for the same level of optimization. Note that I'm not necessarily saying that this is a bad thing because with that slowness come extremely powerful constructs.

I think that a better way to look at it is that LLVM will allow for a certain set of optimizations to be applied to all languages that compile down to it. Thus, while it will make such languages faster, it will not make them equally fast.

Edit: Pointer hacks in Haskell. So many possibilities ...

5
votes

It is an interesting question, but I am afraid that you are lacking some notions of what compilers do.

There will always be several stages of optimizations within a compiler.

  • Some optimizations depend on languages rules, for example in C++ you may optimize away some copy/move constructors calls.
  • Some optimizations are broadly available (loop transforms, tail calls)
  • Some optimizations depend on the hardware (SSE, MMX)

The LLVM compiler applies all 3... but starting from LLVM IR, not from C or Java or whatever.

It is the job of the front-end to provide a suitable LLVM IR.

For example, as noted by @Dietrich Epp, the IR is not really suitable for Functional Languages. A lot of Haskell optimizations must therefore be performed before the representation is lowered to the IR.

Another source of non-optimization is that specific runtime may come with the languages. Haskell has a complicated runtime with sparks pool, lightweight-threads, preemption before system calls, work-stealing etc... The IR is not suitable to represent this rich environment, and no optimization is made on this organization.

2
votes

LLVM has made a far way to be a promising alternative to GCC (it even succeeds in compiling the Linux kernel, with some patches). It is also in many cases faster than GCC (the compiler and the generated executables) and has a structure which makes it quite easy to write frontends for arbitrary languages.

But to allow extensive optimizations also the frontend is equally important since it knows much more details about the program being compiled. Things the LLVM stack cannot easily find out. Therefore to generate efficient programs also the frontend has to optimize the code.

1
votes

Coming back to the your question "is it theoretically possible?" , let's imagine/assume - Just for a sake of discussion - following :

  • we have bytecode, or even machinecode
  • we know it's produced with given compiler X
  • source language was L
  • that input program had finite size.

IMHO - from computer science point of view, it's mostly question of resources, to apply almost anything.

Now let's try following

function machinecode_to_sourcecode( given_machinecode ){
  it = all_posibble_strings_iterator()
  while (true) {
    possible_sourcecode = it->get_string()
    machine_code = compile possible_sourcecode with X
    if (compilation succeeded)
        if(machine_code == given_machinecode)
           return possible_sourcecode;
    else
       it = it->next_string()
  }
}

So we try all possible strings, as input for compiler X, in case of success compilation - ones results equal, we have source.

Once we have source, we have as much information about program as we can, so we are able to apply all possible optimizations.

All looks very costly, but as you've asked "theoretically", I will say

  • it will take finite amount of time

because

  • input sourcecode have finite length

so

  • iterating at all such strings takes finite amount of time
  • I assumed, that compiler process any sourcecode in finite time (That's what we usually assume when we think about compiler ;) ).

whole operation is going to take finite amount of computation time.

0
votes

I don't know any details about the bytecode format used by LLVM but I think the answer to your question is no.

Just consider the following: dynamic vs. static typing. A programming language which is dynamically typed probably will be slower than a statically typed language, because the majority of type checking is performed at run-time.

There can also be some other differences between programming languages which may affect performance.

0
votes

Here I've found report, with easy for begginers overview of GCC vs LLVM differences : Clang/LLVM Maturity Evaluation Report by Dominic Fandrey