Bytecode Generation Engines?

alcouch65 · March 27, 2020, 8:06am

Hey everyone! I’ve decided lately to move away from abstract syntax trees and instead work with bytecode. I’ve decided to following suit with the Rust compiler with its multiple layers of bytecode and do the same. Although, I’m having some trouble with writing reliable, robust, maintainable bytecode generators. I am wondering if anybody knows of any libraries that may exist for this. I doubt there are cause not enough people are working with bytecode in Java or Kotlin to warrant some kind of bytecode api. If there isn’t, I am thinking about making a Kotlin DSL for this as it would be important for Beagle’s compiler.

I’m thinking it could have an API for designing the bytecode in a way that allows you to decorate its metadata in the engine, as well as a means of creating “plugins” for input and output. The input would be “what are we using as a source or frame of reference for generating bytecode”. The output would be “what are we doing with the generated bytecode”. I am wondering if this would be a worthy prospect for the strumenta community. I am still working out the details but I have created an example of what I am imagining. I’ll make a proof of concept because I kinda need it ASAP lol.

The point of this is for more efficient code generation and code evaluation. This would be extremely useful for linters and language servers, virtual machines, complex compilers, and much more. I think a teamwide effort to make this would be an amazing prospect but I am wondering how realistic or in-demand something like this currently is. Even if the demand isn’t necessarily high right now, I’ll still make a proof of concept.

This would help me out with making my bytecode layers because I have so far made two layers for my experimental compiler and it was hell to debug it. Although I was able to produce the following output for the experimental compiler. What you are about to see is the first layer of bytecode “stringified” in a textual representation. This is how I was able to ensure the HIR generator was working for the most part.

Source:

let hello = "Hello";

fn add(x: Int, y: Int): Int{
    return x + y;
};

fn sub(x: Int, y: Int): Int{
    return x - y;
};

fn mul(x: Int, y: Int): Int{
    return x * y;
};

fn div(x: Int, y: Int): Int{
    return x / y;
};

fn main(args: Int): Int{
    printf(hello);
    return 0;
};

HIR Bytecode Textual Format

file 'llvm@C:\projects\kotlinx-llvm\toylang\run\files\test.toy'
global.var hello = string "Hello" 
global.fun add(param.x type.Int, param.y type.Int) type.Int{
    term.return op.add ref x, ref y
}

global.fun sub(param.x type.Int, param.y type.Int) type.Int{
    term.return op.sub ref x, ref y
}

global.fun mul(param.x type.Int, param.y type.Int) type.Int{
    term.return op.mul ref x, ref y
}

global.fun div(param.x type.Int, param.y type.Int) type.Int{
    term.return op.div ref x, ref y
}

global.fun main(param.args type.Int) type.Int{
    call.printf(ref hello)
    term.return int 0
}

The code for these generators are honestly really hard for me to look at. I feel like it should be easier to do so. I have a feeling I’m gonna be doing this kinda stuff a lot and I really encourage other lang devs to follow suit as bytecode is linear in memory and more concise as opposed to AST, being scattered and clustered in memory naturally. AST is also really hard in a shared environment, such as a language server. I really think this could be a great endevour.

cristian.vasile · March 27, 2020, 9:15am

Check this presentation

The library was designed with x86 in mind, but you can go with WASM (which is portable.) or Clang IR - but in this case you have a dependency of CLANG

alcouch65 · March 27, 2020, 9:20am

I’m sorry, what I meant by bytecode was custom bytecode for virtual machines, language servers/linters, and compilers. All my VM’s and compilers will have their own bytecode and I don’t wanna have to rewrite all the spaghetti code every time. I’m already working on a proof of concept.

cristian.vasile · March 27, 2020, 9:32am

OK.
I remember that haxe lang could run on HashLink VM
HashLink is a virtual machine for Haxe

Take a look might be inspirational, for example your compiling infra could emits HashLink bytecode

neomatrix369 · March 27, 2020, 3:32pm

Hi @alcouch65

Welcome to this forum.

I had skim read through your introduction and a few things popped up:

check out awesome-java on github
check out awesome-jvm on github
also GraalVM and Truffle are also dealing with an internal graph different from the AST (although derived out of the AST)

Have you also considered working with the Truffle framework or look into it, it might help resolve some pain points.

I can think of ASM, ByteBuddy and others as an example of this. But not sure what your actual needs are and so would be good to hear your feedback as well.

cristian.vasile · March 27, 2020, 9:47pm

I remember the zetaVM, and on https://pointersgonewild.com/ blog article about a JIT compiler for JavaScript you can read:
As part of my PhD, I developed Higgs, an experimental JIT compiler for JavaScript written using the D programming language.
As part of Higgs, I had written my own x86 machine code generator, which enabled it to do machine code pirouettes LLVM couldn’t.

alcouch65 · March 27, 2020, 11:39pm

The truffle framework would unfortunately not help me. I guess what I’m looking for is something along the lines of Ktor.io for customized bytecode. My compiler uses its own intermediate layers of bytecode derived from AST, and my struggles are with having to deal with the bit by bit assembly of bytecode and disassembly of bytecode between layers. Also, I’m currently using kotlinx-serialization for the bytecode assembly but not using the deserialization for anything which is a bit of a waste in my opinion. Also, I cannot be limited to Java. I would like a kotlin multiplatform api for the customization of bytecode, the input and output handlers, all streamlined in a way that would allow me to easily create a layer of bytecode analysis, assembly, conversion, etc. Not Java bytecode or anything. I don’t wanna be tethered to the JVM at all.