MJr

Intermediate representation

The compiler does not write code in the target language directly; instead, it outputs an intermediate representation (IR) which can be translated into a variety of target languages. The IR is intended to be an abstract syntax tree for a “lowest common denominator” imperative language with the following features:

Translating via this IR also creates some opportunities for additional optimisations which may improve the performance of the program and/or reduce the size of the generated code.

Types

The IR has its own types, not all of which are shared by the MJr language’s type system:

Note that the target language only needs to support integers of at least the required sizes; for example, nothing will go wrong if byte and int are mapped to the same type in the target language, or if an array is capable of holding integers much larger than its domain size, other than more memory being used. Likewise, the target language does not need to enforce immutability of arrays or dicts, since the generated code won’t attempt to mutate these anyway. These distinctions are only made to potentially allow more optimised output code.

Additionally, it’s OK for the mutable IR.INT32_ARRAY_TYPE (which has a domain size of 2 ** 32) to use signed 32-bit integers, since in practice these arrays are either used as bitmasks (so the signs don’t matter) or otherwise the elements will not exceed 2 ** 31 - 1.

Simplifications and optimisations

The IR namespace has a number of factory functions for IR nodes; while most of these simply create a new node object with the given parameters and return it, there are some opportunities for simplifying or optimising the IR tree, in order to produce smaller and/or more efficient code output. For example, stmt.switch statements will group multiple cases together when their blocks are equal.

This requires being able to tell when two IR nodes are equal. Currently this is implemented using the IR.key function, which simply calls JSON.stringify to transform an IR node into a primitive value, in a way which preserves “equality”. For this to work consistently,

Other notes

‘Break’ and ‘continue’ statements

The IR’s stmt.break and stmt.continue statements are only used in stmt.for.range and stmt.while loops, not (directly) in stmt.switch statements. Additionally, break and continue statements are not used inside switch statements to control loops outside of the switch. This is to avoid the need for labelled breaks (which some languages don’t have), whether or not the language’s switch construct interacts with break statements.

‘For range’ statements

The IR’s stmt.for.range loop statements iterate over a range of consecutive integer values, either forwards or in reverse. The range is determined by two expressions, low (inclusive) and high (exclusive), both of which must remain constant for the duration of the loop. A forwards range is equivalent to the following C-style for loop:

for(let i = low; i < high; ++i) { ... }

A reversed range is equivalent to the following C-style for loop:

for(let i = high - 1; i >= low; --i) { ... }

‘Switch’ statements

The IR’s stmt.switch statements are always used with an integer-valued variable in some range starting at zero, and never use fall-throughs between different cases. This is so that they can be compiled in different ways depending on what the target language supports:

Variable declaration statements

Some variable declarations in the IR do not have initialisers; this allows slightly more efficient code to be generated in some cases. However, a target language might not allow an uninitialised variable to be declared, particularly when the target language’s static checker cannot determine that the variable will not be used before it is assigned.

In this case, the code generator class may need to emit a default initialiser of the appropriate type. (For TypeScript output, the code generator instead emits a “definitely assigned” assertion.)

‘yield’ statements

The animate flag is currently implemented by emitting stmt.yield statements; this means the target language must support generator function. In principle, it would be possible later to support target languages without generator functions, by outputting a class which encapsulates the program’s state and has a method to advance the program by one step.

Bitwise operators

The IR supports the bitwise operators int_and, int_or, int_xor, int_not, int_lshift and int_rshift. These are mostly self-explanatory, but have a few subtleties:

‘Loose’ integer operators

The integer operators defined by the ASG are required to have the correct behaviour for signed 32-bit integers, even when the target language does not natively have a signed 32-bit integer type. This means that, for example:

Additionally, the int_floordiv and int_mod operators have different behaviour for negative numbers compared to the similar native operators in some target languages.

Therefore, the IR supports the additional “loose” integer operators loose_int_plus, loose_int_minus, loose_int_mult, loose_int_floordiv and loose_int_mod. These operators are not required to have correct behaviour for overflows or negative operands, since the compiler must only emit them with operands that are non-negative and sufficiently small that the result cannot overflow. This allows for simpler code output in some cases.