MJr

Tokenizer

The tokenizer converts raw MJr source code into a sequence of tokens. Additionally, there are “empty” tokens representing increases or decreases in the indentation level, which are used by the parser to determine the start and end of blocks.

The tokenizer is implemented as a state machine, where the current state is represented by:

Most of the work is done by the LineTokenizer class. Given a current state, the getNextToken method returns a token kind and the next state, and then a token of that kind is emitted, bounded between the old and new column positions of the current line. Newlines and indentation are handled by a separate loop in the tokenize function, which iterates over the input one line at a time.

Some further notes: