The tokenizer converts raw MJr source code into a sequence of tokens. Additionally, there are “empty” tokens representing increases or decreases in the indentation level, which are used by the parser to determine the start and end of blocks.
The tokenizer is implemented as a state machine, where the current state is represented by:
NORMAL
, PATTERN
, CHARSET
, QUOTE_STRING
or DBLQUOTE_STRING
.Most of the work is done by the LineTokenizer
class. Given a current state, the getNextToken
method returns a token kind and the next state, and then a token of that kind is emitted, bounded between the old and new column positions of the current line. Newlines and indentation are handled by a separate loop in the tokenize
function, which iterates over the input one line at a time.
Some further notes:
C
, to simplify the descriptions of which sequences of characters form which kinds of token.(
, [
or {
, and decremented on )
, ]
or }
. The depth is used to ignore indentation and newlines within bracketed expressions and patterns.