[Parser] Build: PL By:JS

  • 2017-12-02
  • 33
  • 0

origin: http://lisperator.net/pltut/

Summary

var ast = parse(TokenStream(InputStream(code)));

Writing a parser is, depending on the language, a moderately complex task.

In essence, it must transform a piece of code (which we inspect by looking at the characters) into an “abstract syntax tree” (AST).

The AST is a structured in-memory representation of the program, and it's “abstract” in the sense that it does not care exactly what characters is the source code made of, but it faithfully represents the semantics of it.

For example, for the following program text:

sum = lambda(a, b) {
  a + b;
};
print(sum(1, 2));

our parser will generate the following AST, as a JavaScript object:

{
  type: "prog",
  prog: [
    // first line:
    {
      type: "assign",
      operator: "=",
      left: { type: "var", value: "sum" },
      right: {
        type: "lambda",
        vars: [ "a", "b" ],
        body: {
          // the body should be a "prog", but because
          // it contains a single expression, our parser
          // reduces it to the expression itself.
          type: "binary",
          operator: "+",
          left: { type: "var", value: "a" },
          right: { type: "var", value: "b" }
        }
      }
    },
    // second line:
    {
      type: "call",
      func: { type: "var", value: "print" },
      args: [{
        type: "call",
        func: { type: "var", value: "sum" },
        args: [ { type: "num", value: 1 },
                { type: "num", value: 2 } ]
      }]
    }
  ]
}

The main difficulty in writing a parser consists in a failure to properly organize the code.

The parser should operate at a higher level than reading characters from a string. A few advices on how to keep complexity manageable:

  • Write many functions and keep them small. In every function, do one thing and do it well.

  • Do not try to use regexps for parsing. They don't work. Regexps can be helpful in the lexer though, but I suggest to limit them to very simple things.

  • Don't attempt to guess. When unsure how to parse something, throw an error and make sure the message contains the error location (line/column).

To keep it simple I've split my code in three parts, which are further divided into many small functions:

  • The character input stream
  • The token input stream (lexer)
  • The parser

Input stream

We're going to create a “stream object” which provides operations to read characters from a string. A stream object has 4 methods:

  • peek() — returns the next value but without removing it from the stream.

  • next() — returns the next value and also discards it from the stream.

  • eof() — returns true if and only if there are no more values in the stream.

  • croak(msg) — does throw new Error(msg).

The reason why I'm including the last one is that the stream can easily keep track of the current location (i.e. line/column), which is important to display in the case of an error message.

Feel free to add more methods here, depending on your needs.

The character input stream deals with characters, so the values that next() / peek() return are chars (well, since JS doesn't have a char type, they're strings containing one character).

Here is the full code of this object, which I will call “InputStream”. It's small enough and you should have no problem to understand it:

function InputStream(input) {
    var pos = 0, line = 1, col = 0;
    return {
        next  : next,
        peek  : peek,
        eof   : eof,
        croak : croak,
    };
    function next() {
        var ch = input.charAt(pos++);
        if (ch == "\n") line++, col = 0; else col++;
        return ch;
    }
    function peek() {
        return input.charAt(pos);
    }
    function eof() {
        return peek() == "";
    }
    function croak(msg) {
        throw new Error(msg + " (" + line + ":" + col + ")");
    }
}

Note that it's not a standard object (the kind you create with new). You just do var stream = InputStream(string) to get a stream object.

Next we're going to write another abstraction on top of this object: the tokenizer.

Token Stream

The tokenizer (also called “lexer”) operates on a character input stream and returns a stream object with the same interface, but the values returned by peek() / next() will be tokens. A token is an object with two properties: type and value. Here are some examples with supported tokens:

{ type: "punc", value: "(" }           // punctuation: parens, comma, semicolon etc.
{ type: "num", value: 5 }              // numbers
{ type: "str", value: "Hello World!" } // strings
{ type: "kw", value: "lambda" }        // keywords
{ type: "var", value: "a" }            // identifiers
{ type: "op", value: "!=" }            // operators

Whitespace and comments are skipped over, no tokens are returned.

In order to write the tokenizer we need to look more closely into the syntax of our language. The idea is to notice that depending on the current character (as returned by input.peek()) we can decide what kind of token to read:

  • First off, skip over whitespace.
  • If input.eof() then return null.
  • If it's a sharp sign (#), skip comment (retry after - the end of line).
  • If it's a quote then read a string.
  • If it's a digit, then we proceed to read a number.
  • If it's a “letter”, then read an identifier or a keyword token.
  • If it's one of the punctuation characters, return a punctuation token.
  • If it's one of the operator characters, return an operator token.
  • If none of the above, error out with input.croak().

So here's the “read_next” function — the “core” of the tokenizer — which implements the above:

function read_next() {
    read_while(is_whitespace);
    if (input.eof()) return null;
    var ch = input.peek();
    if (ch == "#") {
        skip_comment();
        return read_next();
    }
    if (ch == '"') return read_string();
    if (is_digit(ch)) return read_number();
    if (is_id_start(ch)) return read_ident();
    if (is_punc(ch)) return {
        type  : "punc",
        value : input.next()
    };
    if (is_op_char(ch)) return {
        type  : "op",
        value : read_while(is_op_char)
    };
    input.croak("Can't handle character: " + ch);
}

This is a “dispatcher” function and it's what next() will call in order to fetch the next token.
Note it uses many utilities that are focused on particular token types, like read_string(), read_number() etc. There's no point to complicate the dispatcher with code from those functions, even if we never call them elsewhere.

Another thing to notice is that we don't consume all the input stream in one step. Each time the parser will call for next token, we read one token. In case of a parse error we don't even reach the end of the stream.


read_ident() will read characters as long as they are allowed as part of an identifier (is_id). Identifiers must start with a letter, or λ or _, and can contain further such characters, or digits, or one of the following: ?!-<>=. Therefore, foo-bar will not be read as three tokens but as a single identifier (a "var" token). The reason for this rule is that I'd like to be able to define functions named is-pair? or string>= (sorry, it's the Lisper in me).


Also, the read_ident() function will check the identifier against the list of known keywords, and if it's there it will return a "kw" token, instead of a "var" one.

I think the code pretty much speaks for itself now, so here is the complete tokenizer for our language. Couple of small other notes below.

function TokenStream(input) {
    var current = null;
    var keywords = " if then else lambda λ true false ";
    return {
        next  : next,
        peek  : peek,
        eof   : eof,
        croak : input.croak
    };
    function is_keyword(x) {
        return keywords.indexOf(" " + x + " ") >= 0;
    }
    function is_digit(ch) {
        return /[0-9]/i.test(ch);
    }
    function is_id_start(ch) {
        return /[a-zλ_]/i.test(ch);
    }
    function is_id(ch) {
        return is_id_start(ch) || "?!-<>=0123456789".indexOf(ch) >= 0;
    }
    function is_op_char(ch) {
        return "+-*/%=&|<>!".indexOf(ch) >= 0;
    }
    function is_punc(ch) {
        return ",;(){}[]".indexOf(ch) >= 0;
    }
    function is_whitespace(ch) {
        return " \t\n".indexOf(ch) >= 0;
    }
    function read_while(predicate) {
        var str = "";
        while (!input.eof() && predicate(input.peek()))
            str += input.next();
        return str;
    }
    function read_number() {
        var has_dot = false;
        var number = read_while(function(ch){
            if (ch == ".") {
                if (has_dot) return false;
                has_dot = true;
                return true;
            }
            return is_digit(ch);
        });
        return { type: "num", value: parseFloat(number) };
    }
    function read_ident() {
        var id = read_while(is_id);
        return {
            type  : is_keyword(id) ? "kw" : "var",
            value : id
        };
    }
    function read_escaped(end) {
        var escaped = false, str = "";
        input.next();
        while (!input.eof()) {
            var ch = input.next();
            if (escaped) {
                str += ch;
                escaped = false;
            } else if (ch == "\\") {
                escaped = true;
            } else if (ch == end) {
                break;
            } else {
                str += ch;
            }
        }
        return str;
    }
    function read_string() {
        return { type: "str", value: read_escaped('"') };
    }
    function skip_comment() {
        read_while(function(ch){ return ch != "\n" });
        input.next();
    }
    function read_next() {
        read_while(is_whitespace);
        if (input.eof()) return null;
        var ch = input.peek();
        if (ch == "#") {
            skip_comment();
            return read_next();
        }
        if (ch == '"') return read_string();
        if (is_digit(ch)) return read_number();
        if (is_id_start(ch)) return read_ident();
        if (is_punc(ch)) return {
            type  : "punc",
            value : input.next()
        };
        if (is_op_char(ch)) return {
            type  : "op",
            value : read_while(is_op_char)
        };
        input.croak("Can't handle character: " + ch);
    }
    function peek() {
        return current || (current = read_next());
    }
    function next() {
        var tok = current;
        current = null;
        return tok || read_next();
    }
    function eof() {
        return peek() == null;
    }
}
  • The next() function doesn't always call read_next(), because it might have been peeked before (in which case read_next() was already called and the stream advanced). Therefore we need a current variable which keeps track of the current token.

  • We only support decimal numbers with the usual notation (no 1E5 stuff, no hex, no octal). But if we ever need more, the changes go only in read_number() and are pretty easy to do.

  • Unlike JavaScript, the only characters that cannot appear unquoted in a string are the quote character itself and the backslash. You need to backslash them. Otherwise strings can contain hard newlines, tabs, and whatnot. We don't interpret the usual escapes like \n, \t etc. though again, the changes would be pretty trivial (in “read_string”).

We now have sufficiently powerful tools to easily write the parser, but first I'd recommend you to look at the description of the AST.

AST

num { type: "num", value: NUMBER }
str { type: "str", value: STRING }
bool { type: "bool", value: true or false }
var { type: "var", value: NAME }
lambda { type: "lambda", vars: [ NAME... ], body: AST }
call { type: "call", func: AST, args: [ AST... ] }
if { type: "if", cond: AST, then: AST, else: AST }
assign { type: "assign", operator: "=", left: AST, right: AST }
binary { type: "binary", operator: OPERATOR, left: AST, right: AST }
prog { type: "prog", prog: [ AST... ] }
let { type: "let", vars: [ VARS... ], body: AST }

Example:

let (a = 10, b = a * 10) {
  a + b;
}

To:

{
  "type": "let",
  "vars": [
    {
      "name": "a",
      "def": { "type": "num", "value": 10 }
    },
    {
      "name": "b",
      "def": {
        "type": "binary",
        "operator": "*",
        "left": { "type": "var", "value": "a" },
        "right": { "type": "num", "value": 10 }
      }
    }
  ],
  "body": {
    "type": "binary",
    "operator": "+",
    "left": { "type": "var", "value": "a" },
    "right": { "type": "var", "value": "b" }
  }
}

The parser

The parser creates AST nodes that are described in the AST section.

Thanks to the work we did in the tokenizer, the parser operates on a stream of tokens instead of dealing with individual characters. It still defines many helpers to keep complexity down. I'll discuss here the main functions that comprise the parser. Let's start with a high level one, the lambda parser:

function parse_lambda() {
    return {
        type: "lambda",
        vars: delimited("(", ")", ",", parse_varname),
        body: parse_expression()
    };
}

This function will be invoked when the lambda keyword has already been seen and “eaten” from the input, so all it cares for is to parse the argument names; but they're in parentheses and delimited by commas. Rather than placing that code in parse_lambda, I preferred to write a delimited function that takes these arguments: the start token, the end token, the separator, and a function which parses whatever must be between those start/end tokens. In this case, it's parse_varname, which takes care to throw an error if it encounters anything which doesn't look like a variable. The body of the function is an expression, so we get it with parse_expression.

delimited is a bit lower-level:

function delimited(start, stop, separator, parser) {
    var a = [], first = true;
    skip_punc(start);
    while (!input.eof()) {
        if (is_punc(stop)) break;
        if (first) first = false; else skip_punc(separator);
        if (is_punc(stop)) break; // the last separator can be missing
        a.push(parser());
    }
    skip_punc(stop);
    return a;
}

As you can see, it uses more utilities: is_punc and skip_punc. The former will return true if the current token is the given punctuation sign (without “eating” it), while skip_punc will ensure that the current token is that punctuation (throws an error otherwise) and will discard it from the input.

The function that parses the whole program is probably the simplest:

function parse_toplevel() {
    var prog = [];
    while (!input.eof()) {
        prog.push(parse_expression());
        if (!input.eof()) skip_punc(";");
    }
    return { type: "prog", prog: prog };
}

Since we have no statements, we simply call parse_expression() and read expressions until we get to the end of the input. Using skip_punc(";") we demand semicolons between these expressions.

Another simple example: parse_if():

function parse_if() {
    skip_kw("if");
    var cond = parse_expression();
    if (!is_punc("{")) skip_kw("then");
    var then = parse_expression();
    var ret = { type: "if", cond: cond, then: then };
    if (is_kw("else")) {
        input.next();
        ret.else = parse_expression();
    }
    return ret;
}

It skips over the if keyword with skip_kw (and this throws an error if the current token is not the given keyword), reads the condition using parse_expression(). Next, if the consequent branch doesn't start with a { we require the keyword then to be present (I feel like the syntax is too scarce without it). The branches are just expressions, so again we use parse_expression() for them. The else branch is optional so we need to check if the keyword is present before parsing it.

Having many small utilities helps a lot in keeping the code simple. We almost write the parser like we had a high level language dedicated for parsing. All these functions are “mutually recursive”, e.g.: there's a parse_atom() function which is the main dispatcher — based on the current token it calls other functions. One of them is parse_if() (called when the current token is if) and that in turn calls parse_expression(). But parse_expression() calls parse_atom(). The reason why there's no infinite loop is that at each step, one function or another will advance at least one token.

This kind of parser is called a “recursive descent parser” and it's probably the easiest kind to write manually.

**Lower level: parse_atom() and parse_expression() **

parse_atom() does the main dispatching job, depending on the current token:

function parse_atom() {
    return maybe_call(function(){
        if (is_punc("(")) {
            input.next();
            var exp = parse_expression();
            skip_punc(")");
            return exp;
        }
        if (is_punc("{")) return parse_prog();
        if (is_kw("if")) return parse_if();
        if (is_kw("true") || is_kw("false")) return parse_bool();
        if (is_kw("lambda") || is_kw("λ")) {
            input.next();
            return parse_lambda();
        }
        var tok = input.next();
        if (tok.type == "var" || tok.type == "num" || tok.type == "str")
            return tok;
        unexpected();
    });
}

If it sees an open paren, then it must be a parenthesized expression — thus, skip over paren, call parse_expression() and expect a closing paren. If it sees some keyword, it calls the appropriate parser function. If it sees a constant or an identifier, it's just returned as is. And if nothing works, unexpected() will throw an error.

When an atomic expression is expected and it sees {, it calls parse_prog to parse a sequence of expressions. That's defined below. It will do some minor optimization at this point — if the prog is empty, then it just returns FALSE. If it has a single expression, it is returned instead of a "prog" node. Otherwise it returns a "prog" node containing the expressions.

// we're going to use the FALSE node in various places,
// so I'm making it a global.
var FALSE = { type: "bool", value: false };

function parse_prog() {
    var prog = delimited("{", "}", ";", parse_expression);
    if (prog.length == 0) return FALSE;
    if (prog.length == 1) return prog[0];
    return { type: "prog", prog: prog };
}

Here's the parse_expression() function. Contrary to parse_atom(), this one will extend an expression as much as possible to the right using maybe_binary(), which is explained below.

function parse_expression() {
    return maybe_call(function(){
        return maybe_binary(parse_atom(), 0);
    });
}

The maybe_* functions

These functions check what follows after an expression in order to decide whether to wrap that expression in another node, or just return it as is.

maybe_call() is very simple. It receives a function that is expected to parse the current expression. If after that expression it sees a ( punctuation token, then it must be a "call" node, which is what parse_call() makes (included below). Notice again how delimited() comes in handy for reading the argument list.

function maybe_call(expr) {
    expr = expr();
    return is_punc("(") ? parse_call(expr) : expr;
}

function parse_call(func) {
    return {
        type: "call",
        func: func,
        args: delimited("(", ")", ",", parse_expression)
    };
}

Operator precedence

maybe_binary(left, my_prec) is used to compose binary expressions like 1 + 2 * 3. The trick to parse them properly is to correctly define the operator precedence, so we'll start with that:

var PRECEDENCE = {
    "=": 1,
    "||": 2,
    "&&": 3,
    "<": 7, ">": 7, "<=": 7, ">=": 7, "==": 7, "!=": 7,
    "+": 10, "-": 10,
    "*": 20, "/": 20, "%": 20,
};

This says that * binds tighter than +, so an expression like 1 + 2 * 3 must be read as (1 + (2 * 3)) instead of ((1 + 2) * 3), which would be the normal left-to-right order in which the parser operates.

The trick is to read an atomic expression (only the 1) and pass it to maybe_binary() (the left argument), along with the current precedence (the my_prec). maybe_binary will look at what follows. If it doesn't see an operator, or if it has a smaller priority, then left is returned as is.

If it's an operator that has a higher precedence than ours, then it wraps left in a new "binary" node, and for the right side it repeats the trick at the new precedence level (*):

function maybe_binary(left, my_prec) {
    var tok = is_op();
    if (tok) {
        var his_prec = PRECEDENCE[tok.value];
        if (his_prec > my_prec) {
            input.next();
            var right = maybe_binary(parse_atom(), his_prec) // (*);
            var binary = {
                type     : tok.value == "=" ? "assign" : "binary",
                operator : tok.value,
                left     : left,
                right    : right
            };
            return maybe_binary(binary, my_prec);
        }
    }
    return left;
}

Note that before returning the binary expression we must also call maybe_binary at the old precedence level (my_prec), in order to wrap the expression in another one, should an operator with a higher precedence follow. If all this is confusing, read the code again and again (perhaps try to execute it mentally on some input expressions) until you get it.

Finally, since my_prec is initially zero, any operator will trigger the building of a "binary" node (or "assign" when the operator is =).

There are a few more functions in the parser, so I'm including the whole parse function below. Click “Show code” to display it (about 150 lines).

var FALSE = { type: "bool", value: false };
function parse(input) {
    var PRECEDENCE = {
        "=": 1,
        "||": 2,
        "&&": 3,
        "<": 7, ">": 7, "<=": 7, ">=": 7, "==": 7, "!=": 7,
        "+": 10, "-": 10,
        "*": 20, "/": 20, "%": 20,
    };
    return parse_toplevel();
    function is_punc(ch) {
        var tok = input.peek();
        return tok && tok.type == "punc" && (!ch || tok.value == ch) && tok;
    }
    function is_kw(kw) {
        var tok = input.peek();
        return tok && tok.type == "kw" && (!kw || tok.value == kw) && tok;
    }
    function is_op(op) {
        var tok = input.peek();
        return tok && tok.type == "op" && (!op || tok.value == op) && tok;
    }
    function skip_punc(ch) {
        if (is_punc(ch)) input.next();
        else input.croak("Expecting punctuation: \"" + ch + "\"");
    }
    function skip_kw(kw) {
        if (is_kw(kw)) input.next();
        else input.croak("Expecting keyword: \"" + kw + "\"");
    }
    function skip_op(op) {
        if (is_op(op)) input.next();
        else input.croak("Expecting operator: \"" + op + "\"");
    }
    function unexpected() {
        input.croak("Unexpected token: " + JSON.stringify(input.peek()));
    }
    function maybe_binary(left, my_prec) {
        var tok = is_op();
        if (tok) {
            var his_prec = PRECEDENCE[tok.value];
            if (his_prec > my_prec) {
                input.next();
                return maybe_binary({
                    type     : tok.value == "=" ? "assign" : "binary",
                    operator : tok.value,
                    left     : left,
                    right    : maybe_binary(parse_atom(), his_prec)
                }, my_prec);
            }
        }
        return left;
    }
    function delimited(start, stop, separator, parser) {
        var a = [], first = true;
        skip_punc(start);
        while (!input.eof()) {
            if (is_punc(stop)) break;
            if (first) first = false; else skip_punc(separator);
            if (is_punc(stop)) break;
            a.push(parser());
        }
        skip_punc(stop);
        return a;
    }
    function parse_call(func) {
        return {
            type: "call",
            func: func,
            args: delimited("(", ")", ",", parse_expression),
        };
    }
    function parse_varname() {
        var name = input.next();
        if (name.type != "var") input.croak("Expecting variable name");
        return name.value;
    }
    function parse_if() {
        skip_kw("if");
        var cond = parse_expression();
        if (!is_punc("{")) skip_kw("then");
        var then = parse_expression();
        var ret = {
            type: "if",
            cond: cond,
            then: then,
        };
        if (is_kw("else")) {
            input.next();
            ret.else = parse_expression();
        }
        return ret;
    }
    function parse_lambda() {
        return {
            type: "lambda",
            vars: delimited("(", ")", ",", parse_varname),
            body: parse_expression()
        };
    }
    function parse_bool() {
        return {
            type  : "bool",
            value : input.next().value == "true"
        };
    }
    function maybe_call(expr) {
        expr = expr();
        return is_punc("(") ? parse_call(expr) : expr;
    }
    function parse_atom() {
        return maybe_call(function(){
            if (is_punc("(")) {
                input.next();
                var exp = parse_expression();
                skip_punc(")");
                return exp;
            }
            if (is_punc("{")) return parse_prog();
            if (is_kw("if")) return parse_if();
            if (is_kw("true") || is_kw("false")) return parse_bool();
            if (is_kw("lambda") || is_kw("λ")) {
                input.next();
                return parse_lambda();
            }
            var tok = input.next();
            if (tok.type == "var" || tok.type == "num" || tok.type == "str")
                return tok;
            unexpected();
        });
    }
    function parse_toplevel() {
        var prog = [];
        while (!input.eof()) {
            prog.push(parse_expression());
            if (!input.eof()) skip_punc(";");
        }
        return { type: "prog", prog: prog };
    }
    function parse_prog() {
        var prog = delimited("{", "}", ";", parse_expression);
        if (prog.length == 0) return FALSE;
        if (prog.length == 1) return prog[0];
        return { type: "prog", prog: prog };
    }
    function parse_expression() {
        return maybe_call(function(){
            return maybe_binary(parse_atom(), 0);
        });
    }
}

Credit

The moment I understood how to write a non-trivial parser occurred while studying Marijn Haverbeke's parse-js library (Common Lisp). The parser above, although for a much simpler language, is modeled after his code.

评论

还没有任何评论,你来说两句吧

正在获取,请稍候...
00:00/00:00