Implementing a Language in C# - Part 5: Parsing

This post is part of a series entitled Implementing a Language in C#. Click here to view the first post which covers some of the preliminary information on creating a language. Click here to view the last post in the series, which covers the Abstract Syntax Tree and error recovery in the parser. You can also view all of the posts in the series by clicking here.

As you may have noticed in Part 3, transforming well-formatted ENBF into a recursive descent parser is actually quite easy. In this part, I will show you some of the parts that were not as simple as those few examples I did show.

First off: method calls. Method calls are any instance when an expression is followed by an invocation order. print() translates to a method call containing an identifier. console.print() is a reference inside of a method call. (console.print)() is equivalent to the previous example. This makes the language more verbose at the cost of parser clarity. This open-ended definition of a method call also means that we will need to break the "one rule per method" guideline. This is how we correctly transition from a reference to a method call:

Take(TokenKind.Dot);
if (_current == TokenKind.Identifier)
{
    var expr = ParseIdentiferExpression();
    references.Add(expr);
}

if (_current == TokenKind.LeftParenthesis)
{
    var expr = new ReferenceExpression(CreateSpan(hint), references);
    return ParseMethodCallExpression(expr);
}

As you can see, we finish the reference and transition to a method call. This is in contrast to PineTree, where I would just add the method call to the list of references (here and here). This will make building the interpreter easier because we will know that something is a method call before we need to lookup the method. Here is some example pseudo-code to show what I mean:

Method call in reference (see PineTree):

if(reference.Last() is MethodCallExpression) 
{
    evaluate reference and keep value
    return EvaluateMethodCall(reference.Last(), referenceValue);
}
...
EvaluateMethodCall(MethodCallExpression expr, object reference) 
{
    var arguments = EvaluateArguments(expr.Arguments);
    if(_currentReference != null)
    {
        return _currentReference.FindMethod(expr.MethodName).Call(arguments);
    }
    else if ... more special cases
    else 
    {
        return FindGlobalMethod(expr.MethodName).Call(arguments);
    }
}

Reference in method call:

var arguments = EvaluateArguments(methodCall.Arguments);
var method = null;

if(methodCall.Reference is ReferenceExpression) 
{
    method = Evaluate(methodCall.reference);
}

if(method != null) 
{
    return method.Call(arguments);
}
else 
{
    throw new RuntimeException("No method!");
}

As you can see, reference in method call removes the need to keep a pass a reference variable and makes the runtime code overall easier to understand. This is, however, at the cost of some additional confusion during parsing.


Lambdas Ain't Any Better!

Much more confusion comes from the ParseOverrideExpression method. This is because, following the strict ENBF, this method should be no more than four lines, but it's 18! Plus, it has this nightmare of an expression:

if (_current == TokenKind.Identifier 
&& (_next == TokenKind.Comma 
    || (_next == TokenKind.RightParenthesis && Peek(2) == TokenKind.FatArrow) 
    || _next == TokenKind.Colon
   ))

This one statement is just finding if the current pattern matches one for a parameter list. There are ways to clean this up and it is something we will explore in a later part.


Public APIs

The final thing that I will cover is the introduction of public APIs into your parser. It is a good idea to include some APIs that are an entry point into your parser. GlassScript includes a ParseStatement, ParseExpression, and a ParseDocument method for outside use. Each of these accept the output of a lexer and a GlassScriptParserOptions object. The options object includes some flags that may be set that changes the behavior of the parser. These are features such as optional semicolons and root statements in a document.

All these methods do is to initialize the parser with a new SourceCode, options, and set of tokens. They then call into the appropriate Parse method and return the result.

public Expression ParseExpression(SourceCode sourceCode, IEnumerable<Token> tokens)
{
    InitializeParser(sourceCode, tokens, GlassScriptParserOptions.OptionalSemicolons);

    return ParseExpression();
}

public SourceDocument ParseFile(SourceCode sourceCode, IEnumerable<Token> tokens)
{
    return ParseFile(sourceCode, tokens, GlassScriptParserOptions.Default);
}

public SourceDocument ParseFile(SourceCode sourceCode, IEnumerable<Token> tokens, GlassScriptParserOptions options)
{
    InitializeParser(sourceCode, tokens, options);

    return ParseDocument();
}

public SyntaxNode ParseStatement(SourceCode sourceCode, IEnumerable<Token> tokens)
{
    return ParseStatement(sourceCode, tokens, GlassScriptParserOptions.Default);
}

public SyntaxNode ParseStatement(SourceCode sourceCode, IEnumerable<Token> tokens, GlassScriptParserOptions options)
{
    InitializeParser(sourceCode, tokens, options);

    return ParseStatement();
}

Conclusion

That concludes this short post. Next time, our language might actually become useful; we will be covering the creation of a runtime. As always, the source code for this post can be found on GitHub.