Implementing a Language in C# - Part 3: Parser Introduction

This post is part of a series entitled Implementing a Language in C#. Click here to view the first post which covers some of the preliminary information on creating a language. Click here to view the last post in the series, which covers building the Lexer. You can also view all of the posts in the series by clicking here.
I was originally going to make the entire parser into one single post, but I found that there was just too much to cover without making this into a 5000 word essay on the intricacies of the complex parsing process. This post will cover some of the considerations we must take into account when building our parser.
The Parser is one of the most complex parts of the compilation process. Parsing is when the tokens that were generated by the Lexer are read and their meanings understood. This stage is also the last of what is called the "front end" of a compiler.
A Tangent on Interpreted, Vritual Machine, and Compiled Languages
Now is where I should take a moment to explain the high level stages of a compiler. The lexer and parser collectivly are the front end -- the part of the compiler that deals with the user's input. front end generates an Abstract Syntax Tree (AST), also known as a Parse Tree, which is passed onto the back end.
The back end takes the AST and converts it into machine code specific to the platform for which it is being compiled. That is unless you are using a interpreted language, like python, or a language that runs in a virtual machine, like C# or Java. With VM languages, the code is converted to Bytecode and will be Just in Time compiled at runtime. With interpreted languages, the AST is passed onto an interpreter which operates directly on the data.
Interpreted languages have the advantage of having a quick compilation process, which makes them quick to iterate. However, they have the disadvantage of needing to be lexed and parsed every time the interpreter is started. This significantly slows the startup process.
Compiled languages offer much faster runtime speed at the cost of a longer compilation time. This is because the compiler is able to make general optimizations to the user's code. Compiled languages, however, cannot make guesses about the exact environment on which the code will be running.
VM languages try to take the best of both worlds by compiling the program into a very simple instruction set and then compiling parts of code at runtime, just before they're needed. This is where the phrase "Just in Time Compiler" comes form. This results in a significantly improved compile time compared to compiled languages and a significantly reduced startup time compared to interpreted languages.
Language Type | Adventage | Disadvantage |
---|---|---|
Compiled | Fast execution time, startup time | Slow Compile time, not portable |
Interpreted | Fast compile time, able to quickly iterate versions, portable | Slow runtime |
Virtual Machine | Faster compile and startup times compared to interpreted languages, able to make more specific code optimizations, portable | Slow startup time, higher memory footprint, slightly slower runtime compared to compiled code |
GlassScript is, decidedly, a interpreted language.
Parser Grammar
This is where the idea of building a language really becomes fun. This is where you will get the chance to make real choices about what your language will look like and what features it will include. This is because the Parser is where the structure of the features are implemented.
Because we will be writing the parser by hand, we will be creating it in the same way in which the lexer was made: branching logic. This yields what is called a Recursive Descent Parser. It's not too terribly fancy and it obscures the grammar of the language by scattering the rules over several methods, but it is easier to understand the actual code of the parser.
The grammar for a parser can be expressed in the exact same way that we expressed the grammar for the lexer. The difference is that we should provide a starting point for parsing. Here is an example of my start
node:
start = ( { class declaration | method declaration }, { statement } ) | { statement } ;
Recursion in Parsers
Nearly every language today has some sort of recursive syntax. Expressions are one of the best examples. Consider the expression n
. We can add more to this expression using operators: n + 3 * 2
, 6 + n / 3
, (10 - n) % 2
, n == 3.1415926
, etc. These operations are all Binary Expressions, expressions which take two arguments. 2 + 2
could be said as Add the constant "2" and the constant "2"
. Implementing recursion is easy... to mess up.
Consider this recursive grammar:
expression = expression, ( "+" | "-" ), integer;
This makes 2 + 3
and 2 + 3 + 4 + 5...
both valid expressions. The problem comes when we try to convert this grammar into code. One possible implementation of the rule, following the literal structure of the grammar, could be:
ParseExpression()
{
var left = ParseExpression();
var operator = ParseOperator();
var right = ParseInteger();
}
This rule will never resolve.
This is what is called left-recursion. Left-recursion is a problem in a recursive descent parser because of the nature of function evaluation. There are parsers which are able to parse left-recursive grammars, such as a LR(k) parser, but are extremely difficult to write by hand. These are the kind of parsers that are generated by tools like GNU Bison.
Left-recursion can be eliminated by specifying an equivalent rule such as
expression = integer | integer, ( "+" | "-" ), expression ;
Operator Precedence
In Algebra, you have the Order of Operations, which says that you must do parenthesis before exponents before multiplication and division before addition and subtraction. In programming, you have the Order of Operations as well as all of the other operators like And
and Or
, Greater Than
or Less Than
, operators which compare equality, perform bitwise operations, or assign values. All of these operations require some sort of operator precedence. This is the soultion I have implemented:
expression = assignment expression ;
assignment expression = logical expression,
[ assignment operator, assignment expression ] ;
logical expression = equality expression,
[ logical operator, logical expression ] ;
equality expression = relational expression,
[ equality operator, equality expression ] ;
relational expression = bitwise expression,
[ relational operator, relational expression ] ;
bitwise expression = shift expression,
[ bitwise operator, bitwise expression ] ;
shift expression = additive expression,
[ shift operator, shift expression ] ;
additive expression = multiplicative expression,
[ additive operator, additive expression ] ;
multiplicative expression = unary expression,
[ multiplicative operator, multiplicative expression ] ;
unary expression = (prefix unary operator, primary expression)
| (primary expression, suffix unary operator)
| primary expression ;
primary expression = identifier | constant expression
| method call expression | new expression
| array access expression | reference expression
| precedence override | lambda expression ;
In this grammar, each level of expression calls up to the next level of expression. precedence can be overridden by enclosing the expression in parenthesis as such: ( expression )
. This allows for operations like (2 + 2) * 10
to be evaluated to 40
instead of 22
.
Operator Associativity
Mathematical operators in GlassScript are, by definition, left associative. That means that the expression is evaluated from the leftmost term to the rightmost. For example, the expression 1 + 3 - 2
is evaluated as (1 + 3) - 2 = 2
and not 1 + (3 - 2) = 2
. In this case, both expressions result in the same output, 2
, but consider 7 − 4 + 2
-- an expression that could have two outputs given different orders of evaluation. (7 - 4) + 2 = 5
is not the same as 7 - (4 + 2) = 1
. This is why associativity is needed.
Defaulting all operations to left associativity is not a bad option until you consider the expression x = y = 2 + 8
. Should the operation be evaluated ((x = y) = 2) + 8
? No; the expression should evaluate from right to left as so: x = (y = (2 + 8))
.
Choose associativity based on the needs of the individual expression type.
Expressing associativity is not possible in ENBF aside from using a comment to specify that an operator has special associativity. However, this is possible when we implement our parser. Here are two example expressions: one that is left associative and one that is right associative.
Left Associative Additive:
private Expression ParseAdditiveExpression()
{
Expression left = ParseMultiplicativeExpression();
while (IsAdditiveOperator())
{
var op = ParseBinaryOperator();
var right = ParseMultiplicativeExpression();
left = new BinaryExpression(left, right, op);
}
return left;
}
Right Associative Assignment:
private Expression ParseAssignmentExpression()
{
Expression left = ParseLogicalExpression();
if (IsAssignmentOperator())
{
var op = ParseBinaryOperator();
Expression right = ParseAssignmentExpression();
return new BinaryExpression(left, right, op);
}
return left;
}
The left associative example replaces the left term with a new one, while the right associative example recurses into itself.
Conclusion
That is all for this post. I will be back on Monday or Tuesday with the actual code for the parser as well as some more conceptual information.
The code for the entire project can be found on GitHub.