Implementing a Language in C# - Part 1: Introduction

Implementing a Language in C# - Part 1: Introduction

Creating a new programming language is never a task that should be undertaken lightly. Before even starting work on your new language, imagine how it will be used. Will you use it for a new game? For processing data? As a general-purpose language?

Next, consider alternative options. Lua is an industry-standard in the field of Game Design. Functional languages like F# and Scala work great for performing operations with data. There are countless general-purpose languages in existence, and more coming into being every day.

Still want to create a new language? Good.

Creating a language can give a lot of insight into what compiler designers must consider. Things like grammar, use-cases, standard libraries, etc. All of these are things we will need to consider as we implement a new language from the ground up.

Today, there are tools like ANTLR which help with creating a lexer and parser from a definition of a language grammar. In this project, I have opted to write the entire lexer and parser by hand.

Before we start thinking about implementation details, I'll give you a quick introduction on the internals of a compiler.

What is a Compiler?

A compiler, simply put, is a program that translates text (code) into machine instructions. A compiler contains many parts:

  • Lexical Analysis - Turning the text input into a stream of tokens. This process is also known as Tokenization
  • Parsing - Generating an Abstract Syntax Tree which represents the original input.
  • Semantic Analysis - Enforcing general rules of a language such as checking existence of variables, scope rules, etc.
  • Intermediate Code Generation - Translating the code into an intermediate format like MSIL, Java bytecode, LLVM Bitcode, x86 Assembly, etc. This makes the process of producing a final executable much simpler and more general.
  • Object Code Generation - Generation of machine code specific to the platform.
  • Linking - Combining all of the object code into a single source.
  • Executable Generation - Generating a final executable which can run on the target platform.

Disclaimer: this information is super simplified and differs from compiler to compiler in that some steps may be combined or separated into more steps. For the purposes of this article, I am just attempting to provide an educational overview.

In the case of interpreted languages like Python or Javascript, only the first two or three steps are performed before the output is passed to a Runtime. Just In Time compiled languages such as C# or Java perform the first four steps and split the Linking stage into two steps: compile-time linking and run-time linking.

In this series of articles, we will be implementing an interpreted language which will be implemented in C# 6.

First Steps

Before we start implementing a language, we should set up our Environment. This is the initial directory structure which I will be using to implement this.

|-- /
|   \-- Language
|   |   \-- Lexer
|   |   |-- Parser
|   |   |-- Syntax
|   |   |   \-- Expressions
|   |   |   |-- Statements
|   |-- Runtime

Right now is also a good time to choose a name for your language. From now, I will be calling this project GlassScript. This name is for three reasons: (1) this is a scripting language, (2) there is a glass window in front of me, and (3) I have no creativity.

Now is also a good time to think about what functionality we can split outside of the core project. These are things like an interactive console, special hosting APIs (e.g. for ASP.NET), and other features that are not required for the language itself to function.

Another thing you may want to do is to create a file in your Lexer directory that defines your lexical grammar. This will help you to figure out your different lexical rules and get a handle on what your language will look like. This step will also help others to understand your lexer. This grammar does not need to be perfect, just enough for you to figure out some of your rules.

Boilerplate

Now that you have some idea of what your language will be, we can start writing a bit of boilerplate code that will be used in the lexer.

String Extensions

We are going to add one extension method to the string class named CharAt(int index). This method will return the character at a given index or, if the given location is out of range, an ASCII NULL (\0) character.

StringExtensions.cs

public static char CharAt(this string str, int index)
{
    if (index > str.Length - 1 || index < 0)
    {
        return '\0';
    }

    return str[index];
}

The use of this will become evident when we start creating the lexer.

Source Span and Locations

Next, we will be creating two structs: SourceLocation and SourceSpan that will be used for error reporting and statistical information. SourceLocation will contain needed location information in a file: index, line, column. SourceSpan will contain two SourceLocations, Start and End.

SourceLocation.cs

public struct SourceLocation
{
    private readonly int _column;
    private readonly int _index;
    private readonly int _line;

    public int Column => _column;

    public int Index => _index;

    public int Line => _line;

    public SourceLocation(int index, int line, int column)
    {
        _index = index;
        _line = line;
        _column = column;
    }
}

SourceSpan.cs

public struct SourceSpan
{
    private readonly SourceLocation _end;
    private readonly SourceLocation _start;

    public SourceLocation End => _end;

    public int Length => _end.Index - _start.Index;

    public SourceLocation Start => _start;

    public SourceSpan(SourceLocation start, SourceLocation end)
    {
        _start = start;
        _end = end;
    }
}

Lexer Utilities

The last component we will be covering in this post is some of the Lexer utilities. These methods will help to reduce the confusion in an already complex file by taking away some of the manual labor.

The first of these is a Peek(int ahead) method. This one-liner just returns the char n places ahead of _index.

Second, the shorthand _ch, _next, and _last properties return the current, next, and last characters in the source code of the current file.

Third, the Consume() method adds the current character to the token and advances the current position of the stream by one.

Finally, the CreateToken(TokenKind kind) method terminates the current token and initializes the next.

Conclusion

That's all for this post. We got some of the fundamentals out of the way, such as boilerplate utilities and some of the helper objects.

In the next post, we will be implementing GlassScript's Lexer.

The entire code for GlassScript, including some things we did not cover in this post, can be found on Github.