Monday, November 3, 2014

Poor Man's Parser in C#

I am sure that anyone dreaming about writing a compiler will look at least once into parser-generators or things like it. But some may find it daunting and the worst of it, some of them need to have very hard to debug and setup.

So how to write your own parser - which is easy to maintain/write.

Ok, so first of all, please don't blame me if you care about performant parser engine or you want to to do the "correct way". Also, this strategy I found it to work with somewhat simple languages (let's say Pascal, MQL (MetaTrader's trading language), C, and things like it.

The parser has the duty to make a tree based on your tokens.

Steps:
- write a lexer that gives a list of tokens. You can start from this implementation. It should be very easy to change it for example that some IMatcher-s to accept specific words
- create a Node in a tree with arbitrary number of children as following:
    public class BlockAstNode {
        public List<BlockAstNode> Items = new List<BlockAstNode>();

        public TokenDef InnerToken { get; set; }
        public bool IsTerminal { get; set; }
        public BlockAstNode()
        {
        }
      
        public BlockAstNode(TokenDef token)
        {
            InnerToken = token;
            IsTerminal = true;
        }

        public BlockAstNode BuildNonTerminal(int startRange, int endRange)
        {
            var result = new BlockAstNode ();
            for (var i = startRange; i <= endRange; i++)
            {
                var node = Items[i];
                result.Items.Add(node);
            }
            Items.RemoveRange(startRange, endRange- startRange+1);
            Items.Insert (startRange, result);
            return result;
        }
  }
- fill a "tree" with one root and as children create AstNode children for every token (using the constructor with TokenDef, where TokenDef is your token class)
- make a bracing folding. Most major languages support a folding strategy. In Pascal is: "begin .. end", in C is { to } or for Ruby/VB: <class> to <end> or <def>..<end>. Every time when you find a block that starts with your begin brace token, you fold it using BuildNonTerminal method, and the result node is placed instead of the full node. Try for simplicity to find the smallest block that can be folded, and repeat until you cannot fold anymore.
- add more folding strategies for common cases: "fold all comments tokens"
- add fold for reserved words tokens: class, while, if, (interface/implementation) and they are separated iterations that visit the tree recursively and simplify the major instructions

Why this works better than regular parser generators?
- every strategy is very easy to debug: you add multiple iterations and if you have a crash in any of them, you will likely have a mismatched index (with +/- one, but no ugly generated code)

Enjoy writing your new language!

No comments:

Post a Comment