John Merrells

« Overload 46 Editorial - Returning to C++      Overload 47 Editorial - PRD »

Accent Vol 2 No 11 - ANTLR

Every couple of years or so I bump onto an engineering problem most suitably solved by a simple lexical analyzer and parser. Typical examples being a configuration file processor, or a URL processor. Each time, I dust off my old lex and yacc manuals and try to remember how a DFSA differs from an NDFSA and how a shift-reduce error differs from a reduce-reduce error. Invariably I give up and revert to the tried and true method of just hacking up some code myself. Surely there must be a better way?

There is, but first a brief introduction to some language processing terms.

A lexical analyzer, also known as a lexer or scanner, simply transforms an input stream of characters into an output stream of tokens. For example, a lexer might be programmed by a set of rules to transform a stream such as (10+20) into the token stream LP N PLUS N RP. Lexers specialize in grouping sequences of characters into larger constructs named tokens.

A parser takes a stream of tokens from a lexer and groups them into higher order constructions called productions. Following on from our example above, a parser might be programmed by a set of rules to recognize the token stream LP N PLUS N RP as the production named expression.

ANTLR, or ANother Language Recognizer, is a combined lexer and parser generator, with a number of features that make it much simpler and nicer to use than the traditional combination of lex and yacc, or the GNU alternative of flex and bison.

That’s enough terms for now, lets get down to some code. We’ll work through the canonical calculator example. What’s the result of the expression (1+1)? Well, first we’ll need a lexer.

// calc.g
options
{
language=”Cpp”;
}
class CalcLexer extends Lexer;
public
INTEGER: (DIGIT)+;
LPAREN: ‘(’;
RPAREN: ‘)’;
PLUS: ‘+’;
protected
DIGIT: ‘0′..’9′;

The definition code for the lexer is comprised of four sections; options, class derivation, public tokens and protected tokens. The options section contains parameters for the ANTLR tool. In this case the ‘language’ statement selects the generation of C++ code. ANTLR can also generate Java and Sather code, but for this article we’ll use C++. The class statement declares that the subsequent sections are defining a lexer, and allows the programmer to name their scanner. The next two sections provide the token definitions. The case of the token names is significant. The tool knows to treat identifiers that begin with an uppercase letter as tokens. In our example lexer a DIGIT is defined to be any character in the range ‘0′ through to ‘9′, and an INTEGER is a sequence of DIGITS. The definitions use the standard regular expression syntax. Here the ‘+’ in (DIGIT)+ means a sequence of one or more. Other important regular expression operators to keep in mind are (x)? which means x is optional, x | y which means x or y, and (x)* which means a sequence of zero or many x’s.

We need a parser to consume the token stream and recognize productions so that we can evaluate our expressions. The rules used to describe the parser are called production rules and collectively describe a grammar. The grammar for our calculator, in ANTLR notation, looks like this.

class CalcParser extends Parser;
options
{
buildAST=true;
}
expr: LPAREN INTEGER (PLUS INTEGER)* RPAREN;

These three sections of code can appear either before or after the lexer definition. The first section again declares the start of the rules that describe the parser. The options section can contain many statements, but for our simple demonstration we need only to turn on the automatic construction of an abstract syntax tree (AST). The AST is a tree of nodes that is built to represent the productions as they are recognized within the token stream. And the third section provides a definition of the expr production rule. This rule states that an expression begins with a left parenthesis followed by an integer and then followed by zero or more integer additions, followed by a closing right parenthesis.
The lexer and parser source code is generated by executing the following command line. (You’ll have to install ANTLR, the Java SDK, and have set up your class path appropriately.)

> java antlr.Tool calc.g

The files resulting from this execution will be the lexer code: CalcLexer.cpp, CalcLexer.hpp, CalcLexerTokenTypes.hpp; and the parser code: CalcParser.cpp, CalcParser.hpp, CalcParserTokenTypes.hpp. One of the nice things about ANTLR, over lex, flex, yacc, and bision, is that the code it generates is implemented as nested case statements, rather than tables of numbers. This makes the code much simpler to understand when implementing a complex grammar or whilst debugging.

Now we need some main() code to drive our lexer and parser.

//main.cpp
#include
#include “CalcLexer.hpp”
#include “CalcParser.hpp”

using namespace std;
using namespace antlr;

void main()
{
try {
CalcLexer lexer(cin);
CalcParser parser(lexer);
parser.expr();
RefAST t = parser.getAST();
cout < < t->toStringList() < < endl;
} catch(exception& e) {
cerr << "exception: " << e.what() << endl;
}
}

A stream of characters it taken from the standard input stream, passed into the lexer, from which a token stream is passed into the parser, from which productions are recognized, and an abstract syntax tree generated.

Compile main.cpp and the tool generated code, and link with the ANTLR C++ library. (You'll need to compile this library yourself, but instructions are provided.)

Here's a sample session showing a sample input string, and the resultant AST.

[c:\temp\test]test
(1+1)
( 1 + 1 )
[c:\temp\test]test
(1+2+3)
( 1 + 2 + 3 )
[c:\temp\test] test
(1*1)
exception: unexpected char: *

The next stage is clearly for us to actually calculate the resultant value of the expression. A common approach is to place actions with the production rules so that as the production is recognized the result of the expression in evaluated. This solution is cumbersome for large languages as the recognition and evaluation code is intermingled. We'd prefer to separate these issues, just as we separated the concerns of lexical analysis and parsing. Typically, for a complex language, the actions that adorn the grammar do no more than build the abstract syntax tree. But, ANTLR will do this for us.

The default AST being built isn't quite optimal. In the output above the tree includes nodes that represent the parenthesis. These are just there for syntactic sugar. Also, ,the root node in the tree is also the left parenthesis. We'd much rather it was the plus, as this is the operation we're trying to perform. The expression production should be changed thus:

expr: LPAREN! INTEGER (PLUS^ INTEGER)* RPAREN!;

Note the addition of the '!' and '^' notations. The '!' indicates that for AST generation the token should be ignored, and the '^' after the PLUS token signifies that the PLUS is to be the root of the tree. Our interactive session now produces this:

[c:\temp\test]test
(1)
1
[c:\temp\test]test
(1+1)
( + 1 1 )
[c:\temp\test]test
(1+2+3)
( + ( + 1 2 ) 3 )

The AST is flattened for printing. Each pair of parenthesis marks the children of the preceding node.

To actually calculate the result of the expression we must build a walker to walk the AST tree implementing the evaluation actions. Again, ANTLR will write this code for us. Here's the specification of our walker.

class CalcTreeWalker extends TreeParser;

expr returns [float r]
{
float a,b;
r=0;
}
: #(PLUS a=expr b=expr) {r = a+b;}
| i:INTEGER {r = atof(i->getText().c_str());}
;

This introduces quite a bit of new syntax, but the underlying structure is still the same. The walker takes an abstract syntax tree and recognizes productions within that tree. The rule above describes a production called expr, which has a return value of type float. The statements between braces, {’}, are regular C++ code, which I won’t drill into. The #(PLUS ‘) construction describes an AST node for the walker to match with. A PLUS node is expected to have an expression on the left and an expression on the right. The value returned from the left and right hand sides is assigned to the temporary variables a and b.

Our main code must be enhanced to set the walker off down the AST.

CalcLexer lexer(cin);
CalcParser parser(lexer);
parser.expr();
RefAST t = parser.getAST();
CalcTreeWalker walker;
float r = walker.expr(t);
cout < < t->toStringList() << " = " << r << endl;

And, our test session will now produce the answer to our troublesome (1+1) question.

[c:\temp\test]test
(1)
1 = 1
[c:\temp\test]test
(1+1)
( + 1 1 ) = 2
[c:\temp\test]test
(1+2+3)
( + ( + 1 2 ) 3 ) = 6

This is just a taste of some of the powerful features provided by ANTLR. A heartily recommended tool for both pleasure and profit.

John Merrells
merrells@acm.org

www.antlr.org - The home ANTLR.

One Response to “Accent Vol 2 No 11 - ANTLR”

  1. john Says:

    hi can you send me a lexical analyzer for operators in c++ thanks