Iguana

Running Iguana

Iguana is a grammar interpreter, as opposed to a parser generator. This means that Iguana directly interprets an in-memory representation of a grammar, and there is no need to first generate code from the grammar specification. For example, consider the following simple grammar written in Iguana syntax that encodes a list of a's:


A ::= A 'a' 
   |  'a'

You can simply save the above grammar in a text file, e.g., Simple.iggy and load it with Iguana:


val grammar = IggyParser.getGrammar(Input.fromPath("src/resources/grammars/Simple.iggy"))

To parse an input string using this grammar you need to do the following:


val input = Input.fromString("aaa")
val start = Nonterminal.withName("A")
val result = Iguana.parse(input, grammar, start)

Iguana is build on top of our version of the Generalized LL (GLL) parsing algorithm. GLL is a top-down parsing algorithm that supports all context-free grammars and produces a binarized SPPF. Binarized SPPFs, however, are part of the internal machinery of GLL, and are not meant for the end user. Iguana provides support for conversion of binarized SPPF to terms that correspond to our grammar model. We can, for example, visualize the SPPF and parse tree from a ParseResult as follows.


result match {
  case s:ParseSuccess => SPPFVisualization.generate(s.sppfNode, "graphs", "simple_sppf")
                         TermVisualization.generate(s.getTree, "graphs", "simple_terms")
  case e:ParseError   => println(e)
}

The SPPF and terms corresponding to the example above are shown below. More information about terms can be found here.

Examples

Now we give some examples of using Iguana. The source code of this examples is available here.

XML elements

XML is a popular data exchange format, and there are numerous XML parsers for different programming languages. These parsers, however, are handwritten, rather than being generated from a grammar. The main problem is that the syntax of XML is not context-free. In this example, we show how to write a data-dependent grammar for parsing XML.

The XML reference manual has a very straightforward grammar. For example, an XML element is defined by the following context-free rules:


Element ::= STag Content ETag
STag    ::= '<' Name Attribute* '>'
ETag    ::= '</' Name '>'

In this grammar Element defines an XML element with a start (STag) and end (ETag) tag. As can be seen, there is no constraint in this grammar to ensure that the start and end tags are the same. For example, according to this grammar, the following example is valid XML:


<note>
  <to>Bob</from>
  <from>Alice</to>
</note>

Now we show a data-dependent version of an XML element:


Element ::= s=STag Content ETag(s)
STag    ::= '<' n:Name Attribute* '>' { n.yield }
ETag(s) ::= '</' n:Name [n.yield == s] '>'

This grammar specifies that the string parsed by Name should be the same for the start and end tags. Data-dependent grammars have an intuitive semantics. One can consider their runtime semantics as a recursive-descent parser with a guard for left recursion, and cubic runtime bound for the worst case. In the data-dependent grammar of XML elements, we make the following changes:

In the Element rule, we introduce the variable s, which holds the value returned by STag. This value is then passed to the parametrized nonterminal ETag.
In the STag rule, we get the text associated with Name. For this, we label the use of nonterminal Name, n:Name, where n is a variable referring to the node represented by Name. The last element of the STag rule is the return expression. In this case, we return the string parsed by Name, accessing it via the yield property.
In the ETag rule, we compare the start tag, passed via the parameter s, with the matched end tag's string n.yield. If constraint n.yield == s evaluates to false, the parsing path terminates. This ensures that only balanced XML elements are matched.

Operator Precedence

In this example we use high-level constructs for declarative specification of operator precedence. More information about these constructs can be found in our Onward! 15 paper.

Expression grammars in their natural and concise form are ambiguous. It is common to rewrite an expression grammar to an unambiguous grammar that encodes operator precedence. However, this rewriting is not trivial for grammars of real programming languages, and leads to large grammars.

We advocate a declarative approach to operator precedence where an ambiguous grammar and a set of declarative constructs are used. Consider an excerpt of the OCaml grammar and its accompanying operator precedence and associativity table from its language specification:


expr ::= expr '.' field  
       | expr expr
       | '-' expr
       | expr '*' expr
       | expr '+' expr
       | expr '-' expr
       | 'if' expr 'then' expr 'else' expr
       | expr ';' expr
       | '(' expr ')'
       | num

Operator	Associativity
.	-
function application	left
- (unary)	-
*	left
+ -	left
if-then-else	-
;	right

We can encode this grammar in Iguana using high-level syntactic constructs >, left, and right as follows:


expr ::= expr '.' field  
       > expr expr                           left
       > '-' expr
       > expr '*' expr                       left
       > (expr '+' expr | expr '-' expr)     left
       > 'if' expr 'then' expr 'else' expr
       > expr ';' expr                       right
       | '(' expr ')'
       | num

As can be seen, operator precedence is defined by > between two alternatives, and it is decreasing, i.e., the first alternative has the highest precedence. Associativity is defined using left and right. For rules that have the same precedence, but are left-associative with respect to each other, e.g., the '+' and '-' rules, an associativity group is defined.

Indentation-sensitive languages

In this section we show how to define a simple Haskell-like indentation-sensitive language. In programming languages, such as Haskell and Python, whitespace is significant as it determines the meaning of programs. For example, consider the following Haskell function:


f x = case x of
      0 -> 1
      _ -> x + 2
           + 4

In Haskell, the keywords do, let, of and where signal the start of a block. All statements that belong to the same block should be aligned. In our example, two alternatives belong to the same block of the case expression as they are aligned with respect to each other. Haskell also applies the offside rule for each statement of a block. This means that all tokens inside a statement have to be to the right of its starting token. For example, in the second alternative of the case expression, the right-hand-side expression includes subexpression 4 as all the tokens, including + 4, are to the right of _.

In Haskell, the keywords do, let, of and where signal the start of a block, where all the statements should be aligned. In our example two cases, starting with 0 and _, are aligned. Inside each statements the tokens should be offsided. It means that that all the tokens have to be to the right of the starting token. For example, in the second alternative of the match, all tokens should be to the right of _.

Now we consider a slightly modified version of the example above, where we shift + 4 to the left, so that it will be at the same column as _:


f x = case x of
      0 -> 1
      _ -> x + 2
      + 4

In this version, + 4 does not belong to the second alternative anymore, as it is not to the right of its starting token. Rather it belongs to the outermost addition expression with the case expression and 4 as its subexpressions. Now, if we call f 0, it returns 5, while the previous version returns 1.

Indentation sensitivity in programming languages cannot be expressed by pure context-free grammars, and has often been implemented by hacks in the lexer. For example, Haskell deals with indentation sensitivity in the lexing phase, and the context-free part is written as if no indentation-sensitivity exists. In Haskell, the lexer translates indentation information into curly braces and semicolons, based on the rules specified by the L function in the Haskell language manual, and the parser is constructed using an LALR parser generator.

A block of declarations in Haskell is defined by the following context-free grammar (we use a simplified grammar of Haskell):


Decls ::= '{' Decl (';' Decl)* '}'

This grammar says that a Decls starts and ends with curly braces, and inside the braces, Decls are separated by simicolons. In Haskell, when curly braces are explicitly used, the indentation rules inside the braces are ignored. The indentation-sensitive version of this rule, in which a block is identified by a list of aligned Decl's and each Decl is offsided, is implicit and is not part of its reference grammar. To have a declarative, single-phase specification of layout-sensitive constructs, Iguana presents three high-level constructs: align, offside and ignore. Using these constructs, we can redefine the rule above and add the new one as follows:


Decls ::= ignore '{' Decl (';' Decl)* '}'
        | align (offside Decl)*

The use of ignore states that the indentation rules should be ignored when the explicit delimiters are used, align states that all Decl's inside the list should be aligned, and finally, offside states that each Decl should be offsided with regard to its first token. These definitions clearly and concisely captures the indentation rules.