Running Iguana
Iguana is a grammar interpreter, as opposed to a parser generator. This means that Iguana directly interprets an in-memory representation of a grammar, and there is no need to first generate code from the grammar specification. For example, consider the following simple grammar written in Iguana syntax that encodes a list of a's:
You can simply save the above grammar in a text file, e.g., Simple.iggy
and load it with Iguana:
To parse an input string using this grammar you need to do the following:
val input = Input.fromString("aaa")
val start = Nonterminal.withName("A")
val result = Iguana.parse(input, grammar, start)
Iguana is build on top of our version of
the Generalized LL (GLL) parsing algorithm. GLL is a
top-down parsing algorithm that supports all context-free grammars and produces
a binarized SPPF. Binarized SPPFs, however, are part of the internal machinery of GLL, and are not
meant for the end user. Iguana provides support for conversion of binarized SPPF
to terms that correspond to our grammar model. We can, for example, visualize
the SPPF and parse tree from a ParseResult
as follows.
result match {
case s:ParseSuccess => SPPFVisualization.generate(s.sppfNode, "graphs", "simple_sppf")
TermVisualization.generate(s.getTree, "graphs", "simple_terms")
case e:ParseError => println(e)
}
The SPPF and terms corresponding to the example above are shown below. More information about terms can be found here.
Examples
Now we give some examples of using Iguana. The source code of this examples is available here.XML elements
XML is a popular data exchange format, and there are numerous XML parsers for different programming languages. These parsers, however, are handwritten, rather than being generated from a grammar. The main problem is that the syntax of XML is not context-free. In this example, we show how to write a data-dependent grammar for parsing XML.
The XML reference manual has a very straightforward grammar. For example, an XML element is defined by the following context-free rules:
Element ::= STag Content ETag
STag ::= '<' Name Attribute* '>'
ETag ::= '</' Name '>'
In this grammar Element
defines an XML element with a
start (STag
) and end (ETag
) tag. As can be seen, there
is no constraint in this grammar to ensure that the start and end tags are the same.
For example, according to this grammar, the following example is valid XML:
Now we show a data-dependent version of an XML element:
Element ::= s=STag Content ETag(s)
STag ::= '<' n:Name Attribute* '>' { n.yield }
ETag(s) ::= '</' n:Name [n.yield == s] '>'
This grammar specifies
that the string parsed by Name
should be the same for the start and end
tags. Data-dependent grammars have an intuitive semantics. One can consider their
runtime semantics as a recursive-descent parser with a guard for left recursion,
and cubic runtime bound for the worst case. In the data-dependent grammar of XML
elements, we make the following changes:
- In the
Element
rule, we introduce the variables
, which holds the value returned bySTag
. This value is then passed to the parametrized nonterminalETag
. - In the
STag
rule, we get the text associated withName
. For this, we label the use of nonterminalName
,n:Name
, wheren
is a variable referring to the node represented byName
. The last element of theSTag
rule is the return expression. In this case, we return the string parsed byName
, accessing it via theyield
property. - In the
ETag
rule, we compare the start tag, passed via the parameters
, with the matched end tag's stringn.yield
. If constraintn.yield == s
evaluates to false, the parsing path terminates. This ensures that only balanced XML elements are matched.
Operator Precedence
In this example we use high-level constructs for declarative specification of operator precedence. More information about these constructs can be found in our Onward! 15 paper.
Expression grammars in their natural and concise form are ambiguous. It is common to rewrite an expression grammar to an unambiguous grammar that encodes operator precedence. However, this rewriting is not trivial for grammars of real programming languages, and leads to large grammars.
We advocate a declarative approach to operator precedence where an ambiguous grammar and a set of declarative constructs are used. Consider an excerpt of the OCaml grammar and its accompanying operator precedence and associativity table from its language specification:
We can encode this grammar in Iguana using high-level syntactic constructs
>
, left
, and right
as follows:
expr ::= expr '.' field
> expr expr left
> '-' expr
> expr '*' expr left
> (expr '+' expr | expr '-' expr) left
> 'if' expr 'then' expr 'else' expr
> expr ';' expr right
| '(' expr ')'
| num
As can be seen, operator precedence is defined by >
between
two alternatives, and it is decreasing, i.e., the first alternative has the highest
precedence. Associativity is defined using left
and right
.
For rules that have the same precedence, but are left-associative with respect to
each other, e.g., the '+'
and '-'
rules, an associativity
group is defined.
Indentation-sensitive languages
In this section we show how to define a simple Haskell-like indentation-sensitive language. In programming languages, such as Haskell and Python, whitespace is significant as it determines the meaning of programs. For example, consider the following Haskell function:
f x = case x of
0 -> 1
_ -> x + 2
+ 4
In Haskell, the keywords do
, let
, of
and where
signal the start of a block. All statements that belong to
the same block should be aligned. In our example, two alternatives belong to the
same block of the case
expression as they are aligned with respect
to each other. Haskell also applies the offside rule for each statement of a block.
This means that all tokens inside a statement
have to be to the right of its starting token. For example, in the second
alternative of the case
expression, the right-hand-side expression
includes subexpression 4
as all the tokens, including + 4
,
are to the right of _
.
In Haskell, the keywords do, let, of and where signal the start of a block, where all the statements should be aligned. In our example two cases, starting with 0 and _, are aligned. Inside each statements the tokens should be offsided. It means that that all the tokens have to be to the right of the starting token. For example, in the second alternative of the match, all tokens should be to the right of _.
Now we consider a slightly modified version of the example above, where we
shift + 4
to the left, so that it will be at the same column as
_
:
f x = case x of
0 -> 1
_ -> x + 2
+ 4
In this version, + 4
does not belong to the second alternative anymore, as
it is not to the right of its starting token. Rather it belongs to the outermost
addition expression with the case
expression and 4
as
its subexpressions. Now, if we call f 0
, it returns 5, while the previous
version returns 1.
Indentation sensitivity in programming languages cannot be expressed by pure context-free grammars, and has often been implemented by hacks in the lexer. For example, Haskell deals with indentation sensitivity in the lexing phase, and the context-free part is written as if no indentation-sensitivity exists. In Haskell, the lexer translates indentation information into curly braces and semicolons, based on the rules specified by the L function in the Haskell language manual, and the parser is constructed using an LALR parser generator.
A block of declarations in Haskell is defined by the following context-free grammar (we use a simplified grammar of Haskell):
Decls ::= '{' Decl (';' Decl)* '}'
This grammar says that a Decls
starts and ends with curly braces, and
inside the braces, Decl
s are separated by simicolons. In Haskell, when curly
braces are explicitly used, the indentation rules inside the braces are ignored.
The indentation-sensitive version of this
rule, in which a block is identified by a list of aligned Decl
's and
each Decl
is offsided, is implicit and is not part of its reference
grammar. To have a declarative, single-phase specification of layout-sensitive
constructs, Iguana presents three high-level constructs: align
, offside
and ignore
.
Using these constructs, we can redefine the rule above and add the new one as follows:
ignore
states that the indentation rules should
be ignored when the explicit delimiters are used, align
states that
all Decl
's inside the list should be aligned, and finally, offside
states that each Decl
should be offsided with regard to its first token. These definitions
clearly and concisely captures the indentation rules.