Syntax

In this section, we discuss the syntax of Iguana.

Context-free grammars

Iguana supports all context-free grammars in a single-phase parsing mode. This means that in Iguana we use a single specification to define the context-free and lexical syntax, and there is no separate lexing phase before parsing. Grammars can be defined to the level of characters, conforming to Scannerless parsing or using regular expressions, conforming to Context-aware scanning. Context-aware scanning basically means that we use a separate lexer during parser to match tokens. You can find more information about benefits of single-phase parsing in our Onward'15 paper.

A grammar is a set of rules, where each rule has a head (a nonterminal) and a body (a list of terminal and nonterminals). Nonterminals are indentifiers, while terminals are characters or regular expressions. Consider the following simple grammar of statements:

Stmt 
  : 'if' Expr 'then' Stmt 'else' Stmt
  | 'if' Expr 'then' Stmt
  |  Id

Expr
  : 'true'
  | 'false'

regex Id
  : [a-z]+  

As can be seen, each rule definition starts with a nonterminal followed by : followed by a list of alternatives which are separated by |. Moreover, rules must be preceded by a new line character, except for the first rule in the file. Terminals can be defined using either single or double quotes. To define a regular expression, we use the regex keyword. The syntax of regular expressions is explained here.

We also support the following EBNF operators:

  • A*: Zero or more occurrence of A.
  • A+: One or more occurrence of A.
  • A?: Zero or one occurrence of A.
  • (α): A group of non-empty list of symbols α.
  • A|B: A or B. This operator, in contrast to the normal alternatives in context-free rules, can appear in the body of a rule.

Note that * and + are not greedy in context-free rules.

Data-dependent grammars

Iguana supports data-dependent grammars at its core. Data-dependent grammars are an extension of context-free grammar that support arbitrary computation, parametrized nonterminals, variable binding and constraints. These powerful features allow the user to simulate hand-written parsers, and also implement various disambiguation constructs.

Data-dependent grammars provide the following features:

  • Parametrized nonterminals of the form A(p1, p2, ..., pn).
  • Bindings of the form {v=e}, where e is an arbitrary expression.
  • Bindings of the form {v=A(p1, p2, ..., pn)}, where v holds to the return value of A(p1, p2, ..., pn).
  • Labeled symbols of the form l:x, where l is the label and x is a symbol (terminal or nonterminal). A label provides access to the node corresponding to x in the parse forest, and has three properties: l, r, and yield, which provide access to the left extent, right extent and the string parsed by l. One can access these properties using a dot notation, e.g., l.r for the right extent.
  • Return values of the form {e}, where {e} is the last symbol of an alternative.
  • Conditional selections of the form e ? α : β, where e is an expression and α and β are non-empty list of terminals and nonterminals. If e evaluates to true, α is selected, otherwise β.

To demonstrate the concept of data-dependent grammars we use an example from the IMAP protocol. In network protocol messages it is common to send the length of data before the actual data. In IMAP, these messages are called literals, and are described by the following (simplified) context-free rule:

L8 : '~{' Number '}' Octets

Here Octets recognizes a list of octet (any 8-bit) values. An example of L8 is ~{6}aaaaaa. As can be seen, there is no data dependency in this context-free grammar, but the IMAP specification says that the number of Octets is determined by the value parsed by Number. Using data-dependent grammars, we can specify such a data-dependency as:

L8 
  : '~{' nm:Number { n=toInt(nm.yield) } '}' Octets(n) 

Octets(n) 
  : [n > 0] Octets(n - 1) Octet
  | [n == 0]                    // Empty rule

In the data-dependent version, nm provides access to the value parsed by Number. We retrieve the substring of the input parsed by Number via nm.yield which is converted to integer using toInt. This integer value is bound to variable n, and is passed to Octets. Octets takes an argument that specifies the number of iterations. Conditions [n > 0] and [n == 0] specify which alternative is selected at each iteration.

Regular expressions

In our grammar formalism, terminals can be written using regular expressions. Although the regular expression operators in this section are similar to the ones of the EBNF in context-free grammars, their implementation is very different. Regular expressions are only recognizers, i.e., do don't capture the structure, and cannot be recursive. I addition, the repetition operators, * and +, are only greedy. We do not support non-greedy quantifiers such as possessive or reluctant. For such uses, use a context-free rule. The regular expressions are implemented as DFAs, and at the moment no advanced features such as lookahead and lookbehind, commonly found in backtracking regular expressions, are supported.

A regular expression is defined as follows:

  • c: the character c.
  • [c1-c2]: a character range from c1 to c2.
  • [^c1-c2]: negation (any character not in the range)
  • r*: (Kleene star) zero or more repetition of r.
  • r+: one or more repetition of r.
  • r?: zero or more occurrence of r.
  • r1|r2: r1 or r2.
  • (r1r2...rn): a group of r1 r2 ... rn.

Regular expressions are defined using the regex. The nested definitions of the regular expressions are flattened. For example, consider the following grammar definition of float and integer numbers.

@regex 
float
  : integer '.' integer

@regex  
integer
  : [1-9][0-9]*

When float is used in a grammar rule as a terminal, its definitions are flattened, i.e., treated as [1-9][0-9]*[.][1-9][0-9]*.

Layout

In almost all programming languages, whitespace and comment, which we refer to as layout, are optional and can appear anywhere in the program. In a traditional two-phase parsing setting (lexing before parsing), layout is usually thrown away by the lexer and the context-free part is written as if no layout exists. In single-phase parsing, layout should be defined as part of the syntax, like any other grammar symbol. Because it is tedious to manually insert layout into the grammar, and also the fact that it may hinder readability and maintainability, we support automatic layout insertion in Iguana.

Our layout insertion is inspired by SDF, where layout is inserted between symbols in body of rules. More precisely, if S ::= A1 A2 ... An is a context-free rule, and L is a layout definition, the grammar after the layout insertion will be S ::= A1 L A2 L ... L An.

In Iguana, layout is defined using the @Layout annotation. For example, the common layout definition in a language such as Java is as follows:

@Layout
Layout ::= [\ \t \f \r \n]*

There can be only one layout definition per grammar definition, and by default, layout is inserted in all the rules unless they are explicitly marked with @NoLayout. The @NoLayout annotation is useful in cases where the structure of a lexical definition is needed. For example, a float number can be defined as:

@NoLayout 
Float ::= Number '.' Number

In this case, no whitespace between the symbols of Float is allowed.

Disambiguation

Lexical

Operator Precedence

Parse Trees

Parse Trees