Who wrote Aurochs?
Aurochs has been developed at Exalead SA for our internal
parsing needs. Its developer is Berke Durak.
You can reach him at obdurak at free fr. Some contributions to the Java binding have been made by Maxime Van Noppen.
What language is Aurochs written in?
Aurochs is written in Objective Caml and C.
What programming languages can I use Aurochs with?
You can easily use Aurochs in Ocaml, Java or C, or interface it with your own language.
How do I use Aurochs?
- You write a PEG grammar describing the parsing task
- You can then:
- Use Aurochs directly to parse an input text using that grammar and display the parse tree in XML.
- Use Aurochs to compile the PEG grammar into an automaton.
- Run the compiled automaton from Ocaml, Java, C using the existing bindings. This gives you a parse tree.
- Write your own C binding for your favorite language.
- Generate a native C or Ocaml parser.
What does a PEG grammar look like?
A PEG grammar, in Aurochs notation, looks like a BNF grammar.
hex_integer ::= "0x" [0-9a-fA-F]+;
decimal_integer ::= [0-9]+;
integer ::= hex_integer | decimal_integer;
space ::= ' ';
start ::= space* integer (space+ integer)* space* EOF;
Of course such a simple language could also be parsed with regular expressions.
What is that PEG or Packrat parsing technique I keep hearing about?
A kind of left recursive-descent parsing with memoization and backtracking.
Do PEG grammars provide things not available in CFGs?
Yes.
What operators does Aurochs provide?
Here is a summary of PEG operators.
- Character litterals: 'c' for the single ASCII character 'c'
- String litterals: "hello" for the string
- Beginning-of-file BOF and end-of-file EOF
- Any character sigma
- The empty string epsilon
- Character ranges [a-zA-Z0-9] or their complements [^\n\t ]
- Concatenation: a b parses a then b. Note that there exist
operators, such as infixing, which change the meaning of concatenation.
- Ordered disjunction a | b. Here, Aurochs attempts to parse a at the
current position. If it manages, it will never try to parse b at the current position..
If it fails, it will never try to parse a again at that position, and will try to
parse b. If b fails, the whole production fails.
- Negation: ~a attempts to parse a and succeeds, without consuming any input,
if a fails. If a succeeds, it fails (and thus doesn't consume its inputs anyway.)
- Conjunction: &a. Parses a, but without consuming it. Thus
"big" &"ger" "gerbil" will parse "big", then check that "ger" is parseable but without
consuming the input (or advancing the cursor, if you prefer), then parse "gerbil". Thus it will accept "biggerbil",
but not "biggergerbil".
- Expressions can be grouped using parentheses.
- Repetition:
a* parses as many occurences of a as possible, including zero.
a+ parses at least one occurences of a.
- Option: a? parses zero or one occurences of a.
In what language do I specify the actions of the grammar?
Traditional parser generators mix grammar specifications and source code. This makes the grammar file, and the parse
generator, dependent on the language used. That is why you have Yacc clones for C, Yacc clones for Java, Yacc clones for SML, and so on.
Aurochs uses a different approach. Productions of the grammar are marked with XML-like tags. When they are derived, a tree is constructed.
Actually, tags are emitted as CREATE_NODE, ADD_CHILDREN and ADD_ATTRIBUTE instructions in the automaton
code. The interpreter calls the appropriate hooks upon encountering these instructions.
hex_integer ::= "0x" value:[0-9a-fA-F]+;
decimal_integer ::= value:[0-9]+;
integer ::= <Hex>hex_integer</Hex> | <Dec>decimal_integer </Dec>;
space ::= ' ';
start ::= space* <IntegerList>integer (space+ integer)*</IntegerList> space* EOF;
Can you describe the Aurochs parse trees?
Aurochs basically generates a parse tree whose nodes have the following attributes.
Each node has a name, a number of named attributes, and an ordered sequence of children.
Each children is either another node, or a token. A token is a segment of the input.
This is basically the XML model, with the name "token" being substituted for
"PCDATA."
What construction operators are available for specifying how the tree should be built?
Here is a summary of construction operators.
- Each node of the parse tree has the foll
- Construct a single node : <name> x </name> will build a single
node with the given name, if the expression x is parsable. The nodes
constructed by the expression x will be appended as children to name,
as will be any attributes and tokens not caught by other children.
It is possible to abbreviate <name> </name> as <name/>. -
- Add an attribute: attr:x adds an attribute named attr to the current node. The
value of the attribute will be the segment of the input spanned by the whole expression x, regardless
of construction or attribute directives in x itself.
- Add a token: { x } adds the segment of the input spanned by the expression x as a token
to the current node.
- Positions: the constructor position can be used in attributes to encode the current position, in bytes. Thus
one can write <name> start:position x y z end:position </name> to get the beginning and end of the expression
x y z.
Is it possible to modify the meaning of, say, concatenation, to automatically insert spaces, for instance?
Yes, Aurochs has also modifiers, which act on a delimited portion of the grammar, and can be used to
do that.
The following modifiers take an expression
x as parameter and transform each production
a ::= b as follows:
- appending x do ... done transforms them into a ::= a x
- prepending x do ... done transforms them into a ::= x a
- surrounding x do ... done transforms them into a ::= x a x
The following modifiers transform each concatenation a b c:
- outfixing x do ... done transforms them into x a x b x c x
- suffixing x do ... done transforms them into a x b x c x
- prefixing x do ... done transforms them into x a x b x c
Note that these modifiers also act within implicit concatenations produced by the operators
'*' and '+'.
Wouldn't a backtracking parsing algorithm have worst-case exponential time complexity?
In a conventional context-free grammar, as those used in parser generators of the Yacc family,
alternatives for a production combine multiplicatively leading to combinatorial explosion and ambiguity.
No linear or nearly-linear algorithm is known today for parsing general context-free grammars.
Parser generators such as Yacc or Menhir restrict the input grammar to sub-classes such as LALR(1)
or GLR to have an efficient parsing algorithm and resolve ambiguities.
Alternatives in parse-expression grammars are ordered. This resolves all ambiguity and prunes
the search space, giving a simple, linear-time parsing algorithm with the help of basic
memoization.
OK it's linear time, but is it fast in practice?
Quite sufficient for parsing Java and Javascript source code, database queries, regular expressions and
JSON data.
How much memory does it use?
Aurochs now uses memory-efficient data-structure for the memoization tables, which
tend to be rather sparse.
The C NOG interpreter used to use a simple array for its memoization tables. This
has changed in v1.0.93 (released 2009-01-21). An array of small hash tables is
now used, which reduces the memory overhead by a factor of 10 to 20, depending on
grammar.
The overhead depends on the grammar and the input. Aurochs will
use memory in O(mn) where m is the size of your grammar and n is the size of
your input. The practical factor depends on the complexity of your grammar
and your input. In practice, a rather complex full Javascript grammar
with about 300 expanded productions has an overhead of about 260 bytes
per input byte, or slightly less than one byte per production per input byte.
| Grammar | Input size | Memory consumption | CPU time | Memory overhead (64-bit) |
| Javascript | 96kb | 25Mb | 1s | 260 |
Is there room for improvement, or is the approach stretched to its limits?
Yes. As the automaton is sufficiently simple, a sufficiently smart interpreter can make it run as fast as you like.
More practically, we have quite decent performance with no optimization at all, and it should be possible to get
10 to 50 times better performance with grammar optimization, automaton optimization and JIT compilation.
Are there any serious uses of Aurochs or PEG parsing?
At Exalead, we have been parsing real-world complex Javascript
using Aurochs-generated parsers, daily for six months, using Jsure.
We are also implementing Java parser.
Why should I use packrat parsing in general and Aurochs in particular?
-
If you cannot split your language into a lexer and a parser, packrat parsing is for you.
-
Your are in the process of designing your language and you want a quick parse-fix cycle.
-
Because you want a parsing technology that is easy to understand and that you could re-implement
in your own.
-
If you have multiple grammars with incompatible lexical structures, you cannot combine them
using traditional parser generators, because they have a lexing phase. A hand-crafter
parser or packrat parsing is your only solution.
-
If you want to use a given grammar definition with more than one programming language.
-
If your language is not CFG, maybe packrat parsing can do the job.
-
Because you might find that resolving ambiguities by operator precedence rules is hard to understand
and not very modular.
-
If your patterns are such that regular expressions don't cut the job or get awfully complex while
grammars seem overkill, and your input strings are small (less than 10k), then Aurochs is a very
attractive alternative.
Why should I not use Aurochs or packrat parsing?
You cannot wait for all the input before starting to parse.
That is because the entire input must be loaded before parsing can take place,
i.e., the algorithm is not really on-line. But nothing prevents you from splitting
your input into small chunks, such as lines, and parse each line by itself,
if the grammar lends itself.
Your memory budget is low. If you cannot afford 20 to 150 bytes per input
character, it is better to use something else.
You need more speed.
You need to split the parsing in lexing and parsing phases.
Is Aurochs open-source software?
Yes. Aurochs is released under the GNU Lesser General Public License, version 3.
Where can I get Aurochs? And what do I need to build it?
Note: Aurochs is now in a git repository.
You can clone it with +git clone git://github.com/berke/aurochs.git+.
The old browse the Mercurial repository is here but I won't be pushing to it.
To build Aurochs, you need Ocaml version 3.10 or later
and a C compiler. The compilation process uses ocamlbuild
which is included in Ocaml.
Here are more detailed installation instructions:
- Install Objective Caml 3.10 or later:
If you are using Debian or a recent Ubuntu, run apt-get install ocaml-core
Otherwise, go to caml.inria.fr and follow the instructions there.
- Get the Aurochs source code:
and unpack it.
Or, if you have Mercurial, type hg clone http://aurochs.fr/hg/aurochs aurochs to get a local copy
of the latest version.
Type PREFIX=/usr/local sudo make install to install the aurochs binary in /usr/local/bin;
the Ocaml libraries will be installed in Ocaml's library path.
If you want to use Aurochs from Java, look into the java directory, which contains the required bindings
and a simple calculator example.