kuniga.me > NP-Incompleteness > An Introduction to the Parsec Library
21 Jan 2014
The process of compiling or interpreting programs requires, as one of its steps, parsing the source code and structuring them. More specifically, we have a step to convert a string into a set of tokens, called lexical analysis, which is carried out by a lexer or tokenizer.
After that, we structure those tokens in a such a way that encodes meaning of these tokens, such as abstract syntax trees (AST). This step is called parsing and is out by a parser.
The parsec library
Parse combinators are high-order functions that combine smaller combinators (parsers) to build more complex ones. This resembles the idea of context free grammars (CFG), which we talked about in a previous post, where we have productions of form
\[S \rightarrow A \mid B\]Here, \(S\) can be formed by either from \(A\) or \(B\), which in turn, can be other productions or terminals.
The parsec library is an implementation of a parser combinator in Haskell. We talked about combinators in Haskell previously (in portuguese).
Parsec vs. Bison/Yacc/Antlr
All Bison, Yacc and Antlr are not actual parsers, but rather parsers generators. They take a grammar file and generate parsers for the languages that can be described by those grammars.
Parsec, on the other hand, is a parser that you write yourself.
In this post we’ll go through several basic concepts of the parsec library, using as the main reference, the book Real World Haskell, Chapter 16.
The source code for all the examples to follow can be found on the blog’s github repository.
The basic idea of a parser is that it takes an input (for example, a string), it consumes the characters of this string until it’s done with what it’s supposed to parse and then pass the remaining (unparsed) string along, which might be used as input to subsequent parsers.
One of the simples parsers is the one that only consumes a single specific character. There’s a parser named char
at Data.Char
library, so let’s write one that parses the letter a
:
To test our parser with an input, we use the parse
function form Text.Parsec
This function takes as input a parser, a source name and the actual input. The source name parameter is not important for us now, so we can pass the empty string. Let’s write a simple wrapper to avoid boilerplate:
We can now test our parser with same sample inputs:
It extracts the first character of the input string if it’s the 'a'
character, otherwise it throws an error. If we want to match any char, there’s also the function anyChar
. Running it with the same examples:
Note that it doesn’t fail for strings starting with 'b'
. So far our parsers only match one character, so for example, the string "ab"
, it only returns the first character.
We can use a parser for the string too. There’s the string
combinator but let’s develop our own and show how we can combine combinators to form new ones. There’s the many
combinator that applies the combinator passed as argument until it fails.
Thus, we can write a string parser as many anyChar
:
Now let’s try it with the string "ab"
:
More useful than matching all characters is matching all except some, so we know when to stop parsing. For that, we can use noneOf
instead of anyChar
. It takes a list of characters as parameter and matches any character that is not on that list.
Let’s now write a wordParser
, which keeps parsing all characters until it finds an whitespace:
Let’s try it on the most classical string example:
Note that our parsers are throwing away all the unparsed strings. How can we parse the remaining, unparsed string?
We’ve talked about Functors and Monads before, but not about Applicatives functors.
Intuitively, they are a structure in between Functors and Monads, that is, they’re more complex and general than Functors but less than Monads. We can also make the analogy of wrappers that we did for monads.
Originally, the Parsec library was written with Monads in mind, but Applicative functors were introduced after that and using them to write parsers usually leads to more clear syntax. So, in this post, we’ll use the applicative flavor to write our parsers.
Here, we’ll only provide an overview of some of the main applicative operators. For further details, the book Learn You a Haskell for Great Good has a nice introduction to Applicatives.
Operators Cheat Sheet. We can use the Maybe
typeclass to illustrate the main applicative operators, since it implements an applicative functor.
(<*>)
Unwrap the contents of both sides, combine them and wrap again
(*>)
Unwrap the contents of both sides, but discard the result on the left
(<*)
Unwrap the contents of both sides, but discard the result on the right.
(<$>)
Unwrap the contents of the right, combine the left and right arguments and return
(<$)
Unwrap the contents of the right, but only wrap the one to the left
This analogy of wrappers applied to parsers is not as natural though. In this case, we can think of unwrapping as executing the parser, by consuming the input and wrapping the result as getting the parsed token. The unparsed string is always carried over from parser to parser.
Hopefully with the next examples this will become clearer:
Parsing the second word
If we are to get the token from the second parser instead of the first, we need to execute both parsers but ignore the result of the first. Thus, we can use the operator (*>)
to obtain something like
This won’t quite work because the first parser doesn’t consume the whitespace, so the second parser will stop before consuming anything. We can fix that by consuming the whitespace:
So let’s write:
and now we can test:
Parsing two words
We can also return both tokens if we use the operator (<*>)
and then combine them into a list:
Parsing multiple words
Generalizing, we can parse multiple words with the aid of the many
combinator:
We could actually write this using the sepBy1
parser, which parses a list of tokens separated by a separator and requires the list to have at least one element:
Simple CSV parser
With what we’ve seen so far, we can write a very basic CSV parser in 4 lines of code.
Note that it doesn't handle some corner cases like escaped commas within cells. For a full example, refer to either [1] or [2].
Recall that in Context Free Grammars, we can have production rules of the type:
\[S \rightarrow A \mid B\]which means that S can be generated either from \(A\) or \(B\). In Parsec, we can express this option using the (<|>)
operator. Let’s write a simple parser that parses either the "cat"
or "dog"
strings:
Testing on some inputs:
Let’s write another example with different animal names:
and try again with the input "cat"
:
The parser failed because the strings have common prefix. It started matching the camel parser, but it also consumed the "ca"
characters and then it failed to match the cat parser.
The try combinator
To avoid this, problem, there’s the try
combinator, which will make a parser to not consume its input if it fails to match:
which works as expected:
We can see that it’s straightforward to convert a standard context free grammar into a haskell program using parsec.
So far we our parsers have only returned strings and list of strings. We can use data types to structure our parsed data in a way that is easier to evaluate later.
For our example, we’ll build a very simple parser for expressions that only contain +
and -
binary operators, terminals are all integers and all binaries are surrounded by parenthesis so we don’t have to handle precedence. Examples of valid expressions are "12"
, "(1+2)"
, "((3+4)-5)"
, whereas "1+2"
is invalid (no parenthesis).
The first thing we want to do is to define our data types. Our number type, TNumber
, is just an alias to Int
. Our operator type, TOperator
can be one of addition (TAdd
) or subtraction (TSubtract
). Finally, the expression is either binary (TNode
) or a number (TTerminal
).
From what we’ve seen so far, it’s not very complicated to write parsers for TNumber
and TOperator
:
For the expression we have two choices. Either we parse another expression enclosed in parenthesis or we parse a terminal. In the first case, we call the binaryExpressionParser
which looks for the left expression, the operator and then the right expression.
And that’s it! We can now run an example with a valid expression:
The advantage of having this AST is that it’s now very simple to evaluate:
And the final test:
It works! We implemented a simple parser and interpreter for a very limited arithmetic expression. There are much better tools to do expression parsing (see * [5] for a tutorial), but it’s out of the scope of this post.
We’ve learned the basics of the Parsec library and built some non-trivial parsers gluing together basic parsers using combinators. We even started scratching the parsing of programming languages by writing a parser for arithmetic expressions.
The Parsec applications presented in the Real World Haskell book are great. I felt that the content was a bit hard to follow, but writing helped me get a better understanding of the subject.