kuniga.me > NP-Incompleteness > An Introduction to the Parsec Library

An Introduction to the Parsec Library

21 Jan 2014

Introduction

The process of compiling or interpreting programs requires, as one of its steps, parsing the source code and structuring them. More specifically, we have a step to convert a string into a set of tokens, called lexical analysis, which is carried out by a lexer or tokenizer.

After that, we structure those tokens in a such a way that encodes meaning of these tokens, such as abstract syntax trees (AST). This step is called parsing and is out by a parser.

The parsec library

Parse combinators are high-order functions that combine smaller combinators (parsers) to build more complex ones. This resembles the idea of context free grammars (CFG), which we talked about in a previous post, where we have productions of form

\[S \rightarrow A \mid B\]

Here, $S$ can be formed by either from $A$ or $B$, which in turn, can be other productions or terminals.

The parsec library is an implementation of a parser combinator in Haskell. We talked about combinators in Haskell previously (in portuguese).

Parsec vs. Bison/Yacc/Antlr

All Bison, Yacc and Antlr are not actual parsers, but rather parsers generators. They take a grammar file and generate parsers for the languages that can be described by those grammars.

Parsec, on the other hand, is a parser that you write yourself.

In this post we’ll go through several basic concepts of the parsec library, using as the main reference, the book Real World Haskell, Chapter 16.

The source code for all the examples to follow can be found on the blog’s github repository.

Basic combinators

The basic idea of a parser is that it takes an input (for example, a string), it consumes the characters of this string until it’s done with what it’s supposed to parse and then pass the remaining (unparsed) string along, which might be used as input to subsequent parsers.

One of the simples parsers is the one that only consumes a single specific character. There’s a parser named char at Data.Char library, so let’s write one that parses the letter a:

import Data.Char (char)
charAParser = (char 'a')

To test our parser with an input, we use the parse function form Text.Parsec

import Text.Parsec
-- parsedInput = parse someParser "source name" "some input"

This function takes as input a parser, a source name and the actual input. The source name parameter is not important for us now, so we can pass the empty string. Let’s write a simple wrapper to avoid boilerplate:

test p = parse p ""

We can now test our parser with same sample inputs:

> test charAParser "a"
Right 'a'
> test charAParser "ab"
Right 'a'
> test charAParser "ba"
Left (line 1, column 1):
unexpected "b"
expecting "a"

It extracts the first character of the input string if it’s the 'a' character, otherwise it throws an error. If we want to match any char, there’s also the function anyChar. Running it with the same examples:

> test anyChar "a"
Right 'a'
> test anyChar "ab"
Right 'a'
> test anyChar "ba"
Right 'b'

Note that it doesn’t fail for strings starting with 'b'. So far our parsers only match one character, so for example, the string "ab", it only returns the first character.

We can use a parser for the string too. There’s the string combinator but let’s develop our own and show how we can combine combinators to form new ones. There’s the many combinator that applies the combinator passed as argument until it fails.

Thus, we can write a string parser as many anyChar:

stringParser:: Parsec String st String
stringParser = many anyChar

Now let’s try it with the string "ab":

> test stringParser "ab"
Right "ab"

More useful than matching all characters is matching all except some, so we know when to stop parsing. For that, we can use noneOf instead of anyChar. It takes a list of characters as parameter and matches any character that is not on that list.

Let’s now write a wordParser, which keeps parsing all characters until it finds an whitespace:

wordParser:: Parsec String st String
wordParser = many $ noneOf [' ']

Let’s try it on the most classical string example:

> test wordParser "hello world"
Right "hello"

Note that our parsers are throwing away all the unparsed strings. How can we parse the remaining, unparsed string?

The two flavors of the Parsec library: Monads and Applicatives Functors

We’ve talked about Functors and Monads before, but not about Applicatives functors.

Intuitively, they are a structure in between Functors and Monads, that is, they’re more complex and general than Functors but less than Monads. We can also make the analogy of wrappers that we did for monads.

Originally, the Parsec library was written with Monads in mind, but Applicative functors were introduced after that and using them to write parsers usually leads to more clear syntax. So, in this post, we’ll use the applicative flavor to write our parsers.

Here, we’ll only provide an overview of some of the main applicative operators. For further details, the book Learn You a Haskell for Great Good has a nice introduction to Applicatives.

Operators Cheat Sheet. We can use the Maybe typeclass to illustrate the main applicative operators, since it implements an applicative functor.

(<*>) Unwrap the contents of both sides, combine them and wrap again

> Just (+3) <*> Just 9
Just 12

(*>) Unwrap the contents of both sides, but discard the result on the left

> Just (+3) *> Just 9
Just 9
> Just 7 *> Just 8
Just 8

(<*) Unwrap the contents of both sides, but discard the result on the right.

> Just 7 <* Just 8
Just 7

(<$>) Unwrap the contents of the right, combine the left and right arguments and return

> (+3) <$> Just 9
Just 12

(<$) Unwrap the contents of the right, but only wrap the one to the left

> 3 <$ Just 9
Just 3

This analogy of wrappers applied to parsers is not as natural though. In this case, we can think of unwrapping as executing the parser, by consuming the input and wrapping the result as getting the parsed token. The unparsed string is always carried over from parser to parser.

Hopefully with the next examples this will become clearer:

Parsing the second word

If we are to get the token from the second parser instead of the first, we need to execute both parsers but ignore the result of the first. Thus, we can use the operator (*>) to obtain something like

wordParser *> wordParser

This won’t quite work because the first parser doesn’t consume the whitespace, so the second parser will stop before consuming anything. We can fix that by consuming the whitespace:

wordParser *> (char ' ') *> wordParser

So let’s write:

secondWordParser:: Parsec String st String
secondWordParser = wordParser *> (char ' ')  *> wordParser

and now we can test:

> test secondWordParser "ab cd"
> cd

Parsing two words

We can also return both tokens if we use the operator (<*>) and then combine them into a list:

twoWordsParser:: Parsec String st [String]
twoWordsParser = listfy <$> wordParser <*> ((char ' ') *> wordParser)
                   where listfy a b = [a, b]

Parsing multiple words

Generalizing, we can parse multiple words with the aid of the many combinator:

wordsParser:: Parsec String st [String]
wordsParser = (:) <$> wordParser <*> many ((char ' ') *> wordParser)

We could actually write this using the sepBy1 parser, which parses a list of tokens separated by a separator and requires the list to have at least one element:

wordsParserAlt:: Parsec String st [String]
wordsParserAlt = sepBy1 (char ' ')

Simple CSV parser

With what we’ve seen so far, we can write a very basic CSV parser in 4 lines of code.

csvParser:: Parsec String st [[String]]
csvParser = lineParser `endBy` newline <* eof
              where lineParser = cellParser `sepBy` (char &#039;,&#039;)
                    cellParser = many $ noneOf ",\n"

Note that it doesn't handle some corner cases like escaped commas within cells. For a full example, refer to either [1] or [2].

Choosing combinators

Recall that in Context Free Grammars, we can have production rules of the type:

\[S \rightarrow A \mid B\]

which means that S can be generated either from $A$ or $B$. In Parsec, we can express this option using the (<|>) operator. Let’s write a simple parser that parses either the "cat" or "dog" strings:

dogCatParser:: Parsec String st String
dogCatParser = (string "dog") <|> (string "cat")

Testing on some inputs:

> test dogCatParser "dog"
Right "dog"
> test dogCatParser "cat"
Right "cat"
> test dogCatParser "elephant"
Left (line 1, column 1):
unexpected "e"
expecting "cat" or "dog"

Let’s write another example with different animal names:

camelCatParser:: Parsec String st String
camelCatParser = (string "camel") <|> (string "cat")

and try again with the input "cat":

> test camelCatParser "cat"
Left (line 1, column 1):
unexpected "t"
expecting "camel"

The parser failed because the strings have common prefix. It started matching the camel parser, but it also consumed the "ca" characters and then it failed to match the cat parser.

The try combinator

To avoid this, problem, there’s the try combinator, which will make a parser to not consume its input if it fails to match:

camelCatTryParser:: Parsec String st String
camelCatTryParser = try (string "camel") <|> (string "cat")

which works as expected:

> test camelCatTryParser "cat"
Right "cat"

We can see that it’s straightforward to convert a standard context free grammar into a haskell program using parsec.

Simple Expression Parser

So far we our parsers have only returned strings and list of strings. We can use data types to structure our parsed data in a way that is easier to evaluate later.

For our example, we’ll build a very simple parser for expressions that only contain + and - binary operators, terminals are all integers and all binaries are surrounded by parenthesis so we don’t have to handle precedence. Examples of valid expressions are "12", "(1+2)", "((3+4)-5)", whereas "1+2" is invalid (no parenthesis).

The first thing we want to do is to define our data types. Our number type, TNumber, is just an alias to Int. Our operator type, TOperator can be one of addition (TAdd) or subtraction (TSubtract). Finally, the expression is either binary (TNode) or a number (TTerminal).

type TNumber = Int

data TOperator = TAdd
               | TSubtract
                 deriving (Eq, Ord, Show)

data TExpression = TNode (TExpression) TOperator (TExpression)
                 | TTerminal TNumber
                   deriving (Show)

From what we’ve seen so far, it’s not very complicated to write parsers for TNumber and TOperator:

numberParser:: Parsec String st TNumber
numberParser = read <$> (many $ oneOf "0123456789")

operatorParser:: Parsec String st TOperator
operatorParser = chooseOp <$> (oneOf "+-")
                   where chooseOp '+' = TAdd
                         chooseOp '-' = TSubtract

For the expression we have two choices. Either we parse another expression enclosed in parenthesis or we parse a terminal. In the first case, we call the binaryExpressionParser which looks for the left expression, the operator and then the right expression.

expressionParser:: Parsec String st TExpression
expressionParser = (between (char '(') (char ')') binaryExpressionParser) <|>
                   (TTerminal <$> numberParser)

binaryExpressionParser:: Parsec String st TExpression
binaryExpressionParser = TNode <$> expressionParser <*> operatorParser <*> expressionParser

And that’s it! We can now run an example with a valid expression:

> test expressionParser "(123+(324-456))"
Right (TNode (TTerminal 123) TAdd (TNode (TTerminal 324) TSubtract (TTerminal 456)))

The advantage of having this AST is that it’s now very simple to evaluate:

evaluate:: TExpression -> TNumber
evaluate (TNode exp1 TAdd exp2)      = (evaluate exp1) + (evaluate exp2)
evaluate (TNode exp1 TSubtract exp2) = (evaluate exp1) - (evaluate exp2)
evaluate (TTerminal v)               = v

And the final test:

> let Right x = expressionParser "(123+(324-456))"
> evaluate x
-9

It works! We implemented a simple parser and interpreter for a very limited arithmetic expression. There are much better tools to do expression parsing (see * [5] for a tutorial), but it’s out of the scope of this post.

Conclusion

We’ve learned the basics of the Parsec library and built some non-trivial parsers gluing together basic parsers using combinators. We even started scratching the parsing of programming languages by writing a parser for arithmetic expressions.

The Parsec applications presented in the Real World Haskell book are great. I felt that the content was a bit hard to follow, but writing helped me get a better understanding of the subject.

References

[1] Real World Haskell - Chapter 16
[2] A gentle introduction to Parsec
[3] StackOverflow - Parsec vs Yacc/Bison/Antlr: Why and when to use Parsec?
[4] Wikipedia - Parser Combinator
[5] Haskell Wiki - Parsing expressions and statements

| Tags: automata, compilers, haskell