Tutorial¶

Suppose we need to parse a list of numbers separated by commas without an external lexer. This means that we should build our parser from the simplest ones.

Parsing numbers¶

First, we need to recognize digits. For this we will use the reparsec.sequence.satisfy() parser. It is parameterized with a predicate to test the input token.

>>> from reparsec.sequence import satisfy
>>> digit = satisfy(str.isdigit)

Let’s try it in action. We can use the reparsec.Parser.parse() method of our freshly created parser to parse a string. It returns either a result of successful parse or an error. You can get the actual value or exception with an reparsec.ParseResult.unwrap() method:

>>> digit.parse("123").unwrap()
'1'
>>> digit.parse("a").unwrap()
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 0: unexpected input

So far, so good. Next, we want to parse numbers. For simplicity let’s assume that a number is a sequence of one or more digits:

>>> digits = digit + digit.many()

We use method reparsec.Parser.many() to construct parser that tries to apply original parser zero or more times, and operator + to sequentially apply two parsers.

>>> digits.parse("123").unwrap()
('1', ['2', '3'])

The output doesn’t looks like a number yet. We need reparsec.Parser.fmap() to convert it to a number:

>>> number = digits.fmap(lambda v: int(v[0] + "".join(v[1])))
>>> number.parse("123").unwrap()
123

Parsing lists¶

Now we are ready to parse the list. The list is just a sequence of numbers separated by commas. To parse a single comma we will use the reparsec.sequence.sym() parser, which is parameterized with expected character. Parsers for sequences with separators are usually constructed using the reparsec.Parser.sep_by() combinator:

>>> from reparsec.sequence import sym
>>> list_parser = number.sep_by(sym(","))
>>> list_parser.parse("12,34,56").unwrap()
[12, 34, 56]

Success!

Allowing whitespace¶

What if we want to allow whitespace around numbers? Let’s extend the parser to accept such inputs:

>>> space = satisfy(str.isspace)
>>> spaces = space.many()
>>> number = digits.fmap(lambda v: int(v[0] + "".join(v[1]))) << spaces
>>> comma = sym(",") << spaces
>>> list_parser = spaces >> number.sep_by(comma)
>>> list_parser.parse(" 1 , 2 ").unwrap()
[1, 2]

The << and >> operators used here are similar to +, but return only the value of left or right parser, respectively.

Parsing incorrect inputs¶

Until before we focused on parsing valid inputs. But what if we have a string with unexpected characters in it?

>>> list_parser.parse("1,a").unwrap()
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 2: unexpected input

The parser reported an error and provided a brief description of what was wrong with the input.

>>> list_parser.parse("1a").unwrap()
[1]

Ouch! While reporting errors in general, in some cases our parser silently ignores the rest of the input. Let’s fix this by requiring input to end right after the list using the reparsec.sequence.eof() parser:

>>> from reparsec.sequence import eof
>>> list_parser = spaces >> number.sep_by(comma) << eof()
>>> list_parser.parse("1a").unwrap()
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 1: expected ',' or end of file

Much better.

Improving error reporting¶

Let’s take a closer look at the errors messages:

>>> list_parser.parse("1 2").unwrap()
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 2: expected ',' or end of file

Seems informative.

>>> list_parser.parse("1,").unwrap()
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 2: unexpected input

This message is not very helpful. This is because the reparsec.sequence.satisfy() parser has no idea about the expected token. Let’s add some labels to help it with reparsec.Parser.label() combinator:

>>> digit = satisfy(str.isdigit).label("digit")
>>> digits = digit + digit.many()
>>> number = digits.fmap(
...     lambda v: int(v[0] + "".join(v[1]))
... ).label("number") << spaces
>>> list_parser = spaces >> number.sep_by(comma) << eof()
>>> list_parser.parse("1,").unwrap()
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 2: expected number

Recovering from errors¶

And now for something completely different:

>>> list_parser.parse("1 2", recover=True).unwrap(recover=True)
[1]

The parser recovered from the error and produced a partial result. Pretty useful. However, reparsec.satisfy() again doesn’t know how to fix input besides ignoring some parts of the input:

>>> list_parser.parse("1,", recover=True).unwrap(recover=True)
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 2: expected number

We can use reparsec.Parser.recover_with() to return some value during error recovery:

>>> list_parser = spaces >> number.recover_with(0).sep_by(comma) << eof()
>>> list_parser.parse("1,", recover=True).unwrap(recover=True)
[1, 0]

The parser is even capable of fixing multiple errors in the input:

>>> list_parser.parse("1,,,2 3", recover=True).unwrap(recover=True)
[1, 0, 0, 2]

And what if we want to show them to user?

>>> list_parser.parse("1,,,2 3", recover=True).unwrap()
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 2: expected number (inserted 0),
at 3: expected number (inserted 0),
at 6: expected ',' or end of file (skipped 1 token)

Line and column tracking¶

Error reporting still needs another improvement. All of the messages in the previous examples contains indexes in the input string as error positions, but it is more convenient to show line and column numbers instead. To achieve this, we will use reparsec.scannerless.parse(). This is a wrapper around reparsec.Parser.parse() that enables position tracking for parsers with string inputs:

>>> from reparsec.scannerless import parse
>>> src = """\
... 1,,
...  ,2
... 3
... """
>>> parse(list_parser, src, recover=True).unwrap()
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 1:3: expected number (inserted 0),
at 2:2: expected number (inserted 0),
at 3:1: expected ',' or end of file (skipped 2 tokens)

As a finishing touch, let’s write a helper function so that users of our parser don’t have to think about how to properly invoke the parser:

>>> from typing import List
>>> def parse_list(src: str) -> List[int]:
...     return parse(list_parser, src, recover=True).unwrap()
>>> parse_list("1, 2, 3")
[1, 2, 3]
>>> parse_list("1, ,2 3")
Traceback (most recent call last):
  ...
reparsec.types.ParseError: at 1:4: expected number (inserted 0),
at 1:7: expected ',' or end of file (skipped 1 token)

Conclusion¶

The final parser definition should look like this:

from typing import List

from reparsec.scannerless import parse
from reparsec.sequence import eof, satisfy, sym

spaces = satisfy(str.isspace).many()

digit = satisfy(str.isdigit).label("digit")
digits = digit + digit.many()

number = digits.fmap(
    lambda v: int(v[0] + "".join(v[1]))
).label("number") << spaces

comma = sym(",") << spaces

list_parser = spaces >> number.recover_with(0).sep_by(comma) << eof()

def parse_list(src: str) -> List[int]:
    return parse(list_parser, src, recover=True).unwrap()