Frequently Asked Questions

Some pyparsing questions just keep popping up. This page is open for public updates, so feel free to add your own 2 cents.

When should I use pyparsing and when should I use regex?


If I have a task that a regular expression could do fairly easily, should I consider it, or just use pyparsing?

If the regex is something you can grasp easily so that you can maintain it over time, then use the regex. If you are doing some kind of regex+pyparsing hybrid, then use a Regex object in pyparsing. (I often do this for matching decimal numbers, defining them as
Regex(r"\d+\.\d*([Ee][+-]?\d+)?") 

instead of

Combine(Word(nums) + "." + Optional(Word(nums) + Optional(CaselessLiteral("E") + Optional(oneOf("+ -") + Word(nums)) 

There is a substantial speed improvement, especially when this expression is used frequently in the parser, or as an operand in infixNotation (formerly operatorPrecedence). And I don't see a huge loss of maintainability here - this regex is pretty easy to break up and understand.

If you can use ordinary string methods, you should use those. People often overlook startswith and endswith for matching leading or trailing strings. Or they will write re.match(r".*BLAH.*", astring), instead of just using "BLAH" in astring, or astring.index("BLAH").

So unfortunately, the answer is "it depends". Is this an expression that will likely evolve over time? Then use pyparsing. Is it a short, easily-grasped regex? Use the regex. Is the regex full of \s* elements (to skip over potential whitespace)? Consider using pyparsing. Are you parsing HTML? Definitely consider pyparsing over regex.

The biggest factor is probably your own comfort level with using regex. When I wrote pyparsing, I was not very strong in using regex, so I would use pyparsing for almost everything. But since then, my regex skills have improved, and so I'll at least consider them when I don't want to have to add a dependency on pyparsing.


How to get pyparsing to parse the entire input string?


I have a problem with pyparsing seemingly silently failing to parse the remaining part of the text it is to parse if it encounters an error in a block if this block is part of a ZeroOrMore or OneOrMore section and it already successfully has acquired one such block.

alphaWord = Word(alphas)
p = OneOrMore(alphaWord)
print p.parseString("abc def ghi 123 jkl")
This prints
['abc', 'def', 'ghi']
but raises no exception.

To have pyparsing parse the entire input string, have the grammar end with a StringEnd() expression:

p = OneOrMore(alphaWord) + StringEnd()
print p.parseString("abc def ghi 123 jkl")

raises this exception:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "c:\python24\lib\site-packages\pyparsing.py", line 703, in parseString
loc, tokens = self._parse( instring.expandtabs(), 0 )
File "c:\python24\lib\site-packages\pyparsing.py", line 596, in _parseNoCache
loc,tokens = self.parseImpl( instring, loc, doActions )
File "c:\python24\lib\site-packages\pyparsing.py", line 1741, in parseImpl
loc, exprtokens = e._parse( instring, loc, doActions )
File "c:\python24\lib\site-packages\pyparsing.py", line 600, in _parseNoCache
loc,tokens = self.parseImpl( instring, loc, doActions )
File "c:\python24\lib\site-packages\pyparsing.py", line 1619, in parseImpl
raise exc
pyparsing.ParseException: Expected end of text (at char 12), (line:1, col:13)

Note: as of pyparsing 1.5.0, you can also call parseString with the optional argument
parseAll=True, without having to explicitly add StringEnd() to your grammar. This allows you to write:
p = OneOrMore(alphaWord)
print p.parseString("abc def ghi 123 jkl", parseAll=True)
and get the same behavior as shown above.

How do I get the line number of a matching token?


By default ParseResults include the matched text as token strings, and optionally, keyword associations for fields marked with results names, but no additional data (such as the parsing location, or line number). We can expand the results using a parse action. Parse actions can return any kind of object in place of the original parsed strings, so we can construct a tuple containing additional contextual information.

Let's look for all the forms of the word "dedicate" in the Gettysburg address (sort of our own mini-google of the speech's text).

from pyparsing import *
 
gettysburgAddress = """Four-score and seven years ago, our forefathers brought forth upon
this continent a new nation, conceived in liberty, and dedicated to
the proposition that all men are created equal.
 
Now we are engaged in a great civil war, testing whether that nation,
or any nation, so conceived and so dedicated, can long endure.  We
are met on a great battlefield of that war.  We have come to dedicate
a portion of that field, as a final resting place for those who here
gave their lives that that nation might live.  It is altogether
fitting and proper that we should do this.
 
But, in a larger sense, we cannot dedicate - we cannot consecrate -
we cannot hallow this ground.  The brave men, living and dead, who
struggled here, have consecrated it, far above our poor power to add
or detract.  The world will little note, nor long remember what we say
here, but it can never forget what they did here.  It is for us the
living, rather, to be dedicated here to the unfinished work which they
who fought here have thus far so nobly advanced.  It is rather for us
to be here dedicated to the great task remaining before us - that from
these honored dead we take increased devotion to that cause for which
they gave the last full measure of devotion - that we here highly
resolve that these dead shall not have died in vain - that this
nation, under God, shall have a new birth of freedom - and that
government of the people, by the people, for the people, shall not
perish from the earth."""
 
# define the search target - word forms of "dedicate"
searchWords = oneOf("dedicate dedicated dedicating dedication",
                    caseless=True)
 
# define and attach parse action, returning a tuple of the matched
# word, and the line and column of the match (using pyparsing's lineno
# and col functions)
searchWords.setParseAction(
        lambda origString,loc,tokens:
            ( tokens[0], lineno(loc,origString), col(loc,origString) )
        )
 
# scan the string looking for matching words
for toks,startLoc,endLoc in searchWords.scanString( gettysburgAddress ):
    print "%s: line %d, col %d" % toks[0]
 
    # scanString also returns the parsing location - we'll use it here
    # to print the entire line of text
    print line(startLoc, gettysburgAddress)
 
    print

This prints the following output.

dedicated: line 2, col 56
this continent a new nation, conceived in liberty, and dedicated to
 
dedicated: line 6, col 36
or any nation, so conceived and so dedicated, can long endure.  We
 
dedicate: line 7, col 62
are met on a great battlefield of that war.  We have come to dedicate
 
dedicate: line 12, col 35
But, in a larger sense, we cannot dedicate - we cannot consecrate -
 
dedicated: line 17, col 23
living, rather, to be dedicated here to the unfinished work which they
 
dedicated: line 19, col 12
to be here dedicated to the great task remaining before us - that from
 
There is also an example that ships with pyparsing, unimaginatively named .

What the heck is "packrat parsing"?



Packrat parsing was first suggested to me in mid-2005 by an MIT'er, Chris Lesniewski-Laas, as a way to speed up pyparsing performance. As it happens, pyparsing's architecture made packratting almost a natural fit.

In a recursive descent (RD) parser, the grammar is traversed while walking through the input string. As paths through the grammar are walked, matches succeed and the parser advances forward along the input, or matches fail and the parser reverts to the last known good spot with an alternative grammar expression to try. Sometimes the same expression is evaluated multiple times at the same location, being common to several alternatives.

Probably every RD parser written in Python would have some method like
newLocation = parseExpressionAt(expr, inputString, currentParseLocation)
Adding packrat parsing to such a method merely requires including some mechanism for memoizing the return values for the previously seen (expr, inputString, currentParseLocation) tuples, so that if that expression is tried again later, on the same input string at the same location, there is no need to invoke the expr's own parsing function, you can just return the previously computed values from the memoizing cache.

Pyparsing has some mixed success with packrat parsing, mostly good results really, but sometimes a parse action attached to an expression really must run in the proper matching context, and not just the first time an expression might be seen at a location. Most academic discussions of packrat parsing do not consider this behavior, assuming the parser's sole role is to build a structure of matched tokens for subsequent semantic processing. Many of my unit tests and performance tests did very will with packratting, but some unit tests failed - mostly those with parse actions that have global side effects. Since I was adding packratting to an existing module, I took the conservative path and shipped pyparsing with packratting disabled, to be consciously enabled in the calling grammar.

Internally, pyparsing has two methods defined at the base ParserElement class: _parseCache and _parseNoCache, with the static attribute _parse pointing to _parseNoCache by default. If you enable packratting, all that happens is that _parse is changed to point to _parseCache - all the rest of the code references the _parse method, so this single assignment is sufficient to enable packratting.

You can read the full dissertation by Bryan Ford of MIT, and other packrat-related articles, at http://pdos.csail.mit.edu/~baford/packrat/.

Could someone please give an example of assigning names? I have trouble using parsing results.

Here is a snippet from the SimpleSQL.py example on the Examples page:
selectStmt << ( selectToken +
              ( '*' | columnNameList ).setResultsName( "columns" ) +
              fromToken +
              tableNameList.setResultsName( "tables" ) +
              Optional( Group( CaselessLiteral("where") + whereExpression ),
                                                   "" ).setResultsName("where") )

To get the data from an SQL string, save the results to a variable:
sqlData = selectStmt.parseString("SELECT * from EMPLOYEES where Dept='Sales'")

You can then get the data fields with the results names as attributes of sqlData:
print sqlData.columns, sqlData.tables
print "Selecting columns %(columns)s from table %(tables)s" % sqlData

If you prefer, you can use dict-style access:
print sqlData["columns"]

For debugging, you can quickly view all the tokens in nested list form, followed by the list of key names and their values using dump():
print sqlData.dump()
which prints:
['select', '*', 'from', ['EMPLOYEES'], ['where', ['DEPT', '=', "'Sales'"]]]
- columns: *
- tables: ['EMPLOYEES']
- where: [['where', ['DEPT', '=', "'Sales'"]]]

In pyparsing 1.4.8, I introduced a new syntax for setting results names. The above code can be written more simply as:
selectStmt << ( selectToken +
              ( '*' | columnNameList )( "columns" ) +
              fromToken +
              tableNameList( "tables" ) +
              Optional( Group( CaselessLiteral("where") + whereExpression ), "" )("where") )

Other examples that use results names are:





What version of Python supports? Does pyparsing support Python 3?

Pyparsing version 1.5.7 works best with Python versions 2.4 and 2.5. I have tried to maintain compatibility Python 2.3, but this does not always work. This version is kept for compatibility with these older versions of Python, but is no longer being actively maintained.

Pyparsing version 2.0 and above works with Python 2.6, 2.7, and Python 3. This version is the branch that is still being actively supported.