FAQs

=Frequently Asked Questions= toc Some pyparsing questions just keep popping up. This page is open for public updates, so feel free to add your own 2 cents.

When should I use pyparsing and when should I use regex?
//If I have a task that a regular expression could do fairly easily, should I consider it, or just use pyparsing? //

If the regex is something you can grasp easily so that you can maintain it over time, then use the regex. If you are doing some kind of regex+pyparsing hybrid, then use a Regex object in pyparsing. (I often do this for matching decimal numbers, defining them as  code Regex(r"\d+\.\d*([Ee][+-]?\d+)?") code instead of code Combine(Word(nums) + "." + Optional(Word(nums) + Optional(CaselessLiteral("E") + Optional(oneOf("+ -") + Word(nums)) code There is a substantial speed improvement, especially when this expression is used frequently in the parser, or as an operand in infixNotation (formerly operatorPrecedence). And I don't see a huge loss of maintainability here - this regex is pretty easy to break up and understand. If you can use ordinary string methods, you should use those. People often overlook startswith and endswith for matching leading or trailing strings. Or they will write re.match(r".*BLAH.*", astring), instead of just using "BLAH" in astring, or astring.index("BLAH"). So unfortunately, the answer is "it depends". Is this an expression that will likely evolve over time? Then use pyparsing. Is it a short, easily-grasped regex? Use the regex. Is the regex full of ``\s*`` elements (to skip over potential whitespace)? Consider using pyparsing. Are you parsing HTML? Definitely consider pyparsing over regex. The biggest factor is probably your own comfort level with using regex. When I wrote pyparsing, I was not very strong in using regex, so I would use pyparsing for almost everything. But since then, my regex skills have improved, and so I'll at least consider them when I don't want to have to add a dependency on pyparsing.

How to get pyparsing to parse the entire input string?
//I have a problem with pyparsing seemingly silently failing to parse// //the remaining part of the text it is to parse if it encounters an// //error in a block if this block is part of a ZeroOrMore or OneOrMore// //section and it already successfully has acquired one such block.//

code alphaWord = Word(alphas) p = OneOrMore(alphaWord) print p.parseString("abc def ghi 123 jkl") code //This prints// //['abc', 'def', 'ghi']// //but raises no exception.//

To have pyparsing parse the entire input string, have the grammar end with a StringEnd expression:

code p = OneOrMore(alphaWord) + StringEnd print p.parseString("abc def ghi 123 jkl") code

raises this exception: code Traceback (most recent call last): File " ", line 1, in ? File "c:\python24\lib\site-packages\pyparsing.py", line 703, in parseString loc, tokens = self._parse( instring.expandtabs, 0 ) File "c:\python24\lib\site-packages\pyparsing.py", line 596, in _parseNoCache loc,tokens = self.parseImpl( instring, loc, doActions ) File "c:\python24\lib\site-packages\pyparsing.py", line 1741, in parseImpl loc, exprtokens = e._parse( instring, loc, doActions ) File "c:\python24\lib\site-packages\pyparsing.py", line 600, in _parseNoCache loc,tokens = self.parseImpl( instring, loc, doActions ) File "c:\python24\lib\site-packages\pyparsing.py", line 1619, in parseImpl raise exc pyparsing.ParseException: Expected end of text (at char 12), (line:1, col:13) code

Note: as of pyparsing 1.5.0, you can also call parseString with the optional argument , without having to explicitly add StringEnd to your grammar. This allows you to write: code p = OneOrMore(alphaWord) print p.parseString("abc def ghi 123 jkl", parseAll=True) code and get the same behavior as shown above.

How do I get the line number of a matching token?
By default ParseResults include the matched text as token strings, and optionally, keyword associations for fields marked with results names, but no additional data (such as the parsing location, or line number). We can expand the results using a parse action. Parse actions can return any kind of object in place of the original parsed strings, so we can construct a tuple containing additional contextual information.

Let's look for all the forms of the word "dedicate" in the Gettysburg address (sort of our own mini-google of the speech's text).

code from pyparsing import *

gettysburgAddress = """Four-score and seven years ago, our forefathers brought forth upon this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battlefield of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we cannot dedicate - we cannot consecrate - we cannot hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth."""

searchWords = oneOf("dedicate dedicated dedicating dedication",                   caseless=True)
 * 1) define the search target - word forms of "dedicate"

searchWords.setParseAction(       lambda origString,loc,tokens:            ( tokens[0], lineno(loc,origString), col(loc,origString) )        )
 * 1) define and attach parse action, returning a tuple of the matched
 * 2) word, and the line and column of the match (using pyparsing's lineno
 * 3) and col functions)

for toks,startLoc,endLoc in searchWords.scanString( gettysburgAddress ): print "%s: line %d, col %d" % toks[0]
 * 1) scan the string looking for matching words

# scanString also returns the parsing location - we'll use it here # to print the entire line of text print line(startLoc, gettysburgAddress)

print code

This prints the following output.

code dedicated: line 2, col 56 this continent a new nation, conceived in liberty, and dedicated to

dedicated: line 6, col 36 or any nation, so conceived and so dedicated, can long endure. We

dedicate: line 7, col 62 are met on a great battlefield of that war. We have come to dedicate

dedicate: line 12, col 35 But, in a larger sense, we cannot dedicate - we cannot consecrate -

dedicated: line 17, col 23 living, rather, to be dedicated here to the unfinished work which they

dedicated: line 19, col 12 to be here dedicated to the great task remaining before us - that from

code There is also an example that ships with pyparsing, unimaginatively named.

What the heck is "packrat parsing"?
Packrat parsing was first suggested to me in mid-2005 by an MIT'er, Chris Lesniewski-Laas, as a way to speed up pyparsing performance. As it happens, pyparsing's architecture made packratting almost a natural fit.

In a recursive descent (RD) parser, the grammar is traversed while walking through the input string. As paths through the grammar are walked, matches succeed and the parser advances forward along the input, or matches fail and the parser reverts to the last known good spot with an alternative grammar expression to try. Sometimes the same expression is evaluated multiple times at the same location, being common to several alternatives.

Probably every RD parser written in Python would have some method like code newLocation = parseExpressionAt(expr, inputString, currentParseLocation) code Adding packrat parsing to such a method merely requires including some mechanism for memoizing the return values for the previously seen (expr, inputString, currentParseLocation) tuples, so that if that expression is tried again later, on the same input string at the same location, there is no need to invoke the expr's own parsing function, you can just return the previously computed values from the memoizing cache.

Pyparsing has some mixed success with packrat parsing, mostly good results really, but sometimes a parse action attached to an expression really must run in the proper matching context, and not just the first time an expression might be seen at a location. Most academic discussions of packrat parsing do not consider this behavior, assuming the parser's sole role is to build a structure of matched tokens for subsequent semantic processing. Many of my unit tests and performance tests did very will with packratting, but some unit tests failed - mostly those with parse actions that have global side effects. Since I was adding packratting to an existing module, I took the conservative path and shipped pyparsing with packratting disabled, to be consciously enabled in the calling grammar.

Internally, pyparsing has two methods defined at the base ParserElement class: _parseCache and _parseNoCache, with the static attribute _parse pointing to _parseNoCache by default. If you enable packratting, all that happens is that _parse is changed to point to _parseCache - all the rest of the code references the _parse method, so this single assignment is sufficient to enable packratting.

You can read the full dissertation by Bryan Ford of MIT, and other packrat-related articles, at [|http://pdos.csail.mit.edu/~baford/packrat/.]

Could someone please give an example of assigning names? I have trouble using parsing results.
Here is a snippet from the SimpleSQL.py example on the Examples page: code selectStmt << ( selectToken +             ( '*' | columnNameList ).setResultsName( "columns" ) +              fromToken +              tableNameList.setResultsName( "tables" ) +              Optional( Group( CaselessLiteral("where") + whereExpression ), "" ).setResultsName("where") ) code

To get the data from an SQL string, save the results to a variable: code sqlData = selectStmt.parseString("SELECT * from EMPLOYEES where Dept='Sales'") code

You can then get the data fields with the results names as attributes of sqlData: code print sqlData.columns, sqlData.tables print "Selecting columns %(columns)s from table %(tables)s" % sqlData code

If you prefer, you can use dict-style access: code print sqlData["columns"] code

For debugging, you can quickly view all the tokens in nested list form, followed by the list of key names and their values using dump: code print sqlData.dump code which prints: code ['select', '*', 'from', ['EMPLOYEES'], ['where', ['DEPT', '=', "'Sales'"]]] - columns: * - tables: ['EMPLOYEES'] - where: 'where', ['DEPT', '=', "'Sales'"] code

In pyparsing 1.4.8, I introduced a new syntax for setting results names. The above code can be written more simply as: code selectStmt << ( selectToken +             ( '*' | columnNameList )( "columns" ) +              fromToken +              tableNameList( "tables" ) +              Optional( Group( CaselessLiteral("where") + whereExpression ), "" )("where") ) code

Other examples that use results names are:

What version of Python supports? Does pyparsing support Python 3?
Pyparsing version 1.5.7 works best with Python versions 2.4 and 2.5. I have tried to maintain compatibility Python 2.3, but this does not always work. This version is kept for compatibility with these older versions of Python, but is no longer being actively maintained.

Pyparsing version 2.0 and above works with Python 2.6, 2.7, and Python 3. This version is the branch that is still being actively supported.