Comments and Feedback

Behaviour change between 1.5.6 and 1.5.2

Pyparsing in 1.5.6 transforms certain expressions more aggressively than it did 1.5.2, in particular with setResultsName. Consider the following example with a stripped-down grammar that is manipulated to produce a tree in a separate function (this is supposed to illustrate why this bites me):
 
def getGrammar():
    from pyparsing import (Forward, CaselessKeyword, Word, alphas,
        ParserElement, ZeroOrMore, Literal, Optional)
 
    tableName = Word(alphas)
    joinedTable = Word("+-")
    tableReference = (joinedTable
        | tableName)
    fromClause = (CaselessKeyword("FROM")
        + tableReference)("fromClause")
 
    return dict((k, v) for k, v in locals().iteritems()
        if isinstance(v, ParserElement))
 
def enableTree(syms):
    def makeAction(name):
        def action(s, pos, toks):
            return [name, toks]
        return action
    for name in syms:
        syms[name].addParseAction(makeAction(name))
 
if __name__=="__main__":
    import pprint
    syms = getGrammar()
    enableTree(syms)
    pprint.pprint(syms["fromClause"].parseString("FROM ab").asList())
 
With pyparsing 1.5.2, this would print
 
['fromClause', ['FROM', 'tableReference', ['tableName', ['ab']]]]
 
whereas 1.5.6 folds the subexpressions into fromClause (but only when fromClause carries a result name); with this particular scheme, this has user-visible consequences in that the program prints
 
['fromClause', ['FROM', 'ab']]
 
I admit I've not really traced this yet, but since I suspected that streamline is now being called more liberally I tried to inhibit its actions by setNameing the RHS symbols in the fromClause rule before use. Alas, to no avail.
So -- is there anything more sensible I can do to get 1.5.2 results from 1.5.6? Or do you consider my scheme of adding actions long after the symbols have been defined as too harebrained?
— Markus

Python 3 version throws exceptions. Unusable.

Python 2 version works. (refer to http://pypi.python.org/pypi/pyparsing/1.5.6)


(edit 2)

used the installers at http://pypi.python.org/pypi/pyparsing/1.5.6
tried Python 3 version. Does not work for py 3 and running 2to3 tool did not help.
tried the version for python 2.7 and this works.

some ... seems to just take the pyparsing code unchanged, create installers for python 3 and publish it untested on the pypi website.
So for Python 3 check out LEPL (edit 2: and Modgrammar) instead of Pyparsing.

However thanks for Pyparsing for Python 2 which is a nice and working Parser

Parse Result Object Behaves in Unexpected ways . . .

Running a getattr() on any parse result object returns an empty string. This can lead to very confusing behavior like:
result.aslist()

TypeError: 'str' object is not callable, when this is usually thought of as an attribute error. This could lead to a lot of problems passing quietly . . .

-Matt G. (meawoppl at some google mail service)

Simultaneous rules and why tabs must be special on default

1.

I should check that a message A fulfills a quite complex BNF-grammar and that message A's length is not greater than 'l'.
Of course I can parse the message and then check if the length limitations are not violated.
The problem is, that this message A is a part of bigger grammar B and that there might be multiple instances of message A inside B.

Even further, is it possible to make easily an element, that has two or more rules that must be valid at the same time?

For the length example above, this could mean for example something like
elementWithMaxLength = complexElement & Regex('.{1,%d}'%l, flags=re.S+re.M).suppress()
.. with setParseAction perhaps?
2.

Could it be possible that the default value for 'keepTabs' would be True, because it was annoying to find out that tabs are special.

-- kummahiih


Extending ParseResults class

Hi, I've tried to extend this class into a new one "CodeItem" that automatically handles code location for itself and its sub-items.
i think it's convenient for reporting semantic errors. Also there's a simple class ParsingError that takes a message and problem CodeItems as parameters.

The most difficulty I had with - is that ParseResults changes its "appearance" depending on whether instance is named or not, so I've re-implemented getName() and getitem() may be a little sketchy.

Does it make sense for a parser to remember each item's offset automatically? Maybe it will cost some speed - not sure how critical this loss would be though.

Also I've created a wrapper decorator function that would adapt parseAction handler functions to use CodeItems instead of (s,loc,tok) arguments.
-Evgeny.
class ParsingError(Exception):
    def __init__(self,msg,*items):
        self.msg = msg
        self.items = items
    def __str__(self):
        out = [t.info() for t in self.items]
        return 'parsing error: %s\nproblem item(s):\n%s' \
            % (self.msg,'\n'.join(out))
 
def tok2str(tok):
    if isinstance(tok,ParseResults):
        list = tok.asList()
        return tok2str(list)
    elif type(tok) == type([]):
        out = ''
        for item in tok:
            out = out + tok2str(item)
        return out
    elif isinstance(tok,str):
        return tok
    else:
        raise Exception('internal error type=%s' % type(tok))
 
def toklen(tok):
    s = tok2str(tok)
    return len(s)
 
class CodeItem(ParseResults):
    def __init__(self,s,loc,t):
 
        if isinstance(t,ParseResults):
            name = t.getName()
            tlist = t.asList()
        else:
            name = None
            tlist = t
 
        ParseResults.__init__(self,tlist)
 
        self.__ci_name = name
        self.__ci_source = s
        self.__ci_loc = loc
 
    def getName(self):
        return self.__ci_name
 
    def __str__(self):
        tok = ParseResults.__str__(self)
        lineno = self.lineno()
        col = self.col()
        return 'line=%d col=%d tokens=%s' % (lineno,col,tok)
 
    def __repr__(self):
        return ParseResults.__str__(self)
 
    def info(self):
        src = self.source()
        line = self.lineno()
        col = self.col()
        return '(line=%3d col=%3d) %s' % (line,col,src)
 
    def col(self):
        return col(self.__ci_loc,self.__ci_source)
 
    def lineno(self):
        return lineno(self.__ci_loc,self.__ci_source)
 
    def source(self):
        return tok2str(self)
 
    def __getitem__(self,i):
        #this function could be made faster by optimizing toklen function()
 
        t = ParseResults.__getitem__(self,i)
 
        #here I might want to support slicing as well
        if isinstance(i,int):
            offset = 0
            for j in range(i):
                tj = ParseResults.__getitem__(self,j)
                offset += toklen(tj)
            loc = self.__ci_loc + offset
            s = self.__ci_source
            return CodeItem(s,loc,t)
        else:
            #here is a hole, maybe CodeItem should be constructed instead
            #I had problems with named ParseResults
            return t
 
Wrapper function that will convert tokens into CodeItem's and can be used as decorator for parseAction functions as defined in pyparsing.py
This one takes only named parsing results and reconfigures the parseAction to take CodeItem instead of s,loc,tok arguments.
def wrap_named_tokens(f):
    """filters named tokens
    """
    def wrapper(self,s,loc,tok):
        #code_items = TokenList(s,loc,tok)
        code_items = []
        cloc = loc
        for t in tok:
            inc = toklen(t)
            if isinstance(t,ParseResults) and t.getName() != None:
               code_items.append(CodeItem(s,cloc,t))
               cloc = cloc + inc
            f(self,code_items)
   return functools.update_wrapper(wrapper,f)
 
Example of parseAction definition:
@wrap_named_tokens
def some_parse_action(code_items):
    for item in code_items:
        if is_not_good(item):
            raise ParsingError('this code has error',item)
        else:
            do_something_with(item)
The error handler will print line and column numbers automatically.

Setting whitespace characters after defaulting


I am trying to parse a language which has one line statements which are separated by newlines and possible blank lines. In order to parse it I tried setting the default whitespace chars to " \t" and specific whitespace chars for document parser to " \t\n" but I'm not getting the desired effect. Here's an example:
from pyparsing import *
 
ParserElement.setDefaultWhitespaceChars(" \t")
 
statement = Literal("foobar") | Word(nums)
 
statements = ZeroOrMore(statement)
statements.setWhitespaceChars(" \t\n")
 
document = StringStart() + statements + StringEnd()
document.setWhitespaceChars(" \t\n")
 
test = "5498\n foobar"
print test, "->", document.parseString(test)
 
Which raises an error when it hits the newline char.
However, if I manually set the whitespace chars for all items, it works as expected:
from pyparsing import *
 
statement = Literal("foobar") | Word(nums)
statement.setWhitespaceChars(" \t")
 
statements = ZeroOrMore(statement)
statements.setWhitespaceChars(" \t\n")
 
document = StringStart() + statements + StringEnd()
document.setWhitespaceChars(" \t\n")
 
test = "5498\n foobar"
print test, "->", document.parseString(test
 
produces:
5498
 foobar -> ['5498', 'foobar']
 

Am I misunderstanding these commands, or is there a better way to do this?

-Shawn

[reply from Paul]
Shawn -

Well, there is a little confusion on your part, but there is also a subtle bug in pyparsing that prevents you from doing this the actual correct way. Here is the code as I imagine it should be written.
from pyparsing import *
 
ParserElement.setDefaultWhitespaceChars(" \t")
 
statement = (Literal("foobar") | Word(nums)) + LineEnd().suppress()
statements = ZeroOrMore(statement)
document = StringStart() + statements + StringEnd()
 
test = "5498\n foobar"
print test, "->", document.parseString(test)

Only a single call to setDefaultWhitespaceChars, no need to set them on individual parse expressions. However, there is a bug in StringEnd that raises an exception when reading both a LineEnd and a StringEnd at the end of the input string (which I will have fixed in the online CVS code in a few minutes). Note that in your original code that did not work, there was no place for the line breaks to be either parsed or skipped over. The setting of whitespace chars only affects the skipping over of whitespace at the beginning of an expression, so setting whitespace to " \t\n" for document only skips over those characters at the very beginning, not during all immediate child elements of document.

I resolved the newline processing question by leaving in your call to setDefaultWhitespaceChars, and then adding an explicit parse expression to read newlines at the end of each statement, since this is the only place where you want to see newlines.

-- Paul


Can setParseActions be used deeper into the parse hierarchy ?


/* dfadsfasdfasdfasdf */
comment
startcomment + SkipTo(endcomment,include=True)
comment.setParseActions( replaceWith("COMMENT"))

grammar = OneOrMore( comment | command1 | command2 )

result = grammar.transformString( inputstring )


setParseActions can be attached anywhere in the hierarchy. Your example should work ok.
As far as handling comments, you can also look at using ignore:
grammar = OneOrMore( command1 | command2 )
grammar.ignore( comment )
result = grammar.transformString( inputstring )
The reason this is important is that comments can appear even in the middle of a command.


Is there a best practice for parsing mixed content?

Currently I'm having a hard time using parseString() to analyze a wiki paragraph containing mixed content. For example:
Wiki paragraphs can contain [[links]] as well as **bold** and //italic// text.
How would the rules for this paragraph look like that will also preserve the text between wiki markup? Are there any examples which I could have a look at?



For this kind of parsing, parseString is not the best method to use. Just for review, there are now 4 different ways to invoke a pyparsing grammar:
- parseString - parses input string from the beginning, until a mismatch is found or the end of the grammar
- scanString - a generator for partial string matching; returns the matched tokens, and start and end locations of the match
- transformString - wrapper around scanString to apply parse actions to transform the input string
- searchString - wrapper around scanString to return a list of the matched tokens

As you have found, parseString is suitable only if you have a grammar that completely defines the content of the input text. scanString is able to "scan" through the input text, looking for matches - this is closer to what you want, since it only requires definition of pyparsing expressions for that which you are scanning for. transformString and searchString are simple wrappers around scanString, for the most common applications of scanString: converting expressions based on parseActions, and searching for matches and returning a list of matches. So for a wiki markup processor, I'd say transformString is the best fit. In fact, there is a new example on the Examples page titled simpleWiki.py. The one complication is when you get markup nested within markup, but with a little diligence, I hope you can get it worked out.



PyParsing Support Added to Utility Mill

I thought you guys might be interested. You can now make web based utilities using the pyparsing module . As an example I implemented the chemical formula parser example here .

I think this could be very useful for making a quick utility where you want a user to enter some string to be parsed, and easily use pyparser to do the work. Let me know what you think.



Unicode issues

When parsing Unicode strings, PyParsing returns a mixture of unicode and str objects as a result (ASCII strings are always converted to str, others are left intact). This probably should not happen, and intermixing byte strings with Unicode strings is usually not a good idea. I suggest the following patch:
--- pyparsing.py.orig   2008-04-21 23:18:59.000000000 +0600
+++ pyparsing.py        2008-04-21 23:21:53.000000000 +0600
@@ -87,6 +87,11 @@
        str(obj). If that fails with a UnicodeEncodeError, then it tries unicode(obj). It
        then < returns the unicode object | encodes it with the default encoding | ... >.
     """
+
+    # Do not convert unicode to str
+    if isinstance(obj, unicode):
+        return obj
+
     try:
         # If this works, then _ustr(obj) has the same behaviour as str(obj), so
         # it won't break any existing code.
 



equality / equivalency between grammars


I might be the first person to ever equality-test pyparsing grammars, but I need to for pyparsing_helper to work right, and it looks like ParserElement. eq wasn't written to support that (as of 1.5.1).

In [41]: Literal('a') == "a"
Out[41]: True
 
In [42]: Literal('a') == Literal('a')
Out[42]: False

I've submitted a patch, but in the meantime, here's a monkeypatch.

def _eq_monkeypatch(self, other):
    if isinstance(other, pyparsing.ParserElement):
        return self.__dict__ == other.__dict__
    elif isinstance(other, basestring):
        try:
            (self + StringEnd()).parseString(_ustr(other))
            return True
        except ParseBaseException:
            return False
    else:
        return super(ParserElement,self)==other
 
pyparsing.ParserElement.__eq__ = _eq_monkeypatch


pyparsing.ParserElement.eq = _eq_monkeypatch

This was fixed in pyparsing 1.5.2. -- Paul


Generating EBNF-like things from pyparsing grammars?


I'd like to generate some variant of EBNF -- it doesn't need to be too strict -- from a pyparsing grammar. Has anyone tried to do such a thing?

-- Markus

[ reply from Paul ]
Pyparsing's expressions are already self-describing in a quasi-BNF format. For example, here are some of my typical examples (a server name that could be a host name or an IP address), and how they look if printed out:
>>> integer = Word(nums)
>>> print integer
W:(0123...)
 
>>> hostname= Word(alphas, alphanums)
>>> print hostname
W:(abcd...,abcd...)

Since hostname uses different sets of characters for its initial vs. body character, it displays a two-argument format. Unfortunately, the truncation feature clips the significant difference (that the body can contain numeric digits in addition to alpha characters).

Now if we assemble these base expressions into an IP address, we see a couple of other problems:
>>> ip_addr = integer + '.' + integer + '.' + integer + '.' + integer
>>> print ip_addr
{{{{{{W:(0123...) "."} W:(0123...)} "."} W:(0123...)} "."} W:(0123...)}

We really don't want ip_addr to resolve any deeper than its component expressions. For readability's sake, pyparsing allows you to attach a name to an expression (this is not the same as setResultsName):
>>> integer.setName("integer")
integer

Now if we rebuild our ip_addr expression and print out its representation, things are a little better:
>>> ip_addr = integer + '.' + integer + '.' + integer + '.' + integer
>>> print ip_addr
{{{{{{integer "."} integer} "."} integer} "."} integer}

Hmmm, still some room for improvement. What are seeing is the intermediate form that gets created by the '+' operator, which calls ParserElement.__add__(a,b), and returns And([a,b]). Since __add__ can only see two elements at a time, an expression like "a + b + c" returns the nested And([And([a,b]),c]). This is where pyparsing has to do some reshuffling, since the user did not really add any such structure, and would just like things to be processed like And([a,b,c]). So pyparsing has an internal method named streamline() that tries to clean things up a bit. streamline() looks at expressions of like type and tries to collapse unnecessary nesting (while still preserving things like results names, grouping, etc.). If we call it, we can see the results:
>>> ip_addr.streamline()
{integer "." integer "." integer "." integer}

Now this is a lot cleaner! (User code rarely needs to call streamline, it is automatically called as part of the logic in parseString.)

But this is only helpful for showing the top-level expression. If we want to drill down into the parser, we'll need to peel away the names we gave the sub-expressions. See how this is done in the attached little script:
from pyparsing import *
 
integer = Word(nums).setName("integer")
ip_addr = integer + '.' + integer + '.' + integer + '.' + integer
 
hostname = Word(alphas, alphanums+'_').setName("hostname")
 
hostref = hostname | ip_addr
 
# internal pyparsing method, rarely called in user code
hostref.streamline()
 
for exprname in "hostref hostname integer".split():
    expr = locals()[exprname]
    e = expr.copy()
    if hasattr(e,"name"): del e.name
    print exprname,'::',e

Prints:
hostref :: {hostname | {integer "." integer "." integer "." integer}}
hostname :: W:(abcd...,abcd...)
integer :: W:(0123...)

This isn't a complete solution, but maybe it will give you some ideas on how to approach your problem.
-- Paul

[Markus again]

Thanks, Paul. I should really learn to control my coding habit, since of course I got impatient while offline and now coded something that could have made really good use of streamline(). Anyway, there are quite a few subtleties I'd probably have encountered even with streamline. If someone needs something like this: http://www.tfiu.de/homepage/hacks/#pyparsingToEBNF (warning: much more verbose than Paul's suggestions)-- and I'll gladly expand it if someone actually uses it.

[Ben Liles]

Would it be possible to remove the download url from the pypi record so that easy_install will download the tar.gz uploaded to the pypi? That way it won't have to read from wikispaces. I'm using buildout and cannot specify the full url to get it from.

Try it now. - Paul

Error installing

I see in the README that python 2.3.2 or later is required. I am running 2.3.4 on RedHat, and got this error when I tried to install:
[root@host pyparsing-1.5.0]# python setup.py install
Traceback (most recent call last):
  File "setup.py", line 6, in ?
    from pyparsing import __version__
  File "/var/tmp/pyparsing-1.5.0/pyparsing.py", line 2506
    matchOrder += list(e for e in self.exprs if isinstance(e,Optional) and e.expr in tmpOpt)
                           ^
SyntaxError: invalid syntax
Should I upgrade?
[Mark]

Another syntax error raised during install (pyparsing_py3.py, line 2470)

This is my sys.version:
2.5.1 (r251:54863, Feb 6 2009, 19:02:12)
[GCC 4.0.1 (Apple Inc. build 5465)]

The line with the raised syntax error:
except ParseException as err:
The log:
$ python setup.py install
running install
running build
running build_py
creating build
creating build/lib
copying pyparsing.py -> build/lib
copying pyparsing_py3.py -> build/lib
running install_lib
copying build/lib/pyparsing.py -> /Library/Python/2.5/site-packages
copying build/lib/pyparsing_py3.py -> /Library/Python/2.5/site-packages
byte-compiling /Library/Python/2.5/site-packages/pyparsing_py3.py to pyparsing_py3.pyc
  File "/Library/Python/2.5/site-packages/pyparsing_py3.py", line 2470
    except ParseException as err:
                           ^
SyntaxError: invalid syntax
 
running install_egg_info
Writing /Library/Python/2.5/site-packages/pyparsing-1.5.2-py2.5.egg-info

Anyway, the greeting.py example does work.

[2009-10-06]

2010/05/15: Same problem on CygWin with Python 2.5. Replace "as" with ",". Or just ignore the error because that module is intended for v3 only.


know this is random, but can we have a better page were you can comment or bring ideas that the owners can have a look at? or can i email the owners about a new idea?

Post it to the Discussion tab on the Pyparsing WIki home page. (Anyone can post discussion comments)

alphas is locale-dependent

The documentation claims that "alphas" is 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghikjlmnopqrstuvwxyz'. But in reality this is not the case! It constructs it out of alphas.uppercase and alphas.lowercase, which is locale-dependent -- and on my system is full of accented characters! It obviously doesn't make sense to have a programming language whose legal identifiers vary from system to system, so why not just use the literal string 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghikjlmnopqrstuvwxyz'? - Kef Schecter