Python regular expression surprise

I came across this regular expression anomaly recently in Python. The regular expression was more complex, but this example illustrates the idea.

Try and predict the outcome (should work with Python 2.5 – Python 3.1)

import re

abcd = "abcd"
ad = "ad"

cre = re.compile(r"^(a)(bc)?(d)$")

mos1 = cre.match(abcd)
print(cre.sub(r"\1, \2, \3", abcd))

mos2 = cre.match(ad)
print(cre.sub(r"\1, \2, \3", ad))

This produces:

a, bc, d
Traceback (most recent call last): 
  File "<filename>", line <linenumber>, in <module>
    print(cre.sub(r"\1, \2, \3", ad))
  File "C:\programs\Python25\lib\re.py", line 266, in filter
    return sre_parse.expand_template(template, match)
  File "C:\programs\Python25\lib\sre_parse.py", line 793, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

(Line numbers vary with different versions of Python – this is Python 2.5)

Adding the following code illustrates the problem:

print(mos1.group(1), mos1.group(2), mos1.group(3) )
print(mos2.group(1), mos2.group(2), mos2.group(3) )

For Python 2.5, this outputs

('a', 'bc', 'd')
('a', None, 'd')

So now it is clear why it is failing, what is the solution. First fix was:

cre2 = re.compile(r"^(a)((bc)?)(d)$")
print(cre2.sub(r"\1, \2, \4", ad))

This fix works, but it has side-effects – the last group number has changed. So a better solution might be:

cre3 = re.compile(r"^(a)((?:bc)?)(d)$")
print(cre3.sub(r"\1, \2, \3", ad))

Using the non-capturing group notation, (?:…) the problem is fixed without side-effects.

I’m not entirely happy that the problem occurs in the first place, but it does.

This entry was posted in Python, Software. Bookmark the permalink.

1 Response to Python regular expression surprise

  1. Nagesh's avatar Nagesh says:

    Thanks, was looking out for this! πŸ™‚

Leave a comment