I came across this regular expression anomaly recently in Python. The regular expression was more complex, but this example illustrates the idea.
Try and predict the outcome (should work with Python 2.5 – Python 3.1)
import re abcd = "abcd" ad = "ad" cre = re.compile(r"^(a)(bc)?(d)$") mos1 = cre.match(abcd) print(cre.sub(r"\1, \2, \3", abcd)) mos2 = cre.match(ad) print(cre.sub(r"\1, \2, \3", ad))
This produces:
a, bc, d
Traceback (most recent call last):
File "<filename>", line <linenumber>, in <module>
print(cre.sub(r"\1, \2, \3", ad))
File "C:\programs\Python25\lib\re.py", line 266, in filter
return sre_parse.expand_template(template, match)
File "C:\programs\Python25\lib\sre_parse.py", line 793, in expand_template
raise error, "unmatched group"
sre_constants.error: unmatched group
(Line numbers vary with different versions of Python – this is Python 2.5)
Adding the following code illustrates the problem:
print(mos1.group(1), mos1.group(2), mos1.group(3) ) print(mos2.group(1), mos2.group(2), mos2.group(3) )
For Python 2.5, this outputs
('a', 'bc', 'd')
('a', None, 'd')
So now it is clear why it is failing, what is the solution. First fix was:
cre2 = re.compile(r"^(a)((bc)?)(d)$") print(cre2.sub(r"\1, \2, \4", ad))
This fix works, but it has side-effects – the last group number has changed. So a better solution might be:
cre3 = re.compile(r"^(a)((?:bc)?)(d)$") print(cre3.sub(r"\1, \2, \3", ad))
Using the non-capturing group notation, (?:…) the problem is fixed without side-effects.
I’m not entirely happy that the problem occurs in the first place, but it does.
Thanks, was looking out for this! π