Regular expression help

**Fredrik Lundh** · Jul 18 '05, 12:37 AM

Re: Regular expression help

David Lees wrote:
[color=blue]
> I forget how to find multiple instances of stuff between tags using
> regular expressions. Specifically I want to find all the text between a
> series of begin/end pairs in a multiline file.
>
> I tried:[color=green][color=darkred]
> >>> p = 'begin(.*)end'
> >>> m = re.search(p,s,r e.DOTALL)[/color][/color]
>
> and got everything between the first begin and last end. I guess
> because of a greedy match. What I want to do is a list where each
> element is the text between another begin/end pair.[/color]

people will tell you to use non-greedy matches, but that's often a
bad idea in cases like this: the RE engine has to store lots of back-
tracking information, and your program will consume a lot more
memory than it has to (and may run out of stack and/or memory).

a better approach is to do two searches: first search for a "begin",
and once you've found that, look for an "end"

import re

pos = 0

START = re.compile("beg in")
END = re.compile("end ")

while 1:
m = START.search(te xt, pos)
if not m:
break
start = m.end()
m = END.search(text , start)
if not m:
break
end = m.start()
process(text[start:end])
pos = m.end() # move forward

at this point, it's also obvious that you don't really have to use
regular expressions:

pos = 0

while 1:
start = text.find("begi n", pos)
if start < 0:
break
start += 5
end = text.find("end" , start)
if end < 0:
break
process(text[start:end])
pos = end # move forward

</F>

<!-- (the eff-bot guide to) the python standard library (redux):

http://effbot.org/zone/librarybook-index.htm

-->

**Bengt Richter** · Jul 18 '05, 12:38 AM

Re: Regular expression help

On Thu, 17 Jul 2003 04:27:23 GMT, David Lees <abcdebl2nonspa [email protected] > wrote:
[color=blue]
>I forget how to find multiple instances of stuff between tags using
>regular expressions. Specifically I want to find all the text between a
>series of begin/end pairs in a multiline file.
>
>I tried:[color=green][color=darkred]
> >>> p = 'begin(.*)end'
> >>> m = re.search(p,s,r e.DOTALL)[/color][/color]
>
>and got everything between the first begin and last end. I guess
>because of a greedy match. What I want to do is a list where each
>element is the text between another begin/end pair.
>[/color]
You were close. For non-greedy add the question mark after the greedy expression:
[color=blue][color=green][color=darkred]
>>> import re
>>> s = """[/color][/color][/color]
... begin first end
... begin
... second
... end
... begin problem begin nested end end
... begin last end
... """[color=blue][color=green][color=darkred]
>>> p = 'begin(.*?)end'
>>> rx =re.compile(p,r e.DOTALL)
>>> rx.findall(s)[/color][/color][/color]
[' first ', '\nsecond\n', ' problem begin nested ', ' last ']

Notice what happened with the nested begin-ends. If you have nesting, you
will need more than a simple regex approach.

Regards,
Bengt Richter

**yaipa h.** · Jul 18 '05, 12:38 AM

Re: Regular expression help

Fredrik,

Not sure about the original poster, but I can use that. Thanks!

--Alan

"Fredrik Lundh" <fredrik@python ware.com> wrote in message news:<mailman.1 058424506.12031 [email protected] >...[color=blue]
> David Lees wrote:
>[color=green]
> > I forget how to find multiple instances of stuff between tags using
> > regular expressions. Specifically I want to find all the text between a
> > series of begin/end pairs in a multiline file.
> >
> > I tried:[color=darkred]
> > >>> p = 'begin(.*)end'
> > >>> m = re.search(p,s,r e.DOTALL)[/color]
> >
> > and got everything between the first begin and last end. I guess
> > because of a greedy match. What I want to do is a list where each
> > element is the text between another begin/end pair.[/color]
>
> people will tell you to use non-greedy matches, but that's often a
> bad idea in cases like this: the RE engine has to store lots of back-
> tracking information, and your program will consume a lot more
> memory than it has to (and may run out of stack and/or memory).
>
> a better approach is to do two searches: first search for a "begin",
> and once you've found that, look for an "end"
>
> import re
>
> pos = 0
>
> START = re.compile("beg in")
> END = re.compile("end ")
>
> while 1:
> m = START.search(te xt, pos)
> if not m:
> break
> start = m.end()
> m = END.search(text , start)
> if not m:
> break
> end = m.start()
> process(text[start:end])
> pos = m.end() # move forward
>
> at this point, it's also obvious that you don't really have to use
> regular expressions:
>
> pos = 0
>
> while 1:
> start = text.find("begi n", pos)
> if start < 0:
> break
> start += 5
> end = text.find("end" , start)
> if end < 0:
> break
> process(text[start:end])
> pos = end # move forward
>
> </F>
>
> [/color]

**Bengt Richter** · Jul 18 '05, 12:38 AM

Re: Regular expression help

On Thu, 17 Jul 2003 08:44:50 +0200, "Fredrik Lundh" <fredrik@python ware.com> wrote:
[color=blue]
>David Lees wrote:
>[color=green]
>> I forget how to find multiple instances of stuff between tags using
>> regular expressions. Specifically I want to find all the text between a
>> series of begin/end pairs in a multiline file.
>>
>> I tried:[color=darkred]
>> >>> p = 'begin(.*)end'
>> >>> m = re.search(p,s,r e.DOTALL)[/color]
>>
>> and got everything between the first begin and last end. I guess
>> because of a greedy match. What I want to do is a list where each
>> element is the text between another begin/end pair.[/color]
>
>people will tell you to use non-greedy matches, but that's often a
>bad idea in cases like this: the RE engine has to store lots of back-[/color]
would you say so for this case? Or how like this case?
[color=blue]
>tracking information, and your program will consume a lot more
>memory than it has to (and may run out of stack and/or memory).[/color]
For the above case, wouldn't the regex compile to a state machine
that just has a few states to recognize e out of .* and then revert to .*
if the next is not n, and if it is, then look for d similarly, and if not,
revert to .*, etc or finish? For a short terminating match, it would seem
relatively cheap?
[color=blue]
>at this point, it's also obvious that you don't really have to use
>regular expressions:
>
> pos = 0
>
> while 1:
> start = text.find("begi n", pos)
> if start < 0:
> break
> start += 5
> end = text.find("end" , start)
> if end < 0:
> break
> process(text[start:end])
> pos = end # move forward
>
></F>[/color]

Or breaking your loop with an exception instead of tests:
[color=blue][color=green][color=darkred]
>>> text = """begin s1 end[/color][/color][/color]
... sdfsdf
... begin s2 end
... """
[color=blue][color=green][color=darkred]
>>> def process(s): print 'processing(%r) '%s[/color][/color][/color]
...[color=blue][color=green][color=darkred]
>>> try:[/color][/color][/color]
... end = 0 # end of previous search
... while 1:
... start = text.index("beg in", end) + 5
... end = text.index("end ", start)
... process(text[start:end])
... except ValueError:
... pass
...
processing(' s1 ')
processing(' s2 ')

Or if you're guaranteed that every begin has an end, you could also write
[color=blue][color=green][color=darkred]
>>> for begxxx in text.split('beg in')[1:]:[/color][/color][/color]
... process(begxxx. split('end')[0])
...
processing(' s1 ')
processing(' s2 ')

Regards,
Bengt Richter

**David Lees** · Jul 18 '05, 12:39 AM

Re: Regular expression help

Andrew Bennetts wrote:[color=blue]
> On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
>[color=green]
>>I forget how to find multiple instances of stuff between tags using
>>regular expressions. Specifically I want to find all the text between a[/color]
>
> ^^^^^^^^
>
> How about re.findall?
>
> E.g.:
>[color=green][color=darkred]
> >>> re.findall('BEG IN(.*?)END', 'BEGIN foo END BEGIN bar END')[/color][/color]
> [' foo ', ' bar ']
>
> -Andrew.
>
>[/color]

Actually this fails with the multi-line type of file I was asking about.
[color=blue][color=green][color=darkred]
>>> re.findall('BEG IN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')[/color][/color][/color]
[' bar ']

**Bengt Richter** · Jul 18 '05, 12:39 AM

Re: Regular expression help

On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <abcdebl2nonspa [email protected] > wrote:
[color=blue]
>Andrew Bennetts wrote:[color=green]
>> On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
>>[color=darkred]
>>>I forget how to find multiple instances of stuff between tags using
>>>regular expressions. Specifically I want to find all the text between a[/color]
>>
>> ^^^^^^^^
>>
>> How about re.findall?
>>
>> E.g.:
>>[color=darkred]
>> >>> re.findall('BEG IN(.*?)END', 'BEGIN foo END BEGIN bar END')[/color]
>> [' foo ', ' bar ']
>>
>> -Andrew.
>>
>>[/color]
>
>Actually this fails with the multi-line type of file I was asking about.
>[color=green][color=darkred]
> >>> re.findall('BEG IN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')[/color][/color]
>[' bar ']
>[/color]
It works if you include the DOTALL flag (?s) at the beginning, which makes
.. also match \n: (BTW, (?si) would make it case-insensitive).
[color=blue][color=green][color=darkred]
>>> import re
>>> re.findall('(?s )BEGIN(.*?)END' , 'BEGIN foo\nmumble END BEGIN bar END')[/color][/color][/color]
[' foo\nmumble ', ' bar ']

Regards,
Bengt Richter

**David Lees** · Jul 18 '05, 12:39 AM

Re: Regular expression help

Bengt Richter wrote:
[color=blue]
> On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <abcdebl2nonspa [email protected] > wrote:
>
>[color=green]
>>Andrew Bennetts wrote:
>>[color=darkred]
>>>On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
>>>
>>>
>>>>I forget how to find multiple instances of stuff between tags using
>>>>regular expressions. Specifically I want to find all the text between a
>>>
>>> ^^^^^^^^
>>>
>>>How about re.findall?
>>>
>>>E.g.:
>>>
>>> >>> re.findall('BEG IN(.*?)END', 'BEGIN foo END BEGIN bar END')
>>> [' foo ', ' bar ']
>>>
>>>-Andrew.
>>>
>>>[/color]
>>
>>Actually this fails with the multi-line type of file I was asking about.
>>
>>[color=darkred]
>>>>>re.findall ('BEGIN(.*?)END ', 'BEGIN foo\nmumble END BEGIN bar END')[/color]
>>
>>[' bar ']
>>[/color]
>
> It works if you include the DOTALL flag (?s) at the beginning, which makes
> . also match \n: (BTW, (?si) would make it case-insensitive).
>[color=green][color=darkred]
> >>> import re
> >>> re.findall('(?s )BEGIN(.*?)END' , 'BEGIN foo\nmumble END BEGIN bar END')[/color][/color]
> [' foo\nmumble ', ' bar ']
>
> Regards,
> Bengt Richter[/color]
I just tried to benchmark both Fredrik's suggestions along with Bengt's
using the same input file. The results (looping 200 times over the 400k
file) are:
Fredrik, regex = 1.74003930667
Fredrik, no regex = 0.434207978947
Bengt, regex = 1.45420158149

Interesting how much faster the non-regex approach is.

Thanks again.

David Lees

The code (which I have not carefully checked) is:

import re, time

def timeBengt(s,N):
p = 'begin msc(.*?)end msc'
rx =re.compile(p,r e.DOTALL)
t0 = time.clock()
for i in xrange(N):
x = x = rx.findall(s)
t1 = time.clock()
return t1-t0

def timeFredrik1(te xt,N):
t0 = time.clock()
for i in xrange(N):
pos = 0

START = re.compile("beg in")
END = re.compile("end ")

while 1:
m = START.search(te xt, pos)
if not m:
break
start = m.end()
m = END.search(text , start)
if not m:
break
end = m.start()
pass
pos = m.end() # move forward
t1 = time.clock()
return t1-t0

def timeFredrik(tex t,N):
t0 = time.clock()
for i in xrange(N):
pos = 0
while 1:
start = text.find("begi n msc", pos)
if start < 0:
break
start += 9
end = text.find("end msc", start)
if end < 0:
break
pass
pos = end # move forward

t1 = time.clock()
return t1-t0

fh = open('scu.cfg', 'rb')
s = fh.read()
fh.close()

N = 200
print 'Fredrik, regex = ',timeFredrik1( s,N)
print 'Fredrik, no regex = ',timeFredrik(s ,N)
print 'Bengt, regex = ',timeBengt(s,N )

Regular expression help

Regular expression help

Comment

Comment

Comment

Comment

Comment

Comment

Comment