Regular expression help

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • David Lees

    Regular expression help

    I forget how to find multiple instances of stuff between tags using
    regular expressions. Specifically I want to find all the text between a
    series of begin/end pairs in a multiline file.

    I tried:[color=blue][color=green][color=darkred]
    >>> p = 'begin(.*)end'
    >>> m = re.search(p,s,r e.DOTALL)[/color][/color][/color]

    and got everything between the first begin and last end. I guess
    because of a greedy match. What I want to do is a list where each
    element is the text between another begin/end pair.

    TIA

    David Lees

  • Fredrik Lundh

    #2
    Re: Regular expression help

    David Lees wrote:
    [color=blue]
    > I forget how to find multiple instances of stuff between tags using
    > regular expressions. Specifically I want to find all the text between a
    > series of begin/end pairs in a multiline file.
    >
    > I tried:[color=green][color=darkred]
    > >>> p = 'begin(.*)end'
    > >>> m = re.search(p,s,r e.DOTALL)[/color][/color]
    >
    > and got everything between the first begin and last end. I guess
    > because of a greedy match. What I want to do is a list where each
    > element is the text between another begin/end pair.[/color]

    people will tell you to use non-greedy matches, but that's often a
    bad idea in cases like this: the RE engine has to store lots of back-
    tracking information, and your program will consume a lot more
    memory than it has to (and may run out of stack and/or memory).

    a better approach is to do two searches: first search for a "begin",
    and once you've found that, look for an "end"

    import re

    pos = 0

    START = re.compile("beg in")
    END = re.compile("end ")

    while 1:
    m = START.search(te xt, pos)
    if not m:
    break
    start = m.end()
    m = END.search(text , start)
    if not m:
    break
    end = m.start()
    process(text[start:end])
    pos = m.end() # move forward

    at this point, it's also obvious that you don't really have to use
    regular expressions:

    pos = 0

    while 1:
    start = text.find("begi n", pos)
    if start < 0:
    break
    start += 5
    end = text.find("end" , start)
    if end < 0:
    break
    process(text[start:end])
    pos = end # move forward

    </F>

    <!-- (the eff-bot guide to) the python standard library (redux):

    -->




    Comment

    • Bengt Richter

      #3
      Re: Regular expression help

      On Thu, 17 Jul 2003 04:27:23 GMT, David Lees <abcdebl2nonspa [email protected] > wrote:
      [color=blue]
      >I forget how to find multiple instances of stuff between tags using
      >regular expressions. Specifically I want to find all the text between a
      >series of begin/end pairs in a multiline file.
      >
      >I tried:[color=green][color=darkred]
      > >>> p = 'begin(.*)end'
      > >>> m = re.search(p,s,r e.DOTALL)[/color][/color]
      >
      >and got everything between the first begin and last end. I guess
      >because of a greedy match. What I want to do is a list where each
      >element is the text between another begin/end pair.
      >[/color]
      You were close. For non-greedy add the question mark after the greedy expression:
      [color=blue][color=green][color=darkred]
      >>> import re
      >>> s = """[/color][/color][/color]
      ... begin first end
      ... begin
      ... second
      ... end
      ... begin problem begin nested end end
      ... begin last end
      ... """[color=blue][color=green][color=darkred]
      >>> p = 'begin(.*?)end'
      >>> rx =re.compile(p,r e.DOTALL)
      >>> rx.findall(s)[/color][/color][/color]
      [' first ', '\nsecond\n', ' problem begin nested ', ' last ']

      Notice what happened with the nested begin-ends. If you have nesting, you
      will need more than a simple regex approach.

      Regards,
      Bengt Richter

      Comment

      • yaipa h.

        #4
        Re: Regular expression help

        Fredrik,

        Not sure about the original poster, but I can use that. Thanks!

        --Alan

        "Fredrik Lundh" <fredrik@python ware.com> wrote in message news:<mailman.1 058424506.12031 [email protected] >...[color=blue]
        > David Lees wrote:
        >[color=green]
        > > I forget how to find multiple instances of stuff between tags using
        > > regular expressions. Specifically I want to find all the text between a
        > > series of begin/end pairs in a multiline file.
        > >
        > > I tried:[color=darkred]
        > > >>> p = 'begin(.*)end'
        > > >>> m = re.search(p,s,r e.DOTALL)[/color]
        > >
        > > and got everything between the first begin and last end. I guess
        > > because of a greedy match. What I want to do is a list where each
        > > element is the text between another begin/end pair.[/color]
        >
        > people will tell you to use non-greedy matches, but that's often a
        > bad idea in cases like this: the RE engine has to store lots of back-
        > tracking information, and your program will consume a lot more
        > memory than it has to (and may run out of stack and/or memory).
        >
        > a better approach is to do two searches: first search for a "begin",
        > and once you've found that, look for an "end"
        >
        > import re
        >
        > pos = 0
        >
        > START = re.compile("beg in")
        > END = re.compile("end ")
        >
        > while 1:
        > m = START.search(te xt, pos)
        > if not m:
        > break
        > start = m.end()
        > m = END.search(text , start)
        > if not m:
        > break
        > end = m.start()
        > process(text[start:end])
        > pos = m.end() # move forward
        >
        > at this point, it's also obvious that you don't really have to use
        > regular expressions:
        >
        > pos = 0
        >
        > while 1:
        > start = text.find("begi n", pos)
        > if start < 0:
        > break
        > start += 5
        > end = text.find("end" , start)
        > if end < 0:
        > break
        > process(text[start:end])
        > pos = end # move forward
        >
        > </F>
        >
        > <!-- (the eff-bot guide to) the python standard library (redux):
        > http://effbot.org/zone/librarybook-index.htm
        > -->[/color]

        Comment

        • Bengt Richter

          #5
          Re: Regular expression help

          On Thu, 17 Jul 2003 08:44:50 +0200, "Fredrik Lundh" <fredrik@python ware.com> wrote:
          [color=blue]
          >David Lees wrote:
          >[color=green]
          >> I forget how to find multiple instances of stuff between tags using
          >> regular expressions. Specifically I want to find all the text between a
          >> series of begin/end pairs in a multiline file.
          >>
          >> I tried:[color=darkred]
          >> >>> p = 'begin(.*)end'
          >> >>> m = re.search(p,s,r e.DOTALL)[/color]
          >>
          >> and got everything between the first begin and last end. I guess
          >> because of a greedy match. What I want to do is a list where each
          >> element is the text between another begin/end pair.[/color]
          >
          >people will tell you to use non-greedy matches, but that's often a
          >bad idea in cases like this: the RE engine has to store lots of back-[/color]
          would you say so for this case? Or how like this case?
          [color=blue]
          >tracking information, and your program will consume a lot more
          >memory than it has to (and may run out of stack and/or memory).[/color]
          For the above case, wouldn't the regex compile to a state machine
          that just has a few states to recognize e out of .* and then revert to .*
          if the next is not n, and if it is, then look for d similarly, and if not,
          revert to .*, etc or finish? For a short terminating match, it would seem
          relatively cheap?
          [color=blue]
          >at this point, it's also obvious that you don't really have to use
          >regular expressions:
          >
          > pos = 0
          >
          > while 1:
          > start = text.find("begi n", pos)
          > if start < 0:
          > break
          > start += 5
          > end = text.find("end" , start)
          > if end < 0:
          > break
          > process(text[start:end])
          > pos = end # move forward
          >
          ></F>[/color]

          Or breaking your loop with an exception instead of tests:
          [color=blue][color=green][color=darkred]
          >>> text = """begin s1 end[/color][/color][/color]
          ... sdfsdf
          ... begin s2 end
          ... """
          [color=blue][color=green][color=darkred]
          >>> def process(s): print 'processing(%r) '%s[/color][/color][/color]
          ...[color=blue][color=green][color=darkred]
          >>> try:[/color][/color][/color]
          ... end = 0 # end of previous search
          ... while 1:
          ... start = text.index("beg in", end) + 5
          ... end = text.index("end ", start)
          ... process(text[start:end])
          ... except ValueError:
          ... pass
          ...
          processing(' s1 ')
          processing(' s2 ')

          Or if you're guaranteed that every begin has an end, you could also write
          [color=blue][color=green][color=darkred]
          >>> for begxxx in text.split('beg in')[1:]:[/color][/color][/color]
          ... process(begxxx. split('end')[0])
          ...
          processing(' s1 ')
          processing(' s2 ')


          Regards,
          Bengt Richter

          Comment

          • David Lees

            #6
            Re: Regular expression help

            Andrew Bennetts wrote:[color=blue]
            > On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
            >[color=green]
            >>I forget how to find multiple instances of stuff between tags using
            >>regular expressions. Specifically I want to find all the text between a[/color]
            >
            > ^^^^^^^^
            >
            > How about re.findall?
            >
            > E.g.:
            >[color=green][color=darkred]
            > >>> re.findall('BEG IN(.*?)END', 'BEGIN foo END BEGIN bar END')[/color][/color]
            > [' foo ', ' bar ']
            >
            > -Andrew.
            >
            >[/color]

            Actually this fails with the multi-line type of file I was asking about.
            [color=blue][color=green][color=darkred]
            >>> re.findall('BEG IN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')[/color][/color][/color]
            [' bar ']

            Comment

            • Bengt Richter

              #7
              Re: Regular expression help

              On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <abcdebl2nonspa [email protected] > wrote:
              [color=blue]
              >Andrew Bennetts wrote:[color=green]
              >> On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
              >>[color=darkred]
              >>>I forget how to find multiple instances of stuff between tags using
              >>>regular expressions. Specifically I want to find all the text between a[/color]
              >>
              >> ^^^^^^^^
              >>
              >> How about re.findall?
              >>
              >> E.g.:
              >>[color=darkred]
              >> >>> re.findall('BEG IN(.*?)END', 'BEGIN foo END BEGIN bar END')[/color]
              >> [' foo ', ' bar ']
              >>
              >> -Andrew.
              >>
              >>[/color]
              >
              >Actually this fails with the multi-line type of file I was asking about.
              >[color=green][color=darkred]
              > >>> re.findall('BEG IN(.*?)END', 'BEGIN foo\nmumble END BEGIN bar END')[/color][/color]
              >[' bar ']
              >[/color]
              It works if you include the DOTALL flag (?s) at the beginning, which makes
              .. also match \n: (BTW, (?si) would make it case-insensitive).
              [color=blue][color=green][color=darkred]
              >>> import re
              >>> re.findall('(?s )BEGIN(.*?)END' , 'BEGIN foo\nmumble END BEGIN bar END')[/color][/color][/color]
              [' foo\nmumble ', ' bar ']

              Regards,
              Bengt Richter

              Comment

              • David Lees

                #8
                Re: Regular expression help

                Bengt Richter wrote:
                [color=blue]
                > On Fri, 18 Jul 2003 04:31:32 GMT, David Lees <abcdebl2nonspa [email protected] > wrote:
                >
                >[color=green]
                >>Andrew Bennetts wrote:
                >>[color=darkred]
                >>>On Thu, Jul 17, 2003 at 04:27:23AM +0000, David Lees wrote:
                >>>
                >>>
                >>>>I forget how to find multiple instances of stuff between tags using
                >>>>regular expressions. Specifically I want to find all the text between a
                >>>
                >>> ^^^^^^^^
                >>>
                >>>How about re.findall?
                >>>
                >>>E.g.:
                >>>
                >>> >>> re.findall('BEG IN(.*?)END', 'BEGIN foo END BEGIN bar END')
                >>> [' foo ', ' bar ']
                >>>
                >>>-Andrew.
                >>>
                >>>[/color]
                >>
                >>Actually this fails with the multi-line type of file I was asking about.
                >>
                >>[color=darkred]
                >>>>>re.findall ('BEGIN(.*?)END ', 'BEGIN foo\nmumble END BEGIN bar END')[/color]
                >>
                >>[' bar ']
                >>[/color]
                >
                > It works if you include the DOTALL flag (?s) at the beginning, which makes
                > . also match \n: (BTW, (?si) would make it case-insensitive).
                >[color=green][color=darkred]
                > >>> import re
                > >>> re.findall('(?s )BEGIN(.*?)END' , 'BEGIN foo\nmumble END BEGIN bar END')[/color][/color]
                > [' foo\nmumble ', ' bar ']
                >
                > Regards,
                > Bengt Richter[/color]
                I just tried to benchmark both Fredrik's suggestions along with Bengt's
                using the same input file. The results (looping 200 times over the 400k
                file) are:
                Fredrik, regex = 1.74003930667
                Fredrik, no regex = 0.434207978947
                Bengt, regex = 1.45420158149

                Interesting how much faster the non-regex approach is.

                Thanks again.

                David Lees

                The code (which I have not carefully checked) is:

                import re, time

                def timeBengt(s,N):
                p = 'begin msc(.*?)end msc'
                rx =re.compile(p,r e.DOTALL)
                t0 = time.clock()
                for i in xrange(N):
                x = x = rx.findall(s)
                t1 = time.clock()
                return t1-t0

                def timeFredrik1(te xt,N):
                t0 = time.clock()
                for i in xrange(N):
                pos = 0

                START = re.compile("beg in")
                END = re.compile("end ")

                while 1:
                m = START.search(te xt, pos)
                if not m:
                break
                start = m.end()
                m = END.search(text , start)
                if not m:
                break
                end = m.start()
                pass
                pos = m.end() # move forward
                t1 = time.clock()
                return t1-t0


                def timeFredrik(tex t,N):
                t0 = time.clock()
                for i in xrange(N):
                pos = 0
                while 1:
                start = text.find("begi n msc", pos)
                if start < 0:
                break
                start += 9
                end = text.find("end msc", start)
                if end < 0:
                break
                pass
                pos = end # move forward

                t1 = time.clock()
                return t1-t0

                fh = open('scu.cfg', 'rb')
                s = fh.read()
                fh.close()

                N = 200
                print 'Fredrik, regex = ',timeFredrik1( s,N)
                print 'Fredrik, no regex = ',timeFredrik(s ,N)
                print 'Bengt, regex = ',timeBengt(s,N )

                Comment

                Working...