Match beginning of two strings

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Ravi

    Match beginning of two strings

    Hi,

    I have about 200GB of data that I need to go through and extract the
    common first part of a line. Something like this.
    [color=blue][color=green][color=darkred]
    >>>a = "abcdefghijklmn opqrstuvwxyz"
    >>>b = "abcdefghijklmn opBHLHT"
    >>>c = extract(a,b)
    >>>print c[/color][/color][/color]
    "abcdefghijklmn op"

    Here I want to extract the common string "abcdefghijklmn op". Basically I
    need a fast way to do that for any two given strings. For my situation,
    the common string will always be at the beginning of both strings. I can
    use regular expressions to do this, but from what I understand there is
    a lot of overhead. New data is being generated at the rate of about 1GB
    per hour, so this needs to be reasonably fast while leaving CPU time for
    other processes.

    Thanks
    Ravi

  • Ravi

    #2
    Re: Match beginning of two strings

    >>[color=blue]
    > While you can be forgiven for not have guessed, os.path is the place to
    > look:
    > import os.path
    > a = "abcdefghijklmn opqrstuvwxyz"
    > b = "abcdefghijklmn opBHLHT"
    > print os.path.commonp refix([a,b])
    >
    > -Scott David Daniels
    > Scott.Daniels@A cm.Org
    >[/color]

    Certainly not where I was expecting it, Thanks

    Ravi

    Comment

    • Jim Richardson

      #3
      Re: Match beginning of two strings

      -----BEGIN PGP SIGNED MESSAGE-----
      Hash: SHA1

      On Sat, 02 Aug 2003 17:39:26 -0400,
      Ravi <[email protected] u> wrote:[color=blue]
      > Hi,
      >
      > I have about 200GB of data that I need to go through and extract the
      > common first part of a line. Something like this.
      >[color=green][color=darkred]
      > >>>a = "abcdefghijklmn opqrstuvwxyz"
      > >>>b = "abcdefghijklmn opBHLHT"
      > >>>c = extract(a,b)
      > >>>print c[/color][/color]
      > "abcdefghijklmn op"
      >
      > Here I want to extract the common string "abcdefghijklmn op". Basically I
      > need a fast way to do that for any two given strings. For my situation,
      > the common string will always be at the beginning of both strings. I can
      > use regular expressions to do this, but from what I understand there is
      > a lot of overhead. New data is being generated at the rate of about 1GB
      > per hour, so this needs to be reasonably fast while leaving CPU time for
      > other processes.
      >
      > Thanks
      > Ravi
      >[/color]

      Are you trying to match any to any strings? or only a pair as above?


      -----BEGIN PGP SIGNATURE-----
      Version: GnuPG v1.2.2 (GNU/Linux)

      iD8DBQE/LENWd90bcYOAWPY RAtWhAJ4ozTD1G3 xLzVkeuJvPDJTsL bkcBQCfX4E0
      YR/+zWSPDwX0uUf8y0 QkxJs=
      =sGTb
      -----END PGP SIGNATURE-----

      --
      Jim Richardson http://www.eskimo.com/~warlock

      Linux, because eventually, you grow up enough to be trusted with a fork()

      Comment

      • Ravi

        #4
        Re: Match beginning of two strings

        [color=blue]
        > Are you trying to match any to any strings? or only a pair as above?
        >[/color]

        Just a pair at a time, and I only want the first N characters that are
        common to both strings. The os.path.commonp refix works nicely. Thanks
        for your help.

        Ravi

        Comment

        • John Machin

          #5
          Re: Match beginning of two strings

          On Sat, 02 Aug 2003 21:18:32 -0700, Scott David Daniels
          <Scott.Daniels@ Acm.Org> wrote:
          [color=blue]
          >Ravi wrote:[color=green]
          >> Hi,
          >>
          >> I have about 200GB of data that I need to go through and extract the
          >> common first part of a line. Something like this.
          >>[color=darkred]
          >> >>>a = "abcdefghijklmn opqrstuvwxyz"
          >> >>>b = "abcdefghijklmn opBHLHT"
          >> >>>c = extract(a,b)
          >> >>>print c[/color]
          >> "abcdefghijklmn op"
          >>
          >> Here I want to extract the common string "abcdefghijklmn op". Basically I
          >> need a fast way to do that for any two given strings. For my situation,
          >> the common string will always be at the beginning of both strings. I can
          >> use regular expressions to do this, but from what I understand there is
          >> a lot of overhead. New data is being generated at the rate of about 1GB
          >> per hour, so this needs to be reasonably fast while leaving CPU time for
          >> other processes.
          >>
          >> Thanks
          >> Ravi
          >>[/color]
          >While you can be forgiven for not have guessed, os.path is the place to
          >look:
          > import os.path
          > a = "abcdefghijklmn opqrstuvwxyz"
          > b = "abcdefghijklmn opBHLHT"
          > print os.path.commonp refix([a,b])
          >[/color]

          I don't think that os.path.commonp refix was designed with 200Gb of
          data in mind. Inspection of Lib/*path.py gives one the impression that
          it spends a lot of time discovering that the first element is a prefix
          of itself.

          Ravi, You may need to drop down to C to get the speed you want for
          your requirement to find the longest common prefix of two strings. Two
          things puzzling me: (1) how you would do this with regular expressions
          (2) you have 200Gb now, new data arriving at the rate of 1Gb per hour,
          after a year you have almost 9000Gb; where are you putting it all?
          BTW, I do hope your algorithm is not O(N**2) ...

          Cheers,
          John

          Comment

          • Ravi

            #6
            Re: Match beginning of two strings

            > I don't think that os.path.commonp refix was designed with 200Gb of[color=blue]
            > data in mind. Inspection of Lib/*path.py gives one the impression that
            > it spends a lot of time discovering that the first element is a prefix
            > of itself.
            >
            > Ravi, You may need to drop down to C to get the speed you want for
            > your requirement to find the longest common prefix of two strings. Two
            > things puzzling me: (1) how you would do this with regular expressions
            > (2) you have 200Gb now, new data arriving at the rate of 1Gb per hour,
            > after a year you have almost 9000Gb; where are you putting it all?
            > BTW, I do hope your algorithm is not O(N**2) ...
            >
            > Cheers,
            > John
            >[/color]

            Well, I timed os.path.commonp refix with some typical data and it's
            pulling about 40usec per loop. So I did what I hated and coded a little
            function in C. Goes something like this. My reasoning is that usually
            the point where the strings start to differ is in the 30 - 50 character
            range. Basically the idea is the same as a binary search on a sorted
            array. Divide and conquer by going halfway each time.

            Read in both strings.
            Check to see if the first character matches.
            If yes:
            Check halfway through the string and see if that character matches
            Repeatedly check halfway until the difference point is found.
            Go back through from the difference point backwards and make sure
            the characters match from the start to the difference point.

            I timed it, and it seems to be doing about 3.5usec per loop. Using
            pipes, I can feed it directly into the processing program. Good enough
            for me.

            As for the data, it's data from a radio telescope that's being recorded.
            We do pattern analysis and reduce it to strings. By examining these
            strings (more analysis than just the first common bit of course), we can
            determine what data should be looked at further and what data is
            garbage. The 9000GB problem isn't all that bad, the stuff compresses
            extremely well, down to about 700GB for a year's worth. A couple of RAID
            arrays makes quick work of that.

            Thanks,

            Ravi

            Comment

            • Jeff Epler

              #7
              Re: Match beginning of two strings

              This is a naive implementation of the 'extract' function.
              def extract(a, b):
              m = min(len(a), len(b))
              for i in range(m):
              if a[i] != b[i]:
              return a[:i]
              return a[:m]

              Here's one that uses the new zip() function:
              def extract2(a, b):
              m = min(len(a), len(b))
              for i, ai, bi in (range(m), a, b):
              if ai != bi: return a[:i]
              return a[:m]
              ... unfortunately, it seems to be slower than the first method. On my
              machine (800MHz PIII):
              $ python timeit.py -s 'import ravi' \
              'ravi.extract(" abcdefghijklmno pqrstuvwxyz","a bcdefghijklmnop BHLHT")'
              10000 loops, best of 3: 32.7 usec per loop

              If your goal is actually something slightly different---for instance,
              find the string from a list with the largest shared prefix to a given
              string---then you probably need to research an efficient algorithm.*
              Otherwise, you may be stuck with the above. If you have 200GB, and each
              line is 80 chars, you have about 2.7 billion lines. If you call extract()
              once per line, and it takes 32 usec, you're looking at 24 hours to run.

              Writing extract as a C extension may be wise, you're likely to be able to
              cut those 32 usec down to little more than Python function call overhead,
              or <2usec per loop. That makes your 2.7x10^9 calls only take 70 minutes.

              Jeff
              * The efficient algorithm for this problem might involve arranging the
              list of strings as a tree, which makes finding the right prefix for
              a given string against the list take only about lg(len(l)) steps,
              instead of len(l) steps. This isn't so much a Python problem as a
              computer science problem, though.

              Comment

              • Jim Richardson

                #8
                Re: Match beginning of two strings

                -----BEGIN PGP SIGNED MESSAGE-----
                Hash: SHA1

                On Sat, 02 Aug 2003 23:14:10 -0400,
                Ravi <[email protected] u> wrote:[color=blue]
                >[color=green]
                >> Are you trying to match any to any strings? or only a pair as above?
                >>[/color]
                >
                > Just a pair at a time, and I only want the first N characters that are
                > common to both strings. The os.path.commonp refix works nicely. Thanks
                > for your help.
                >
                > Ravi
                >[/color]

                Well, it wasn't me, but you're welcome anyway :)




                -----BEGIN PGP SIGNATURE-----
                Version: GnuPG v1.2.2 (GNU/Linux)

                iD8DBQE/LJrDd90bcYOAWPY RAlD9AJ9mqKMhx5 if6eZ+i9vMuVkAs mfLugCgs2RE
                A6o4gDz8qa/YhpjSoHdTwJo=
                =ZAWM
                -----END PGP SIGNATURE-----

                --
                Jim Richardson http://www.eskimo.com/~warlock

                Linux, because eventually, you grow up enough to be trusted with a fork()

                Comment

                • Andrew Dalke

                  #9
                  Re: Match beginning of two strings

                  Jim Richardson:[color=blue]
                  > Why bother finding out which one is the shorter? if you try the compare,
                  > and you run out of the other to compare to, then by default, it's not
                  > the same :)[/color]

                  Because I wasn't sure if the strings had embedded NULs in them.
                  Python strings allow those.

                  Otherwise something like this would work

                  char *s1, *s2 = ... the strings from Python
                  char *s = s1;
                  while ( *s1 && (*s1++ == *s2++))
                  ;
                  return the string from s->s1, or just the size s1-s.

                  Andrew
                  dalke@dalkescie ntific.com


                  Comment

                  • Bengt Richter

                    #10
                    Re: Match beginning of two strings

                    On Sat, 2 Aug 2003 22:23:12 -0500, Jeff Epler <jepler@unpytho nic.net> wrote:
                    [color=blue]
                    >This is a naive implementation of the 'extract' function.
                    > def extract(a, b):
                    > m = min(len(a), len(b))
                    > for i in range(m):
                    > if a[i] != b[i]:
                    > return a[:i]
                    > return a[:m]
                    >
                    >Here's one that uses the new zip() function:[/color]
                    I don't see "zip" ;-)
                    [color=blue]
                    > def extract2(a, b):
                    > m = min(len(a), len(b))
                    > for i, ai, bi in (range(m), a, b):
                    > if ai != bi: return a[:i]
                    > return a[:m][/color]
                    [color=blue]
                    >.. unfortunately, it seems to be slower than the first method. On my
                    >machine (800MHz PIII):
                    >$ python timeit.py -s 'import ravi' \
                    > 'ravi.extract(" abcdefghijklmno pqrstuvwxyz","a bcdefghijklmnop BHLHT")'
                    >10000 loops, best of 3: 32.7 usec per loop
                    >[/color]

                    My timing harness (I seem to need a new getopt for timeit.py under 2.3)
                    shows a slight (15-22% less time) improvement for this 2.3 alternative:

                    def commonprefix(s1 , s2): # very little tested!
                    try:
                    for i, c in enumerate(s1):
                    if c != s2[i]: return s1[:i]
                    except IndexError:
                    return s1[:i]
                    return s1


                    [12:39] C:\pywk\clp>tim efuns ravi -c extract -s 'abcdefghijklmn opqrstuvwxyz' -s 'abcdefghijklmn o
                    pBHLHT' -c commonprefix -s 'abcdefghijklmn opqrstuvwxyz' -s 'abcdefghijklmn opBHLHT'
                    timing oh: 0.000007 ratio
                    extract: 0.000088 1.00
                    commonprefix: 0.000074 0.85

                    [12:39] C:\pywk\clp>tim efuns ravi -c extract -s 'abcdefghijklmn opqrstuvwxyz' -s 'abcdefghijklmn o
                    pBHLHT' -c commonprefix -s 'abcdefghijklmn opqrstuvwxyz' -s 'abcdefghijklmn opBHLHT'
                    timing oh: 0.000007 ratio
                    extract: 0.000091 1.00
                    commonprefix: 0.000071 0.78

                    [12:40] C:\pywk\clp>tim efuns ravi -c extract -s 'abcdefghijklmn opqrstuvwxyz' -s 'abcdefghijklmn o
                    pBHLHT' -c commonprefix -s 'abcdefghijklmn opqrstuvwxyz' -s 'abcdefghijklmn opBHLHT'
                    timing oh: 0.000007 ratio
                    extract: 0.000091 1.00
                    commonprefix: 0.000071 0.78

                    [12:40] C:\pywk\clp>tim efuns ravi -c extract -s 'abcdefghijklmn opqrstuvwxyz' -s 'abcdefghijklmn o
                    pBHLHT' -c commonprefix -s 'abcdefghijklmn opqrstuvwxyz' -s 'abcdefghijklmn opBHLHT'
                    timing oh: 0.000007 ratio
                    extract: 0.000088 1.00
                    commonprefix: 0.000071 0.81

                    Regards,
                    Bengt Richter

                    Comment

                    • Jim Richardson

                      #11
                      Re: Match beginning of two strings

                      -----BEGIN PGP SIGNED MESSAGE-----
                      Hash: SHA1

                      On Sun, 3 Aug 2003 12:44:59 -0600,
                      Andrew Dalke <adalke@mindspr ing.com> wrote:[color=blue]
                      > Jim Richardson:[color=green]
                      >> Why bother finding out which one is the shorter? if you try the compare,
                      >> and you run out of the other to compare to, then by default, it's not
                      >> the same :)[/color]
                      >
                      > Because I wasn't sure if the strings had embedded NULs in them.
                      > Python strings allow those.
                      >[/color]

                      Ah, I hadn't considered that.
                      [color=blue]
                      > Otherwise something like this would work
                      >
                      > char *s1, *s2 = ... the strings from Python
                      > char *s = s1;
                      > while ( *s1 && (*s1++ == *s2++))
                      > ;
                      > return the string from s->s1, or just the size s1-s.
                      >
                      > Andrew
                      > dalke@dalkescie ntific.com
                      >
                      >[/color]


                      -----BEGIN PGP SIGNATURE-----
                      Version: GnuPG v1.2.2 (GNU/Linux)

                      iD8DBQE/La14d90bcYOAWPY RApsqAJ9OW1jCVi H/ytTMwpMv/TtTo9Q0BwCg2/Hn
                      R41XPXwJUhQYvCB nrqRMqLY=
                      =arqM
                      -----END PGP SIGNATURE-----

                      --
                      Jim Richardson http://www.eskimo.com/~warlock

                      Linux, because eventually, you grow up enough to be trusted with a fork()

                      Comment

                      • Alex Martelli

                        #12
                        Re: Match beginning of two strings

                        Jeff Epler wrote:
                        ...[color=blue]
                        > .. unfortunately, it seems to be slower than the first method. On my
                        > machine (800MHz PIII):
                        > $ python timeit.py -s 'import ravi' \
                        > 'ravi.extract(" abcdefghijklmno pqrstuvwxyz","a bcdefghijklmnop BHLHT")'
                        > 10000 loops, best of 3: 32.7 usec per loop[/color]

                        Here's my proposal...:

                        import sys

                        def extract(a, b):
                        m = min(len(a), len(b))
                        for i in range(m):
                        if a[i] != b[i]:
                        return a[:i]
                        return a[:m]

                        def extract2(a, b):
                        for i, ai, bi in zip(xrange(sys. maxint), a, b):
                        if ai != bi: return a[:i]
                        return a[:m]

                        def extract3(a, b):
                        for i, ai in enumerate(a):
                        if ai != b[i:i+1]:
                        return a[:i]
                        return a

                        [alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
                        'exa.extract("a bcdefghijklmnop qrstuvwxyz","ab cdefghijklmnopB HLHT")'
                        100000 loops, best of 3: 13.9 usec per loop
                        [alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
                        'exa.extract2(" abcdefghijklmno pqrstuvwxyz","a bcdefghijklmnop BHLHT")'
                        10000 loops, best of 3: 19.7 usec per loop
                        [alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
                        'exa.extract3(" abcdefghijklmno pqrstuvwxyz","a bcdefghijklmnop BHLHT")'
                        100000 loops, best of 3: 15.7 usec per loop

                        Now add after the "import sys" two lines:

                        import psyco
                        psyco.full()

                        and run again:

                        [alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
                        'exa.extract("a bcdefghijklmnop qrstuvwxyz","ab cdefghijklmnopB HLHT")'
                        1000000 loops, best of 3: 0.771 usec per loop

                        [alex@lancelot python2.3]$ python -O timeit.py -s 'import exa'
                        'exa.extract3(" abcdefghijklmno pqrstuvwxyz","a bcdefghijklmnop BHLHT")'
                        100000 loops, best of 3: 3.37 usec per loop

                        However, extract2 doesn't run correctly with psyco (gets a MemoryError).

                        Still, the 18-times acceleration that psyco is able to effect on
                        the naive extract IS typical of psyco's effect on functions coded
                        in simple, elementary terms. When you really need speed, assuming
                        that your processor is Intel-compatible, consider psyco (of course,
                        you'll generally use psyco.profile, or something more selective
                        still, rather than psyco.full) -- orders-of-magnitude improvements
                        on low-level bottleneck functions are anything but surprising...


                        Alex

                        Comment

                        • Alex Martelli

                          #13
                          Re: Match beginning of two strings

                          Ravi wrote:
                          [color=blue]
                          > Hi,
                          >
                          > I have about 200GB of data that I need to go through and extract the
                          > common first part of a line. Something like this.
                          >[color=green][color=darkred]
                          > >>>a = "abcdefghijklmn opqrstuvwxyz"
                          > >>>b = "abcdefghijklmn opBHLHT"
                          > >>>c = extract(a,b)
                          > >>>print c[/color][/color]
                          > "abcdefghijklmn op"
                          >
                          > Here I want to extract the common string "abcdefghijklmn op". Basically I
                          > need a fast way to do that for any two given strings. For my situation,
                          > the common string will always be at the beginning of both strings. I can[/color]

                          Here's my latest study on this:

                          *** pexa.py:

                          import sys

                          import psyco
                          psyco.full()

                          import cexa
                          import exa

                          def extract(a, b):
                          m = min(len(a), len(b))
                          for i in range(m):
                          if a[i] != b[i]:
                          return a[:i]
                          return a[:m]

                          def extract2(a, b):
                          for i, ai, bi in zip(xrange(len( a)), a, b):
                          if ai != bi: return a[:i]
                          return a[:m]

                          def extract3(a, b):
                          for i, ai in enumerate(a):
                          if ai != b[i:i+1]:
                          return a[:i]
                          return a

                          extract_pyrex = exa.exa

                          extract_c = cexa.cexa

                          *** exa.pyx:

                          def exa(a, b):
                          cdef int la
                          cdef int lb
                          la = len(a)
                          lb = len(b)
                          cdef int lmin
                          lmin = min(la, lb)
                          cdef int i
                          i = 0
                          while i < lmin:
                          if a[i] != b[i]:
                          return a[:i]
                          i = i + 1
                          if lmin == la:
                          return a
                          else:
                          return b

                          *** cexa.c:

                          #include <Python.h>

                          static PyObject*
                          cexa(PyObject* self, PyObject* args)
                          {
                          char *a, *b;
                          int la, lb;
                          int lmin, i;
                          if(!PyArg_Parse Tuple(args, "s#s#", &a, &la, &b, &lb))
                          return 0;

                          lmin = la;
                          if(lmin<lb) lmin = lb;

                          for(i=0; i<lmin; i++)
                          if(a[i] != b[i])
                          break;

                          return Py_BuildValue(" s#", a, i);
                          }

                          static PyMethodDef cexaMethods[] = {
                          {"cexa", cexa, METH_VARARGS, "Extract common prefix"},
                          {0}
                          };

                          void
                          initcexa(void)
                          {
                          Py_InitModule(" cexa", cexaMethods);
                          }

                          I've built the pyrex-coded extension with:

                          from distutils.core import setup
                          from distutils.exten sion import Extension
                          from Pyrex.Distutils import build_ext

                          setup(
                          name = "exa",
                          ext_modules=[
                          Extension("exa" , ["exa.pyx"])
                          ],
                          cmdclass = {'build_ext': build_ext}
                          )

                          and the C-coded one with:

                          from distutils.core import setup
                          from distutils.exten sion import Extension

                          setup(
                          name = "cexa",
                          ext_modules=[
                          Extension("cexa ", ["cexa.c"])
                          ],
                          )

                          and my measurements give me:

                          [alex@lancelot exi]$ python -O timeit.py -s 'import pexa' \[color=blue]
                          > 'pexa.extract(" abcdefghijklmon pKOU", "abcdefghijklmo npZE")'[/color]
                          100000 loops, best of 3: 2.39 usec per loop
                          [alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
                          'pexa.extract(" abcdefghijklmon pKOU", "abcdefghijklmo npZE")'
                          100000 loops, best of 3: 2.14 usec per loop
                          [alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
                          'pexa.extract2( "abcdefghijklmo npKOU", "abcdefghijklmo npZE")'
                          10000 loops, best of 3: 30.2 usec per loop
                          [alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
                          'pexa.extract3( "abcdefghijklmo npKOU", "abcdefghijklmo npZE")'
                          100000 loops, best of 3: 9.59 usec per loop
                          [alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
                          'pexa.extract_p yrex("abcdefghi jklmonpKOU", "abcdefghijklmo npZE")'
                          10000 loops, best of 3: 21.8 usec per loop
                          [alex@lancelot exi]$ python -O timeit.py -s 'import pexa'
                          'pexa.extract_c ("abcdefghijklm onpKOU", "abcdefghijklmo npZE")'
                          100000 loops, best of 3: 1.88 usec per loop
                          [alex@lancelot exi]$

                          So, it seems you can still get a tiny drop of extra speed
                          with a C-coded extension, though it's doubtful whether it's
                          worth the bother wrt the pyro-optimized simple Python code
                          in function 'extract'. I'm not sure where I went wrong in
                          the Pyrex coding (it doesn't seem to be performing anywhere
                          as well as I thought it might) and I'll be happy for real
                          Pyrex expert to show me the way.

                          Of course, as others have pointed out, it's unclear from
                          your problem description that doing such operations pairwise
                          on a lot of pairs of strings is actually what you need. It
                          IS quite possible that what you're doing could often be
                          better modeled, e.g., by repeated "prefix extractions"
                          between ONE fixed string and several other candidate strings;
                          or "prefix extraction" between a set of more than two strings.
                          In each case, it's likely that you can get much better
                          performance by more sophisticated algorithms. However,
                          which algorithms those might be is unclear unless you can
                          provide mode details on what you're doing.


                          Alex

                          Comment

                          • John Machin

                            #14
                            Re: Match beginning of two strings

                            On Mon, 04 Aug 2003 11:56:04 GMT, Alex Martelli <[email protected] >
                            wrote:
                            [color=blue]
                            > I'm not sure where I went wrong in
                            >the Pyrex coding (it doesn't seem to be performing anywhere
                            >as well as I thought it might) and I'll be happy for real
                            >Pyrex expert to show me the way.[/color]

                            I don't call myself an expert, but here's my best shot:

                            If you look at the generated C code, you'll see lots of conversion
                            between C and Python types. The trick is to get your args into C, stay
                            in C as much as possible, and ship back a Python return value.

                            It's made harder with strings as there is not (yet) any way of hinting
                            to Pyrex to use the "s#" gadget, you have to DIY, see below.

                            cdef extern from "Python.h":
                            int PyString_Size(o bject s)

                            def exa2(arga, argb):
                            cdef int la, lb, lmin, i
                            cdef char *a, *b

                            a = arga
                            b = argb
                            la = PyString_Size(a rga)
                            lb = PyString_Size(a rgb)
                            # living dangerously, not testing for error;
                            # Easy to eyeball for correctness in this case,
                            # but ...
                            if la <= lb:
                            lmin = la
                            else:
                            lmin = lb
                            i = 0
                            while i < lmin:
                            if a[i] != b[i]:
                            return arga[:i]
                            i = i + 1
                            if lmin == la:
                            return arga
                            else:
                            return argb

                            Comment

                            • Richie Hindle

                              #15
                              Re: Match beginning of two strings


                              [Alex][color=blue]
                              > I'm not sure where I went wrong in
                              > the Pyrex coding (it doesn't seem to be performing anywhere
                              > as well as I thought it might) and I'll be happy for real
                              > Pyrex expert to show me the way.[/color]

                              I'm no an expert, but I can see a few easily-fixed problems. The line:

                              if a[i] != b[i]:

                              is working with Python strings when it could be working with C
                              strings. Here's the original code and its output on my machine:

                              def exa(a, b):
                              cdef int la
                              cdef int lb
                              la = len(a)
                              lb = len(b)
                              cdef int lmin
                              lmin = min(la, lb)
                              cdef int i
                              i = 0
                              while i < lmin:
                              if a[i] != b[i]:
                              return a[:i]
                              i = i + 1
                              if lmin == la:
                              return a
                              else:
                              return b

                              100000 loops, best of 3: 9.11 usec per loop

                              Here's a modified version of the code comparing C strings:

                              def exa(a, b):
                              cdef char* c_a # `a` as a C string
                              cdef char* c_b # `b` as a C string
                              cdef int la
                              cdef int lb

                              c_a = a
                              c_b = b
                              la = len(a)
                              lb = len(b)
                              cdef int lmin
                              lmin = min(la, lb)
                              cdef int i
                              i = 0
                              while i < lmin:
                              if c_a[i] != c_b[i]:
                              return a[:i]
                              i = i + 1
                              if lmin == la:
                              return a
                              else:
                              return b

                              100000 loops, best of 3: 5.79 usec per loop

                              Almost twice as fast. Looking at the generated C is always
                              worthwhile when optimising Pyrex code - here's the code that does
                              the comparison against Python strings:

                              /* "D:\src\tests\p yrex\exa.pyx":1 1 */
                              __pyx_3 = PyInt_FromLong( __pyx_v_i); if (!__pyx_3) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
                              __pyx_1 = PyObject_GetIte m(__pyx_v_a, __pyx_3); if (!__pyx_1) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
                              Py_DECREF(__pyx _3); __pyx_3 = 0;
                              __pyx_5 = PyInt_FromLong( __pyx_v_i); if (!__pyx_5) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
                              __pyx_2 = PyObject_GetIte m(__pyx_v_b, __pyx_5); if (!__pyx_2) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
                              Py_DECREF(__pyx _5); __pyx_5 = 0;
                              if (PyObject_Cmp(_ _pyx_1, __pyx_2, &__pyx_4) < 0) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 11; goto __pyx_L1;}
                              __pyx_4 = __pyx_4 != 0;
                              Py_DECREF(__pyx _1); __pyx_1 = 0;
                              Py_DECREF(__pyx _2); __pyx_2 = 0;
                              if (__pyx_4) {

                              vs.

                              /* "D:\src\tests\p yrex\exa.pyx":1 6 */
                              __pyx_5 = ((__pyx_v_c_a[__pyx_v_i]) != (__pyx_v_c_b[__pyx_v_i]));
                              if (__pyx_5) {

                              for C strings. There's another similar optimisation that the C
                              output leads you to: you can use strlen rather than Python's len:

                              cdef extern from "string.h":
                              int strlen(char*)

                              def exa(a, b):
                              cdef char* c_a # `a` as a C string
                              cdef char* c_b # `b` as a C string
                              cdef int la
                              cdef int lb

                              c_a = a
                              c_b = b
                              la = strlen(c_a)
                              lb = strlen(c_b)
                              cdef int lmin
                              lmin = min(la, lb)
                              cdef int i
                              i = 0
                              while i < lmin:
                              if c_a[i] != c_b[i]:
                              return a[:i]
                              i = i + 1
                              if lmin == la:
                              return a
                              else:
                              return b

                              100000 loops, best of 3: 3.58 usec per loop

                              That replaces:

                              /* "D:\src\tests\p yrex\exa.pyx":4 */
                              __pyx_1 = __Pyx_GetName(_ _pyx_b, "len"); if (!__pyx_1) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
                              __pyx_2 = PyTuple_New(1); if (!__pyx_2) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
                              Py_INCREF(__pyx _v_a);
                              PyTuple_SET_ITE M(__pyx_2, 0, __pyx_v_a);
                              __pyx_3 = PyObject_CallOb ject(__pyx_1, __pyx_2); if (!__pyx_3) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
                              Py_DECREF(__pyx _1); __pyx_1 = 0;
                              Py_DECREF(__pyx _2); __pyx_2 = 0;
                              __pyx_4 = PyInt_AsLong(__ pyx_3); if (PyErr_Occurred ()) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 4; goto __pyx_L1;}
                              Py_DECREF(__pyx _3); __pyx_3 = 0;
                              __pyx_v_la = __pyx_4;

                              /* "D:\src\tests\p yrex\exa.pyx":5 */
                              __pyx_1 = __Pyx_GetName(_ _pyx_b, "len"); if (!__pyx_1) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
                              __pyx_2 = PyTuple_New(1); if (!__pyx_2) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
                              Py_INCREF(__pyx _v_b);
                              PyTuple_SET_ITE M(__pyx_2, 0, __pyx_v_b);
                              __pyx_3 = PyObject_CallOb ject(__pyx_1, __pyx_2); if (!__pyx_3) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
                              Py_DECREF(__pyx _1); __pyx_1 = 0;
                              Py_DECREF(__pyx _2); __pyx_2 = 0;
                              __pyx_4 = PyInt_AsLong(__ pyx_3); if (PyErr_Occurred ()) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 5; goto __pyx_L1;}
                              Py_DECREF(__pyx _3); __pyx_3 = 0;
                              __pyx_v_lb = __pyx_4;

                              with :

                              /* "D:\src\tests\p yrex\exa.pyx":1 2 */
                              __pyx_v_la = strlen(__pyx_v_ c_a);

                              /* "D:\src\tests\p yrex\exa.pyx":1 3 */
                              __pyx_v_lb = strlen(__pyx_v_ c_b);

                              and leaves the call to 'min' as the only remaining huge block of C.
                              The final version looks like this, eliminating 'min' (Greg, can we have
                              the terary operator in Pyrex please? <good_mood ? wink : frown>)

                              cdef extern from "string.h":
                              int strlen(char*)

                              def exa(a, b):
                              cdef char* c_a # `a` as a C string
                              cdef char* c_b # `b` as a C string
                              cdef int la
                              cdef int lb

                              c_a = a
                              c_b = b
                              la = strlen(c_a)
                              lb = strlen(c_b)
                              cdef int lmin
                              if la < lb:
                              lmin = la
                              else:
                              lmin = lb
                              cdef int i
                              i = 0
                              while i < lmin:
                              if c_a[i] != c_b[i]:
                              return a[:i]
                              i = i + 1
                              if lmin == la:
                              return a
                              else:
                              return b


                              1000000 loops, best of 3: 0.803 usec per loop

                              Over ten times quicker than the original, for the sake of a couple of
                              small tweaks driven by looking at the C output. Although the C still
                              looks very verbose at first glance, it's now substantially the same as
                              Alex's cexa.c.

                              Hope that helps,

                              --
                              Richie Hindle
                              richie@entrian. com


                              Comment

                              Working...