My Big Dict.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Xavier

    My Big Dict.

    Greetings,

    (do excuse the possibly comical subject text)

    I need advice on how I can convert a text db into a dict. Here is an
    example of what I need done.

    some example data lines in the text db goes as follows:

    CODE1!DATA1 DATA2, DATA3
    CODE2!DATA1, DATA2 DATA3

    As you can see, the lines are dynamic and the data are not alike, they
    change in permission values (but that's obvious in any similar situation)

    Any idea on how I can convert 20,000+ lines of the above into the following
    protocol for use in my code?:

    TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}

    I was thinking of using AWK or something to the similar liking but I just
    wanted to check up with the list for any faster/sufficient hacks in python
    to do such a task.

    Thanks.

    -- Xavier.

    oderint dum mutuant



  • Christophe Delord

    #2
    Re: My Big Dict.

    Hello,

    On Wed, 2 Jul 2003 00:13:26 -0400, Xavier wrote:
    [color=blue]
    > Greetings,
    >
    > (do excuse the possibly comical subject text)
    >
    > I need advice on how I can convert a text db into a dict. Here is an
    > example of what I need done.
    >
    > some example data lines in the text db goes as follows:
    >
    > CODE1!DATA1 DATA2, DATA3
    > CODE2!DATA1, DATA2 DATA3
    >
    > As you can see, the lines are dynamic and the data are not alike, they
    > change in permission values (but that's obvious in any similar
    > situation)
    >
    > Any idea on how I can convert 20,000+ lines of the above into the
    > following protocol for use in my code?:
    >
    > TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}
    >
    > I was thinking of using AWK or something to the similar liking but I
    > just wanted to check up with the list for any faster/sufficient hacks
    > in python to do such a task.[/color]

    If your data is in a string you can use a regular expression to parse
    each line, then the findall method returns a list of tuples containing
    the key and the value of each item. Finally the dict class can turn this
    list into a dict. For example:

    data_re = re.compile(r"^( \w+)!(.*)", re.MULTILINE)

    bigdict = dict(data_re.fi ndall(data))

    On my computer the second line take between 7 and 8 seconds to parse
    100000 lines.

    Try this:

    ------------------------------
    import re
    import time

    N = 100000

    print "Initialisation ..."
    data = "".join(["CODE%d!DATA%d_ 1, DATA%d_2, DATA%d_3\n"%(i, i,i,i) for i
    in range(N)])

    data_re = re.compile(r"^( \w+)!(.*)", re.MULTILINE)

    print "Parsing... "
    start = time.time()
    bigdict = dict(data_re.fi ndall(data))
    stop = time.time()

    print "%s items parsed in %s seconds"%(len(b igdict), stop-start)
    ------------------------------
    [color=blue]
    >
    > Thanks.
    >
    > -- Xavier.
    >
    > oderint dum mutuant
    >
    >
    >[/color]


    --

    (o_ Christophe Delord __o
    //\ http://christophe.delord.free.fr/ _`\<,_
    V_/_ mailto:christop he.delord@free. fr (_)/ (_)

    Comment

    • John Hunter

      #3
      Re: My Big Dict.

      >>>>> "Russell" == Russell Reagan <rreagan@attbi. com> writes:

      drs> f = open('somefile. txt')
      drs> d = {}
      drs> l = f.readlines()
      drs> for i in l:
      drs> a,b = i.split('!')
      drs> d[a] = b.strip()


      I would make one minor modification of this. If the file were *really
      long*, you could run into troubles trying to hold it in memory. I
      find the following a little cleaner (with python 2.2), and doesn't
      require putting the whole file in memory. A file instance is an
      iterator (http://www.python.org/doc/2.2.1/whatsnew/node4.html) which
      will call readline as needed:

      d = {}
      for line in file('sometext. dat'):
      key,val = line.split('!')
      d[key] = val.strip()

      Or if you are not worried about putting it in memory, you can use list
      comprehensions for speed

      d = dict([ line.split('!') for line in file('somefile. text')])

      Russell> I have just started learning Python, and I have never
      Russell> used dictionaries in Python, and despite the fact that
      Russell> you used mostly non-descriptive variable names, I can
      Russell> still read your code perfectly and know exactly what it
      Russell> does. I think I could use dictionaries now, just from
      Russell> looking at your code snippet. Python rules :-)

      Truly.

      JDH




      Comment

      • Aurélien Géron

        #4
        Re: My Big Dict.

        > "Christophe Delord" <christophe.del [email protected]> wrote in message[color=blue]
        > news:2003070207 3735.40293ba2.c hristophe.delor [email protected]...[color=green]
        > > Hello,
        > >
        > > On Wed, 2 Jul 2003 00:13:26 -0400, Xavier wrote:
        > >[color=darkred]
        > > > Greetings,
        > > >
        > > > (do excuse the possibly comical subject text)
        > > >
        > > > I need advice on how I can convert a text db into a dict. Here is an
        > > > example of what I need done.
        > > >
        > > > some example data lines in the text db goes as follows:
        > > >
        > > > CODE1!DATA1 DATA2, DATA3
        > > > CODE2!DATA1, DATA2 DATA3
        > > >
        > > > As you can see, the lines are dynamic and the data are not alike, they
        > > > change in permission values (but that's obvious in any similar
        > > > situation)
        > > >
        > > > Any idea on how I can convert 20,000+ lines of the above into the
        > > > following protocol for use in my code?:
        > > >
        > > > TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}
        > > >[/color]
        > >
        > > If your data is in a string you can use a regular expression to parse
        > > each line, then the findall method returns a list of tuples containing
        > > the key and the value of each item. Finally the dict class can turn this
        > > list into a dict. For example:[/color]
        >
        > and you can kill a fly with a sledgehammer. why not
        >
        > f = open('somefile. txt')
        > d = {}
        > l = f.readlines()
        > for i in l:
        > a,b = i.split('!')
        > d[a] = b.strip()
        >
        > or am i missing something obvious? (b/t/w the above parsed 20000+ lines on[/color]
        a[color=blue]
        > celeron 500 in less than a second.)[/color]

        Your code looks good Christophe. Just two little things to be aware of:
        1) if you use split like this, then each line must contain one and only one
        '!', which means (in particular) that empy lines will bomb, and also data
        must not contain any '!' or else you'll get an exception such as
        "ValueError : unpack list of wrong size". If your data may contain '!',
        then consider slicing up each line in a different way.
        2) if your file is really huge, then you may want to fill up your dictionary
        as you're reading the file, instead of reading everything in a list and then
        building your dictionary (hence using up twice the memory).

        But apart from these details, I agree with Christophe that this is the way
        to go.

        Aurélien


        Comment

        • Paul Simmonds

          #5
          Re: My Big Dict.

          "Aurélien Géron" <ageron@HOHOHOH Ovideotron.ca> wrote in message news:<bdua4i$18 [email protected] erim.net>...[color=blue]
          > "drs" wrote...[color=green]
          > > "Christophe Delord" <christophe.del [email protected]> wrote in message
          > > news:2003070207 3735.40293ba2.c hristophe.delor [email protected]...[color=darkred]
          > > > Hello,
          > > >
          > > > On Wed, 2 Jul 2003 00:13:26 -0400, Xavier wrote:[/color][/color][/color]
          <snip>[color=blue][color=green][color=darkred]
          > > > > I need advice on how I can convert a text db into a dict. Here is an
          > > > > example of what I need done.
          > > > >
          > > > > some example data lines in the text db goes as follows:
          > > > >
          > > > > CODE1!DATA1 DATA2, DATA3
          > > > > CODE2!DATA1, DATA2 DATA3[/color][/color][/color]
          <snip>[color=blue][color=green][color=darkred]
          > > > > Any idea on how I can convert 20,000+ lines of the above into the
          > > > > following protocol for use in my code?:
          > > > >
          > > > > TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}
          > > > >
          > > >
          > > > If your data is in a string you can use a regular expression to parse
          > > > each line, then the findall method returns a list of tuples containing
          > > > the key and the value of each item. Finally the dict class can turn this
          > > > list into a dict. For example:[/color][/color][/color]
          <example snipped>[color=blue][color=green]
          > >
          > > and you can kill a fly with a sledgehammer. why not
          > >
          > > f = open('somefile. txt')
          > > d = {}
          > > l = f.readlines()
          > > for i in l:
          > > a,b = i.split('!')
          > > d[a] = b.strip()[/color][/color]
          <snip>[color=blue]
          > Your code looks good Christophe. Just two little things to be aware of:[/color]

          I think I'm right in saying Christophe's approach was using the 're'
          module, which has been snipped, whereas the approach was the above
          using split was by "drs".
          [color=blue]
          > 1) if you use split like this, then each line must contain one and only one
          > '!', which means (in particular) that empy lines will bomb, and also data
          > must not contain any '!' or else you'll get an exception such as
          > "ValueError : unpack list of wrong size". If your data may contain '!',
          > then consider slicing up each line in a different way.[/color]

          If this is a problem, use a combination of count and index methods to
          find the first, and use slices. For example, if you don't mind
          two-lined list comps:

          d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
          for l in file('test.txt' ) if l.count('!')])
          [color=blue]
          > 2) if your file is really huge, then you may want to fill up your dictionary
          > as you're reading the file, instead of reading everything in a list and then
          > building your dictionary (hence using up twice the memory).[/color]
          Agreed.

          The above list comprehension has the disadvantages that it finds how
          many '!' characters for every line, and it reads the whole file in at
          once. Assuming there are going to be more data lines than not, this is
          much faster:

          d={}
          for l in file("test.txt" ):
          try: i=l.index('!')
          except ValueError: continue
          d[l[:i]]=l[i+1:]

          It's often much faster to ask forgiveness than permission. I measure
          it about twice as fast as the 're' method, and about four times as
          fast as the list comp above.
          HTH,
          Paul
          [color=blue]
          >
          > But apart from these details, I agree with Christophe that this is the way
          > to go.
          >
          > Aurélien[/color]

          Comment

          • John Hunter

            #6
            Re: My Big Dict.

            >>>>> "Russell" == Russell Reagan <rreagan@attbi. com> writes:

            drs> f = open('somefile. txt')
            drs> d = {}
            drs> l = f.readlines()
            drs> for i in l:
            drs> a,b = i.split('!')
            drs> d[a] = b.strip()


            I would make one minor modification of this. If the file were *really
            long*, you could run into troubles trying to hold it in memory. I
            find the following a little cleaner (with python 2.2), and doesn't
            require putting the whole file in memory. A file instance is an
            iterator (http://www.python.org/doc/2.2.1/whatsnew/node4.html) which
            will call readline as needed:

            d = {}
            for line in file('sometext. dat'):
            key,val = line.split('!')
            d[key] = val.strip()

            Or if you are not worried about putting it in memory, you can use list
            comprehensions for speed

            d = dict([ line.split('!') for line in file('somefile. text')])

            Russell> I have just started learning Python, and I have never
            Russell> used dictionaries in Python, and despite the fact that
            Russell> you used mostly non-descriptive variable names, I can
            Russell> still read your code perfectly and know exactly what it
            Russell> does. I think I could use dictionaries now, just from
            Russell> looking at your code snippet. Python rules :-)

            Truly.

            JDH




            Comment

            • Paul Simmonds

              #7
              Re: My Big Dict.

              "Aurélien Géron" <ageron@HOHOHOH Ovideotron.ca> wrote in message news:<bdua4i$18 [email protected] erim.net>...[color=blue]
              > "drs" wrote...[color=green]
              > > "Christophe Delord" <christophe.del [email protected]> wrote in message
              > > news:2003070207 3735.40293ba2.c hristophe.delor [email protected]...[color=darkred]
              > > > Hello,
              > > >
              > > > On Wed, 2 Jul 2003 00:13:26 -0400, Xavier wrote:[/color][/color][/color]
              <snip>[color=blue][color=green][color=darkred]
              > > > > I need advice on how I can convert a text db into a dict. Here is an
              > > > > example of what I need done.
              > > > >
              > > > > some example data lines in the text db goes as follows:
              > > > >
              > > > > CODE1!DATA1 DATA2, DATA3
              > > > > CODE2!DATA1, DATA2 DATA3[/color][/color][/color]
              <snip>[color=blue][color=green][color=darkred]
              > > > > Any idea on how I can convert 20,000+ lines of the above into the
              > > > > following protocol for use in my code?:
              > > > >
              > > > > TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}
              > > > >
              > > >
              > > > If your data is in a string you can use a regular expression to parse
              > > > each line, then the findall method returns a list of tuples containing
              > > > the key and the value of each item. Finally the dict class can turn this
              > > > list into a dict. For example:[/color][/color][/color]
              <example snipped>[color=blue][color=green]
              > >
              > > and you can kill a fly with a sledgehammer. why not
              > >
              > > f = open('somefile. txt')
              > > d = {}
              > > l = f.readlines()
              > > for i in l:
              > > a,b = i.split('!')
              > > d[a] = b.strip()[/color][/color]
              <snip>[color=blue]
              > Your code looks good Christophe. Just two little things to be aware of:[/color]

              I think I'm right in saying Christophe's approach was using the 're'
              module, which has been snipped, whereas the approach was the above
              using split was by "drs".
              [color=blue]
              > 1) if you use split like this, then each line must contain one and only one
              > '!', which means (in particular) that empy lines will bomb, and also data
              > must not contain any '!' or else you'll get an exception such as
              > "ValueError : unpack list of wrong size". If your data may contain '!',
              > then consider slicing up each line in a different way.[/color]

              If this is a problem, use a combination of count and index methods to
              find the first, and use slices. For example, if you don't mind
              two-lined list comps:

              d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
              for l in file('test.txt' ) if l.count('!')])
              [color=blue]
              > 2) if your file is really huge, then you may want to fill up your dictionary
              > as you're reading the file, instead of reading everything in a list and then
              > building your dictionary (hence using up twice the memory).[/color]
              Agreed.

              The above list comprehension has the disadvantages that it finds how
              many '!' characters for every line, and it reads the whole file in at
              once. Assuming there are going to be more data lines than not, this is
              much faster:

              d={}
              for l in file("test.txt" ):
              try: i=l.index('!')
              except ValueError: continue
              d[l[:i]]=l[i+1:]

              It's often much faster to ask forgiveness than permission. I measure
              it about twice as fast as the 're' method, and about four times as
              fast as the list comp above.
              HTH,
              Paul
              [color=blue]
              >
              > But apart from these details, I agree with Christophe that this is the way
              > to go.
              >
              > Aurélien[/color]

              Comment

              • Christian Tismer

                #8
                Re: My Big Dict.

                Paul Simmonds wrote:
                ....

                I'm not trying to intrude this thread, but was just
                struck by the list comprehension below, so this is
                about readability.
                [color=blue]
                > If this is a problem, use a combination of count and index methods to
                > find the first, and use slices. For example, if you don't mind
                > two-lined list comps:
                >
                > d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
                > for l in file('test.txt' ) if l.count('!')])[/color]

                With every respect, this looks pretty much like another
                P-language. The pure existance of list comprehensions
                does not try to force you to use it everywhere :-)

                ....

                compared to this:
                ....
                [color=blue]
                > d={}
                > for l in file("test.txt" ):
                > try: i=l.index('!')
                > except ValueError: continue
                > d[l[:i]]=l[i+1:][/color]

                which is both faster in this case and easier to read.

                About speed: I'm not sure with the current Python
                version, but it might be worth trying to go without
                the exception:

                d={}
                for l in file("test.txt" ):
                i=l.find('!')
                if i >= 0:
                d[l[:i]]=l[i+1:]

                and then you might even consider to split on the first
                "!", but I didn't do any timings:

                d={}
                for l in file("test.txt" ):
                try:
                key, value = l.split("!", 1)
                except ValueError: continue
                d[key] = value


                cheers -- chris

                --
                Christian Tismer :^) <mailto:tismer@ tismer.com>
                Mission Impossible 5oftware : Have a break! Take a ride on Python's
                Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
                14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
                work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776
                PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
                whom do you want to sponsor today? http://www.stackless.com/


                Comment

                • Christian Tismer

                  #9
                  Re: My Big Dict.

                  Paul Simmonds wrote:
                  ....

                  I'm not trying to intrude this thread, but was just
                  struck by the list comprehension below, so this is
                  about readability.
                  [color=blue]
                  > If this is a problem, use a combination of count and index methods to
                  > find the first, and use slices. For example, if you don't mind
                  > two-lined list comps:
                  >
                  > d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
                  > for l in file('test.txt' ) if l.count('!')])[/color]

                  With every respect, this looks pretty much like another
                  P-language. The pure existance of list comprehensions
                  does not try to force you to use it everywhere :-)

                  ....

                  compared to this:
                  ....
                  [color=blue]
                  > d={}
                  > for l in file("test.txt" ):
                  > try: i=l.index('!')
                  > except ValueError: continue
                  > d[l[:i]]=l[i+1:][/color]

                  which is both faster in this case and easier to read.

                  About speed: I'm not sure with the current Python
                  version, but it might be worth trying to go without
                  the exception:

                  d={}
                  for l in file("test.txt" ):
                  i=l.find('!')
                  if i >= 0:
                  d[l[:i]]=l[i+1:]

                  and then you might even consider to split on the first
                  "!", but I didn't do any timings:

                  d={}
                  for l in file("test.txt" ):
                  try:
                  key, value = l.split("!", 1)
                  except ValueError: continue
                  d[key] = value


                  cheers -- chris

                  --
                  Christian Tismer :^) <mailto:tismer@ tismer.com>
                  Mission Impossible 5oftware : Have a break! Take a ride on Python's
                  Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
                  14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
                  work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776
                  PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
                  whom do you want to sponsor today? http://www.stackless.com/


                  Comment

                  • Paul Simmonds

                    #10
                    Re: My Big Dict.

                    Christian Tismer <tismer@tismer. com> wrote in message news:<mailman.1 057162092.18394 [email protected] >...[color=blue]
                    > Paul Simmonds wrote:
                    > ...
                    > I'm not trying to intrude this thread, but was just
                    > struck by the list comprehension below, so this is
                    > about readability.[/color]
                    <snipped>[color=blue][color=green]
                    > >
                    > > d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
                    > > for l in file('test.txt' ) if l.count('!')])[/color]
                    >
                    > With every respect, this looks pretty much like another
                    > P-language. The pure existance of list comprehensions
                    > does not try to force you to use it everywhere :-)
                    >[/color]

                    Quite right. I think that mutation came from the fact that I was
                    thinking in C all day. Still, I don't even write C like that...it
                    should be put to sleep ASAP.

                    <snip>[color=blue][color=green]
                    > > d={}
                    > > for l in file("test.txt" ):
                    > > try: i=l.index('!')
                    > > except ValueError: continue
                    > > d[l[:i]]=l[i+1:][/color]
                    >
                    > About speed: I'm not sure with the current Python
                    > version, but it might be worth trying to go without
                    > the exception:
                    >
                    > d={}
                    > for l in file("test.txt" ):
                    > i=l.find('!')
                    > if i >= 0:
                    > d[l[:i]]=l[i+1:]
                    >
                    > and then you might even consider to split on the first
                    > "!", but I didn't do any timings:
                    >
                    > d={}
                    > for l in file("test.txt" ):
                    > try:
                    > key, value = l.split("!", 1)
                    > except ValueError: continue
                    > d[key] = value
                    >[/color]
                    Just when you think you know a language, an optional argument you've
                    never used pops up to make your life easier. Thanks for pointing that
                    out.

                    I've done some timings on the functions above, here are the results:

                    Python2.2.1, 200000 line file(all data lines)
                    try/except with split: 3.08s
                    if with slicing: 2.32s
                    try/except with slicing: 2.34s

                    So slicing seems quicker than split, and using if instead of
                    try/except appears to speed it up a little more. I don't know how much
                    faster the current version of the interpreter would be, but I doubt
                    the ranking would change much.

                    Paul

                    Comment

                    • Paul Simmonds

                      #11
                      Re: My Big Dict.

                      Christian Tismer <tismer@tismer. com> wrote in message news:<mailman.1 057162092.18394 [email protected] >...[color=blue]
                      > Paul Simmonds wrote:
                      > ...
                      > I'm not trying to intrude this thread, but was just
                      > struck by the list comprehension below, so this is
                      > about readability.[/color]
                      <snipped>[color=blue][color=green]
                      > >
                      > > d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
                      > > for l in file('test.txt' ) if l.count('!')])[/color]
                      >
                      > With every respect, this looks pretty much like another
                      > P-language. The pure existance of list comprehensions
                      > does not try to force you to use it everywhere :-)
                      >[/color]

                      Quite right. I think that mutation came from the fact that I was
                      thinking in C all day. Still, I don't even write C like that...it
                      should be put to sleep ASAP.

                      <snip>[color=blue][color=green]
                      > > d={}
                      > > for l in file("test.txt" ):
                      > > try: i=l.index('!')
                      > > except ValueError: continue
                      > > d[l[:i]]=l[i+1:][/color]
                      >
                      > About speed: I'm not sure with the current Python
                      > version, but it might be worth trying to go without
                      > the exception:
                      >
                      > d={}
                      > for l in file("test.txt" ):
                      > i=l.find('!')
                      > if i >= 0:
                      > d[l[:i]]=l[i+1:]
                      >
                      > and then you might even consider to split on the first
                      > "!", but I didn't do any timings:
                      >
                      > d={}
                      > for l in file("test.txt" ):
                      > try:
                      > key, value = l.split("!", 1)
                      > except ValueError: continue
                      > d[key] = value
                      >[/color]
                      Just when you think you know a language, an optional argument you've
                      never used pops up to make your life easier. Thanks for pointing that
                      out.

                      I've done some timings on the functions above, here are the results:

                      Python2.2.1, 200000 line file(all data lines)
                      try/except with split: 3.08s
                      if with slicing: 2.32s
                      try/except with slicing: 2.34s

                      So slicing seems quicker than split, and using if instead of
                      try/except appears to speed it up a little more. I don't know how much
                      faster the current version of the interpreter would be, but I doubt
                      the ranking would change much.

                      Paul

                      Comment

                      • Christian Tismer

                        #12
                        Re: My Big Dict.

                        Paul Simmonds wrote:

                        [some alternative implementations]
                        [color=blue]
                        > I've done some timings on the functions above, here are the results:
                        >
                        > Python2.2.1, 200000 line file(all data lines)
                        > try/except with split: 3.08s
                        > if with slicing: 2.32s
                        > try/except with slicing: 2.34s
                        >
                        > So slicing seems quicker than split, and using if instead of
                        > try/except appears to speed it up a little more. I don't know how much
                        > faster the current version of the interpreter would be, but I doubt
                        > the ranking would change much.[/color]

                        Interesting. I doubt that split() itself is slow, instead
                        I believe that the pure fact that you are calling a function
                        instead of using a syntactic construct makes things slower,
                        since method lookup is not so cheap. Unfortunately, split()
                        cannot be cached into a local variable, since it is obtained
                        as a new method of the line, all the time. On the other hand,
                        the same holds for the find method...

                        Well, I wrote a test program and figured out, that the test
                        results were very dependant from the order of calling the
                        functions! This means, the results are not independent,
                        probably due to the memory usage.
                        Here some results on Win32, testing repeatedly...

                        D:\slpdev\src\2 .2\src\PCbuild> python -i \python22\py\te stlines.py[color=blue][color=green][color=darkred]
                        >>> test()[/color][/color][/color]
                        function test_index for 200000 lines took 1.064 seconds.
                        function test_find for 200000 lines took 1.402 seconds.
                        function test_split for 200000 lines took 1.560 seconds.[color=blue][color=green][color=darkred]
                        >>> test()[/color][/color][/color]
                        function test_index for 200000 lines took 1.395 seconds.
                        function test_find for 200000 lines took 1.502 seconds.
                        function test_split for 200000 lines took 1.888 seconds.[color=blue][color=green][color=darkred]
                        >>> test()[/color][/color][/color]
                        function test_index for 200000 lines took 1.416 seconds.
                        function test_find for 200000 lines took 1.655 seconds.
                        function test_split for 200000 lines took 1.755 seconds.[color=blue][color=green][color=darkred]
                        >>>[/color][/color][/color]

                        For that reason, I added a command line mode for testing
                        single functions, with these results:

                        D:\slpdev\src\2 .2\src\PCbuild> python \python22\py\te stlines.py index
                        function test_index for 200000 lines took 1.056 seconds.

                        D:\slpdev\src\2 .2\src\PCbuild> python \python22\py\te stlines.py find
                        function test_find for 200000 lines took 1.092 seconds.

                        D:\slpdev\src\2 .2\src\PCbuild> python \python22\py\te stlines.py split
                        function test_split for 200000 lines took 1.255 seconds.

                        The results look much more reasonable; the index thing still
                        seems to be optimum.

                        Then I added another test, using an unbound str.index function,
                        which was again a bit faster.
                        Finally, I moved the try..except clause out of the game, by
                        using an explicit, restartable iterator, see the attached program.

                        D:\slpdev\src\2 .2\src\PCbuild> python \python22\py\te stlines.py index3
                        function test_index3 for 200000 lines took 0.997 seconds.

                        As a side result, split seems to be unnecessarily slow.

                        cheers - chris
                        --
                        Christian Tismer :^) <mailto:tismer@ tismer.com>
                        Mission Impossible 5oftware : Have a break! Take a ride on Python's
                        Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
                        14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
                        work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776
                        PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
                        whom do you want to sponsor today? http://www.stackless.com/


                        import sys, time

                        def test_index(data ):
                        d={}
                        for l in data:
                        try: i=l.index('!')
                        except ValueError: continue
                        d[l[:i]]=l[i+1:]
                        return d

                        def test_find(data) :
                        d={}
                        for l in data:
                        i=l.find('!')
                        if i >= 0:
                        d[l[:i]]=l[i+1:]
                        return d

                        def test_split(data ):
                        d={}
                        for l in data:
                        try:
                        key, value = l.split("!", 1)
                        except ValueError: continue
                        d[key] = value
                        return d

                        def test_index2(dat a):
                        d={}
                        idx = str.index
                        for l in data:
                        try: i=idx(l, '!')
                        except ValueError: continue
                        d[l[:i]]=l[i+1:]
                        return d

                        def test_index3(dat a):
                        d={}
                        idx = str.index
                        it = iter(data)
                        while 1:
                        try:
                        for l in it:
                        i=idx(l, '!')
                        d[l[:i]]=l[i+1:]
                        else:
                        return d
                        except ValueError: continue


                        def make_data(n=200 000):
                        return [ "this is some silly key %d!and that some silly value" % i for i in xrange(n) ]

                        def test(funcnames, n=200000):
                        if sys.platform == "win32":
                        default_timer = time.clock
                        else:
                        default_timer = time.time

                        data = make_data(n)
                        for name in funcnames.split ():
                        fname = "test_"+nam e
                        f = globals()[fname]
                        t = default_timer()
                        f(data)
                        t = default_timer() - t
                        print "function %-10s for %d lines took %0.3f seconds." % (fname, n, t)

                        if __name__ == "__main__":
                        funcnames = "index find split index2 index3"
                        if len(sys.argv) > 1:
                        funcnames = " ".join(sys. argv[1:])
                        test(funcnames)

                        Comment

                        Working...