Unicode problem.... as always

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Todd Jenista

    Unicode problem.... as always

    I have a parser I am building with python and, unfortunately, people
    have decided to put unicode characters in the files I am parsing.
    The parser seems to have a fit when I search for one \uXXXX symbol,
    and there is another unicode symbol in the file. In this case, a
    search and replace for © with a µ in the file causes the infamous
    ordinal error.
    My quick-fix, because they have good context, is to change them both
    to "UTF8", and then attempt to replace the UTF8 at the end with the
    original µ. The problem is that I am getting a µ when I try to
    re-insert using \u00b5 which is the UTF8 code.
    Words of wisdom would be greatly appreciated.
  • Thomas Güttler

    #2
    Re: Unicode problem.... as always

    Todd Jenista wrote:
    [color=blue]
    > I have a parser I am building with python and, unfortunately, people
    > have decided to put unicode characters in the files I am parsing.[/color]

    Maybe this helps you. It converts a latin1 byte to unicode
    and then converts it to utf8.[color=blue][color=green][color=darkred]
    >>> s="ä"
    >>> s_u=unicode(s, "latin1")
    >>> s_utf8=s_u.enco de("utf8")[/color][/color][/color]

    You need to know the encoding of the input (utf8, utf16) .

    thomas

    Comment

    Working...