xml processing and sys.setdefaultencoding

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • christof hoeke

    xml processing and sys.setdefaultencoding

    hi,
    i wrote a small application which extracts a javadoc similar documentation
    for xslt stylesheets using python, xslt and pyana.
    using non-ascii characters was a problem. so i set the defaultending to
    UTF-8 and now everything works (at least it seems so, need to do more
    testing though).

    it may not be the most elegant solution (according to python in a nutshell)
    but it almost seems when doing xml processing it is mandatory to set the
    default encoding. xml processing should almost only work with unicode
    strings and this seems the easiest solution.

    any comments on this? better ways to work

    thanks
    chris


  • Alan Kennedy

    #2
    Re: xml processing and sys.setdefaulte ncoding

    christof hoeke wrote:
    [color=blue]
    > i wrote a small application which extracts a javadoc similar
    > documentation
    > for xslt stylesheets using python, xslt and pyana.
    > using non-ascii characters was a problem.[/color]

    That's odd. Did your stylesheets contain non-ascii characters? If yes,
    did you declare the character encoding at the beginning of the
    document, e.g.

    "<?xml version="1.0" encoding="iso-8859-1"?>
    [color=blue]
    > so i set the [python] defaultending to
    > UTF-8 and now everything works (at least it seems so, need to do more
    > testing though).[/color]

    If you don't put an encoding declaration in your XML documents
    (including XSLT style/transform sheets), then an XML parser would by
    default treat the document content as UTF-(8|16), as the XML standard
    mandates.

    Are you working from XML documents which are stored as strings inside
    a python module? In which case, your special characters will actually
    be encoded in whatever encoding your python module is stored. So you
    might need to put an encoding declaration on your python module:-

    This PEP proposes to introduce a syntax to declare the encoding of a Python source file. The encoding information is then used by the Python parser to interpret the file using the given encoding. Most notably this enhances the interpretation of Unicode ...

    [color=blue]
    > it may not be the most elegant solution (according to python in a
    > nutshell)
    > but it almost seems when doing xml processing it is mandatory to set the
    > default encoding. xml processing should almost only work with unicode
    > strings and this seems the easiest solution.[/color]

    It is always recommended to explicitly state the encoding on your XML
    documents. If you don't, then the parser assumes UTF-(8|16). If your
    documents aren't really UTF-(8|16), then you will get seemingly random
    mapping of characters to other characters.
    [color=blue]
    > any comments on this? better ways to work[/color]

    If you're not dealing specifically with ASCII, then declare your
    encodings, in both your python modules and your xml documents. Find
    out what is the default character set used by your text editor. Find
    out how to change which character set is in use.

    If you create, sell or maintain text editing or processing software,
    make it easy for your users to find out what character encodings are
    in effect.

    HTH,

    --
    alan kennedy
    -----------------------------------------------------
    check http headers here: http://xhaus.com/headers
    email alan: http://xhaus.com/mailto/alan

    Comment

    Working...