-
Notifications
You must be signed in to change notification settings - Fork 8.3k
XML as input format #29822
Copy link
Copy link
Open
Labels
comp-formatsInput/output formats (CSV/JSON/Parquet/ORC/Arrow/Protobuf/etc.).Input/output formats (CSV/JSON/Parquet/ORC/Arrow/Protobuf/etc.).feature
Description
Use case
- Stack Exchange archives: https://archive.org/details/stackexchange
- Wikipedia dumps.
- OpenStreetMap dumps.
Describe the solution you'd like
Add support for XML as input format.
The user should specify one of the three flavours with a setting:
elements:
<table>
<row>
<column1_name>value</column1_name>
<column2_name>value</column2_name>
...
</row>
...
</table>
attributes
<table>
<row column1_name="value" column2_name="value" ... />
...
</table>
cells
<table>
<row>
<cell name="column1_name">value</cell>
<cell name="column2_name">value</cell>
...
</row>
...
</table>
and the settings with the path to the table (like /table),
the name of row element,
the name of the cell element and the name of the attribute with column name (for cells variant).
The format should not use full-featured XML parser, but should support:
- decoding of XML entities;
- optional BOM at the beginning of the file, including UTF-8 BOM;
- UTF16 and UTF-32 BE/LE encodings;
- CDATA;
- attributes in single and double quotes;
- self-closing tags or separate closing tags;
- skipping XML header and DOCTYPE;
- processing of invalid (unescaped) charaters;
Additional context
Let these settings also control XML output format (that we already have).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
comp-formatsInput/output formats (CSV/JSON/Parquet/ORC/Arrow/Protobuf/etc.).Input/output formats (CSV/JSON/Parquet/ORC/Arrow/Protobuf/etc.).feature