Skip to content

A function to extract text from HTML and XML #18454

@alexey-milovidov

Description

@alexey-milovidov

Use case

We have web crawl database with HTML content in a column.
Need to extract text content to do some analysis.

Describe the solution you'd like

It can be not 100% correct but must be fast.

For HTML and XHTML:

  • remove script and style elements (and maybe meta) with all their content (assuming </script> is properly escaped in JS string literals as expected, counterexample: <script>var x = "</script>"</script>);
  • unwrap CDATA;
  • remove tags (assuming entities are properly escaped in attributes, counterexample: <test test=">">);
  • collapse whitespaces;

For XML:

  • unwrap CDATA;
  • remove tags (assuming entities are properly escaped in attributes, counterexample: <test test=">">);
  • collapse whitespaces;

(everything should be done in single pass but the logical order matters)

We will not support custom XML entities. We will not decode HTML and XML entities (there will be a separate function for it). We will not process meta charset declaration... It's in question should we involve processing of comments.

Describe alternatives you've considered

Parse HTML with regular expressions:

replaceRegexpAll(replaceRegexpAll(content, '(?s)<(script|style)[^>]*>.*?</(script|style)>', ''), '<[^>]+>', '')

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions