-
Notifications
You must be signed in to change notification settings - Fork 8.3k
A function to extract text from HTML and XML #18454
Copy link
Copy link
Closed
Labels
Description
Use case
We have web crawl database with HTML content in a column.
Need to extract text content to do some analysis.
Describe the solution you'd like
It can be not 100% correct but must be fast.
For HTML and XHTML:
- remove
scriptandstyleelements (and maybemeta) with all their content (assuming</script>is properly escaped in JS string literals as expected, counterexample:<script>var x = "</script>"</script>); - unwrap CDATA;
- remove tags (assuming entities are properly escaped in attributes, counterexample:
<test test=">">); - collapse whitespaces;
For XML:
- unwrap CDATA;
- remove tags (assuming entities are properly escaped in attributes, counterexample:
<test test=">">); - collapse whitespaces;
(everything should be done in single pass but the logical order matters)
We will not support custom XML entities. We will not decode HTML and XML entities (there will be a separate function for it). We will not process meta charset declaration... It's in question should we involve processing of comments.
Describe alternatives you've considered
Parse HTML with regular expressions:
replaceRegexpAll(replaceRegexpAll(content, '(?s)<(script|style)[^>]*>.*?</(script|style)>', ''), '<[^>]+>', '')
Reactions are currently unavailable