Skip to content

Add a function htmlOrXmlCoarseParse to extract content from html or xml format string.#19600

Merged
abyss7 merged 20 commits intoClickHouse:masterfrom
zlx19950903:master
Feb 18, 2021
Merged

Add a function htmlOrXmlCoarseParse to extract content from html or xml format string.#19600
abyss7 merged 20 commits intoClickHouse:masterfrom
zlx19950903:master

Conversation

@zlx19950903
Copy link
Copy Markdown
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Mentioned in A function to extract text from HTML and XML #18454
  • add function htmlOrxmlCoarseParse;
  • support <script></script> parse;
  • support <style></style> parse;
  • support <![CDATA[]]> parse;
  • support white space collapse;
  • support any <content> format parse;
  • HyperScan to support SIMD;
  • Everything is done in a single pass.

Usage:

SELECT htmlOrXmlCoarseParse('<script>hello </script><![CDATA[world]]>')
world

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Jan 26, 2021
@abyss7 abyss7 self-assigned this Jan 26, 2021
@zlx19950903
Copy link
Copy Markdown
Contributor Author

I got a question about fast test. My source code depends on a third party named hyperscan. When I build locally, USE_HYPERSCAN will be set default 1. But in the fast test, it is set to be 0. Why does it occurs? This is a picture of cmake_log.txt output by fast test.
image
And this picture is cmaked locally:
image

@abyss7
Copy link
Copy Markdown
Contributor

abyss7 commented Jan 27, 2021

@zlx19950903 because hyperscan is an optional dependency in our code, which means that you have to conditionally exclude your code if hyperscan is not available (i.e. checking the related macro). Your tests and code should compile and run fine in other checks.

@zlx19950903
Copy link
Copy Markdown
Contributor Author

@zlx19950903 because hyperscan is an optional dependency in our code, which means that you have to conditionally exclude your code if hyperscan is not available (i.e. checking the related macro). Your tests and code should compile and run fine in other checks.

If i exclude hyperscan from my source code, my function does not work. How about the query test? Shoule I also delete the unit test about my function?


DataTypePtr getReturnTypeImpl(const DataTypes & arguments) const override
{
if(!isString(arguments[0]))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style error, need a space between if and (.

@abyss7
Copy link
Copy Markdown
Contributor

abyss7 commented Jan 29, 2021

@zlx19950903 because hyperscan is an optional dependency in our code, which means that you have to conditionally exclude your code if hyperscan is not available (i.e. checking the related macro). Your tests and code should compile and run fine in other checks.

If i exclude hyperscan from my source code, my function does not work. How about the query test? Shoule I also delete the unit test about my function?

It's expected to not work. There is an array of skipped tests inside docker/test/fasttest/run.sh - add your test here.

@abyss7 abyss7 changed the title Add a function htmlOrCoarseParse to extract content from html or xml format string. Add a function htmlOrXmlCoarseParse to extract content from html or xml format string. Feb 9, 2021
@abyss7 abyss7 merged commit c92e613 into ClickHouse:master Feb 18, 2021
traceon added a commit to traceon/ClickHouse that referenced this pull request Feb 19, 2021
* master: (160 commits)
  Make Poco HTTP Server zero-copy again (ClickHouse#19516)
  Fixed documentation
  ccache 4.2+ does not requires any quirks for SOURCE_DATE_EPOCH
  Add a function `htmlOrXmlCoarseParse` to extract content from html or xml format string. (ClickHouse#19600)
  Reinterpret function added Decimal, DateTim64 support
  Add test
  Update InterpreterSelectQuery.cpp
  Improved serialization for data types combined of Arrays and Tuples. Improved matching enum data types to protobuf enum type. Fixed serialization of the Map data type. Omitted values are now set by default.
  Log stdout and stderr when failed to start docker in integration tests.
  Added comment
  Don't backport base commit of branch in the same branch (ClickHouse#20628)
  Fix fasttest retry for failed tests
  Dictionary create source with functions crash fix
  Added error reinterpretation tests
  Update run.sh
  Updated documentation
  fix subquery with limit
  Rename untyped function reinterpretAs into reinterpret
  ignore data store files
  Support vhost
  ...
traceon added a commit to traceon/ClickHouse that referenced this pull request Feb 19, 2021
* master: (153 commits)
  Add gdb to fasttest image
  Make Poco HTTP Server zero-copy again (ClickHouse#19516)
  Use fixed version for aerospike
  Fixed documentation
  ccache 4.2+ does not requires any quirks for SOURCE_DATE_EPOCH
  Add a function `htmlOrXmlCoarseParse` to extract content from html or xml format string. (ClickHouse#19600)
  Reinterpret function added Decimal, DateTim64 support
  test/stress: fix permissions for clickhouse directories
  test/stress: improve backtrace catching on server failures
  test/stress: use clickhouse builtin start/stop to run server from the same user
  Add test
  Update InterpreterSelectQuery.cpp
  Improved serialization for data types combined of Arrays and Tuples. Improved matching enum data types to protobuf enum type. Fixed serialization of the Map data type. Omitted values are now set by default.
  Log stdout and stderr when failed to start docker in integration tests.
  Added comment
  Don't backport base commit of branch in the same branch (ClickHouse#20628)
  Fix fasttest retry for failed tests
  Dictionary create source with functions crash fix
  Added error reinterpretation tests
  Update run.sh
  ...
@alexey-milovidov
Copy link
Copy Markdown
Member

I believe this function can run faster.

This is on 80 vCPU machine:

:) SELECT count() FROM minicrawl WHERE NOT ignore(htmlOrXmlCoarseParse(content))

SELECT count()
FROM minicrawl
WHERE NOT ignore(htmlOrXmlCoarseParse(content))

→ Progress: 1.58 million rows, 143.44 GB (19.86 thousand rows/s., 1.80 GB/s.)
  18.86%  clickhouse              [.] runSheng
  14.56%  clickhouse              [.] roseRunProgram
  14.15%  clickhouse              [.] fdr_exec_teddy_msks4_pck
  11.27%  clickhouse              [.] roseCatchUpNfas
   6.28%  clickhouse              [.] handleSomInternal
   3.87%  clickhouse              [.] roseCatchUpAll
   3.75%  clickhouse              [.] nfaQueueExecToMatch
   3.70%  clickhouse              [.] roseFloatingCallback
   3.32%  clickhouse              [.] std::__1::vector<DB::(anonymous namespace)::HxCoarseParseImpl::SpanInfo, std::__1::allocator<DB::(anonymous namespace)::HxCoarseParseImpl::SpanInfo> >::push_back
   2.99%  clickhouse              [.] ZSTD_decompressSequences_bmi2
   2.77%  clickhouse              [.] DB::(anonymous namespace)::HxCoarseParseImpl::spanCollect
   1.15%  clickhouse              [.] roseNfaBlastAdaptor
   1.00%  clickhouse              [.] memcpy
   0.97%  clickhouse              [.] run_accel
   0.88%  clickhouse              [.] roseRunProgram_l
   0.85%  clickhouse              [.] roseNfaAdaptor
   0.83%  clickhouse              [.] nfaQueueExec
   0.78%  clickhouse              [.] shuftiExec
   0.73%  clickhouse              [.] handleSomExternal
   0.55%  clickhouse              [.] DB::deserializeBinarySSE2<4>
   0.51%  clickhouse              [.] DB::(anonymous namespace)::FunctionHtmlOrXmlCoarseParse::executeImpl
   0.45%  clickhouse              [.] nfaExecSheng_Q2

For max_threads = 1, the speed is just about 50 MB/sec. Looks obnoxiously slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants