Trim all xml_text results from a nodeset at the same time

`xml_text` can take a long time when used with a large nodeset when `trim` is `TRUE`, and most of that time is spent calling `sub` twice for each node.  One experience I had was getting the text for a nodeset with about 4.5 million nodes, where `Rprof` showed that calls to `sub` made up 72% of the nearly two minutes spent in `xml_text`.

I'll happily make a pull request if you'd like

Here's a brief demo showing time saved:

```r
library(xml2)
library(microbenchmark)

blob <- paste0(c("<x>", rep("<y> Hi </y>", 1000), "</x>"), collapse = "")
tree <- read_xml(blob)
y_tags <- xml_find_all(tree, "//y")

xml_text_sub_after <- function(x, trim = FALSE) {
  res <- vapply(x, xml_text, trim = FALSE, FUN.VALUE = character(1))
  if (isTRUE(trim)) {
      res <- sub("^[[:space:] ]+", "", res)
      res <- sub("[[:space:] ]+$", "", res)
  }
  res
}

microbenchmark(
  as_is = xml_text(y_tags, trim = TRUE),
  proposed = xml_text_sub_after(y_tags, trim = TRUE),
  check = "identical"
)
# Unit: milliseconds
#      expr     min      lq      mean   median       uq     max neval
#     as_is 35.2209 35.9433 37.787917 36.64865 38.15000 58.4016   100
#  proposed  6.7719  6.9522  7.625734  7.16715  7.40195 16.2116   100
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trim all xml_text results from a nodeset at the same time #386

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Trim all xml_text results from a nodeset at the same time #386

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions