Skip to content

Trim all xml_text results from a nodeset at the same time #386

@WerthPADOH

Description

@WerthPADOH

xml_text can take a long time when used with a large nodeset when trim is TRUE, and most of that time is spent calling sub twice for each node. One experience I had was getting the text for a nodeset with about 4.5 million nodes, where Rprof showed that calls to sub made up 72% of the nearly two minutes spent in xml_text.

I'll happily make a pull request if you'd like

Here's a brief demo showing time saved:

library(xml2)
library(microbenchmark)

blob <- paste0(c("<x>", rep("<y> Hi </y>", 1000), "</x>"), collapse = "")
tree <- read_xml(blob)
y_tags <- xml_find_all(tree, "//y")

xml_text_sub_after <- function(x, trim = FALSE) {
  res <- vapply(x, xml_text, trim = FALSE, FUN.VALUE = character(1))
  if (isTRUE(trim)) {
      res <- sub("^[[:space:] ]+", "", res)
      res <- sub("[[:space:] ]+$", "", res)
  }
  res
}

microbenchmark(
  as_is = xml_text(y_tags, trim = TRUE),
  proposed = xml_text_sub_after(y_tags, trim = TRUE),
  check = "identical"
)
# Unit: milliseconds
#      expr     min      lq      mean   median       uq     max neval
#     as_is 35.2209 35.9433 37.787917 36.64865 38.15000 58.4016   100
#  proposed  6.7719  6.9522  7.625734  7.16715  7.40195 16.2116   100

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancement

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions