Skip to content

Allow parallelization of getSheetNames #262

@sgrote

Description

@sgrote

Is your feature request related to a problem? Please describe.
It does not seem to be possible to run openxlsx::getSheetNames in parallel. Here's my code:

getSheetNames <- function(excel_files, parallel = FALSE, BPPARAM = BiocParallel::bpparam()) {
  if (parallel) {
    sheet_list <- BiocParallel::bplapply(excel_files,
      FUN = function(excel_file) {
        if (!file.exists(excel_file)) {
          return(NULL)
        }
        openxlsx::getSheetNames(excel_file)
      },
      BPPARAM = BPPARAM
    )
  } else {
    sheet_list <- lapply(excel_files,
      FUN = function(excel_file) {
        if (!file.exists(excel_file)) {
          warning("File '", excel_file, "' could not be found.")
          return(NULL)
        }
        openxlsx::getSheetNames(excel_file)
      }
    )
  }
  names(sheet_list) <- excel_files
  sheet_list
}

Now, getSheetNames(excel_files, parallel = TRUE) sometimes throws an error

 Error: BiocParallel errors
  element index: 1
  first error: cannot open file '/tmp/Rtmpxttndw/_excelXMLRead/[Content_Types].xml': No such file or directory 

or

Error: BiocParallel errors
  element index: 1
  first error: cannot open the connection 

Describe the solution you'd like
My guess would be that each of the parallel tasks writes the same temporary file.
Maybe using a directory or a file based on tempfile() would be a solution.

Describe alternatives you've considered
Parallelization does work with readxl::excel_sheets, but that is much slower than openxlsx::getSheetNames.
Another alternative is to not run it in parallel, but with a growing number of excel files this will be more time consuming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions