Skip to content

Conversation

@ColeMiller1
Copy link
Contributor

@ColeMiller1 ColeMiller1 commented Aug 1, 2020

Closes #4646
Closes #4658
Closes #4424

library(data.table)

set.seed(123L)
n = 500L
n_nested = 40L
dt = data.table(ordered_id = seq_len(n),
                unordered_id = sample(n),
                value = replicate(n, data.table(val1 = sample(n_nested)), simplify = FALSE))

bench::mark(
    dt[, value[[1L]], by = ordered_id]
    , dt[, value[[1L]], by = unordered_id]
    , check = FALSE
    , time_unit = "ms" 
)

## pre-regression (1.12.8)

##  expression                             min median `itr/sec` mem_alloc
##  <bch:expr>                           <dbl>  <dbl>     <dbl> <bch:byt>
##1 dt[, value[[1L]], by = ordered_id]   0.755  0.838     1131.     401KB
##2 dt[, value[[1L]], by = unordered_id] 0.796  0.892     1033.     409KB

## Post-regression (CRAN 1.13.0)

#> # A tibble: 2 x 6
#>   expression                               min  median `itr/sec` mem_alloc
#>   <bch:expr>                             <dbl>   <dbl>     <dbl> <bch:byt>
#> 1 dt[, value[[1L]], by = ordered_id]   102.    126.         8.02    53.7MB
#> 2 dt[, value[[1L]], by = unordered_id]   0.630   0.669   1261.     409.2KB
#> # ... with 1 more variable: `gc/sec` <dbl>

## after patch

##  expression                             min median `itr/sec` mem_alloc `gc/sec`
##  <bch:expr>                           <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
##1 dt[, value[[1L]], by = ordered_id]   0.634  0.691     1358.     401KB     15.0
##2 dt[, value[[1L]], by = unordered_id] 0.651  0.778     1187.     409KB     12.8

The root cause appears to be in memrecycle which is in assign.c. I can get similar performance by commenting out the duplicate line but then there are many errors during the testing. I'm including this in case someone is able to move forward with fixing the root cause.

data.table/src/assign.c

Lines 707 to 723 in 9d3b920

if (isNewList(source)) {
// A list() column; i.e. target is a column of pointers to SEXPs rather than the more common case of numbers in an atomic vector.
// If any item within the list is NAMED then take a fresh copy. So far this has occurred from dogroups.c when
// j returns .BY or similar specials as-is within a list(). Those specials are static inside
// dogroups so if we don't copy now the last value written to them by dogroups becomes repeated in the result;
// i.e. the wrong result.
// If source is itself recycled later (many list() column items pointing to the same object) we are ok with that
// since we now have a fresh copy and := will not assign with a list() column's cell value; := only changes the
// SEXP pointed to.
// If source is already not named (because j already created a fresh unnamed vector within a list()) we don't want to
// duplicate unnecessarily, hence checking for named rather than duplicating always.
// See #481, #1270 and tests 1341.* fail without this copy.
// ********** This might go away now that we copy properly in dogroups.c **********
if (anyNamed(source)) {
source = PROTECT(copyAsPlain(source)); protecti++;
}
}

@ColeMiller1

This comment has been minimized.

@mattdowle mattdowle added this to the 1.13.1 milestone Aug 4, 2020
@codecov
Copy link

codecov bot commented Aug 4, 2020

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.44%. Comparing base (be6c1fc) to head (ec041f7).
Report is 1554 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4655      +/-   ##
==========================================
- Coverage   99.61%   99.44%   -0.18%     
==========================================
  Files          73       73              
  Lines       14250    14536     +286     
==========================================
+ Hits        14195    14455     +260     
- Misses         55       81      +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mattdowle
Copy link
Member

mattdowle commented Sep 24, 2020

I followed Cole's first inclination; i.e. removed the copy from memrecycle as he did which seems right to me too.
That fixes the slowdown problem in subsetting but leaves us with a different problem to solve in a different way. All the test fails were to do with static vectors not being copied which is why that copy was there in memrecycle. So I moved that copy into dogroups.c to be close to where it's relevant. dogroups now uses negative truelength to mark the specials (.SD, .BY, .I). Just before assigning jval to a list column, it checks to see if any specials are present and copies them if so. Although it's still relatively complex, it's cleaner in the sense that the complexity is now focused and localized to dogroups.c. This should avoid unexpected effects like the slowdown occurring due to the copy in the central place (memrecycle) which is now gone.
Extensive comments and example added to new anySpecialStatic() at the top of dogroups.c which replaces the anyNamed() in assign.c.
Other tidy ups:
Removed origSDnrow as it was equal to maxGrpSize, adding a check that that's true.
origIlen could also be removed because I was allocated with maxGrpSize anyway.

@jangorecki
Copy link
Member

this will also resolve #4424

@jangorecki jangorecki linked an issue Sep 24, 2020 that may be closed by this pull request
@mattdowle
Copy link
Member

mattdowle commented Sep 24, 2020

You mean .NGRP? That's constant across groups so doesn't need to be copied. But tested anyway (2153.3) and referred to in comment in anySpecialStatic.

@MichaelChirico MichaelChirico added the atime Requests related to adding/improving/monitoring performance regression tests via atime. label Jul 8, 2025
@jangorecki jangorecki deleted the extract_performance branch September 27, 2025 09:18
@MichaelChirico MichaelChirico restored the extract_performance branch October 20, 2025 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

atime Requests related to adding/improving/monitoring performance regression tests via atime.

Projects

None yet

4 participants