Skip to content

Conversation

@OfekShilon
Copy link
Contributor

Per @MichaelChirico 's request, this improvement is separated to its own PR.

@codecov
Copy link

codecov bot commented Oct 28, 2022

Codecov Report

Merging #5501 (bb08e14) into master (19b7866) will decrease coverage by 1.18%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5501      +/-   ##
==========================================
- Coverage   99.51%   98.32%   -1.19%     
==========================================
  Files          80       80              
  Lines       14774    14801      +27     
==========================================
- Hits        14702    14553     -149     
- Misses         72      248     +176     
Impacted Files Coverage Δ
R/utils.R 100.00% <100.00%> (ø)
R/xts.R 0.00% <0.00%> (-100.00%) ⬇️
R/last.R 47.91% <0.00%> (-52.09%) ⬇️
R/fread.R 69.32% <0.00%> (-30.68%) ⬇️
R/fwrite.R 79.48% <0.00%> (-20.52%) ⬇️
R/test.data.table.R 94.56% <0.00%> (-5.44%) ⬇️
src/fwrite.c 92.37% <0.00%> (-4.75%) ⬇️
src/between.c 96.24% <0.00%> (-3.76%) ⬇️
R/tables.R 97.77% <0.00%> (-2.23%) ⬇️
src/fread.c 99.41% <0.00%> (-0.01%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@MichaelChirico
Copy link
Member

could you add a test where this matters? probably worth a NEWS item as well, this could potentially break some folks

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Oct 28, 2022

could you add a test where this matters? probably worth a NEWS item as well, this could potentially break some folks

I'm not aware of any case where this change makes an observable difference in behavior. The best I can suggest is a test documenting the performance improvement in an awkward call of the sort transform used to make:

t1 <- system.time(dt1 <- data.table(a=runif(1e5), b=runif(1e5)))
t2 <- system.time(dt2 <- do.call(data.table, list(dt1)))

test(2239.1, all.equal(dt1,dt2), TRUE)
test(2239.2, t2["elapsed"]/t1["elapsed"]<0.5, TRUE)   # On my system this is >30 before the fix, <0.2 after

Do you think this adds value?

@MichaelChirico
Copy link
Member

MichaelChirico commented Oct 28, 2022

It matters here:

asodijfoaisjdfoiajsdofijasodifjoaisdjfoiajsdoifjaosidjfoiajsdofijaosidjfoiasjdofijasdoifjaosidjfoiasjdfiuajsfuasiudfhoiauhjzxclkvjbhlszkjxhcliuasdhfgoiuahsdoufhlijxchlkzjhvkjhaljshdfiluhcubvjdfbvjkahsdfliuhawpioedhflaijsdh = function() 1L
data.table(c(
 
 asodijfoaisjdfoiajsdofijasodifjoaisdjfoiajsdoifjaosidjfoiajsdofijaosidjfoiasjdofijasdoifjaosidjfoiasjdfiuajsfuasiudfhoiauhjzxclkvjbhlszkjxhcliuasdhfgoiuahsdoufhlijxchlkzjhvkjhaljshdfiluhcubvjdfbvjkahsdfliuhawpioedhflaijsdh(),
 
 asodijfoaisjdfoiajsdofijasodifjoaisdjfoiajsdoifjaosidjfoiajsdofijaosidjfoiasjdofijasdoifjaosidjfoiasjdfiuajsfuasiudfhoiauhjzxclkvjbhlszkjxhcliuasdhfgoiuahsdoufhlijxchlkzjhvkjhaljshdfiluhcubvjdfbvjkahsdfliuhawpioedhflaijsdh(),
 
 asodijfoaisjdfoiajsdofijasodifjoaisdjfoiajsdoifjaosidjfoiajsdofijaosidjfoiasjdofijasdoifjaosidjfoiasjdfiuajsfuasiudfhoiauhjzxclkvjbhlszkjxhcliuasdhfgoiuahsdoufhlijxchlkzjhvkjhaljshdfiluhcubvjdfbvjkahsdfliuhawpioedhflaijsdh(),
 
 asodijfoaisjdfoiajsdofijasodifjoaisdjfoiajsdoifjaosidjfoiajsdofijaosidjfoiasjdofijasdoifjaosidjfoiasjdfiuajsfuasiudfhoiauhjzxclkvjbhlszkjxhcliuasdhfgoiuahsdoufhlijxchlkzjhvkjhaljshdfiluhcubvjdfbvjkahsdfliuhawpioedhflaijsdh()
))

There should be a more minimal version of this but hopefully illustrates the point

@MichaelChirico
Copy link
Member

Do you think this adds value?

We shy away from timing tests in our main suite -- you never know when some system interference / sun spots can cause things to go haywire and now we have to deal with the threat of CRAN removal...

OTOH, there is this non-CRAN suite, couldn't hurt to throw a test like that in there:

https://github.com/Rdatatable/data.table/blob/master/inst/tests/benchmark.Rraw

I'd suggest a threshold of maybe 2 or 5 to be conservative, too, .2 vs .5 seems to tight and likely not to replicate across platforms / with high statistical certainty

@OfekShilon
Copy link
Contributor Author

There should be a more minimal version of this but hopefully illustrates the point

Nice! deparse has a default width.cutoff = 60L, so these can be trimmed to 61-char names.
I'll do that - and add a conservative timing test to benchmark.Rraw - tomorrow. Thanks.

@OfekShilon
Copy link
Contributor Author

@MichaelChirico I checked now and think your example also doesn't have an observable impact on behavior. With and without nlines=1, the generated name of the column is "V1". Notice the line after the deparse:

      tmp = if (syms[i]) as.character(dot_sub[[i]]) else deparse(dot_sub[[i]])[1L]
      if (tmp == make.names(tmp)) vnames[i]=tmp

make.names converts newlines, brackets and others to dots, and so does change pretty much anything that slips into this code path - except proper variable names - and causes the deparse to go unused. With and without nlines=1, including your example and anything I could come up with.
This is kind of good news - I probably won't be able to add a test, but the chances of breaking anyone are virtually zero.

I am adding the news item and benchmark test.

@mattdowle mattdowle added this to the 1.14.5 milestone Nov 11, 2022
Copy link
Member

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find. Just on the news item, I try to imagine myself as user reading news, and this is the thought I would have.

53. `as.data.frame(DT, row.names=)` no longer silently ignores `row.names`, [#5319](https://github.com/Rdatatable/data.table/issues/5319). Thanks to @dereckdemezquita for the fix and PR, and @ben-schwen for guidance.

54. data.table construction uses `deparse` to try and automatically name un-named cols. The attempt is now limited to 1- line which gives substantial speedup in [some edge cases](https://github.com/Rdatatable/data.table/pull/5501). Thanks @OfekShilon for the report and fix, @michaelchirico for guidance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data.table construction

Does that mean a data.table(...) call with arguments passed to it? If so, could "data.table construction" be replaced with data.table(...) ?

The attempt is now limited to 1- line

Does that mean calls to data.table(...) split over several lines (which I guess is very common) will lose this feature in the 2nd line onwards? i.e.

data.table( A = 1:5,
            mean(my_vec)      # column name will be different now?
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refined the phrasing a bit (see the commit) - it is now both more accurate and less understandable.. I'd be thankful for any suggestion, and am totally fine with omitting the NEWS item altogether. AFAICT this change has no user-facing consequences, beyond some speedup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Ok I played with it a bit and tweaked the news item. It's not just 'some' speedup, but it scales with the data size. It's a tremendous finding certainly deserving of a news item and example!

@mattdowle mattdowle added the High label Nov 11, 2022
OfekShilon and others added 6 commits November 12, 2022 11:08
…sible, and played with it preparing for the news item
…n imagine what might be similar, provide example with the timings and make the example more plausible by changing the input to be a data.frame
…w; this benchmark test file isn't in use yet and using 2239 here might make us think in future that these tests were moved from tests.Rraw which is happening currently but doesn't apply to these new ones
@mattdowle mattdowle merged commit fe8623c into Rdatatable:master Nov 13, 2022
@mattdowle mattdowle mentioned this pull request Nov 13, 2022
@OfekShilon OfekShilon deleted the LimitDeparse branch November 13, 2022 09:36
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants