Skip to content

Conversation

@jimhester
Copy link
Contributor

Fixes #3292

@jimhester
Copy link
Contributor Author

Before this PR you can see the memory leak with the increase in Vcells

dummy <- rep("1\t2\t3\t4\t5", 10000000)
writeLines(dummy, "out.tsv")

invisible(data.table::fread("out.tsv"))
for (i in 1:10) {
  data.table::fread("out.tsv")
  print(gc())
}
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539635  28.9     918942  49.1         NA   850004  45.4
#> Vcells 16041623 122.4   41582216 317.3      32768 41041748 313.2
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539797  28.9     918942  49.1         NA   850004  45.4
#> Vcells 18540198 141.5   54339092 414.6      32768 43540348 332.2
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539799  28.9     918942  49.1         NA   850004  45.4
#> Vcells 21040504 160.6   54339092 414.6      32768 46043083 351.3
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539801  28.9     918942  49.1         NA   850004  45.4
#> Vcells 23540814 179.7   54339092 414.6      32768 48543401 370.4
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539803  28.9     918942  49.1         NA   850004  45.4
#> Vcells 26041116 198.7   54339092 414.6      32768 51043727 389.5
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539805  28.9     918942  49.1         NA   850004  45.4
#> Vcells 28541418 217.8   54339092 414.6      32768 53544029 408.6
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539807  28.9     918942  49.1         NA   850004  45.4
#> Vcells 31041722 236.9   69340920 529.1      32768 56041872 427.6
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539809  28.9     918942  49.1         NA   850004  45.4
#> Vcells 33542027 256.0   69340920 529.1      32768 58544647 446.7
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539811  28.9     918942  49.1         NA   850004  45.4
#> Vcells 36042332 275.0   69340920 529.1      32768 61044961 465.8
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539813  28.9     918942  49.1         NA   850004  45.4
#> Vcells 38542637 294.1   69340920 529.1      32768 63545275 484.9
print(gc())
#>            used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539779  28.9     918942  49.1         NA   850004  45.4
#> Vcells 38540380 294.1   69340920 529.1      32768 63545275 484.9

Created on 2020-09-18 by the reprex package (v0.3.0)

Afterwards the memory leak is gone and the Vcells are basically constant.

dummy <- rep("1\t2\t3\t4\t5", 10000000)
writeLines(dummy, "out.tsv")

invisible(data.table::fread("out.tsv"))
for (i in 1:10) {
  data.table::fread("out.tsv")
  print(gc())
}
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539635 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11041081 84.3   52078987 397.4      32768 63993871 488.3
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539797 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11039362 84.3   41663190 317.9      32768 63993871 488.3
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539799 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11039382 84.3   52228899 398.5      32768 63993871 488.3
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539801 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11039406 84.3   41783120 318.8      32768 63993871 488.3
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539803 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11039422 84.3   52228944 398.5      32768 63993871 488.3
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539805 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11039438 84.3   41783156 318.8      32768 63993871 488.3
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539807 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11039456 84.3   52228984 398.5      32768 63993871 488.3
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539809 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11039475 84.3   41783188 318.8      32768 63993871 488.3
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539811 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11039494 84.3   52229028 398.5      32768 63993871 488.3
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539813 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11039513 84.3   50203867 383.1      32768 63993871 488.3
print(gc())
#>            used (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
#> Ncells   539779 28.9     918942  49.1         NA   850004  45.4
#> Vcells 11037405 84.3   40163094 306.5      32768 63993871 488.3

Created on 2020-09-18 by the reprex package (v0.3.0)

This change does potentially perform a reallocation, whereas the previous code does not, but I am not sure there is any way around it.

@jimhester
Copy link
Contributor Author

An alternative implementation that also fixes the memory leak and avoids the reallocation would be to set the growable bit and the true length to be the allocated length. e.g. this diff from the current PR.

diff --git a/src/freadR.c b/src/freadR.c
index 8293586b..de260369 100644
--- a/src/freadR.c
+++ b/src/freadR.c
@@ -527,7 +527,9 @@ void setFinalNrow(size_t nrow) {
     if (nrow == dtnrows)
       return;
     for (int i=0; i<LENGTH(DT); i++) {
-      SET_VECTOR_ELT(DT, i, Rf_lengthgets(VECTOR_ELT(DT, i), nrow));
+      SETLENGTH(VECTOR_ELT(DT,i), nrow);
+      SET_TRUELENGTH(VECTOR_ELT(DT,i), dtnrow);
+      SET_GROWABLE_BIT(VECTOR_ELT(DT,i));
     }
   }
   R_FlushConsole(); // # 2481. Just a convenient place; nothing per se to do with setFinalNrow()

SET_GROWABLE_BIT is only available in R 3.4+, so this would need to be conditionally disabled for earlier R versions.

@mattdowle mattdowle added this to the 1.14.5 milestone Nov 15, 2022
@codecov
Copy link

codecov bot commented Nov 15, 2022

Codecov Report

Merging #4710 (44f5bbb) into master (dd2134e) will decrease coverage by 0.01%.
The diff coverage is 71.42%.

❗ Current head 44f5bbb differs from pull request most recent head 84afbaf. Consider uploading reports for the commit 84afbaf to get more accurate results

@@            Coverage Diff             @@
##           master    #4710      +/-   ##
==========================================
- Coverage   98.31%   98.30%   -0.02%     
==========================================
  Files          80       80              
  Lines       14791    14795       +4     
==========================================
+ Hits        14542    14544       +2     
- Misses        249      251       +2     
Impacted Files Coverage Δ
src/init.c 97.95% <0.00%> (-2.05%) ⬇️
src/freadR.c 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mattdowle mattdowle merged commit 5affced into Rdatatable:master Nov 15, 2022
mattdowle added a commit that referenced this pull request Nov 15, 2022
…ble.h thanks Jan, plotting in dev memtest need not require imports of standard functions (strange)
@mattdowle mattdowle modified the milestones: 1.14.7, 1.14.6 Nov 16, 2022
mattdowle pushed a commit that referenced this pull request Nov 16, 2022
@aitap
Copy link
Member

aitap commented Nov 15, 2024

Dear @jimhester,

I am currently researching data.table's use of non-API entry points and would like to figure out the role of SET_GROWABLE_BIT. Do you possibly remember how did the memory leak work and why does setting the growable bit on the over-allocated vectors inside the data.table prevent it?

The only use for the growable bit inside the garbage collector is in getVecSizeInVEC where the length of a growable vector gets reset to its TRUELENGTH, and with PROTECTCHECK disabled (which is normally is) the only use for getVecSizeInVEC is in ReleaseLargeFreeVectors where the vector size is used for size reporting, but the vector is still freed as a whole. (These parts of code are unchanged since the growable bit was introduced in 3.4.)

You've reproduced the leak with a purely numeric vector; where does R forget to release memory for a REALSXP with an artificially shortened XLENGTH?

With

#include <R.h>
#include <Rinternals.h>
SEXP foo(void) {
	R_xlen_t len = 1024*1024*42;
	SEXP ret = PROTECT(allocVector(REALSXP, len));
	double * pret = REAL(ret);
	for (R_xlen_t i = 0; i < xlength(ret); ++i) pret[i] = i;
	SETLENGTH(ret, 100);
	SET_TRUELENGTH(ret, len);
	UNPROTECT(1);
	return ret;
}

and dyn.load('foo.so'); for (x in 1:1000) { .Call('foo'); print(gc()) } I definitely see the effect on the values reported by gc(), so absolutely not disputing the need for the SET_GROWABLE_BIT here, but they easily reach more virtual memory than my computer could ever muster (336004.6 Mb used) and the actual memory usage doesn't seem to grow (despite the arrays always being initialised and therefore paged in), neither in R-devel nor in R-3.3.0. How do I make it actually leak memory?

@jimhester
Copy link
Contributor Author

I believe part that avoids the memory leak is actually setting the true length to dtnrows rather than nrows. Setting the growable bit is an optimization to let R use the extra memory.

@aitap
Copy link
Member

aitap commented Nov 19, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fread: memory not being freed

5 participants