fix variable with melt(measure.vars=list), na.rm=T/F consistency (#4723)

tdhock · web-flow · commit ebc5bc37ff4f · 2021-05-07T15:44:57.000-06:00
diff --git a/NEWS.md b/NEWS.md
@@ -80,6 +80,8 @@
 
 12. `as.data.table(table(NULL))` now returns `data.table(NULL)` rather than error `attempt to set an attribute on NULL`, [#4179](https://github.com/Rdatatable/data.table/issues/4179). The result differs slightly to `as.data.frame(table(NULL))` (0-row, 1-column) because 0-column works better with other `data.table` functions like `rbindlist()`. Thanks to Michael Chirico for the report and fix.
 
+13. `melt` with a list for `measure.vars` would output `variable` inconsistently between `na.rm=TRUE` and `FALSE`, [#4455](https://github.com/Rdatatable/data.table/issues/4455). Thanks to @tdhock for reporting and fixing.
+
 ## NOTES
 
 1. New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of `0` (type double) vs `0L` (type integer) is too inconvenient for you the user. So such coercions happen in `data.table` automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling `set()` many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as `"2"` into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with `as.<type>()` so that it is clear to readers of your code that a coercion from character to that type is intended. For example :
diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw
@@ -3014,6 +3014,9 @@ test(1034, as.data.table(x<-as.character(sample(letters, 5))), data.table(V1=x))
        error="Unknown 'id.vars' type raw")
   test(1035.012, melt(DT, id.vars=1:3, measure.vars=as.raw(0)),
        error="Unknown 'measure.vars' type raw")
+  test(1035.013, melt(data.table(a=1, b=1), id.vars=c(1,1)), data.table(a=1, a.1=1, variable=factor("b"), value=1))
+  test(1035.014, melt(data.table(a1=1, b1=1, b2=2), na.rm=TRUE, measure.vars=list(a="a1", b=c("b1","b2"))), data.table(variable=factor(1,c("1","2")), a=1, b=1))
+  test(1035.015, melt(data.table(a=1+2i, b=1), id.vars="a"), error="Unknown column type 'complex' for column 'a' in 'data'")
 
   ans1 = cbind(DT[, c(1,2,8), with=FALSE], variable=factor("l_1"))
   ans1[, value := DT$l_1]
@@ -3175,9 +3178,9 @@ Sep,33.5,19.4,15.7,11.9,0,100.8,100.8,0,12.7,12.7,0,174.1")
   x[, c("y1","z1"):=NA]
   test(1037.405, dim(melt(x, measure.vars=patterns("^y", "^z"))), INT(4,5))
   test(1037.406, dim(ans<-melt(x, measure.vars=patterns("^y", "^z"), na.rm=TRUE)), INT(2,5))
-  test(1037.407, ans$variable, factor(c("1","1")))
+  test(1037.407, ans$variable, factor(c("2","2"), c("1", "2")))
   test(1037.408, dim(ans<-melt(x, measure.vars=patterns("^y", "^z"), na.rm=TRUE, variable.factor=FALSE)), INT(2,5))
-  test(1037.409, ans$variable, c("1","1"))
+  test(1037.409, ans$variable, c("2","2"))
 
   test(1037.410, melt(data.table(NULL), verbose=TRUE), data.table(NULL),
        output="ncol(data) is 0. Nothing to melt")
diff --git a/man/melt.data.table.Rd b/man/melt.data.table.Rd
@@ -31,7 +31,7 @@ non-measure columns will be assigned to it. If integer, must be positive; see De
   }
 
     For convenience/clarity in the case of multiple \code{melt}ed columns, resulting column names can be supplied as names to the elements \code{measure.vars} (in the \code{list} and \code{patterns} usages). See also \code{Examples}. }
-\item{variable.name}{name for the measured variable names column. The default name is \code{'variable'}.}
+\item{variable.name}{name (default \code{'variable'}) of output column containing information about which input column(s) were melted. If \code{measure.vars} is an integer/character vector, then each entry of this column contains the name of a melted column from \code{data}. If \code{measure.vars} is a list of integer/character vectors, then each entry of this column contains an integer indicating an index/position in each of those vectors.}
 \item{value.name}{name for the molten data values column(s). The default name is \code{'value'}. Multiple names can be provided here for the case when \code{measure.vars} is a \code{list}, though note well that the names provided in \code{measure.vars} take precedence. }
 \item{na.rm}{If \code{TRUE}, \code{NA} values will be removed from the molten
 data.}
diff --git a/src/fmelt.c b/src/fmelt.c
@@ -538,19 +538,18 @@ SEXP getvarcols(SEXP DT, SEXP dtnames, Rboolean varfactor, Rboolean verbose, str
     } else {
       for (int j=0, ansloc=0, level=1; j<data->lmax; ++j) {
         const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow;
-        if (thislen==0) continue;  // so as not to bump level
         char buff[20];
         snprintf(buff, 20, "%d", level++);
         SEXP str = PROTECT(mkChar(buff));
         for (int k=0; k<thislen; ++k) SET_STRING_ELT(target, ansloc++, str);
         UNPROTECT(1);
       }
     }
-  } else {
+  } else {// varfactor==TRUE
     SET_VECTOR_ELT(ansvars, 0, target=allocVector(INTSXP, data->totlen));
     SEXP levels;
     int *td = INTEGER(target);
-    if (data->lvalues == 1) {
+    if (data->lvalues == 1) {//single output column.
       SEXP thisvaluecols = VECTOR_ELT(data->valuecols, 0);
       int len = length(thisvaluecols);
       levels = PROTECT(allocVector(STRSXP, len)); protecti++;
@@ -573,24 +572,16 @@ SEXP getvarcols(SEXP DT, SEXP dtnames, Rboolean varfactor, Rboolean verbose, str
         const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow;
         for (int k=0; k<thislen; ++k) td[ansloc++] = md[j];
       }
-    } else {
+    } else {//multiple output columns.
       int nlevel=0;
       levels = PROTECT(allocVector(STRSXP, data->lmax)); protecti++;
       for (int j=0, ansloc=0; j<data->lmax; ++j) {
         const int thislen = data->narm ? length(VECTOR_ELT(data->naidx, j)) : data->nrow;
-        if (thislen==0) continue;  // so as not to bump level
         char buff[20];
         snprintf(buff, 20, "%d", nlevel+1);
         SET_STRING_ELT(levels, nlevel++, mkChar(buff));  // generate levels = 1:nlevels
         for (int k=0; k<thislen; ++k) td[ansloc++] = nlevel;
       }
-      if (nlevel < data->lmax) {
-        // data->narm is true and there are some all-NA items causing at least one 'if (thislen==0) continue' above
-        // shrink the levels
-        SEXP newlevels = PROTECT(allocVector(STRSXP, nlevel)); protecti++;
-        for (int i=0; i<nlevel; ++i) SET_STRING_ELT(newlevels, i, STRING_ELT(levels, i));
-        levels = newlevels;
-      }
     }
     setAttrib(target, R_LevelsSymbol, levels);
     setAttrib(target, R_ClassSymbol, ScalarString(char_factor));

Original file line number	Diff line number	Diff line change
`@@ -31,7 +31,7 @@ non-measure columns will be assigned to it. If integer, must be positive; see De`
`31`	`31`	`}`
`32`	`32`
`33`	`33`	`For convenience/clarity in the case of multiple \code{melt}ed columns, resulting column names can be supplied as names to the elements \code{measure.vars} (in the \code{list} and \code{patterns} usages). See also \code{Examples}. }`
`34`		`-\item{variable.name}{name for the measured variable names column. The default name is \code{'variable'}.}`
	`34`	`+\item{variable.name}{name (default \code{'variable'}) of output column containing information about which input column(s) were melted. If \code{measure.vars} is an integer/character vector, then each entry of this column contains the name of a melted column from \code{data}. If \code{measure.vars} is a list of integer/character vectors, then each entry of this column contains an integer indicating an index/position in each of those vectors.}`
`35`	`35`	`\item{value.name}{name for the molten data values column(s). The default name is \code{'value'}. Multiple names can be provided here for the case when \code{measure.vars} is a \code{list}, though note well that the names provided in \code{measure.vars} take precedence. }`
`36`	`36`	`\item{na.rm}{If \code{TRUE}, \code{NA} values will be removed from the molten`
`37`	`37`	`data.}`