Skip to content

Commit e198080

Browse files
committed
2 parents f0ebcb7 + 9c03c56 commit e198080

File tree

351 files changed

+5690
-3080
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

351 files changed

+5690
-3080
lines changed

.github/PULL_REQUEST_TEMPLATE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,4 @@
77
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
88
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
99

10-
Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
10+
Please review http://spark.apache.org/contributing.html before opening a pull request.

CONTRIBUTING.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
## Contributing to Spark
22

33
*Before opening a pull request*, review the
4-
[Contributing to Spark wiki](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark).
4+
[Contributing to Spark guide](http://spark.apache.org/contributing.html).
55
It lists steps that are required before creating a PR. In particular, consider:
66

77
- Is the change important and ready enough to ask the community to spend time reviewing?
88
- Have you searched for existing, related JIRAs and pull requests?
9-
- Is this a new feature that can stand alone as a [third party project](https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects) ?
9+
- Is this a new feature that can stand alone as a [third party project](http://spark.apache.org/third-party-projects.html) ?
1010
- Is the change being proposed clearly explained and motivated?
1111

1212
When you contribute code, you affirm that the contribution is your original work and that you

R/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ sparkR.session()
5151

5252
#### Making changes to SparkR
5353

54-
The [instructions](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) for making contributions to Spark also apply to SparkR.
54+
The [instructions](http://spark.apache.org/contributing.html) for making contributions to Spark also apply to SparkR.
5555
If you only make R file changes (i.e. no Scala changes) then you can just re-install the R package using `R/install-dev.sh` and test your changes.
5656
Once you have made your changes, please include unit tests for them and run existing unit tests using the `R/run-tests.sh` script as described below.
5757

R/pkg/DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),
1111
email = "[email protected]"),
1212
person(family = "The Apache Software Foundation", role = c("aut", "cph")))
1313
URL: http://www.apache.org/ http://spark.apache.org/
14-
BugReports: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingBugReports
14+
BugReports: http://spark.apache.org/contributing.html
1515
Depends:
1616
R (>= 3.0),
1717
methods

R/pkg/R/DataFrame.R

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2541,7 +2541,8 @@ generateAliasesForIntersectedCols <- function (x, intersectedColNames, suffix) {
25412541
#'
25422542
#' Return a new SparkDataFrame containing the union of rows in this SparkDataFrame
25432543
#' and another SparkDataFrame. This is equivalent to \code{UNION ALL} in SQL.
2544-
#' Note that this does not remove duplicate rows across the two SparkDataFrames.
2544+
#'
2545+
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
25452546
#'
25462547
#' @param x A SparkDataFrame
25472548
#' @param y A SparkDataFrame
@@ -2584,7 +2585,8 @@ setMethod("unionAll",
25842585
#' Union two or more SparkDataFrames
25852586
#'
25862587
#' Union two or more SparkDataFrames. This is equivalent to \code{UNION ALL} in SQL.
2587-
#' Note that this does not remove duplicate rows across the two SparkDataFrames.
2588+
#'
2589+
#' Note: This does not remove duplicate rows across the two SparkDataFrames.
25882590
#'
25892591
#' @param x a SparkDataFrame.
25902592
#' @param ... additional SparkDataFrame(s).

R/pkg/R/functions.R

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2296,7 +2296,7 @@ setMethod("n", signature(x = "Column"),
22962296
#' A pattern could be for instance \preformatted{dd.MM.yyyy} and could return a string like '18.03.1993'. All
22972297
#' pattern letters of \code{java.text.SimpleDateFormat} can be used.
22982298
#'
2299-
#' NOTE: Use when ever possible specialized functions like \code{year}. These benefit from a
2299+
#' Note: Use when ever possible specialized functions like \code{year}. These benefit from a
23002300
#' specialized implementation.
23012301
#'
23022302
#' @param y Column to compute on.
@@ -2341,7 +2341,7 @@ setMethod("from_utc_timestamp", signature(y = "Column", x = "character"),
23412341
#' Locate the position of the first occurrence of substr column in the given string.
23422342
#' Returns null if either of the arguments are null.
23432343
#'
2344-
#' NOTE: The position is not zero based, but 1 based index. Returns 0 if substr
2344+
#' Note: The position is not zero based, but 1 based index. Returns 0 if substr
23452345
#' could not be found in str.
23462346
#'
23472347
#' @param y column to check
@@ -2779,7 +2779,8 @@ setMethod("window", signature(x = "Column"),
27792779
#' locate
27802780
#'
27812781
#' Locate the position of the first occurrence of substr.
2782-
#' NOTE: The position is not zero based, but 1 based index. Returns 0 if substr
2782+
#'
2783+
#' Note: The position is not zero based, but 1 based index. Returns 0 if substr
27832784
#' could not be found in str.
27842785
#'
27852786
#' @param substr a character string to be matched.

R/pkg/R/mllib.R

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -278,8 +278,10 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "SparkDat
278278

279279
#' @param object a fitted generalized linear model.
280280
#' @return \code{summary} returns a summary object of the fitted model, a list of components
281-
#' including at least the coefficients, null/residual deviance, null/residual degrees
282-
#' of freedom, AIC and number of iterations IRLS takes.
281+
#' including at least the coefficients matrix (which includes coefficients, standard error
282+
#' of coefficients, t value and p value), null/residual deviance, null/residual degrees of
283+
#' freedom, AIC and number of iterations IRLS takes. If there are collinear columns
284+
#' in you data, the coefficients matrix only provides coefficients.
283285
#'
284286
#' @rdname spark.glm
285287
#' @export
@@ -303,9 +305,18 @@ setMethod("summary", signature(object = "GeneralizedLinearRegressionModel"),
303305
} else {
304306
dataFrame(callJMethod(jobj, "rDevianceResiduals"))
305307
}
306-
coefficients <- matrix(coefficients, ncol = 4)
307-
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
308-
rownames(coefficients) <- unlist(features)
308+
# If the underlying WeightedLeastSquares using "normal" solver, we can provide
309+
# coefficients, standard error of coefficients, t value and p value. Otherwise,
310+
# it will be fitted by local "l-bfgs", we can only provide coefficients.
311+
if (length(features) == length(coefficients)) {
312+
coefficients <- matrix(coefficients, ncol = 1)
313+
colnames(coefficients) <- c("Estimate")
314+
rownames(coefficients) <- unlist(features)
315+
} else {
316+
coefficients <- matrix(coefficients, ncol = 4)
317+
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
318+
rownames(coefficients) <- unlist(features)
319+
}
309320
ans <- list(deviance.resid = deviance.resid, coefficients = coefficients,
310321
dispersion = dispersion, null.deviance = null.deviance,
311322
deviance = deviance, df.null = df.null, df.residual = df.residual,

R/pkg/R/sparkR.R

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -373,8 +373,13 @@ sparkR.session <- function(
373373
overrideEnvs(sparkConfigMap, paramMap)
374374
}
375375

376+
deployMode <- ""
377+
if (exists("spark.submit.deployMode", envir = sparkConfigMap)) {
378+
deployMode <- sparkConfigMap[["spark.submit.deployMode"]]
379+
}
380+
376381
if (!exists(".sparkRjsc", envir = .sparkREnv)) {
377-
retHome <- sparkCheckInstall(sparkHome, master)
382+
retHome <- sparkCheckInstall(sparkHome, master, deployMode)
378383
if (!is.null(retHome)) sparkHome <- retHome
379384
sparkExecutorEnvMap <- new.env()
380385
sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, sparkExecutorEnvMap,
@@ -550,24 +555,27 @@ processSparkPackages <- function(packages) {
550555
#
551556
# @param sparkHome directory to find Spark package.
552557
# @param master the Spark master URL, used to check local or remote mode.
558+
# @param deployMode whether to deploy your driver on the worker nodes (cluster)
559+
# or locally as an external client (client).
553560
# @return NULL if no need to update sparkHome, and new sparkHome otherwise.
554-
sparkCheckInstall <- function(sparkHome, master) {
561+
sparkCheckInstall <- function(sparkHome, master, deployMode) {
555562
if (!isSparkRShell()) {
556563
if (!is.na(file.info(sparkHome)$isdir)) {
557564
msg <- paste0("Spark package found in SPARK_HOME: ", sparkHome)
558565
message(msg)
559566
NULL
560567
} else {
561-
if (!nzchar(master) || isMasterLocal(master)) {
562-
msg <- paste0("Spark not found in SPARK_HOME: ",
563-
sparkHome)
568+
if (isMasterLocal(master)) {
569+
msg <- paste0("Spark not found in SPARK_HOME: ", sparkHome)
564570
message(msg)
565571
packageLocalDir <- install.spark()
566572
packageLocalDir
567-
} else {
573+
} else if (isClientMode(master) || deployMode == "client") {
568574
msg <- paste0("Spark not found in SPARK_HOME: ",
569575
sparkHome, "\n", installInstruction("remote"))
570576
stop(msg)
577+
} else {
578+
NULL
571579
}
572580
}
573581
} else {

R/pkg/R/utils.R

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -777,6 +777,10 @@ isMasterLocal <- function(master) {
777777
grepl("^local(\\[([0-9]+|\\*)\\])?$", master, perl = TRUE)
778778
}
779779

780+
isClientMode <- function(master) {
781+
grepl("([a-z]+)-client$", master, perl = TRUE)
782+
}
783+
780784
isSparkRShell <- function() {
781785
grepl(".*shell\\.R$", Sys.getenv("R_PROFILE_USER"), perl = TRUE)
782786
}

R/pkg/inst/tests/testthat/test_mllib.R

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -169,6 +169,15 @@ test_that("spark.glm summary", {
169169
df <- suppressWarnings(createDataFrame(data))
170170
regStats <- summary(spark.glm(df, b ~ a1 + a2, regParam = 1.0))
171171
expect_equal(regStats$aic, 14.00976, tolerance = 1e-4) # 14.00976 is from summary() result
172+
173+
# Test spark.glm works on collinear data
174+
A <- matrix(c(1, 2, 3, 4, 2, 4, 6, 8), 4, 2)
175+
b <- c(1, 2, 3, 4)
176+
data <- as.data.frame(cbind(A, b))
177+
df <- createDataFrame(data)
178+
stats <- summary(spark.glm(df, b ~ . - 1))
179+
coefs <- unlist(stats$coefficients)
180+
expect_true(all(abs(c(0.5, 0.25) - coefs) < 1e-4))
172181
})
173182

174183
test_that("spark.glm save/load", {

0 commit comments

Comments
 (0)