[SPARK-19714] Clarify Bucketizer handling of invalid input - ASF Jira

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 3.0.0
Component/s: ML, MLlib
Labels:
None

Description

contDF = spark.range(500).selectExpr("cast(id as double) as id")
import org.apache.spark.ml.feature.Bucketizer

val splits = Array(5.0, 10.0, 250.0, 500.0)

val bucketer = new Bucketizer()
  .setSplits(splits)
  .setInputCol("id")
  .setHandleInvalid("skip")

bucketer.transform(contDF).show()

You would expect that this would handle the invalid buckets. However it fails

Caused by: org.apache.spark.SparkException: Feature value 0.0 out of Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the lower/upper bound constraints.

It seems strange that handleInvalud doesn't actually handleInvalid inputs.

Thoughts anyone?

Attachments

Issue Links

links to

[Github] Pull Request #17169 (wojtek-szymanski)

[Github] Pull Request #23003 (srowen)

Activity

People

Assignee:: Wojciech Szymanski

Reporter:: Bill Chambers

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 23/Feb/17 18:26

Updated:: 11/Nov/18 15:21

Resolved:: 11/Nov/18 15:21