Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19714

Clarify Bucketizer handling of invalid input

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.1.0
    • 3.0.0
    • ML, MLlib
    • None

    Description

      contDF = spark.range(500).selectExpr("cast(id as double) as id")
      import org.apache.spark.ml.feature.Bucketizer
      
      val splits = Array(5.0, 10.0, 250.0, 500.0)
      
      val bucketer = new Bucketizer()
        .setSplits(splits)
        .setInputCol("id")
        .setHandleInvalid("skip")
      
      bucketer.transform(contDF).show()
      

      You would expect that this would handle the invalid buckets. However it fails

      Caused by: org.apache.spark.SparkException: Feature value 0.0 out of Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the lower/upper bound constraints.
      

      It seems strange that handleInvalud doesn't actually handleInvalid inputs.

      Thoughts anyone?

      Attachments

        Activity

          People

            wojtek-szymanski Wojciech Szymanski
            bill_chambers Bill Chambers
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: