-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Ensure Sanitized Column Names are Unique in AutoML #3902
Copy link
Copy link
Closed
Labels
AutoML.NETAutomating various steps of the machine learning processAutomating various steps of the machine learning processP1Priority of the issue for triage purpose: Needs to be fixed soon.Priority of the issue for triage purpose: Needs to be fixed soon.classificationBugs related classification tasksBugs related classification taskscommand-lineIssues pertaining to the command-line interfaceIssues pertaining to the command-line interfaceimageBugs related image datatype tasksBugs related image datatype tasks
Metadata
Metadata
Assignees
Labels
AutoML.NETAutomating various steps of the machine learning processAutomating various steps of the machine learning processP1Priority of the issue for triage purpose: Needs to be fixed soon.Priority of the issue for triage purpose: Needs to be fixed soon.classificationBugs related classification tasksBugs related classification taskscommand-lineIssues pertaining to the command-line interfaceIssues pertaining to the command-line interfaceimageBugs related image datatype tasksBugs related image datatype tasks
Type
Fields
Give feedbackNo fields configured for issues without a type.
When creating sanitized column names, we have to ensure the column names are distinct.
Error
We generate non-compilable C# code when there are naming collisions.
Example build error:
/private/tmp/out/CivicHonesty/CivicHonesty.Model/DataModels/ModelInput.cs(23,23): Error CS0102: The type 'ModelInput' already contains a definition for 'Country' (CS0102) (CivicHonesty.Model)Repro
In the Civic Honesty dataset (CSV), we cause a naming collision in the generated C# code:
You'll note the two variables both called
Country.This comes from the dataset using "country" and "Country":
Dataset: https://dataverse.harvard.edu/api/access/datafile/3451248?format=original&gbrecs=true
CLI command:
mlnet auto-train --dataset "behavioral data (csv file).csv" --label-column-name response --mltask multiclass-classification --ignore-columns responsetime,id --max-exploration-time 60 --output-path /tmp/out/ --name CivicHonestyCause
Currently, we have no detection for duplicate column names.
The above "country" vs. "Country" is a rather direct example. This will occur less directly and more often due to our name sanitization where we map, for example, "$ spent" and "% spent" both to "__spent":
machinelearning/src/mlnet/Utilities/Utils.cs
Lines 67 to 83 in 227da9d
machinelearning/src/mlnet/Utilities/Utils.cs
Lines 47 to 50 in 227da9d
Work
Todo: