Skip to content

Ensure Sanitized Column Names are Unique in AutoML #3902

@justinormont

Description

@justinormont

When creating sanitized column names, we have to ensure the column names are distinct.

Error

We generate non-compilable C# code when there are naming collisions.

Example build error:
/private/tmp/out/CivicHonesty/CivicHonesty.Model/DataModels/ModelInput.cs(23,23): Error CS0102: The type 'ModelInput' already contains a definition for 'Country' (CS0102) (CivicHonesty.Model)

Repro

In the Civic Honesty dataset (CSV), we cause a naming collision in the generated C# code:

namespace CivicHonesty.Model.DataModels
{
    public class ModelInput
    {
        [ColumnName("id"), LoadColumn(0)]
        public float Id { get; set; }

        [ColumnName("country"), LoadColumn(1)]
        public float Country { get; set; }

        [ColumnName("Country"), LoadColumn(2)]
        public string Country { get; set; }

        [ColumnName("city"), LoadColumn(3)]
        public float City { get; set; }

You'll note the two variables both called Country.

This comes from the dataset using "country" and "Country":

image

Dataset: https://dataverse.harvard.edu/api/access/datafile/3451248?format=original&gbrecs=true

CLI command:
mlnet auto-train --dataset "behavioral data (csv file).csv" --label-column-name response --mltask multiclass-classification --ignore-columns responsetime,id --max-exploration-time 60 --output-path /tmp/out/ --name CivicHonesty

Cause

Currently, we have no detection for duplicate column names.

The above "country" vs. "Country" is a rather direct example. This will occur less directly and more often due to our name sanitization where we map, for example, "$ spent" and "% spent" both to "__spent":

internal static string Normalize(string input)
{
//check if first character is int
if (!string.IsNullOrEmpty(input) && int.TryParse(input.Substring(0, 1), out int val))
{
input = "Col" + input;
return input;
}
switch (input)
{
case null: throw new ArgumentNullException(nameof(input));
case "": throw new ArgumentException($"{nameof(input)} cannot be empty", nameof(input));
default:
var sanitizedInput = Sanitize(input);
return sanitizedInput.First().ToString().ToUpper() + input.Substring(1);
}
}

internal static string Sanitize(string name)
{
return string.Join("", name.Select(x => Char.IsLetterOrDigit(x) ? x : '_'));
}

Work

Todo:

  • Check that sanitized column names are unique
  • Check that when converted to C# variable names, the sanitized column names will be unique and valid C# variables

Metadata

Metadata

Assignees

Labels

AutoML.NETAutomating various steps of the machine learning processP1Priority of the issue for triage purpose: Needs to be fixed soon.classificationBugs related classification taskscommand-lineIssues pertaining to the command-line interfaceimageBugs related image datatype tasks

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions