Ensure Sanitized Column Names are Unique in AutoML

When creating sanitized column names, we have to ensure the column names are distinct.

### Error
We generate non-compilable C# code when there are naming collisions.

Example build error:
`/private/tmp/out/CivicHonesty/CivicHonesty.Model/DataModels/ModelInput.cs(23,23): Error CS0102: The type 'ModelInput' already contains a definition for 'Country' (CS0102) (CivicHonesty.Model)`

### Repro
In the [Civic Honesty dataset](https://dataverse.harvard.edu/api/access/datafile/3451248?format=original&gbrecs=true) (CSV), we cause a naming collision in the generated C# code:
```C#
namespace CivicHonesty.Model.DataModels
{
    public class ModelInput
    {
        [ColumnName("id"), LoadColumn(0)]
        public float Id { get; set; }

        [ColumnName("country"), LoadColumn(1)]
        public float Country { get; set; }

        [ColumnName("Country"), LoadColumn(2)]
        public string Country { get; set; }

        [ColumnName("city"), LoadColumn(3)]
        public float City { get; set; }
```

You'll note the two variables both called `Country`.

This comes from the dataset using "country" and "Country":
> ![image](https://user-images.githubusercontent.com/4080826/59969829-60f00680-950c-11e9-8eeb-141981ae43c3.png)

Dataset: https://dataverse.harvard.edu/api/access/datafile/3451248?format=original&gbrecs=true

CLI command:
`mlnet auto-train --dataset "behavioral data (csv file).csv"  --label-column-name response --mltask multiclass-classification --ignore-columns responsetime,id --max-exploration-time 60 --output-path /tmp/out/ --name CivicHonesty`

### Cause
Currently, we have no detection for duplicate column names. 

The above "country" vs. "Country" is a rather direct example. This will occur less directly and more often due to our name sanitization where we map, for example, "$ spent" and "% spent" both to "__spent":
https://github.com/dotnet/machinelearning/blob/227da9d7db2ce80b073cc64bfd067b04e6189de1/src/mlnet/Utilities/Utils.cs#L67-L83

https://github.com/dotnet/machinelearning/blob/227da9d7db2ce80b073cc64bfd067b04e6189de1/src/mlnet/Utilities/Utils.cs#L47-L50



### Work
Todo:
* Check that sanitized column names are unique
* Check that when converted to C# variable names, the sanitized column names will be unique and valid C# variables

	internal static string Normalize(string input)
	{
	//check if first character is int
	if (!string.IsNullOrEmpty(input) && int.TryParse(input.Substring(0, 1), out int val))
	{
	input = "Col" + input;
	return input;
	}
	switch (input)
	{
	case null: throw new ArgumentNullException(nameof(input));
	case "": throw new ArgumentException($"{nameof(input)} cannot be empty", nameof(input));
	default:
	var sanitizedInput = Sanitize(input);
	return sanitizedInput.First().ToString().ToUpper() + input.Substring(1);
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure Sanitized Column Names are Unique in AutoML #3902

Error

Repro

Cause

Work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	internal static string Sanitize(string name)
	{
	return string.Join("", name.Select(x => Char.IsLetterOrDigit(x) ? x : '_'));
	}

Ensure Sanitized Column Names are Unique in AutoML #3902

Description

Error

Repro

Cause

Work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions