Unify file encoding when a cmdlet creates a file by JamesWTruher · Pull Request #4119 · PowerShell/PowerShell

JamesWTruher · 2017-06-27T18:11:44Z

Implementation of https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0020-DefaultFileEncoding.md

Cmdlets now create files with consistent encoding (UTF8 without BOM) on all platforms.
A new preference variable PSDefaultFileEncoding is now available to enable users to set encoding for cmdlets. By setting $PSDefaultFileEncoding = "WindowsLegacy" users can select the encoding which exists in PowerShell5 for each specific cmdlet.

Provider and cmdlet encoding methods have been centralized and are now common.

iSazonov · 2017-06-27T18:25:59Z

src/System.Management.Automation/utils/Encoding.cs

We can use static cache for UTF8Encoding(false); here and in Line 142 too - the cache is in Line 311.

lzybkr · 2017-06-29T18:45:36Z

Please fix the conflict and rebase.

iSazonov · 2017-06-30T04:13:53Z

If the new test batch covers all the old encoding variants, it would be good to merge the tests before the PR - then we'll be sure to see that the PR has full backward compatibility.

adityapatwardhan · 2017-07-07T17:46:02Z

src/System.Management.Automation/utils/Encoding.cs

Extra empty line

adityapatwardhan · 2017-07-07T17:47:14Z

src/System.Management.Automation/utils/Encoding.cs

Use legacyEncodingMap.TryGetValue instead

adityapatwardhan · 2017-07-07T17:55:07Z

src/System.Management.Automation/utils/Encoding.cs

Do we need to care about WindowsLegacyEncoding here?

yes - because of the provider cmdlets which leave the encoding to the underlying provider as a dynamic parameter it needs to be handled via a different path than checking against the name of the cmdlet.

adityapatwardhan · 2017-07-07T18:41:23Z

src/System.Management.Automation/utils/PathUtils.cs

Delete the commented block?

adityapatwardhan · 2017-07-07T18:42:09Z

src/System.Management.Automation/utils/PathUtils.cs

Delete commented block

fixed - I missed that one

adityapatwardhan · 2017-07-07T18:53:00Z

src/System.Management.Automation/utils/Encoding.cs

Can you check the code coverage with the newly added tests for this file? So we add tests now if we are missing any.

now that we can, i will :)

dantraMSFT · 2017-07-10T22:13:25Z

src/Microsoft.PowerShell.Commands.Utility/commands/utility/CSVCommands.cs

Why aren't you using a ValidateSet with the FileEncoding values?

We could use #3784 - dynamic validate set, don't expose FileEncoding and allow enhancement to use legacy charsets.

I don't want to use validateset here at all. Especially since this is so pervasive in so many areas, I get the same more easily with an enum

dantraMSFT · 2017-07-10T22:13:54Z

src/Microsoft.PowerShell.Commands.Utility/commands/utility/CSVCommands.cs

As above, Why aren't you using a ValidateSet with the FileEncoding values

same as above

dantraMSFT · 2017-07-10T22:23:51Z

src/Microsoft.PowerShell.Commands.Utility/commands/utility/MatchString.cs

The default should be FileEncoding.UTF8. See BeginProcessing.

i want to be sure that I discover the encoding. If I set it to something that is likely to be provided, I won't be able to determine whether the user explicitly set it. By setting it to unknown, I'll calculate the appropriate value and I can tell that it wasn't set by the user

dantraMSFT · 2017-07-10T22:36:20Z

src/System.Management.Automation/utils/Encoding.cs

Suggest checking for encoding != FileEncoding.Unknown first; none of the other code paths are executed for that case.

dantraMSFT · 2017-07-11T00:07:17Z

src/System.Management.Automation/utils/Encoding.cs

use initialBytes.Length instead of 100.

dantraMSFT · 2017-07-11T00:07:29Z

src/System.Management.Automation/utils/Encoding.cs

Do you really need to read 100 bytes?

this was part of a refactor where this code has been around for a long time, and I didn't really want to change it. I'm certainly open reducing the size, but it's not too much, not too little and any number here would be a guess anyway, right?

If we look beyond the preamble - a page size seems fine because we'll read that much from disk anyway, so 100 is totally fine.

dantraMSFT · 2017-07-11T00:08:57Z

src/System.Management.Automation/utils/Encoding.cs

Suggest naming this GetFileEncoding for clarity.

dantraMSFT · 2017-07-11T00:09:36Z

src/System.Management.Automation/utils/Encoding.cs

I think this should probably throw; not return Default.

not necessarily - the file could be later created, so I need to pass back something - this code was refactored from another location, so i wanted to do as little as possible

dantraMSFT · 2017-07-11T00:11:44Z

src/System.Management.Automation/utils/Encoding.cs

I suspect the string building/lookups are overkill. Consider testing the bytes directly?

this was code which was refactored from another location, I didn't really want to change it

Please do fix it, or open an issue to get it fixed. Testing bytes directly is much better - no extra allocations, no static initialization. It's simple code too - here is my PowerShell version: https://gist.github.com/lzybkr/f040f18d1fbfff9eaf3f4533e24126fe#file-fix_trailing_ws-ps1-L14 - note that my version handles UFT7 which this version does not, but mine misses the big endian encodings.

dantraMSFT · 2017-07-11T00:31:20Z

src/System.Management.Automation/commands/utility/FormatAndOutput/common/FormatXMLWriter.cs

This appears to be the only external call to Encoding overload of PathUtils.MasterStreamOpen. Suggest using FileEncoding.Ascii here and merge the two MasterStreamOpen overloads.

yah - I was worried about removing the overload as it was a public interface and thought to do less harm, just in case someone else was using it somehow.

MasterStreamOpen is internal.

I'll investigate

the encoding overload is used by CSV cmdlets which do their own determination of the encoding (in the append case), and pass that along to the open. I'd rather not refactor all that code if I don't need to.

dantraMSFT · 2017-07-11T16:55:19Z

src/System.Management.Automation/engine/InitialSessionState.cs

I can find no references to this field other than setting the session state variable. How is it used and how does it control behavior? Currently, it appears to be a NOP.

it's used down in line 4915 where it essentially lays claim to the variable

dantraMSFT · 2017-07-11T16:57:38Z

src/System.Management.Automation/engine/Utils.cs

Is this actually needed? The only other change to this file is the removal of the GetEncoding logic.

dantraMSFT · 2017-07-11T17:12:20Z

src/System.Management.Automation/utils/ClrFacade.cs

I think the CORECLR branches could use commenting in both GetDefaultEncoding and GetOEMEncoding. The logic reasoning for GetACP and GetOEMCP is opaque.

The discussion was in #3248 and fixed in #3467. Maybe it should be reviewed again. In short we returned behavior Windows PowerShell and exclude the breaking change.

I'll add comments about this, we can change it later after we have live-fire feedback.

dantraMSFT · 2017-07-11T17:16:22Z

src/System.Management.Automation/utils/PathUtils.cs

Suggest merging this with the Encoding overload. There appears to be only 1 caller to the later overload and it does so by converting a FileEncoding to an Encoding to accomplish it.

dantraMSFT · 2017-07-11T17:27:33Z

src/System.Management.Automation/utils/PathUtils.cs

Suggest using a dictionary for these cascading 'if' tests.

dantraMSFT · 2017-07-11T18:36:54Z

src/System.Management.Automation/utils/Encoding.cs

dantraMSFT · 2017-07-11T18:37:08Z

src/System.Management.Automation/utils/Encoding.cs

dantraMSFT · 2017-07-12T17:49:34Z

test/powershell/engine/Encoding.Tests.ps1

This function is not referenced.

SteveL-MSFT · 2017-07-26T23:56:12Z

@PowerShell/powershell-committee reviewed this and we should have an RFC specific to the new public apis (and generally any new public apis should be RFCs going forward)

SteveL-MSFT · 2017-07-28T20:27:03Z

src/System.Management.Automation/utils/Encoding.cs

Seems like this should be GetEncodingFromFile to keep the naming consistent

lzybkr · 2017-08-01T21:42:34Z

src/System.Management.Automation/utils/Encoding.cs

return Unspecified, not return unknown

lzybkr · 2017-08-01T21:44:04Z

src/System.Management.Automation/utils/Encoding.cs

Maybe this api would be better if we passed a SessionState instead of Cmdlet.

Create new class PowerShellEncoding and enum FileEncoding to unify cmdlet and provider code for file encoding. Created PowerShellEncoding class and FileEncoding enum and removed ClrFacade.GetDefaultEncoding. PSDefaultFileEncoding preference variable now can set file encoding across all cmdlets. Setting PSDefaultFileEncoding to WindowsLegacy will set file encoding to historic PowerShell5 encodings.

some tests were failing on Windows because new line is different, calculate the bytes in newline rather than hardcoding them

ClrFacade retains some of its functionality, but now relies on PowerShellEncoding class for knowing what the default coding is. The encoding methods which calls native methods is retained.

the WindowsLegacy behavior for New-ModuleManifest should get the correct number of bytes which will change depending on how many bytes are encoded for [Environment]::NewLine

update calls which create had been creating a new instance of the Utf8 encoding without BOM to return the available static

…for newline

only use hardcoded bytes when it's a custom file generation (like Export-CliXml or New-ModuleManifest) or when we're looking at a set of partial results also remove an unused function

…ing files improve code coverage for PowerShellEncoding class and don't duplicate byte representations when they're not needed

Remove a couple of extraneous using statements Move encoding code from PathUtils.cs to Encoding.cs Move and rename GetEncoding method in Utils.cs to GetFileEncodingFromFile in Encoding.cs Expand some explanatory comments with more details

…oser to what is really happening Update RedirectionOperator tests to compare bytes in a more sensible manner Remove PSDefaultFileEncoding from special variable collection so they don't show up in script cmdlets miscellaneous code clean up

made file encoding probe method internal

mklement0 · 2017-08-28T22:35:18Z

src/System.Management.Automation/utils/Encoding.cs

+                { "239-187-191", FileEncoding.UTF8BOM },
+            };
+
+        internal static char[] nonPrintableCharacters = {


Strictly speaking, 11 (vertical tab) and 12 (form feed) are also printable characters; given their rarity, I assume they were left out intentionally; if so, perhaps a comment would appease drive-by pedants like me.

mklement0 · 2017-08-28T22:43:57Z

src/System.Management.Automation/utils/ClrFacade.cs

+#else           // PowerShell Core on Windows, which needs provider registration
                EncodingRegisterProvider();

                uint oemCp = NativeMethods.GetOEMCP();


Looks like this line is no longer needed.

mklement0 · 2017-08-28T22:45:34Z

src/System.Management.Automation/utils/ClrFacade.cs

+            // The OEM code pages are sometimes used by Win32 console applications, and
+            // on non-Windows platforms they still may have uses (if installed) and
+            // could be used if desired.
+            // On non-windows platforms, they have more limited uses, and probably won't


This sentence is a bit confusing. Doesn't the previous one say everything that needs to be said?

mklement0 · 2017-08-28T23:14:10Z

src/System.Management.Automation/utils/Encoding.cs

+                { "microsoft.powershell.commands.setcontentcommand", Encoding.ASCII },
+                // Providers are handled here
+                { "microsoft.powershell.commands.filesystemprovider", Encoding.ASCII },
+            };


Unfortunately, that doesn't appear to be correct.

As for the writing cmdlets:

Add-Content and Set-Content - despite what the documentation claims - have always used "ANSI" encoding, not ASCII - see MicrosoftDocs/PowerShell-Docs#1483.

What about Send-MailMessage? The help topics claim that ASCII encoding is the default.

New-Item -Type File -Value currently creates BOM-less(!) UTF-8.

As an aside: While Export-Csv without -Append indeed creates ASCII files, Export-Csv -Append currently blindly appends UTF-8 if the existing file's encoding is any of ASCII/UTF-8/"ANSI", but correctly matches UTF-16LE and UTF-16BE.

As for the reading cmdlets:

Get-Content currently defaults to "ANSI" in the absence of a BOM, as does Import-PowerShellDataFile.

Import-Csv currently actually assumes UTF-8 in the absence of a BOM, as do Import-CliXml and Select-String

JamesWTruher · 2017-08-28T23:23:17Z

i am going to take a different route for this change - closing this

JamesWTruher added WG-Cmdlets general cmdlet issues WG-Cmdlets-Management cmdlets in the Microsoft.PowerShell.Management module Breaking-Change breaking change that may affect users labels Jun 27, 2017

msftclas added the cla-not-required label Jun 27, 2017

iSazonov reviewed Jun 27, 2017

View reviewed changes

mirichmo assigned lzybkr Jun 28, 2017

JamesWTruher force-pushed the jameswtruher/encoding3 branch 2 times, most recently from a013795 to 9f77970 Compare June 29, 2017 22:44

SteveL-MSFT requested a review from adityapatwardhan July 7, 2017 17:12

adityapatwardhan requested changes Jul 7, 2017

View reviewed changes

dantraMSFT reviewed Jul 10, 2017

View reviewed changes

dantraMSFT reviewed Jul 11, 2017

View reviewed changes

src/System.Management.Automation/utils/Encoding.cs Outdated

Copy link
Copy Markdown

Contributor

dantraMSFT Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code.

dantraMSFT reviewed Jul 11, 2017

View reviewed changes

src/System.Management.Automation/utils/Encoding.cs Outdated

Copy link
Copy Markdown

Contributor

dantraMSFT Jul 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code.

dantraMSFT reviewed Jul 12, 2017

View reviewed changes

SteveL-MSFT reviewed Jul 28, 2017

View reviewed changes

lzybkr reviewed Aug 1, 2017

View reviewed changes

src/System.Management.Automation/utils/Encoding.cs Outdated

Copy link
Copy Markdown

Contributor

lzybkr Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Unspecified, not return unknown

lzybkr reviewed Aug 1, 2017

View reviewed changes

JamesWTruher added 14 commits August 2, 2017 14:57

Fix test issue where newline was being provided rather than calculated

afbf374

some tests were failing on Windows because new line is different, calculate the bytes in newline rather than hardcoding them

Remove code which was relying on previous encoding code

f996e6f

ClrFacade retains some of its functionality, but now relies on PowerShellEncoding class for knowing what the default coding is. The encoding methods which calls native methods is retained.

fix another couple of xplat issues

a067de4

the WindowsLegacy behavior for New-ModuleManifest should get the correct number of bytes which will change depending on how many bytes are encoded for [Environment]::NewLine

Return static Utf8NoBom encoder rather than creating a new instance

9b8e4fb

update calls which create had been creating a new instance of the Utf8 encoding without BOM to return the available static

Change expected encoding to not distinguish based on platform except …

5309b90

…for newline

Remove commented out code

9a45f40

refactor tests to use fewer instances of hardcoded byte strings

65c255c

only use hardcoded bytes when it's a custom file generation (like Export-CliXml or New-ModuleManifest) or when we're looking at a set of partial results also remove an unused function

update tests to include tests against Encoding probe method for exist…

14933a9

…ing files improve code coverage for PowerShellEncoding class and don't duplicate byte representations when they're not needed

Refactor Encoding.cs to improve readability

acac547

Remove a couple of extraneous using statements Move encoding code from PathUtils.cs to Encoding.cs Move and rename GetEncoding method in Utils.cs to GetFileEncodingFromFile in Encoding.cs Expand some explanatory comments with more details

Changed PowerShellEncoding class name to EncodingUtils

bf2be6f

made file encoding probe method internal

update PowerShellEncoding to use new name EncodingUtils

8a0765e

Use new class name EncodingUtils

ff24e22

JamesWTruher force-pushed the jameswtruher/encoding3 branch from 62ba583 to ff24e22 Compare August 2, 2017 22:14

anmenaga mentioned this pull request Aug 22, 2017

New-ModuleManifest creates files in UTF-16 #3789

Closed

iSazonov mentioned this pull request Aug 28, 2017

Prepare for BOM-less UTF-8 default character encoding with respect to $OutputEncoding and console code page #4681

Closed

mklement0 reviewed Aug 28, 2017

View reviewed changes

JamesWTruher closed this Aug 28, 2017

SteveL-MSFT mentioned this pull request Jan 17, 2018

Encoding for New-ModuleManifest on all platforms should be UTF-8 NoBOM #5923

Merged

9 tasks

mklement0 mentioned this pull request Apr 11, 2019

Incorrect default-encoding information for Windows PowerShell cmdlets persists MicrosoftDocs/PowerShell-Docs#4155

Closed

8 tasks

JamesWTruher deleted the jameswtruher/encoding3 branch May 11, 2022 18:27

Conversation

JamesWTruher commented Jun 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lzybkr commented Jun 29, 2017

Uh oh!

iSazonov commented Jun 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dantraMSFT Jul 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dantraMSFT Jul 11, 2017 •

edited

Loading

dantraMSFT Jul 11, 2017 •

edited

Loading

iSazonov Jul 12, 2017 •

edited

Loading

JamesWTruher Jul 19, 2017 •

edited

Loading