Unify file encoding when a cmdlet creates a file#4119
Unify file encoding when a cmdlet creates a file#4119JamesWTruher wants to merge 14 commits intoPowerShell:masterfrom
Conversation
There was a problem hiding this comment.
We can use static cache for UTF8Encoding(false); here and in Line 142 too - the cache is in Line 311.
|
Please fix the conflict and rebase. |
a013795 to
9f77970
Compare
|
If the new test batch covers all the old encoding variants, it would be good to merge the tests before the PR - then we'll be sure to see that the PR has full backward compatibility. |
There was a problem hiding this comment.
Use legacyEncodingMap.TryGetValue instead
There was a problem hiding this comment.
Do we need to care about WindowsLegacyEncoding here?
There was a problem hiding this comment.
yes - because of the provider cmdlets which leave the encoding to the underlying provider as a dynamic parameter it needs to be handled via a different path than checking against the name of the cmdlet.
There was a problem hiding this comment.
Delete the commented block?
There was a problem hiding this comment.
fixed - I missed that one
There was a problem hiding this comment.
Can you check the code coverage with the newly added tests for this file? So we add tests now if we are missing any.
There was a problem hiding this comment.
now that we can, i will :)
There was a problem hiding this comment.
Why aren't you using a ValidateSet with the FileEncoding values?
There was a problem hiding this comment.
We could use #3784 - dynamic validate set, don't expose FileEncoding and allow enhancement to use legacy charsets.
There was a problem hiding this comment.
I don't want to use validateset here at all. Especially since this is so pervasive in so many areas, I get the same more easily with an enum
There was a problem hiding this comment.
As above, Why aren't you using a ValidateSet with the FileEncoding values
There was a problem hiding this comment.
same as above
There was a problem hiding this comment.
The default should be FileEncoding.UTF8. See BeginProcessing.
There was a problem hiding this comment.
i want to be sure that I discover the encoding. If I set it to something that is likely to be provided, I won't be able to determine whether the user explicitly set it. By setting it to unknown, I'll calculate the appropriate value and I can tell that it wasn't set by the user
There was a problem hiding this comment.
Suggest checking for encoding != FileEncoding.Unknown first; none of the other code paths are executed for that case.
There was a problem hiding this comment.
use initialBytes.Length instead of 100.
There was a problem hiding this comment.
Do you really need to read 100 bytes?
There was a problem hiding this comment.
this was part of a refactor where this code has been around for a long time, and I didn't really want to change it. I'm certainly open reducing the size, but it's not too much, not too little and any number here would be a guess anyway, right?
There was a problem hiding this comment.
If we look beyond the preamble - a page size seems fine because we'll read that much from disk anyway, so 100 is totally fine.
There was a problem hiding this comment.
Suggest naming this GetFileEncoding for clarity.
There was a problem hiding this comment.
I think this should probably throw; not return Default.
There was a problem hiding this comment.
not necessarily - the file could be later created, so I need to pass back something - this code was refactored from another location, so i wanted to do as little as possible
There was a problem hiding this comment.
I suspect the string building/lookups are overkill. Consider testing the bytes directly?
There was a problem hiding this comment.
this was code which was refactored from another location, I didn't really want to change it
There was a problem hiding this comment.
Please do fix it, or open an issue to get it fixed. Testing bytes directly is much better - no extra allocations, no static initialization. It's simple code too - here is my PowerShell version: https://gist.github.com/lzybkr/f040f18d1fbfff9eaf3f4533e24126fe#file-fix_trailing_ws-ps1-L14 - note that my version handles UFT7 which this version does not, but mine misses the big endian encodings.
There was a problem hiding this comment.
This appears to be the only external call to Encoding overload of PathUtils.MasterStreamOpen. Suggest using FileEncoding.Ascii here and merge the two MasterStreamOpen overloads.
There was a problem hiding this comment.
yah - I was worried about removing the overload as it was a public interface and thought to do less harm, just in case someone else was using it somehow.
There was a problem hiding this comment.
MasterStreamOpen is internal.
There was a problem hiding this comment.
I'll investigate
There was a problem hiding this comment.
the encoding overload is used by CSV cmdlets which do their own determination of the encoding (in the append case), and pass that along to the open. I'd rather not refactor all that code if I don't need to.
There was a problem hiding this comment.
I can find no references to this field other than setting the session state variable. How is it used and how does it control behavior? Currently, it appears to be a NOP.
There was a problem hiding this comment.
it's used down in line 4915 where it essentially lays claim to the variable
There was a problem hiding this comment.
Is this actually needed? The only other change to this file is the removal of the GetEncoding logic.
There was a problem hiding this comment.
I think the CORECLR branches could use commenting in both GetDefaultEncoding and GetOEMEncoding. The logic reasoning for GetACP and GetOEMCP is opaque.
There was a problem hiding this comment.
I'll add comments about this, we can change it later after we have live-fire feedback.
There was a problem hiding this comment.
Suggest merging this with the Encoding overload. There appears to be only 1 caller to the later overload and it does so by converting a FileEncoding to an Encoding to accomplish it.
There was a problem hiding this comment.
Suggest using a dictionary for these cascading 'if' tests.
There was a problem hiding this comment.
This function is not referenced.
|
@PowerShell/powershell-committee reviewed this and we should have an RFC specific to the new public apis (and generally any new public apis should be RFCs going forward) |
There was a problem hiding this comment.
Seems like this should be GetEncodingFromFile to keep the naming consistent
There was a problem hiding this comment.
return Unspecified, not return unknown
There was a problem hiding this comment.
Maybe this api would be better if we passed a SessionState instead of Cmdlet.
Create new class PowerShellEncoding and enum FileEncoding to unify cmdlet and provider code for file encoding. Created PowerShellEncoding class and FileEncoding enum and removed ClrFacade.GetDefaultEncoding. PSDefaultFileEncoding preference variable now can set file encoding across all cmdlets. Setting PSDefaultFileEncoding to WindowsLegacy will set file encoding to historic PowerShell5 encodings.
some tests were failing on Windows because new line is different, calculate the bytes in newline rather than hardcoding them
ClrFacade retains some of its functionality, but now relies on PowerShellEncoding class for knowing what the default coding is. The encoding methods which calls native methods is retained.
the WindowsLegacy behavior for New-ModuleManifest should get the correct number of bytes which will change depending on how many bytes are encoded for [Environment]::NewLine
update calls which create had been creating a new instance of the Utf8 encoding without BOM to return the available static
only use hardcoded bytes when it's a custom file generation (like Export-CliXml or New-ModuleManifest) or when we're looking at a set of partial results also remove an unused function
…ing files improve code coverage for PowerShellEncoding class and don't duplicate byte representations when they're not needed
Remove a couple of extraneous using statements Move encoding code from PathUtils.cs to Encoding.cs Move and rename GetEncoding method in Utils.cs to GetFileEncodingFromFile in Encoding.cs Expand some explanatory comments with more details
…oser to what is really happening Update RedirectionOperator tests to compare bytes in a more sensible manner Remove PSDefaultFileEncoding from special variable collection so they don't show up in script cmdlets miscellaneous code clean up
made file encoding probe method internal
62ba583 to
ff24e22
Compare
| { "239-187-191", FileEncoding.UTF8BOM }, | ||
| }; | ||
|
|
||
| internal static char[] nonPrintableCharacters = { |
There was a problem hiding this comment.
Strictly speaking, 11 (vertical tab) and 12 (form feed) are also printable characters; given their rarity, I assume they were left out intentionally; if so, perhaps a comment would appease drive-by pedants like me.
| #else // PowerShell Core on Windows, which needs provider registration | ||
| EncodingRegisterProvider(); | ||
|
|
||
| uint oemCp = NativeMethods.GetOEMCP(); |
There was a problem hiding this comment.
Looks like this line is no longer needed.
| // The OEM code pages are sometimes used by Win32 console applications, and | ||
| // on non-Windows platforms they still may have uses (if installed) and | ||
| // could be used if desired. | ||
| // On non-windows platforms, they have more limited uses, and probably won't |
There was a problem hiding this comment.
This sentence is a bit confusing. Doesn't the previous one say everything that needs to be said?
| { "microsoft.powershell.commands.setcontentcommand", Encoding.ASCII }, | ||
| // Providers are handled here | ||
| { "microsoft.powershell.commands.filesystemprovider", Encoding.ASCII }, | ||
| }; |
There was a problem hiding this comment.
Unfortunately, that doesn't appear to be correct.
As for the writing cmdlets:
Add-Content and Set-Content - despite what the documentation claims - have always used "ANSI" encoding, not ASCII - see MicrosoftDocs/PowerShell-Docs#1483.
What about Send-MailMessage? The help topics claim that ASCII encoding is the default.
New-Item -Type File -Value currently creates BOM-less(!) UTF-8.
As an aside: While Export-Csv without -Append indeed creates ASCII files, Export-Csv -Append currently blindly appends UTF-8 if the existing file's encoding is any of ASCII/UTF-8/"ANSI", but correctly matches UTF-16LE and UTF-16BE.
As for the reading cmdlets:
Get-Content currently defaults to "ANSI" in the absence of a BOM, as does Import-PowerShellDataFile.
Import-Csv currently actually assumes UTF-8 in the absence of a BOM, as do Import-CliXml and Select-String
|
i am going to take a different route for this change - closing this |
Implementation of https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0020-DefaultFileEncoding.md
Cmdlets now create files with consistent encoding (UTF8 without BOM) on all platforms.
A new preference variable
PSDefaultFileEncodingis now available to enable users to set encoding for cmdlets. By setting$PSDefaultFileEncoding = "WindowsLegacy"users can select the encoding which exists in PowerShell5 for each specific cmdlet.Provider and cmdlet encoding methods have been centralized and are now common.