重新编码错误的文件名答案

【问题标题】：Re-Encoding of wrong filenames重新编码错误的文件名
【发布时间】：2020-03-01 15:03:55
【问题描述】：

所以我读了Spolsky Article 两遍，this question 也读了很多遍。现在我来了。

我在具有区域设置 ISO-8859-1 的 Linux 机器上创建了一个目录结构的压缩包，并在 Windows 上使用 7zip 解压缩了它。结果，当我在 Windows 资源管理器（以及我的 C# 程序）中查看它们时，文件名被打乱了：我希望看到德语变音符号 ü 它是 ³ - 难怪，因为文件名是使用 ISO-8859-1 代码页写入 tar 文件，Windows 显然不知道这一点。

我想通过将文件重命名为正确的名称来解决此问题。所以我想我必须告诉程序“读取文件名，将其视为 ISO-8859-1 并将每个字符都返回为 UTF-16 字符。”

我的代码找到正确的文件名：

void Main()
{
    string[] files = Directory.GetFiles(@"C:\test", @"*", SearchOption.AllDirectories);
    var e1 = Encoding.GetEncoding("ISO-8859-1");
    var e2 = Encoding.GetEncoding("UTF-16");
    foreach (var f in files)
    {
        Console.WriteLine($"Source: {f}");
        var source = e1.GetBytes(f);
        var dest = Encoding.Convert(e1, e2, source);
        Console.WriteLine($"Result: {e2.GetString(dest)}");
    }
}

结果 - 什么都没发生：

Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrl³.odt

预期结果：

Source: C:\test\Brief-mrl³.odt
Result: C:\test\Brief-mrlü.odt

当我交换 e1 和 e2 时，我得到了奇怪的结果。我的脑袋疼。我没有得到什么？

编辑：我知道早先犯了这个错误，但是现在我在 Windows 机器上的文件名有误，我需要更正。但是，它可能无法通过Encoding-Class 解决。我找到了这个blog post，作者说

事实证明，这根本不是编码的问题，而是相同的字符地址对不同的字符集意味着不同的东西。

总之，他写了一个方法，用特定的不同字符替换 130 到 173 之间的字符。这对我来说看起来并不简单，但有可能这是唯一的方法吗？有人可以评论一下吗？

【问题讨论】：

你的终端可能不支持UTF-16编码
哪个终端？我认为这不是重点。 tar 不应该关心我使用的终端，不是吗？
转换后，将这些字符串写入文件（默认使用UTF-8，一开始就应该使用的），看看你得到了什么。也尝试 CodePage 1252 而不是 ISO-8859-1。
使用Console.WriteLine时，终端显然不会显示UTF-16，与window的文件资源管理器一样。另一方面，7zip 的资源管理器会正确显示它们。
@Tal 否，Windows 中的 7zip's 资源管理器也显示错误的名称 - 与 Windows 资源管理器相同。

标签： c# linux windows character-encoding codepages

【解决方案1】：

经过更多阅读，我自己得到了解决方案。 This excellent article 帮助了。关键是：一旦使用了错误的编码，您只能猜测（或必须知道）究竟出了什么问题。如果您知道，您可以在代码中还原整个内容。

void Main()
{
    // We get the source string e.g. reading files from a directory. We see a "³" when 
    // we expect a German umlaut "ü". The reason can be a poorly configured smb share
    // on a Linux server or other problems.
    string source = "M³nch";

    // We are in a .NET program, so the source string (here in the 
    // program) is Unicode in UTF-16 encoding. I.e., the codepoints 
    // M, ³, n, c and h are encoded in UTF-16.

    byte[] bytesFromSource = Encoding.Unicode.GetBytes(source); // 
    // The source encoding is UTF-16, hence we get two bytes per character.

    // We accidently worked with the OEM850 Codepage, we now have look up the bytes of 
    // the codepoints on the OEM850 codepage: We convert our bytesFromSource to the wrong Codepage
    byte[] bytesInWrongCodepage = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding(850), bytesFromSource);

    // Here's the trick: Although converting to OEM850, we now assume that the bytes are Codepage ISO-8859-1.
    // We convert the bytes from ISO-8859-1 to Unicode.
    byte[] bytesFromCorrectCodepage = Encoding.Convert(Encoding.GetEncoding("ISO-8859-1"), Encoding.Unicode, bytesInWrongCodepage);

    // And finally we get the right character.
    string result = Encoding.Unicode.GetString(bytesFromCorrectCodepage);

    Console.WriteLine(result); // Münch
}

CAVEAT：不要在其结果上运行此方法。这可能会产生不可打印的字符或其他混乱。

【讨论】：

@stackoverflow.com/users/423780/steve-mcgill 你引导我找到解决方案。我认为值得将其与您的博文进行比较。