C++ UTF-16 到字符转换 (Linux/Ubuntu)答案

【问题标题】：C++ UTF-16 to char conversion (Linux/Ubuntu)C++ UTF-16 到字符转换 (Linux/Ubuntu)
【发布时间】：2013-09-19 18:50:39
【问题描述】：

我正在尝试帮助一位朋友完成一个应该是 1H 并且现在已经 3 天的项目。不用说我感到非常沮丧和愤怒 ;-) ooooouuuu... 我呼吸。

所以用 C++ 编写的程序只是读取一堆文件并处理它们。问题是我的程序读取使用 UTF-16 编码的文件（因为这些文件包含用不同语言编写的单词）并且对 ifstream 的简单使用似乎不起作用（它读取并输出垃圾）。我花了一段时间才意识到这是因为文件是 UTF-16 格式的。

现在我整个下午都在网上寻找有关阅读 UTF16 文件并将 UTF16 行的内容转换为字符的信息！我就是看不出来！这是一场噩梦。我尝试了解以前从未使用过的<locale> 和<codecvt>、wstring 等（我专注于图形应用程序，而不是桌面应用程序）。我就是看不懂。

这就是我所做的一切（但不起作用）：

std::wifstream file2(fileFullPath);
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>);
std::cout.imbue(loc);
while (!file2.eof()) {
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl;
}

这是我能想到的最大值，但它甚至不起作用。它并没有做任何更好的事情。但问题是我一开始就不明白我在做什么。

所以请帮忙！我什至可以阅读 G*** D*** 文本文件，这真是太疯狂了。

最重要的是，我的朋友使用 Ubuntu（我使用 clang++），而这段代码需要 -stdlib=libc++，这似乎不被他的 gcc 支持（即使他使用了一个非常高级的 gcc 版本，即4.6.3 我相信）。所以我什至不确定使用 codecvt 和 locale 是一个好主意（如“可能”）。会有更好的（另一种）选择吗？

如果我仅从命令行（使用 linux 命令）将所有文件转换为 utf-8，我是否会丢失信息？

非常感谢，如果你能帮助我，我将永远感激你。

【问题讨论】：

您不会丢失任何将 UTF-16 转换为 UTF-8 的信息。我认为你的错误在于认为 C++ 会为你做这件事。我不完全确定这一点，但我不相信它会。在任何情况下，我都会手动编写 UTF-16 到 UTF-8 的转换。很简单，肯定会花不到三天时间。
好吧，问题在于，我没有阅读有关 UTF-16 的内容，而是愚蠢地试图通过从网络上复制/粘贴一些我不完全理解的代码来暴力破解解决方案...... ;-( 那么你确定从 16 转换为 8 不会导致信息丢失吗？问题是为什么首先使用 UTF-16 作为外语。我认为这是必要的，因为有些字母表有更多的字符比你可以用 utf-8 编码吗？
UTF-16 和 UTF-8 都是 Unicode 的完整编码。我相信你不会丢失任何信息。
可能使用 UTF-16，因为这些文件来自 Java/DotNET 背景。 Unix 上没有人会考虑使用 UTF-16 做任何事情。（UTF-8 实际上比 UTF-16 可以表示更多个字符。）
gcc 尚不支持 C++11 的 unicode 转换，如果您不想手动编写它们，则需要诸如 boost.locale 之类的库来进行移植。跨度>

标签： c++ ubuntu utf-8 ifstream utf-16

【解决方案1】：

UTF-8 能够表示所有有效的 Unicode 字符（代码点），优于 UTF-16（覆盖前 110 万个代码点）。 [尽管正如评论所解释的，没有超过 110 万个值的有效 Unicode 代码点，因此 UTF-16 对于所有当前可用的代码点都是“安全的”——而且可能在很长一段时间内，除非我们确实有额外的地球访客，他们的书写语言非常复杂......]

它通过在必要时使用多个字节/字来存储单个代码点（我们称之为字符）来做到这一点。在 UTF-8 中，这由设置的最高位标记 - 在“多字节”字符的第一个字节中，设置了前两位，在接下来的字节中，设置了最高位，下一个从顶部为零。

要将任意代码点转换为 UTF-8，您可以使用我提供的 previous answer 中的代码。（是的，该问题与您所要求的相反，但我的答案中的代码涵盖了两个转换方向）

从 UTF16 转换为“整数”将是一个类似的方法，除了输入的长度。如果你幸运的话，你甚至可以不这样做......

UTF16 使用范围 D800-DBFF 作为第一部分，它保存 10 位数据，然后下一项是 DC00-DFFF，保存以下 10 位数据。

要遵循的 16 位代码...

16 位到 32 位转换的代码（我只测试了一点，但它似乎工作正常）：

std::vector<int> utf32_to_utf16(int charcode)
{
    std::vector<int> r;
    if (charcode < 0x10000)
    {
    if (charcode & 0xFC00 == 0xD800)
    {
        std::cerr << "Error bad character code" << std::endl;
        exit(1);
    }
    r.push_back(charcode);
    return r;
    }
    charcode -= 0x10000;
    if (charcode > 0xFFFFF)
    {
    std::cerr << "Error bad character code" << std::endl;
    exit(1);
    }
    int coded = 0xD800 | ((charcode >> 10) & 0x3FF);
    r.push_back(coded);
    coded = 0xDC00 | (charcode & 0x3FF);
    r.push_back(coded);
    return r;
}


int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF) | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}

【讨论】：

谢谢。现在我正在使用 iconv 使用程序中的系统调用来转换文件。这似乎行得通。不太理想，不过后面我会学习utf-16...
Unicode 仅限于 UTF-16 可以处理的内容。他们做出了这个决定，因为他们不期望这百万个代码点会在下一个千年用完。
感谢 Mats 和 Bames53 提供非常有趣的答案和付出的巨大努力。
有人能解释一下为什么这个答案被否决了吗？如果有什么问题，我想知道...
@Mats Petersson，我刚刚阅读并改编了您令人印象深刻的解决方案。我在其中发现了一个错误。我们可以在您方便的时候讨论一下这个问题。

【解决方案2】：

如果我仅从命令行（使用 linux 命令）将所有文件转换为 utf-8，我是否会丢失信息？

不可以，所有 UTF-16 数据都可以无损转换为 UTF-8。这可能是最好的做法。

当引入宽字符时，它们旨在成为专门用于程序内部的文本表示，并且永远不会作为宽字符写入磁盘。宽流通过将您写出的宽字符转换为输出文件中的窄字符，并在读取时将文件中的窄字符转换为内存中的宽字符来反映这一点。

std::wofstream wout("output.txt");
wout << L"Hello"; // the output file will just be ASCII (assuming the platform uses ASCII).

std::wifstream win("ascii.txt");
std::wstring s;
wout >> s; // the ascii in the file is converted to wide characters.

当然，实际的编码取决于流的灌输语言环境中的codecvt facet，但流的作用是使用codecvt 在写入时使用该facet 从wchar_t 转换为char，然后转换阅读时从char 到wchar_t。

然而，自从有些人开始用 UTF-16 写出文件后，其他人就不得不处理它了。他们使用 C++ 流的方法是创建 codecvt 分面，将 char 视为持有半个 UTF-16 代码单元，这就是 codecvt_utf16 所做的。

根据这个解释，您的代码存在以下问题：

std::wifstream file2(fileFullPath); // UTF-16 has to be read in binary mode
std::locale loc (std::locale(), new std::codecvt_utf16<char32_t>); // do you really want char32_t data? or do you want wchar_t?
std::cout.imbue(loc); // You're not even using cout, so why are you imbuing it?
// You need to imbue file2 here, not cout.
while (!file2.eof()) { // Aside from your UTF-16 question, this isn't the usual way to write a getline loop, and it doesn't behave quite correctly
    std::wstring line;
    std::getline(file2, line);
    std::wcout << line << std::endl; // wcout is not imbued with a locale that will correctly display the original UTF-16 data
}

这是重写上述内容的一种方法：

// when reading UTF-16 you must use binary mode
std::wifstream file2(fileFullPath, std::ios::binary);

// ensure that wchar_t is large enough for UCS-4/UTF-32 (It is on Linux)
static_assert(WCHAR_MAX >= 0x10FFFF, "wchar_t not large enough");

// imbue file2 so that it will convert a UTF-16 file into wchar_t data.
// If the UTF-16 files are generated on Windows then you probably want to
// consume the BOM Windows uses
std::locale loc(
    std::locale(),
    new std::codecvt_utf16<wchar_t, 0x10FFFF, std::consume_header>);
file2.imbue(loc);

// imbue wcout so that wchar_t data printed will be converted to the system's
// encoding (which is probably UTF-8).
std::wcout.imbue(std::locale(""));

// Note that the above is doing something that one should not do, strictly
// speaking. The wchar_t data is in the wide encoding used by `codecvt_utf16`,
// UCS-4/UTF-32. This is not necessarily compatible with the wchar_t encoding
// used in other locales such as std::locale(""). Fortunately locales that use
// UTF-8 as the narrow encoding will generally also use UTF-32 as the wide
// encoding, coincidentally making this code work

std::wstring line;
while (std::getline(file2, line)) {
  std::wcout << line << std::endl;
}

【讨论】：

这是一个非常有用的答案。很好的解释，很长，完整的代码。太感谢了。它告诉我，我对 C++ 的这个特定部分是如何工作的以及它的作用一无所知。尽管我发现它“极客”和先进，但知道它的存在仍然非常有用，但感觉我需要花时间研究它、学习它并消化它。再次感谢。非常感谢。
我发现 Unicode 和编码非常有趣，我认为这很好，因为如果不了解细节就很难知道如何在 C++ 中处理它们。除非你真的在做更严肃的文本处理，否则最简单的事情就是到处使用 UTF-8。
我不知道你是否会觉得它有帮助，但这里解释了为什么 wchar_t 没有人们希望的那么有用：stackoverflow.com/a/11107667/365496

【解决方案3】：

我改编、纠正和测试了 Mats Petersson 令人印象深刻的解决方案。

int utf16_to_utf32(std::vector<int> &coded)
{
    int t = coded[0];
    if (t & 0xFC00 != 0xD800)
    {
    return t;
    }
    int charcode = (coded[1] & 0x3FF); // | ((t & 0x3FF) << 10);
    charcode += 0x10000;
    return charcode;
}



#ifdef __cplusplus    // If used by C++ code,
extern "C" {          // we need to export the C interface
#endif
void convert_utf16_to_utf32(UTF16 *input,
                            size_t input_size,
                            UTF32 *output)
{
     const UTF16 * const end = input + 1 * input_size;
     while (input < end){
       const UTF16 uc = *input++;
       std::vector<int> vec; // endianess
       vec.push_back(U16_LEAD(uc) & oxFF);
       printf("LEAD + %.4x\n",U16_LEAD(uc) & 0x00FF);
       vec.push_back(U16_TRAIL(uc) & oxFF);
       printf("TRAIL + %.4x\n",U16_TRAIL(uc) & 0x00FF);
       *output++ = utf16_to_utf32(vec);
     }
}
#ifdef __cplusplus
}
#endif

【讨论】：

您的“修复”显然不正确 - 我并不是说我的代码是正确的，但您的修复显然不正确，因为将 10 位编码为 16 位，然后丢弃其他 10 位将完全没有意义。
@Mats Petersson，我只是按照你的建议在 UTF16 中保留了 16 位，它工作正常。我应该如何在 Ubuntu Linux 15.10 和 Mono 版本 4.2.1 上正确地将 C++ struct CC_STR32 { wchar_t szString[32] ;} 数组转换或编组为 C# IntPtr 或 StringBuilder？谢谢。
@Mats Petersson，感谢您的评论。我的意思是问我应该如何在 Ubuntu Linux 15.10 和 Mono 版本 4.2.1 上正确地将 C++ struct CC_STR32 { wchar_t szString[32] ;} 数组转换或编组为 IntPtr 的 C# 数组？