C++ 中的 Unicode 到 UTF-8答案

【问题标题】：Unicode to UTF-8 in C++C++ 中的 Unicode 到 UTF-8
【发布时间】：2012-07-21 03:35:48
【问题描述】：

我搜索了很多，但找不到任何东西：

unsigned int unicodeChar = 0x5e9;
unsigned int utf8Char;
uni2utf8(unicodeChar, utf8Char);
assert(utf8Char == 0xd7a9);

是否有实现类似于 uni2utf8 的库（最好是 boost）？

【问题讨论】：

对于新的 c++11 unicode 字符串文字，请参阅 stackoverflow.com/questions/6796157/…
你所要求的没有意义，也无法工作。没有 UTF-8 字符这样的东西。有 UTF-8 代码单元，它们是 8 位值，正确解码后形成 Unicode 代码点。但是 UTF-8 代码单元不存储在 32 位大小的 unsigned ints 中。每个代码单元大小为 8 位；因此，在 UTF-8 中存储 Unicode 代码点的方法是作为代码单元序列。 字符串，不是整数。
1. UTF8 是 unicode 2。使用 nowide。
utf8 不是 Unicode，utf8 是一种表示数字的方法。另一方面，unicode 是符号到数字之间的映射。抽象数字，而不是它们的表示。

标签： c++ boost unicode utf-8

【解决方案1】：

Unicode 转换是 C++11 的一部分：

#include <codecvt>
#include <locale>
#include <string>
#include <cassert>

int main() {
  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
  std::string utf8 = convert.to_bytes(0x5e9);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}

【讨论】：

是否有等效的提升？（对于那些不会编写 c++11 的人）
@Ezra 是的，有 Boost.Locale，我为此添加了另一个答案。
你不需要 codecvt_utf8。 codecvt<char32_t,char,std::mbstate> 在 UTF-32 和 UTF-8 之间转换，codecvt<char16_t,char,std::mbstate> 在 UTF-16 和 UTF-8 之间转换。
@bames53：我强烈怀疑只有当char 本身是UTF-8 时才有效。例如。 Linux，但不是 Windows。
@bames53 选择codecvt_utf8 的三个理由（至少与wstring_convert 一起使用）： 1. 它包含utf8 这个词，因此读者更清楚发生了什么。 2. 它更短（需要更少的模板参数）。 3. codecvt 有一个受保护的析构函数，因此不能用作codecvt_utf8 的直接替代品。如果您使用的是wstring_convert，那么无论如何您都需要C++11，所以请随时使用codecvt_utf8。我认为在这里使用codecvt 没有多大价值。

【解决方案2】：

Boost.Locale 也有编码转换的功能：

#include <boost/locale.hpp>

int main() {
  unsigned int point = 0x5e9;
  std::string utf8 = boost::locale::conv::utf_to_utf<char>(&point, &point + 1);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}

【讨论】：

【解决方案3】：

您可能想试试UTF8-CPP library。用它编码一个 Unicode 字符看起来像这样：

std::wstring unicodeChar(L"\u05e9");
std::string utf8Char;
encode_utf8(unicodeChar, utf8Char);

std::string 在这里仅用作 UTF-8 字节的容器。

【讨论】：

这不是假设您的unicodeChar 是用UTF-32 编码的吗？据我所知，C 和 C++ 中的“宽字符串”有一个未指定的、不透明的“系统编码”，可以是任何东西。您首先需要使用 iconv 之类的东西将宽字符串转换为 UTF-32。
@KerrekSB 您认为我是单独使用原始 C 宽字符串还是与特定于平台的 std::wstring 实现结合使用？
@KerrekSB 我是不是忘了用std::wstring“煮”那个原始宽字符串，它完全知道在当前平台/编译器上应该如何处理这些字符串？
你认为wstring 是什么？它只是wchar_ts 的容器，您可以从沼泽标准宽字符串文字中初始化它们。 “做饭”在哪里？
这段代码确实不能在 Windows 上运行，wchar_t 是 UCS-2/UTF-16（至少 16 位），因此无法将 U+10000 转换为 UTF-8跨度>

【解决方案4】：

使用sprintf。（：

cstring = sprintf("%S", unicodestring);

【讨论】：