如何在 C++ 中使用 UTF-8 和 Unicode？ C++20 char8_t 有多大？答案

【问题标题】：How to use UTF-8 and Unicode in C++? How big is C++20 char8_t?如何在 C++ 中使用 UTF-8 和 Unicode？ C++20 char8_t 有多大？
【发布时间】：2020-09-10 15:08:14
【问题描述】：

假设我想在 C++ 中存储一个（而不是 std::string 中的）Unicode 字符，我该怎么做？ char8_t 是在 C++20 中引入的，但看起来它只是 unsigned char 的 typedef，最多只能存储 1 个字节的信息。某些字符（尤其是表情符号等更具异国情调的字符）一次最多可占用 4 个字节。

不起作用的代码示例：

char8_t smth = "????";

有趣的是，尽管sizeof() 说它有 8 个字节大，但我对此表示怀疑。

const char* smth = "????";

【问题讨论】：

char32_t 可以存储任何 Unicode 字符。试试char32_t smth = U'????';（注意 U 前缀和单引号）。如果要使用 UTF-8 编码，则需要将此类字符存储为字符串（8 位字符）。
sizeof 描述了指针的大小，它当然可以是 8 个字节大。如果您想要字符串的长度（以字节为单位），请使用strlen。虽然要小心，因为 UTF-8 序列可以包含 nul 字节，这会欺骗strlen。
const char* 是一个指针，在 64 位机器上可能是 8 个字节，所以这是正确的。
@IgorTandetnik 我认为 char8_t 的意义在于拥有一种可以使用 UTF-8 编码表示所有 Unicode 字符的类型。
@Xrey274 这不是char8_t 的用途。它表示 single 编码 single Unicode codepoint 的 codeunits sequence 中的 single UTF-8 codeunit 。（对于 UTF-16 与 char16_t 相同，对于 UTF-32 与 char32_t 相同）。在 UTF-8 中，???? 被编码为 4 个代码单元（字节）F0 9F 98 80。您可以使用const char smth[] = u8"????"; int size = sizeof(smyh) - 1; 获取该号码。一些表情符号需要多个 代码点，因此会超过 4 个字节。

标签： c++ unicode utf-8 char c++20

【解决方案1】：

Unicode vs UTF-8 vs UTF-32 vs char8_t vs char32_t

Unicode 是基于 32 位无符号整数表示的标准字符表示（code point）。通过滥用语言我们也说“Unicode”来谈论代码点。例如 ? 的 Unicode（代码点）是 0x1F600。

UTF-32 是将 Unicode 代码点简单地编码为 4 个字节（或 32 位）。这很简单，因为您可以只存储 32 位无符号整数的代码点。

UTF-8 是一种 Unicode 代码点的编码格式，能够将它们存储在 1 到 4 个 8 位数据块中。这是可能的，因为 Unicode 代码点不使用所有 32 位，因此可以用 1 个字节（或 8 位）表示最常用的字符（~ASCII），用 2 到 4 个字节表示不常用的字符。

char8_t 大致是一个 8 位的无符号整数。我说“大致”是因为（至少）两个原因：第一个 c++ 标准规定它的大小至少为 8 位，但如果编译器/系统决定这样做可能会更多，其次它被认为是它的唯一类型并且不是' 与 std::uint8_t 不完全相同（尽管从一个转换到另一个是微不足道的）。

char32_t 类似于char8_t，除了它允许使用 32 位（因此它与std::uint32_t 大致相当），这很方便，因为您可以使用它来存储一个 Unicode 代码点。

char(8_t) const*的案例

在 C++ 中，使用 c-string (char(8_t) const*) 时应该小心。它们的行为不像一个对象，而是像一个指针，因此查询它的大小将返回指针之一（在 64 位处理器上为 8）。下面的代码看起来更愚蠢：

char8_t const* str = u"Hello";
sizeof(str); // == 8
sizeof(u"Hello"); // == 6 (5 letters + trailing 0x00)

使用适当的字符串文字

使用char（或char const*或std::string）时要小心。它不是用来存储 UTF-8 编码的字符串，而是存储扩展的 ASCII。因此，您的编译器将不知道您正在尝试编写什么，并且可能不会按照您的预期进行。

char c0 = '?';             // = '?' on Visual Studio (with 3 warnings)
char8_t c1 = u8'?';        // Compilation error: trying to store 4 char8_t in 1
char32_t c2 = U'?';        // = ? (or 128512)

char const* s0 = "?";      // = "??" on Visual Studio (with 1 warning)
char8_t const* s1 = "?";   // = "?" stored on 4 bytes (0xf0, 0x9f, 0x98, 0x80), or "ðŸ˜€"
char32_t const* s2 = U"?"; // = "?" stored like the 4 bytes unsigned integer 128512

sizeof("?");               // = 3: 2 bytes for ? (not sure why) + 1 byte for 0x00
sizeof(u8"?");             // = 5: 4 bytes for ? + 1 byte for 0x00
sizeof(U"?");              // = 8: 4 bytes for ? + 4 bytes for 0x00

存储一个 Unicode/Unicode 字符

正如 Igor 所说，存储 1 个 Unicode 字符可以通过使用 char32_t 来完成。但是，如果您想存储代码本身（整数），您可以存储 std::uint32_t。这两种表示对于编译器和语义都是不同的，所以请注意！大多数时候使用 char32_t 会更明确，更不容易出错。

char32_t c = U'?';
std::uint32_t u = 0x1F600u; // it's funny because 'u' stands for unsigned here..

存储一串 Unicode 字符

但是，如果您想存储一串 Unicode 字符，您有多种选择。你首先想知道的是你的程序的约束是什么，它与什么其他系统交互等等。

使用 char32_t

如果您需要不断添加/删除字符或检查 Unicode（例如，如果您需要从字体在屏幕上绘制字符）并且 - 这非常重要 - 如果您没有强大的内存限制 + 您不要与使用普通字符串存储UTF-8 字符的（旧）库交互，您可以通过使用char32_t 来使用UTF-32 表示：

std::size_t size = sizeof(U"?Ö"); // = 12 -> 4 bytes for each character including trailing 0x00

char32_t const* cString = U"?Ö"; // sizeof(...) = 8 -> the size of a pointer

std::u32string string{ U"?Ö" }; // .size() = 2

std::u32string_view stringView{ U"?Ö" }; // .size() = 2

使用 char8_t

如果您受到内存的限制并且无法为每个 Unicode 使用 32 位存储空间（知道在大多数情况下它将是 ASCII 字符，在 UTF-8 编码中只能用 8 位表示）或者如果您需要与（例如）使用char const*/std::string 来存储UTF-8 编码字符的库进行交互，您可以决定通过使用char8_t 来存储以UTF-8 编码的字符串：

std::size_t size = sizeof(u8"?Ö");
// = 7 -> 4 bytes for the emoji (they are pretty uncommon so UTF-8 encodes them on 4 bytes)
//   + 2 bytes for the "Ö" (not as uncommon but not a -very common- ASCII)
//   + 1 byte for the trailing 0x00

char8_t const* cString = u8"?Ö"; // sizeof(...) = 8 -> the size of a pointer

std::u8string string{ u8"?Ö" }; // .size() = 6 (string's size method doesn't count the 0x00)

std::u8string_view stringView{ u8"?Ö" }; // .size() = 6

使用char8_t 的技巧是，从技术上讲，您的计算机不知道它是用UTF-8 编码的（好吧，您的编译器会知道并为您编码“?Ö”），它只知道您正在存储代表字符的 8 位长的东西，因此为什么当您询问这些字符串的大小时它不会返回“2”。如果您需要知道代表多少个 Unicode（或者您必须在屏幕上绘制多少个字符），您需要对此编码进行解码。它可能存在一些可以为你做的花哨的库，但这是我个人使用的（我根据 UTF-8 规范编写的）：

// How many char8_t of this string you need to read to get 1 Unicode. The trick here 
// is that it can be done using only the first char8_t of the string because of how
// UTF-8 encoding works. However this won't check for following bytes that could be
// erroneous.
constexpr std::size_t code_size(std::u8string_view a_string) noexcept
{
    auto const h0 = a_string[0] & 0b11110000;
    return h0 < 0b10000000 ? 1 : (h0 < 0b11100000 ? 2 : (h0 < 0b11110000 ? 3 : 4));
}

// How many char8_t you need to add to a string to encode this Unicode with UTF-8.
constexpr std::size_t code_size(char32_t const a_code) noexcept
{
    return a_code < 0x007f ? 1 : (a_code < 0x07ff ? 2 : (a_code < 0xffff ? 3 : 4));
}

// How many Unicode characters are stored in this UTF-8 encoded string.
constexpr std::size_t string_size(std::u8string_view a_string) noexcept
{
    auto size = 0ull;
    while (!a_string.empty())
    {
        auto const codeSize = code_size(a_string);
        if (codeSize > a_string.size())
        {
            return -1; // Error: this is not a valid UTF-8 encoded string.
        }
        size += codeSize;
        a_string = a_string.substr(codeSize);
    }
}

// Append the UTF-8 encoding of a code to an u8string.
template<typename TAllocator>
constexpr std::size_t write(
    char32_t a_code,
    std::basic_string<char8_t, std::char_traits<char8_t>, TAllocator>& a_output) noexcept
{
    if (a_code <= 0x007f)
    {
        a_output += static_cast<char8_t>(a_code);
        return 1;
    }
    else if (a_code <= 0x07ff)
    {
        a_output += static_cast<char8_t>(0b11000000 | ((a_code >> 6) & 0b00011111));
        a_output += static_cast<char8_t>(0b10000000 | (a_code & 0b00111111));
        return 2;
    }
    else if (a_code <= 0xffff)
    {
        a_output += static_cast<char8_t>(0b11100000 | ((a_code >> 12) & 0b00001111));
        a_output += static_cast<char8_t>(0b10000000 | ((a_code >> 6) & 0b00111111));
        a_output += static_cast<char8_t>(0b10000000 | (a_code & 0b00111111));
        return 3;
    }
    else
    {
        a_output += static_cast<char8_t>(0b11110000 | ((a_code >> 18) & 0b00000111));
        a_output += static_cast<char8_t>(0b10000000 | ((a_code >> 12) & 0b00111111));
        a_output += static_cast<char8_t>(0b10000000 | ((a_code >> 6) & 0b00111111));
        a_output += static_cast<char8_t>(0b10000000 | (a_code & 0b00111111));
        return 4;
    }
}

// Read an Unicode from an UTF-8 encoded string view, effectively decreasing its size.
constexpr char32_t read(std::u8string_view& a_string)
{
    if (a_string.empty())
    {
        return 0x0000; // Null character
    }

    auto const codeSize = code_size(a_string);
    if (codeSize > a_string.size())
    {
        return 0xffff; // Invalid unicode
    }

    char8_t mask0 = codeSize < 2 ?
        0b1111111 : (codeSize < 3 ? 0b11111 : (codeSize < 4 ? 0b1111 : 0b111));
    char32_t unicode = mask0 & a_string[0];
    a_string = a_string.substr(1);

    constexpr char8_t mask = 0b00111111;
    for (auto i = 1u; i < codeSize; ++i)
    {
        if ((a_string[0] & ~mask) != 0b10000000)
        {
            return 0xffff; // Invalid unicode
        }
        unicode = (unicode << 6) | (mask & a_string[0]);
        a_string = a_string.substr(1);
    }
    
    return unicode;
}

【讨论】：