如何解决这个 UTF-8 编码 C 问题？答案

【问题标题】：How to solve this UTF-8 encoding C problem?如何解决这个 UTF-8 编码 C 问题？
【发布时间】：2020-07-23 02:40:58
【问题描述】：

在我的课堂上，我们遇到了这个问题。我不知道如何解决它。

“下面的程序计算文件中的字符数，假设文件被编码为 ASCII。修改程序，使其计算文件中编码为 UTF-8 的字符数”

#include <stdbool.h>
#include <stdio.h>
typedef unsigned char BYTE;
int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        printf("Usage: ./count INPUT\n");
        return 1;
    }
    FILE *file = fopen(argv[1], "r");
    if (!file)
    {
        printf("Could not open file.\n");
        return 1;
    }
    int count = 0;
    while (true)
    {
        BYTE b;
        fread(&b, 1, 1, file);
        if (feof(file))
        {
            break;
        }
        count++;
    }
    printf("Number of characters: %i\n", count);
}

谁能帮我解决这个问题？

【问题讨论】：

你知道 UTF-8 是如何工作的吗？似乎很容易识别开始字节并跳到下一个开始。
UTF-8 被设计成微不足道的。有一个所有连续字节（您要忽略的字节）共有的属性，并且只能在连续字节中找到。这是什么？

标签： c unicode byte ascii

【解决方案1】：

UTF-8 被设计成微不足道的。有一个所有连续字节（您要忽略的字节）共有的属性，并且只能在连续字节中找到。这是什么？

First     Last      Number of
Code      Code      bytes in   Byte 1    Byte 2    Byte 3    Byte 4
Point     Point     encoding 
--------  --------  ---------  --------  --------  --------  --------
U+000000  U+00007F          1  0xxxxxxx
U+000080  U+0007FF          2  110xxxxx  10xxxxxx
U+000800  U+00FFFF          3  1110xxxx  10xxxxxx  10xxxxxx
U+010000  U+10FFFF          4  11110xxx  10xxxxxx  10xxxxxx  10xxxxxx

那么，这只是做一些算术的问题。 Bitwise-AND 可用于隔离您要检查的位。 C 有一个operator。

【讨论】：