wchar_t 变量仅在 C 中存储乌尔都语字符的一半答案

【问题标题】：wchar_t variables only store half of an Urdu character in Cwchar_t 变量仅在 C 中存储乌尔都语字符的一半
【发布时间】：2016-11-19 22:51:21
【问题描述】：

我正在尝试从文件中读取和操作乌尔都语文本。但是，似乎没有将字符全部读入wchar_t 变量。这是我的代码，它读取文本并在新行中打印每个字符：

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");
    printf("This program tests Urdu reading:\n");
    wchar_t c;
    FILE *f = fopen("urdu.txt", "r");
    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc\n", c);
    }
    fclose(f);
}

这是我的示例文本：

میرا نام ابراھیم ھے۔

میں وینڈربلٹ یونیورسٹی میں پڑھتا ھوں۔

但是，打印的字符数似乎是文本中字母数的两倍。我知道宽或多字节字符使用多个字节，但我认为wchar_t 类型会将与字母表中的字母对应的所有字节存储在一起。

如何读取文本以便在任何时候都将整个字符存储在变量中？

关于我的环境的详细信息：
gcc：（x86_64-posix-seh-rev0，由 MinGW-W64 项目构建）5.3.0
操作系统：Windows 10 64 位
文本文件编码：UTF-8

这是我的文本在十六进制格式中的外观：

d9 85 db 8c d8 b1 d8 a7 20 d9 86 d8 a7 d9 85 20 d8 a7 d8 a8 d8 b1 d8 a7 da be db 8c d9 85 20 da be db 92 db 94 ad 98 5d b8 cd ab a2 0d 98 8d b8 cd 98 6d a8 8d 8b 1d 8a 8d 98 4d 9b 92 0d b8 cd 98 8d 98 6d b8 cd 98 8d 8b 1d 8b 3d 9b 9d b8 c2 0d 98 5d b8 cd ab a2 0d 9b ed a9 1d ab ed 8a ad 8a 72 0d ab ed 98 8d ab ad b9 4a

【问题讨论】：

您需要提供更多详细信息，例如文本文件的编码、您使用的编译器和操作系统
Light C Unicode Library的可能重复
Not all types are created equal
@Amd 这可能是有用的信息，但肯定不是重复的
为了帮助了解发生了什么，您可以输出每个字符的字符代码

标签： c file-io wchar non-latin

【解决方案1】：

UTF-8 是 Unicode 编码，每个字符占用 1-4 个字节。我能够将每个 unicode 字符存储在 uint32_t（或某些 UNIX 平台上的 u_int32_t）变量中。我使用的库是 (utf8.h | utf8.c)。它为 UTF-8 字符串提供了一些转换和操作函数。

所以如果一个文件在 UTF-8 中是 n 个字节，那么它最多会有 n 个 Unicode 字符。这意味着我需要一个 4*n 字节的内存（每个 u_int32_t 变量 4 个字节）来存储文件的内容。

#include "utf8.h"

// here read contents of file into a char* => buff
// keep count of # of bytes read => N

ubuff = (u_int32_t*) calloc(N, sizeof(u_int32_t));  // calloc initializes to 0
u8_toucs(ubuff, N, buff, N);

// ubuff now is an array of 4-byte integers representing
// a Unicode character each

当然，如果多个字节表示单个字符，则文件中的 Unicode 字符完全有可能少于 n 个。这意味着 4*n 内存分配过多。在这种情况下，ubuff 的一部分将为 0（Unicode 空字符）。所以我只需扫描数组并根据需要重新分配内存：

u_int32_t* original = ubuff;
int sz=0;
while *ubuff != 0 {
    ubuff++;
    sz++;
}
ubuff = realloc(original, sizeof(*original) * i);

注意：如果您收到关于 u_int32_t 的类型错误，请将 typedef uint32_t u_int32_t; 放在代码的开头。

【讨论】：

【解决方案2】：

Windows 对 Unicode 的支持主要是专有的，不可能编写使用 UTF-8 并使用 Windows 本机库在 Windows 上工作的便携式软件。如果您愿意考虑非便携式解决方案，这里有一个：

#include <stdio.h>
#include <wchar.h>
#include <locale.h>
#include <fcntl.h>

void main(int argc, char* argv[]) {
    setlocale(LC_ALL, "");

    // Next line is needed to output wchar_t data to the console. Note that 
    // Urdu characters are not supported by standard console fonts. You may
    // have to install appropriate fonts to see Urdu on the console.
    // Failing that, redirecting to a file and opening with a text editor
    // should show Urdu characters.

    _setmode(_fileno(stdout), _O_U16TEXT);

    // Mixing wide-character and narrow-character output to stdout is not
    // a good idea. Using wprintf throughout. (Not Windows-specific)

    wprintf(L"This program tests UTF-8 reading:\n");

    // WEOF is not guaranteed to fit into wchar_t. It is necessary
    // to use wint_t to keep a result of fgetwc, or to print with
    // %lc. (Not Windows-specific)

    wint_t c;

    // Next line has a non-standard parameter passed to fopen, ccs=...
    // This is a Windows way to support different file encodings.
    // There are no UTF-8 locales in Windows. 

    FILE *f = fopen("urdu.txt", "r,ccs=UTF-8");

    while ((c = fgetwc(f)) != WEOF) {
        wprintf(L"%lc", c);
    }
    fclose(f);
}

OTOH 与 glibc（例如使用 cygwin）不需要这些 Windows 扩展，因为 glibc 在内部处理这些事情。

【讨论】：

什么，即使在 Windows 10 中也没有 utf-8 语言环境。？
@ddbug AFAICT Windows 语言环境未指定任何编码。