打印 wchar_t 作为 wchar_t* 字符串的一部分不会终止答案

【问题标题】：Printing a wchar_t as part of a wchar_t* string does not terminate打印 wchar_t 作为 wchar_t* 字符串的一部分不会终止
【发布时间】：2021-11-21 10:14:23
【问题描述】：

所以，I found a bug in glibc 我想举报。问题是printf() 为no_NO.utf8 语言环境中的分组字符计算了错误的宽度，因此没有在字符串左侧留出足够的填充。我最初在 shell util printf 中发现了这个，但它似乎源自 libc 中的原始 printf 函数，我已经使用一个小测试程序验证了它。

我从大学开始就没有接触过 C，所以在创建测试用例时我有点生疏。到目前为止，我唯一的问题是，当将此分组字符用作字符串（wchar_t 数组）的一部分时，字符串不会终止，我不确定我做错了什么。

这是我的小测试驱动的输出：

$ gcc printf-test.c && ./a.out 
Using locale nb_NO.utf8
<1 234> (length 7 according to strlen)
<1 234> (length -1 according to wcswidth)

Using locale en_US.utf8
<  1,234> (length 7 according to strlen)
<  1,234> (length 7 according to wcswidth)

Width of character e280af: -1

Width of s0  4: (ABCD)
Width of s1  4: (ABCD)
Width of s2 -1: (

很明显，最终字符串中的打印出现了问题，这与我如何尝试使用nb_NO 语言环境中使用的多字节分组字符打印字符串有关。

完整来源：

#define _XOPEN_SOURCE       /* See feature_test_macros(7) */
#include <wchar.h>
#include <stdio.h>
#include <locale.h>
#include <string.h>


void print_num(char *locale){ 
    printf("Using locale %s", locale);
    setlocale(LC_NUMERIC, locale);
    char buf[40];
    sprintf(buf,"%'7d", 1234);
    printf("\n<%s> (length %d according to strlen)\n", buf, (int) strlen(buf));

    wchar_t wbuf[40];
    swprintf(wbuf, 40, L"%'7d", 1234); 
    int wide_width = wcswidth (wbuf, 40);
    printf("<%s> (length %d according to wcswidth)\n", buf, wide_width);
    puts("");
}

int main(){
    print_num("nb_NO.utf8");
    print_num("en_US.utf8");

    // just trying to understand
    wchar_t wc = (wchar_t) 0xe280af; // is this a correct way of specifying the char e2 80 af?
    int width = wcwidth (wc);
    printf("Width of character %x: %d\n", (int) wc, width);

    wchar_t s0[] = L"ABCD";
    wchar_t s1[] = {'A','B','C', 'D', '\0'};
    wchar_t s2[] = {'A',wc,'B', '\0'}; // something fishy
    int widthOfS0 = wcswidth (s0, 4);
    int widthOfS1 = wcswidth (s1, 4);
    int widthOfS2 = wcswidth (s2, 4);
    printf("\nWidth of s0  %d: (%ls)", widthOfS0, s0);
    printf("\nWidth of s1  %d: (%ls)", widthOfS1, s1);
    printf("\nWidth of s2 %d: (%ls)", widthOfS2, s2); // this does not terminate the string

    return 0;
}

【问题讨论】：

与您的问题无关，但不是转换strlen 的结果，您为什么不使用size_t 的正确格式说明符%zu？
另请注意，wcslen 返回宽字符串的长度，而wcswidth 返回列数我>。这些可能不一样。
那么0xe280af的编码是什么？它应该是UTF-8吗？然后你需要一个三个字节的数组，否则在小端系统上编码将是错误的（字节将被反转）。
@Someprogrammerdude 我不知道size_t 问题，我的linter 没有警告我:) 参考wcslen 与wcswidth，这是一个很好的观点，但我有点失落什么是专栏。我找不到参考，所以我假设一列等于“屏幕上的一个可打印字符”，这与此处相关，而长度（包括不可打印字符）并不那么有趣（除了用于比较）。
参考0xe280af的编码，我认为这应该是UTF-8，因为nb_NO.utf8是我在bug发生时设置的语言环境。

标签： c wchar-t wchar multibyte-characters

【解决方案1】：

也许太明显了，您需要使用wprintf() 来打印wchar_t。您添加的任何字符串都会自动终止，但如果您用单个字符填充它并且强制转换只是更改它显示的大小和类型以使其“适合”，它不会在数字类型之间进行任何类型的转换。

#include <wchar.h>
#include <stdio.h>

#ifndef __STDC_ISO_10646__
    #pragma warning() // 16 bit wchar
#endif

int main(void){

    int ret;
    wchar_t W [] = {                  // 0x80AF
        U'\x42', (wchar_t)0x43, (wchar_t)0xE280AF 
    };

    printf("Num cast %X -> %X \n", 0xE280AF, (wchar_t)0xE280AF);

    wchar_t S1[] = {'A', W[0], 'C',  0};
    wchar_t S2[] = {'A', 'B',  W[1], 0};
    wchar_t S3[] = {'A', W[2], 'C',  0};

    ret = wprintf(L"wstr S1 -> (%ls)", S1);
    printf(" / %i xchars printed \n", ret);

    ret = wprintf(L"wstr S2 -> (%ls)", S2); 
    printf(" / %i xchars printed \n", ret);

    ret = wprintf(L"wstr S3 -> (%ls)", S3);
    printf(" / %i xchars printed \n", ret);

    return 0;
}

【讨论】：

上次我涉足 C 15 年后，没有什么明显的 :)