从 txt 文件中读取重音字符时出现问题答案

【问题标题】：Problem reading accented characters from a txt file从 txt 文件中读取重音字符时出现问题
【发布时间】：2021-09-09 05:58:25
【问题描述】：

我有一个分配，我必须找到文本文件中每个字符的频率，问题是我的第一个 lenguaje 是西班牙语，所以文本 fila 有重音字符，如“á”，我必须计算“á " 像 "a"，我的代码是：

int main(){
    int c;
    FILE *file;
    file = fopen("prueba.txt", "r");
    int letters[27] = {0}; 
    if (file){
        while ((c=getc(file)) !=EOF  )
        {
            if( ((c-65) >=0 && (c-65) <= 25)){
                letters[c-65]++;
            }
            else if( (c-97) >=0 && (c-97) <= 25){
                letters[c-97]++;
            }
            else if( c ==181 || c== 160){ //a
                letters[0]++;
            }
            else if( c == 130 || c== 144){//e 
                letters[4]++;
            }
            else if(c ==161 || c==214){//i
                letters[8]++;
            }
            else if(c == 162 || c ==224){
                letters[14]++;
            }
            else if(c ==163 || c == 233){
                letters[20]++;
            }
            else if( c==164 || c== 165){
                letters[26]++;
            }
        }
        fclose(file);
    }
}

但我发现我的代码读取“á”就像一个多字符，所以 c 需要三个值 195,161,10 而不是 160，我该怎么办？

【问题讨论】：

请不要使用magic numbers！如果通过例如65 你的意思是 'A' 的 ASCII 编码值，那么最好明确地说 'A' （即使你所做的无论如何都不是可移植的）。
还要注意ASCII实际上是一个七位编码，并且“扩展”字符（具有高于127的值）将取决于操作系统及其设置。
á 字符是encoded as UTF-8 in two bytes，其值为 225。第三个字节只是一个换行符。将 UTF-8 转换为十进制数很容易，但我不知道您应该如何找到所有可用于 a 变体的 the unicode code points。
也许使用 Unicode 库将文本转换为 NFD 形式，只查看基本字符而忽略组合字符？
这能回答你的问题吗？ How to Read/Write UTF8 text files in C?

标签： arrays c character-encoding text-files non-ascii-characters

【解决方案1】：

我能做什么？

评论者已经注意到您的文本文件是 UTF-8（而不是扩展的 ASCII）编码的，并提供了如何读取多字节字符的链接。现在，为了总结每个字母的出现而不考虑变音符号，我们可以利用一个字母的几个变体的整理位置相同的语言环境，例如。 G。西班牙语言环境 - 巧合的是，因为您说您的 first lenguaje 是西班牙语，这可能已经是您的环境，但您也可以明确使用 es_ES.UTF-8 或类似的。通过这种方式，我们可以识别哪些字母属于一起，而无需搜索代码表的繁琐任务。这是您的程序的相应修改版本：

#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
#include <assert.h>
#include <string.h>
int main()
{
    // "": locale from environment; "es_ES.UTF-8": specific locale
    if (!setlocale(LC_ALL, "")) return 1;
    FILE *file = fopen("prueba.txt", "r");
    if (!file) return perror("prueba.txt"), 1;
    // the letters you want to count:
    wchar_t alphabet[] = L"abcdefghijklmnopqrstuvwxyzñ";
    size_t n = sizeof alphabet / sizeof *alphabet;
    int letters[n]; // the letter counters
    memset(letters, 0, sizeof letters);
    wchar_t collate[n];     // representatives for the letters
    assert(wcsxfrm(collate, alphabet, n) < n);  // to be sure
    wint_t c;
    while (c = towlower(getwc(file)), c != WEOF)
    {
        wchar_t s[2];  // representative of the current character
        wcsxfrm(s, (wchar_t [2]){c}, 2);
        // find letter, otherwise last element
        ++letters[wcscspn(collate, s)];
    }
    fclose(file);
    long t = 0;
    for (int i = 0; i < n; ++i)    // print the counters
        t += letters[i],
        printf("%lc: %d\n", (wint_t)alphabet[i], letters[i]);
    printf("total: %ld\n", t);
}

它打印所有字母的计数以及其他字符的计数和总计数。请注意，如果文件中有多字节字符，则总数小于文件长度。

【讨论】：