UTF8 字符串转 int答案

【问题标题】：UTF8 string to intUTF8 字符串转 int
【发布时间】：2016-07-03 18:31:03
【问题描述】：

我一直在努力尝试从 UTF8 文件中提取 int：

#include <iostream>
#include <fstream>
#include <sstream>

using namespace std;

int main()
{
    ifstream file("UTF8.txt");
    if(file.is_open())
    {
        string line;
        getline(file, line);
        istringstream ss(line);
        int a;
        ss >> a;
        if(ss.fail())
        {
            cout << "Error parsing" << endl;
            ss.clear();
        }
        getline(file, line);
        cout << a << endl << line << endl;
        file.close();
    }
}

该文件包含 2 行：“42”和“è_é”，并以 UTF8 格式保存在记事本中。当文件为 ANSI 时，上述方法有效，但在 Unicode 时失败。我尝试了很多事情，最有希望的是设置语言环境，但我希望程序独立于计算机的语言环境（即即使 PC 是美国的也能读取汉字）。老实说，我现在没有想法。如果可能，我想避免使用 Qt 中的 CStrings。

更新

以下显示“0”、“解析错误”，因为文件开头有一个奇怪的字符。一个空行，在读取时丢弃，就在数字使其工作之前，但我无法在最终程序中更改文件。口音在控制台中没有正确显示，但是当我将输出写入文件时一切都很好，这就是我所需要的。所以只是文件开头的那个问题！

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <sstream>

int main()
{
    std::ifstream file("UTF8.srt");
    file.imbue(std::locale(file.getloc(),
        new std::codecvt_utf8<wchar_t,0x10ffff,std::consume_header>));
    if (file.is_open()) {
        std::string line;
        std::getline(file,line);
        std::istringstream ss{line};
        int a;
        ss >> a;
        if (ss.fail()) {
            std::cout << "Error parsing" << std::endl;
            ss.clear();
        }
        getline(file,line);
        std::cout << a << std::endl << line << std::endl;
        file.close();
    }
}

解决方案

以下工作，输入文件内容如下：

5
bla bla é_è

6
truc è_é

代码：

#include <cstdint>
#include <iostream>
#include <fstream>
#include <sstream>

// Do not get used to it:
// using namespace std;

inline const char* skip_utf8_bom(const char* s, std::size_t size)
{
    if(3 <= size && s[0] == char(0xEF) && s[1] == char(0xBB) && s[2] == char(0xBF))
        s += 3;
    return s;
}

int main()
{
    std::ifstream file("UTF8.txt");
    std::ofstream fileO("UTF8_copy.txt");
    if(!file || !fileO) {
        std::cout << "Error opening files" << std::endl;
    }
    else {
        std::string line;

        //Parse the first number
        std::getline(file, line);
        {
            const char* linePtr = skip_utf8_bom(line.c_str(), line.size());
            std::istringstream input(linePtr);
            int a = -1;
            input >> a;
            if( ! input) {
                std::cout << "Error parsing" << std::endl;
            }
            std::cout << "Number 1: " << a << std::endl;
            fileO << a << std::endl;
        }

        //Copy the following line as is
        std::getline(file, line);
        fileO << line << std::endl;

        //Discard empty line, copy it in the output file
        std::getline(file, line);
        fileO << std::endl;

        //Parse the second number
        std::getline(file, line);
        {
            const char* linePtr = skip_utf8_bom(line.c_str(), line.size());
            std::istringstream input(linePtr);
            int a = -1;
            input >> a;
            if( ! input) {
                std::cout << "Error parsing" << std::endl;
            }
            std::cout << "Number 1: " << a << std::endl;
            fileO << a << std::endl;
        }

        //Copy the following line as is
        std::getline(file, line);
        fileO << line << std::endl;

        file.close();
        fileO.close();
    }

    return 0;
}

【问题讨论】：

在十六进制编辑器中打开文件 - 可能有 UTF8 的 BOM
什么意思？什么是 BOM？
字节顺序标记：en.wikipedia.org/wiki/Byte_order_mark
检查 notepad++ 文本编辑器（可以轻松检查/转换为任何格式，整体很棒的编辑器），Windows 的 utf-8 文本文件应该包含 BOM，正如@MisterMystère 所说。
解决更新代码中的解析错误；将ifstream 更改为wifstream，将string 更改为wstring，将istringstream 更改为wistringstream。

标签： c++ stl utf

【解决方案1】：

读取文件with std::codecvt_mode

以上链接中的示例：

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

int main()
{
    // UTF-8 data with BOM
    std::ofstream("text.txt") << u8"\ufeffz\u6c34\U0001d10b";
    // read the UTF8 file, skipping the BOM
    std::wifstream fin("text.txt");
    fin.imbue(std::locale(fin.getloc(),
                          new std::codecvt_utf8<wchar_t, 0x10ffff, std::consume_header>));
    for (wchar_t c; fin.get(c); )
        std::cout << std::hex << std::showbase << c << '\n';
}

注意std::consume_header 设置。

适应您的问题可能是：

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <sstream>

int main()
{
    std::ifstream file("UTF8.txt");
    file.imbue(std::locale(file.getloc(),
        new std::codecvt_utf8<char,0x10ffff,std::consume_header>));
    if (file.is_open()) {
        std::string line;
        std::getline(file,line);
        std::istringstream ss{line};
        int a;
        ss >> a;
        if (ss.fail()) {
            std::cout << "Error parsing" << std::endl;
            ss.clear();
        }
        getline(file,line);
        std::cout << a << std::endl << line << std::endl;
        file.close();
    }
}

或者wchar_t:

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <sstream>

int main()
{
    std::wifstream file("UTF8.txt");
    file.imbue(std::locale(file.getloc(),
        new std::codecvt_utf8<wchar_t,0x10ffff,std::consume_header>));
    if (file.is_open()) {
        std::wstring line;
        std::getline(file,line);
        std::wistringstream ss{line};
        int a;
        ss >> a;
        if (ss.fail()) {
            std::wcout << L"Error parsing" << std::endl;
            ss.clear();
        }
        std::getline(file,line);
        std::wcout << a << std::endl << line << std::endl;
        file.close();
    }
}

【讨论】：

谢谢 - 但我无法编译它，它说 codecvt: no such file or directory。在 64 位 Windows 上使用 CodeBlocks 和 MingW。
@MisterMystère 那么你的编译器/MinGW 配置错误。
有趣，不知道这个选项。我觉得令人费解的是，您将文件作为wchar_t 流打开。与char 不一样吗（请记住，建议使用UTF-8 everywhere）？
@KonradRudolph：其他一切正常，它只是 codecvt。我添加了 C++11 标志。我已经坚持了 2 天，真的开始变得绝望了。
@MisterMystère 好点，看起来 MinGW 已经过时了；标头是最近添加到 libstdc++ 中的：stackoverflow.com/questions/15615136/… — 您可能不得不切换到 MinGW-64 或 Cygwin，因为 MinGW 似乎人手严重不足。据我所知，在可预见的未来，他们不会发布最近的 GCC。

【解决方案2】：

只需跳过前导 BOM（字节顺序标记）：

#include <cstdint>
#include <iostream>
#include <fstream>
#include <sstream>

// Do not get used to it:
// using namespace std;

inline const char* skip_utf8_bom(const char* s, std::size_t size)
{
    if(3 <= size && s[0] == char(0xEF) && s[1] == char(0xBB) && s[2] == char(0xBF))
        s += 3;
    return s;
}


int main()
{
    std::istringstream file(u8"\xEF\xBB\xBF""42\n\u00E8_\u00E9\n");
    std::string line;
    getline(file, line);
    const char* linePtr = skip_utf8_bom(line.c_str(), line.size());
    std::istringstream input(linePtr);
    int a = -1;
    input >> a;
    if( ! input) {
        std::cout << "Error parsing" << std::endl;
    }
    getline(file, line);
    std::cout << a << std::endl << line << std::endl;
}

【讨论】：

完美，它有效（我已经更新了我的帖子）！非常感谢！