使用 ICU (ICU4C) 读取 UTF-8 编码文件的缓冲区大小答案

【问题标题】：Buffer size for reading a UTF-8-encoded file using ICU (ICU4C)使用 ICU (ICU4C) 读取 UTF-8 编码文件的缓冲区大小
【发布时间】：2013-07-07 11:35:54
【问题描述】：

我正在尝试在 Windows 上使用 ICU4C 和 msvc11 读取一个 UTF-8 编码的文件。我需要确定缓冲区的大小来构建一个 UnicodeString。由于 ICU4C API 中没有类似 fseek 的函数，我想我可以使用底层 C 文件：

#include <unicode/ustdio.h>
#include <stdio.h>
/*...*/
UFILE *in = u_fopen("utfICUfseek.txt", "r", NULL, "UTF-8");
FILE* inFile = u_fgetfile(in);
fseek(inFile,  0, SEEK_END); /* Access violation here */
int size = ftell(inFile);
auto uChArr = new UChar[size];

这段代码有两个问题：

它出于某种原因在 fseek() 行“抛出”访问冲突（test.exe 中 0x000007FC5451AB00 (ntdll.dll) 处的未处理异常：0xC0000005：访问冲突写入位置 0x0000000000000024。）
ftell 函数返回的大小不是我想要的大小，因为 UTF-8 最多可以使用 4 个字节作为代码点（u8"tю" 字符串的长度为 3）。

所以问题是：

如果我知道输入文件是 UTF-8 编码的，如何确定 UnicodeString 的缓冲区大小？
是否有一种便携式方法可以使用 iostream/fstream 来读取和写入 ICU 的 UnicodeStrings？

编辑：这是基于第一个答案和 C++11 标准的可能解决方案（在 msvc11 和 gcc 4.8.1 上测试）。 ISO IEC 14882 2011 中的一些内容：

“C++ 内存模型中的基本存储单元是字节。A 字节至少足够大以包含基本的任何成员执行字符集（2.3）和八位代码单元 Unicode UTF-8 编码形式..."
“基本源字符集由 96 个字符组成...”，- 已经需要 7 位
"基本执行字符集和基本执行宽字符集应包含基本的所有成员源字符集..."
“声明为字符 (char) 的对象应足够大存储实现的基本字符集的任何成员。”

因此，为了让实现定义的 char 大小为 1 字节 = 8 位（不知道这不是真的）的平台可移植，我们可以使用未格式化的输入操作将 Unicode 字符读入字符：

std::ifstream is;
is.open("utfICUfSeek.txt");
is.seekg(0, is.end);
int strSize = is.tellg();
auto inputCStr = new char[strSize + 1];
inputCStr[strSize] = '\0'; //add null-character at the end
is.seekg(0, is.beg);
is.read(inputCStr, strSize);
is.seekg(0, is.beg);
UnicodeString uStr = UnicodeString::fromUTF8(inputCStr);
is.close();

让我烦恼的是我必须为字符创建一个额外的缓冲区，然后才将它们转换为所需的 UnicodeString。

【问题讨论】：

标签： c++ unicode c++11 fstream icu

【解决方案1】：

这是使用 ICU 的替代方法。

使用标准std::fstream，您可以将文件的整个/部分读入标准std::string，然后使用支持Unicode 的迭代器对其进行迭代。 http://code.google.com/p/utf-iter/

std::string get_file_contents(const char *filename)
{
    std::ifstream in(filename, std::ios::in | std::ios::binary);
    if (in)
    {
        std::string contents;
        in.seekg(0, std::ios::end);
        contents.reserve(in.tellg());
        in.seekg(0, std::ios::beg);
        contents.assign((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
        in.close();
        return(contents);
    }
    throw(errno);
}

然后在你的代码中

std::string myString = get_file_contents( "foobar" );
unicode::iterator< std::string, unicode::utf8 /* or utf16/32 */ > iter = myString.begin();

while ( iter != myString.end() )
{
    ...
    ++iter;
}

【讨论】：

感谢您的回答。 get_file_contents 是我一直在寻找的，但我不知道使用具有线性复杂度（en.cppreference.com/w/cpp/string/basic_string/assign 数字（7））的 assign(...) 函数是否更快或给定 tellg 的 read 函数() 结果（见编辑）。迭代器解决方案很有趣，我将探索资源，但我可能还需要 ICU 的排序规则和语言环境，所以我可能不能放弃这个库。

【解决方案2】：

好吧，或者您想一次读取整个文件以进行某种后处理，在这种情况下，icu::UnicodeString 并不是最好的容器...

#include <iostream>
#include <fstream>
#include <sstream>

int main()
{
    std::ifstream in( "utfICUfSeek.txt" );
    std::stringstream buffer;
    buffer << in.rdbuf();
    in.close();
    // ...
    return 0;
}

...或者您真正想要的是读入icu::UnicodeString，就像读入任何其他字符串对象一样，但走了很长一段路...

#include <iostream>
#include <fstream>

#include <unicode/unistr.h>
#include <unicode/ustream.h>

int main()
{
    std::ifstream in( "utfICUfSeek.txt" );
    icu::UnicodeString uStr;
    in >> uStr;
    // ...
    in.close();
    return 0;
}

...或者我完全错过了您的问题的真正含义。 ;)

【讨论】：

这已经过时了。我猜想尽量避免为字符创建单独的缓冲区，因为你最终在 RAM 中有一个 UTF-8 字符串（在这个 char 数组中）和一个 UTF-16 字符串（内部在一个 UnicodeString 中）。因此，一种解决方案是实现一个函数，该函数将读取 UTF-8 代码单元并逐个推断代码点（从这些代码单元中）。然后它将一组 UTF-8 代码单元（形成一个有效的代码点）转换为 UTF-16 代码单元并将它们“加载”到一个 UnicodeString 中，然后尝试推断出足够的下一组代码单元以形成一个代码点等等。
后一个建议只需要最多 4 个字节用于 RAM 中的 UTF-8 代码单元，最多需要 TotalCodePoints * 4 个字节用于 UTF-16 缓冲区。但是你必须在不复制 UTF-16 字符串的情况下将 UChar（只是 16 位整数的 typedef）缓冲区加载到 UnicodeString 中（如果在删除字符串后释放缓冲区，这个 tinyurl.com/o2hxfd3 将是合适的）。