将窄字符串插入到 std::basic_ostream<wchar_t>答案

【问题标题】：Inserting narrow character string to std::basic_ostream<wchar_t>将窄字符串插入到 std::basic_ostream<wchar_t>
【发布时间】：2015-12-30 17:05:36
【问题描述】：

根据cppref，std::basic_ostream<wchar_t> 有一个operator << 重载，它接受const char*。似乎转换操作只是将每个char 扩展为wchar_t。也就是说，转换（插入）的宽字符数等于窄字符数。那么问题来了。窄字符串可能正在编码国际字符，例如使用GB2312 的汉字。进一步假设sizeof(wchar_t) 是2 并使用UTF16 编码。那么这种幼稚的字符转换方法应该如何工作呢？

【问题讨论】：

我会说它不会工作。如果您需要在不同的编码和字符宽度之间进行转换，您应该查看处理它的库，例如 ICU。
@JoachimPileborg 那么宽字符日志记录在 Boost.Log 中是如何工作的呢？请看boost.org/doc/libs/1_59_0/libs/log/doc/html/log/tutorial/…
我不能对 Boost 日志说什么，但它可能只是在某处进行了适当的转换？
@JoachimPileborg 我不这么认为。它只是灌输了一个定制的语言环境。在链接页面中查找operator << 重载severity_level。

标签： c++ internationalization locale c++-standard-library

【解决方案1】：

我刚刚签入了 Visual Studio 2015，你是对的。 chars 只扩大到wchar_ts 没有任何转换。在我看来，您必须自己将窄字符串转换为宽字符串。有几种方法可以做到这一点，其中一些已经被建议了。

这里我建议你可以使用纯 C++ 工具来完成它，假设你的 C++ 编译器和标准库足够完整（Visual Studio 或 Linux 上的 GCC（并且只有那里））：

void clear_mbstate (std::mbstate_t & mbs);

void
towstring_internal (std::wstring & outstr, const char * src, std::size_t size,
    std::locale const & loc)
{
    if (size == 0)
    {
        outstr.clear ();
        return;
    }

    typedef std::codecvt<wchar_t, char, std::mbstate_t> CodeCvt;
    const CodeCvt & cdcvt = std::use_facet<CodeCvt>(loc);
    std::mbstate_t state;
    clear_mbstate (state);

    char const * from_first = src;
    std::size_t const from_size = size;
    char const * const from_last = from_first + from_size;
    char const * from_next = from_first;

    std::vector<wchar_t> dest (from_size);

    wchar_t * to_first = &dest.front ();
    std::size_t to_size = dest.size ();
    wchar_t * to_last = to_first + to_size;
    wchar_t * to_next = to_first;

    CodeCvt::result result;
    std::size_t converted = 0;
    while (true)
    {
        result = cdcvt.in (
            state, from_first, from_last,
            from_next, to_first, to_last,
            to_next);
        // XXX: Even if only half of the input has been converted the
        // in() method returns CodeCvt::ok. I think it should return
        // CodeCvt::partial.
        if ((result == CodeCvt::partial || result == CodeCvt::ok)
            && from_next != from_last)
        {
            to_size = dest.size () * 2;
            dest.resize (to_size);
            converted = to_next - to_first;
            to_first = &dest.front ();
            to_last = to_first + to_size;
            to_next = to_first + converted;
            continue;
        }
        else if (result == CodeCvt::ok && from_next == from_last)
            break;
        else if (result == CodeCvt::error
            && to_next != to_last && from_next != from_last)
        {
            clear_mbstate (state);
            ++from_next;
            from_first = from_next;
            *to_next = L'?';
            ++to_next;
            to_first = to_next;
        }
        else
            break;
    }
    converted = to_next - &dest[0];

    outstr.assign (dest.begin (), dest.begin () + converted);
}

void
clear_mbstate (std::mbstate_t & mbs)
{
    // Initialize/clear mbstate_t type.
    // XXX: This is just a hack that works. The shape of mbstate_t varies
    // from single unsigned to char[128]. Without some sort of initialization
    // the codecvt::in/out methods randomly fail because the initial state is
    // random/invalid.
    std::memset (&mbs, 0, sizeof (std::mbstate_t));
}

这个函数是 log4cplus 库的一部分，它可以工作。它使用codecvt 方面进行转换。你必须给它适当的设置locale。

Visual Studio 在为您正确设置 GB2312 区域设置时可能会遇到问题。您可能必须使用 _setmbcp() 才能使其正常工作。详情见“double byte character sequence conversion issue in Visual Studio 2015”。

【讨论】：