将转义的 UTF-8 八位字节的 char 数组转换为 C++ 中的字符串答案

【问题标题】：Convert a char array of escaped UTF-8 octets to a string in C++将转义的 UTF-8 八位字节的 char 数组转换为 C++ 中的字符串
【发布时间】：2018-03-02 06:24:05
【问题描述】：

我有一个 char 数组，其中包含一些 UTF-8 编码的土耳其语字符 - 以转义八位字节的形式。因此，如果我在 C++11 中运行此代码：

void foo(char* utf8_encoded) { 

    cout << utf8_encoded << endl;

}

它打印\xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e。我想将此char[] 转换为std::string，以便它包含UTF-8 解码值İ-Ç-Ü-Ğ。我已将 char[] 转换为 wstring，但仍打印为 \xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e。我该怎么做？

编辑：我不是构造这个 char[] 的人。它是私有库调用的回调函数的静态长度参数之一。所以回调函数如下：

void some_callback_function (INFO *info) { 

    cout << info->some_char_array << endl;
    cout << "*****" << endl;

    for(int i=0; i<64; i++) {
        cout << "-" << info->some_char_array[i];
    }
    cout << "*****" << endl;

    char bar[65] = "\xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e";
    cout << bar << endl;
}

INFO 结构在哪里：

typedef struct {
    char some_char_array[65];
} INFO;

所以当我的回调函数被调用时，输出如下：

\xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e
*****
-\-x-c-4-\-x-b-0---\-x-c-3-\-x-8-7---\-x-c-3-\-x-9-c---\-x-c-4-\-x-9-e-----------------------------
*****
İ-Ç-Ü-Ğ

所以我目前的问题是，我没有得到 info->some_char_array 和 bar 字符数组之间的区别。我想要的是编辑info->some_char_array 这样，它将输出打印为İ-Ç-Ü-Ğ。

【问题讨论】：

MultiByteToWideChar(CP_UTF8, 0, src, srclen, dst, dstlen); ?
无法重现：ideone.com/kOKJBK 您的字符串不是您想的那样（确保逐字节打印十六进制值），或者您的控制台语言环境不是 utf-8
我对这个问题非常感到困惑。 \xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e 是字符 İ-Ç-Ü-Ğ...?!? 的 UTF-8 序列您在寻找什么“转换”？ “UTF-8 解码”是什么意思？（对每个人来说，wstring 是怎么回事？这不是问题，UTF-8 是......）你得到\xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e，逐字逐句，即字符序列'\'、'x'、'c' 和很快？然后你的调用代码（你没有显示）做了一些非常奇怪的事情。借用 xskxzr，我们需要一个minimal reproducible example，有输入、观察到的输出和预期的输出。
或者您是否正在寻找将{ '\', 'x', 'c', '4', ... } 转换为正确的UTF-8 字符串？
我认为你的术语是错误的。 “UTF-8 解码”是指“UTF-8 编码”，而“UTF-8 编码”是指“以 ASCII 编码为类似 C 的字符转义序列，表示组成 Unicode 代码点的值”。阿米尔特？

标签： c++ string c++11 utf-8 char

【解决方案1】：

好的，这有点少，是从我正在使用的一个更大的解析器中提取出来的。但是“一点点”是Boost.Spirit 的本质。 ;-)

解析器不仅会解析十六进制转义，还会解析八进制 (\123) 和“标准”转义 (\n)。在 CC0 下提供，因此您可以随心所欲地使用它。 ;-)

Boost.Spirit 是 Boost 的“仅标题”部分，因此您无需链接任何库代码。不过，Spirit 头文件为允许以这种方式在 C++ 源代码中表达的语法所做的相当复杂的“魔术”在编译时有点困难。

但它有效，而且效果很好。

#define BOOST_SPIRIT_USE_PHOENIX_V3

#include "boost/spirit/include/qi.hpp"
#include "boost/spirit/include/phoenix.hpp"

#include <string>
#include <cstring>
#include <sstream>
#include <stdexcept>

namespace
{

// Helper function: Turn on_error positional parameters into error message.
template< typename Iterator >
std::string make_error_message( boost::spirit::info const & info, Iterator first, Iterator last )
{
    std::ostringstream oss;
    oss << "Invalid sequence. Expecting " << info << " here: \"" << std::string( first, last ) << "\"";
    return oss.str();
}

}

// Wrap helper function with Boost.Phoenix boilerplate, so the function
// can be called from within a parser's [].
BOOST_PHOENIX_ADAPT_FUNCTION( std::string, make_error_message_, make_error_message, 3 )

// Supports various escape sequences:
// - Character escapes ( \a \b \f \n \r \t \v \" \\ )
// - Octal escapes ( \n \nn \nnn )
// - Hexadecimal escapes ( \xnn ) (*)
//
// (*): In C/C++, a hexadecimal escape runs until the first non-hexdigit
//      is encountered, which is not very helpful. This one takes exactly
//      two hexdigits.

// Declaring a grammer that works given any kind of iterator,
// and results in a std::string object.
template < typename Iterator >
class EscapedString : public boost::spirit::qi::grammar< Iterator, std::string() >
{
    public:
        // Constructor
        EscapedString() : EscapedString::base_type( escaped_string )
        {
            // An escaped string is a sequence of
            // characters that are not '\', or
            // an escape sequence
            escaped_string = *( +( boost::spirit::ascii::char_ - '\\' ) | escapes );

            // An escape sequence begins with '\', followed by
            // an escaped character (e.g. "\n"), or
            // an 'x' and 2..2 hexadecimal digits, or
            // 1..3 octal digits.
            escapes = '\\' > ( escaped_character
                               | ( "x" > boost::spirit::qi::uint_parser< char, 16, 2, 2 >() )
                               | boost::spirit::qi::uint_parser< char, 8, 1, 3 >() );

            // The list of special "escape" characters
            escaped_character.add
            ( "a", 0x07 )  // alert
            ( "b", 0x08 )  // backspace
            ( "f", 0x0c )  // form feed
            ( "n", 0x0a )  // new line
            ( "r", 0x0d )  // carriage return
            ( "t", 0x09 )  // horizontal tab
            ( "v", 0x0b )  // vertical tab
            ( "\"", 0x22 ) // literal quotation mark
            ( "\\", 0x5c ) // literal backslash
            ;

            // Error handling
            boost::spirit::qi::on_error< boost::spirit::qi::fail >
            (
                escapes,
                // backslash not followed by a valid sequence
                boost::phoenix::throw_(
                    boost::phoenix::construct< std::runtime_error >( make_error_message_( boost::spirit::_4, boost::spirit::_3, boost::spirit::_2 ) )
                )
            );
        }

    private:
        // Qi Rule member
        boost::spirit::qi::rule< Iterator, std::string() > escaped_string;

        // Helpers
        boost::spirit::qi::rule< Iterator, std::string() > escapes;
        boost::spirit::qi::symbols< char const, char > escaped_character;
};


int main()
{
    // Need to escape the backslashes, or "\xc4" would give *one*
    // byte of output (0xc4, decimal 196). I understood the input
    // to be the FOUR character hex char literal,
    // backslash, x, c, 4 in this case,
    // which is what this string literal does.
    char * some_char_array = "\\xc4\\xb0-\\xc3\\x87-\\xc3\\x9c-\\xc4\\x9e";

    std::cout << "Input: '" << some_char_array << "'\n";

    // result object    
    std::string s;

    // Create an instance of the grammar with "char *"
    // as the iterator type.
    EscapedString< char * > es;

    // start, end, parsing grammar, result object
    boost::spirit::qi::parse( some_char_array,
                              some_char_array + std::strlen( some_char_array ),
                              es,
                              s );

    std::cout << "Output: '" << s << "'\n";

    return 0;
}

这给出了：

Input: '\xc4\xb0-\xc3\x87-\xc3\x9c-\xc4\x9e'
Output: 'İ-Ç-Ü-Ğ'

【讨论】：

原始代码使用 ICU 也进行 Unicode 转义（\uxxxx 和 \Uxxxxxxxx），但我编辑了这些，因为你没有说明这些是否是必要的，或者 ICU “允许”在您的环境中。
感谢您的回答。编译时出现“'UChar32' 未在此范围内声明”错误。但是添加“#include ”仍然会出现同样的错误。
臭小子，也忘了把它拿出来..给我一分钟。
@cagrias：在我取出 ICU 的东西后，应该是 char。（原始代码在内部使用 UTF-32 并最终将字符串转换为 UTF-16。在我将其剔除并使其播放 UTF-8 后，一些工件仍然存在。抱歉。）
@cagrias：这对你有用吗？我应该从我的答案中去掉非十六进制部分以使其更“流线型”/可接受吗？我错过了什么吗？