使用 UTF-8 或 Latin1 编码将 QString 转换为 QByteArray答案

【问题标题】：Convert QString into QByteArray with either UTF-8 or Latin1 encoding使用 UTF-8 或 Latin1 编码将 QString 转换为 QByteArray
【发布时间】：2011-07-14 10:08:19
【问题描述】：

我想将 QString 转换为 utf8 或 latin1 QByteArray，但今天我得到的一切都是utf8。

我正在用 latin1 的较高段中高于 0x7f 的一些字符对此进行测试，德语 ü 就是一个很好的例子。

如果我这样做：

QString name("\u00fc"); // U+00FC = ü
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();

QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();

QTextCodec *codec = QTextCodec::codecForName("ISO 8859-1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();

我得到以下输出。

utf8 "ü" "c3bc" 
Latin1 "ü" "c3bc" 
ISO 8859-1 "ü" "c3bc"

如您所见，我在任何地方都得到了 unicode 0xc3bc，我希望在第 2 步和第 3 步得到 Latin1 0xfc。

我的猜测是我应该得到这样的东西：

utf8 "ü" "c3bc" 
Latin1 "ü" "fc" 
ISO 8859-1 "ü" "fc"

这是怎么回事？

/谢谢

一些字符表的链接：

此代码是在基于 Ubuntu 10.04 的系统上构建和执行的。

$> uname -a
Linux frog 2.6.32-28-generic-pae #55-Ubuntu SMP Mon Jan 10 22:34:08 UTC 2011 i686 GNU/Linux
$> env | grep LANG
LANG=en_US.utf8

如果我尝试使用

utf8.append(name.toUtf8());

我得到这个输出

utf8 "ü" "c383c2bc" 
Latin1 "ü" "c3bc" 
ISO 8859-1 "ü" "c3bc"

所以 latin1 是 unicode 而 utf8 是双重编码...

这一定取决于某些系统设置？

如果我运行它（无法构建 .name()）

qDebug() << "system name:"      << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings();
qDebug() << "codecForLocale:"   << QTextCodec::codecForLocale()->name();

然后我明白了：

system name: "en_US" 
codecForCStrings: 0x0 
codecForLocale: "System"

解决方案

如果我指定它是 UTF-8 我正在使用所以不同的类知道这一点，然后就可以了。

QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));

qDebug() << "system name:"      << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings()->name();
qDebug() << "codecForLocale:"   << QTextCodec::codecForLocale()->name();

QString name("\u00fc"); 
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();

QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();

QTextCodec *codec = QTextCodec::codecForName("latin1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();

然后我得到这个输出：

system name: "en_US" 
codecForCStrings: "UTF-8" 
codecForLocale: "UTF-8" 
utf8 "ü" "c3bc" 
Latin1 "ü" "fc" 
ISO 8859-1 "ü" "fc"

看起来应该这样。

【问题讨论】：

标签： c++ qt utf-8 latin1 qbytearray

【解决方案1】：

须知：

执行字符页面

在 C++ 标准中有一个称为 execution 字符集的东西，该术语描述了字符串和字符文字的输出将在编译器生成的二进制文件中的内容。您可以在http://gcc.gnu.org 站点的C 预处理器手册 的1 概述 部分的1.1 Character sets 小节中阅读它。

问题：
"\u00fc" 字符串文字会产生什么结果？

答案：
这取决于执行字符集是什么。对于 gcc（这是您正在使用的），默认情况下是 UTF-8，除非您使用 -fexec-charset 选项指定不同的内容。您可以在http://gcc.gnu.org 站点的GCC 手册 中3 GCC 命令选项 部分的3.11 Options Controlling the Preprocessor 小节中阅读有关此选项和其他控制预处理阶段的选项。现在，当我们知道执行字符集是 UTF-8 时，我们知道 "\u00fc" 将被转换为 U+00FC Unicode 代码点的 UTF-8 编码，即两个字节的序列 0xc3 0xbc。

QString::QString ( const char * str ) 和 QByteArray & QByteArray::append ( const QString & str ) 依赖于全局状态

QString 的构造函数采用char * 调用QString QString::fromAscii ( const char * str, int size = -1 )，它使用void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) 设置的编解码器（如果已设置任何编解码器）或与QString QString::fromLatin1 ( const char * str, int size = -1 ) 执行相同的操作（以防未设置编解码器）。

问题：
QString 的构造函数将使用什么编解码器来解码它得到的两个字节序列（0xc3 0xbc）？

答案：
默认情况下，QTextCodec::setCodecForCStrings() 没有设置编解码器，这就是为什么 Latin1 将用于解码字节序列的原因。由于0xc3 和0xbc 在拉丁语1 中都有效，分别代表Ã 和¼（这应该已经很熟悉了，因为它直接取自this 对您之前问题的回答）我们得到了带有这两个字符的QString .

qDebug() 不是 8 位干净的

您不应该使用QDebug 类来输出ASCII 之外的任何内容。你无法保证得到什么。

测试程序：

#include <QtCore>

void dbg(char const * rawInput, QString s) {

    QString codepoints;
    foreach(QChar chr, s) {
        codepoints.append(QString::number(chr.unicode(), 16)).append(" ");
    }

    qDebug() << "Input: " << rawInput
             << ", "
             << "Unicode codepoints: " << codepoints;
}

int main(int argc, char *argv[])
{
    QCoreApplication app(argc, argv);

    qDebug() << "system name:"
             << QLocale::system().name();

    for (int i = 1; i <= 5; ++i) {

        switch(i) {

        case 1:
            qDebug() << "\nWithout codecForCStrings (default is Latin1)\n";
            break;
        case 2:
            qDebug() << "\nWith codecForCStrings set to UTF-8\n";
            QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
            break;
        case 3:
            qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n";
            QTextCodec::setCodecForCStrings(0);
            QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
            break;
        case 4:
            qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n";
            QTextCodec::setCodecForCStrings(0);
            QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1"));
            break;
        }

        qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings()
                                           ? QTextCodec::codecForCStrings()->name()
                                           : "NOT SET");
        qDebug() << "codecForLocale:"   << (QTextCodec::codecForLocale()
                                           ? QTextCodec::codecForLocale()->name()
                                           : "NOT SET");

        qDebug() << "\n1. Using QString::QString(char const *)";
        dbg("\\u00fc", QString("\u00fc"));
        dbg("\\xc3\\xbc", QString("\xc3\xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü"));

        qDebug() << "\n2. Using QString::fromUtf8(char const *)";
        dbg("\\u00fc", QString::fromUtf8("\u00fc"));
        dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü"));

        qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)";
        dbg("\\u00fc", QString::fromLocal8Bit("\u00fc"));
        dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc"));
        dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü"));
    }

    return app.exec();
}

在 Windows XP 上的 mingw 4.4.0 上的输出：

system name: "pl_PL"

Without codecForCStrings (default is Latin1)

codecForCStrings: "NOT SET"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "102 13d "
Input:  \xc3\xbc ,  Unicode codepoints:  "102 13d "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

With codecForCStrings set to UTF-8

codecForCStrings: "UTF-8"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "102 13d "
Input:  \xc3\xbc ,  Unicode codepoints:  "102 13d "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8

codecForCStrings: "NOT SET"
codecForLocale: "UTF-8"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1

codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

2. Using QString::fromUtf8(char const *)
Input:  \u00fc ,  Unicode codepoints:  "fc "
Input:  \xc3\xbc ,  Unicode codepoints:  "fc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input:  \u00fc ,  Unicode codepoints:  "c3 bc "
Input:  \xc3\xbc ,  Unicode codepoints:  "c3 bc "
Input:  LATIN SMALL LETTER U WITH DIAERESIS ,  Unicode codepoints:  "fc "

我要感谢 #qt freenode.org 的 thiago、cbreak、peppe 和 heinz IRC 频道，用于展示和帮助我了解此处涉及的问题。

【讨论】：

我使用的是基于 Ubuntu Linux 的系统，我认为他使用的是 utf8 作为默认字符集。
@Johan 我不太明白你的意思 但 utf8 是唯一正确的吗？
在我的第一个代码中，ü 被转换为 0xc3bc，这是正确的。但是作为一个 latin1 应该转换成 0xfc 。就像你输出显示的那样。所以在我的例子中，utf8 是正确的，而 latin1 不是。
@Johan qDebug() 做什么
我需要同时设置 codecForCStrings 和 setCodecForLocale，否则它不能正常工作...