须知:
在 C++ 标准中有一个称为 execution 字符集的东西,该术语描述了字符串和字符文字的输出将在编译器生成的二进制文件中的内容。您可以在http://gcc.gnu.org 站点的C 预处理器手册 的1 概述 部分的1.1 Character sets 小节中阅读它。
问题:
"\u00fc" 字符串文字会产生什么结果?
答案:
这取决于执行字符集是什么。对于 gcc(这是您正在使用的),默认情况下是 UTF-8,除非您使用 -fexec-charset 选项指定不同的内容。您可以在http://gcc.gnu.org 站点的GCC 手册 中3 GCC 命令选项 部分的3.11 Options Controlling the Preprocessor 小节中阅读有关此选项和其他控制预处理阶段的选项。现在,当我们知道执行字符集是 UTF-8 时,我们知道 "\u00fc" 将被转换为 U+00FC Unicode 代码点的 UTF-8 编码,即两个字节的序列 0xc3 0xbc。
QString 的构造函数采用char * 调用QString QString::fromAscii ( const char * str, int size = -1 ),它使用void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) 设置的编解码器(如果已设置任何编解码器)或与QString QString::fromLatin1 ( const char * str, int size = -1 ) 执行相同的操作(以防未设置编解码器)。
问题:
QString 的构造函数将使用什么编解码器来解码它得到的两个字节序列(0xc3 0xbc)?
答案:
默认情况下,QTextCodec::setCodecForCStrings() 没有设置编解码器,这就是为什么 Latin1 将用于解码字节序列的原因。由于0xc3 和0xbc 在拉丁语1 中都有效,分别代表Ã 和¼(这应该已经很熟悉了,因为它直接取自this 对您之前问题的回答)我们得到了带有这两个字符的QString .
您不应该使用QDebug 类来输出ASCII 之外的任何内容。你无法保证得到什么。
测试程序:
#include <QtCore>
void dbg(char const * rawInput, QString s) {
QString codepoints;
foreach(QChar chr, s) {
codepoints.append(QString::number(chr.unicode(), 16)).append(" ");
}
qDebug() << "Input: " << rawInput
<< ", "
<< "Unicode codepoints: " << codepoints;
}
int main(int argc, char *argv[])
{
QCoreApplication app(argc, argv);
qDebug() << "system name:"
<< QLocale::system().name();
for (int i = 1; i <= 5; ++i) {
switch(i) {
case 1:
qDebug() << "\nWithout codecForCStrings (default is Latin1)\n";
break;
case 2:
qDebug() << "\nWith codecForCStrings set to UTF-8\n";
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
break;
case 3:
qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n";
QTextCodec::setCodecForCStrings(0);
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
break;
case 4:
qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n";
QTextCodec::setCodecForCStrings(0);
QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1"));
break;
}
qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings()
? QTextCodec::codecForCStrings()->name()
: "NOT SET");
qDebug() << "codecForLocale:" << (QTextCodec::codecForLocale()
? QTextCodec::codecForLocale()->name()
: "NOT SET");
qDebug() << "\n1. Using QString::QString(char const *)";
dbg("\\u00fc", QString("\u00fc"));
dbg("\\xc3\\xbc", QString("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü"));
qDebug() << "\n2. Using QString::fromUtf8(char const *)";
dbg("\\u00fc", QString::fromUtf8("\u00fc"));
dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü"));
qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)";
dbg("\\u00fc", QString::fromLocal8Bit("\u00fc"));
dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü"));
}
return app.exec();
}
在 Windows XP 上的 mingw 4.4.0 上的输出:
system name: "pl_PL"
Without codecForCStrings (default is Latin1)
codecForCStrings: "NOT SET"
codecForLocale: "System"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "102 13d "
Input: \xc3\xbc , Unicode codepoints: "102 13d "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
With codecForCStrings set to UTF-8
codecForCStrings: "UTF-8"
codecForLocale: "System"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "102 13d "
Input: \xc3\xbc , Unicode codepoints: "102 13d "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8
codecForCStrings: "NOT SET"
codecForLocale: "UTF-8"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"
1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "
3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
我要感谢 #qt freenode.org 的 thiago、cbreak、peppe 和 heinz IRC 频道,用于展示和帮助我了解此处涉及的问题。