【问题标题】:Different behaviors algorithm when working with a UTF8 on different operating systems在不同操作系统上使用 UTF8 时的不同行为算法
【发布时间】:2017-04-02 20:44:24
【问题描述】:

算法简单代码:

#include <iostream>
#include <string>

std::string::size_type GetLengthWithUTF(std::string &sValue);

int main()
{
    std::string sTestValueUTF8 = "\xD0\xB6\xD0\xB6\xD0\xB6";
    std::string sTestValueASCII = "\x67\x67\x67";
    std::string sTestValueMIX = "\x67\x67\x67\xD0\xB6\xD0\xB6\xD0\xB6";
    std::string::size_type iFuncResult = 0;

    std::cout << "=========== START TEST ==========\n\n";

    std::cout << "+TEST UTF8 STRING\n";
    std::cout << "+----+Bytes of string (sTestValueUTF8.length()) = " << sTestValueUTF8.length() << "\n";
    iFuncResult = GetLengthWithUTF(sTestValueUTF8);
    std::cout << "+----+Function result (GetLengthWithUTF(\"" << sTestValueUTF8 << "\")) = " << iFuncResult<< "\n\n";

    std::cout << "+TEST ASCII STRING\n";
    std::cout << "+----+Bytes of string (sTestValueASCII.length()) = " << sTestValueASCII.length() << "\n";
    iFuncResult = GetLengthWithUTF(sTestValueASCII);
    std::cout << "+----+Function result (GetLengthWithUTF(\"" << sTestValueASCII << "\")) = " << iFuncResult<< "\n\n";

    std::cout << "+TEST MIX STRING\n";
    std::cout << "+----+Bytes of string (sTestValueMIX.length()) = " << sTestValueMIX.length() << "\n";
    iFuncResult = GetLengthWithUTF(sTestValueMIX);
    std::cout << "+----+Function result (GetLengthWithUTF(\"" << sTestValueMIX << "\")) = " << iFuncResult<< "\n\n";

    std::cout << "\n===========  END TEST  ==========\n\n";
}

std::string::size_type GetLengthWithUTF(std::string &sValue)
{
    std::cout << "     +----+START GetLengthWithUTF\n";
    std::cout << "          +Input string is: " << sValue << "\n";
    std::string::size_type i;
    std::cout << "          +Start cycle\n";
    int iCountUTF8characters = 0;
    for (i = 0; i < sValue.length(); i++)
    {
        std::cout << "          +----+Iteration N " << i << "\n";
        std::cout << "               +Current character is: " << sValue[i] << ", integer value = " << (int)sValue[i] << "\n";
        if (sValue[i] > 127)
        {
            iCountUTF8characters++;
            std::cout << "               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: " << iCountUTF8characters << "\n";
        }
        else
        {
            std::cout << "               +----+If statement (sValue[i] > 127) is false.\n";
        }
    }

    std::cout << "          +End cycle\n";
    iCountUTF8characters = iCountUTF8characters / 2;
    std::cout << "          +Return sValue.length() - (iCountUTF8characters / 2) ---> " << sValue.length() << " - (" << iCountUTF8characters << " / 2) = " << (sValue.length() - (std::string::size_type)iCountUTF8characters) <<"\n";
    std::cout << "     +----+ASCIID GetLengthWithUTF\n";
    return (sValue.length() - (std::string::size_type)iCountUTF8characters);
}

控制台编译命令:

AIX 6

g++ -o test test.cpp

RHEL 服务器 6.7 圣地亚哥

g++ -o test test.cpp

微软视窗 v10.0.14393

cl /EHsc test.cpp



结果:

AIX 6
=========== START TEST ==========

+TEST UTF8 STRING
+----+Bytes of string (sTestValueUTF8.length()) = 6
     +----+START GetLengthWithUTF
          +Input string is: жжж
          +Start cycle
          +----+Iteration N 0
               +Current character is: Ь integer value = 208
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 1
          +----+Iteration N 1
               +Current character is: ֬ integer value = 182
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 2
          +----+Iteration N 2
               +Current character is: Ь integer value = 208
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 3
          +----+Iteration N 3
               +Current character is: ֬ integer value = 182
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 4
          +----+Iteration N 4
               +Current character is: Ь integer value = 208
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 5
          +----+Iteration N 5
               +Current character is: ֬ integer value = 182
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 6
          +End cycle
          +Return sValue.length() - (iCountUTF8characters / 2) ---> 6 - (3 / 2) = 3
     +----+ASCIID GetLengthWithUTF
+----+Function result (GetLengthWithUTF("жжж")) = 3

+TEST ASCII STRING
+----+Bytes of string (sTestValueASCII.length()) = 3
     +----+START GetLengthWithUTF
          +Input string is: ggg
          +Start cycle
          +----+Iteration N 0
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 1
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 2
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +End cycle
          +Return sValue.length() - (iCountUTF8characters / 2) ---> 3 - (0 / 2) = 3
     +----+ASCIID GetLengthWithUTF
+----+Function result (GetLengthWithUTF("ggg")) = 3

+TEST MIX STRING
+----+Bytes of string (sTestValueMIX.length()) = 9
     +----+START GetLengthWithUTF
          +Input string is: gggжжж
          +Start cycle
          +----+Iteration N 0
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 1
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 2
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 3
               +Current character is: Ь integer value = 208
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 1
          +----+Iteration N 4
               +Current character is: ֬ integer value = 182
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 2
          +----+Iteration N 5
               +Current character is: Ь integer value = 208
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 3
          +----+Iteration N 6
               +Current character is: ֬ integer value = 182
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 4
          +----+Iteration N 7
               +Current character is: Ь integer value = 208
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 5
          +----+Iteration N 8
               +Current character is: ֬ integer value = 182
               +----+If statement (sValue[i] > 127) is true, value of iCountUTF8characters is: 6
          +End cycle
          +Return sValue.length() - (iCountUTF8characters / 2) ---> 9 - (3 / 2) = 6
     +----+ASCIID GetLengthWithUTF
+----+Function result (GetLengthWithUTF("gggжжж")) = 6


===========  END TEST  ==========

RHEL 服务器 6.7 圣地亚哥

=========== START TEST ==========

+TEST UTF8 STRING
+----+Bytes of string (sTestValueUTF8.length()) = 6
     +----+START GetLengthWithUTF
          +Input string is: жжж
          +Start cycle
          +----+Iteration N 0
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 1
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 2
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 3
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 4
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 5
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +End cycle
          +Return sValue.length() - (iCountUTF8characters / 2) ---> 6 - (0 / 2) = 6
     +----+ASCIID GetLengthWithUTF
+----+Function result (GetLengthWithUTF("жжж")) = 6

+TEST ASCII STRING
+----+Bytes of string (sTestValueASCII.length()) = 3
     +----+START GetLengthWithUTF
          +Input string is: ggg
          +Start cycle
          +----+Iteration N 0
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 1
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 2
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +End cycle
          +Return sValue.length() - (iCountUTF8characters / 2) ---> 3 - (0 / 2) = 3
     +----+ASCIID GetLengthWithUTF
+----+Function result (GetLengthWithUTF("ggg")) = 3

+TEST MIX STRING
+----+Bytes of string (sTestValueMIX.length()) = 9
     +----+START GetLengthWithUTF
          +Input string is: gggжжж
          +Start cycle
          +----+Iteration N 0
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 1
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 2
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 3
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 4
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 5
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 6
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 7
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 8
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +End cycle
          +Return sValue.length() - (iCountUTF8characters / 2) ---> 9 - (0 / 2) = 9
     +----+ASCIID GetLengthWithUTF
+----+Function result (GetLengthWithUTF("gggжжж")) = 9


===========  END TEST  ==========

微软视窗 v10.0.14393

=========== START TEST ==========

+TEST UTF8 STRING
+----+Bytes of string (sTestValueUTF8.length()) = 6
     +----+START GetLengthWithUTF
          +Input string is: жжж
          +Start cycle
          +----+Iteration N 0
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 1
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 2
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 3
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 4
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 5
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +End cycle
          +Return sValue.length() - (iCountUTF8characters / 2) ---> 6 - (0 / 2) = 6
     +----+ASCIID GetLengthWithUTF
+----+Function result (GetLengthWithUTF("жжж")) = 6

+TEST ASCII STRING
+----+Bytes of string (sTestValueASCII.length()) = 3
     +----+START GetLengthWithUTF
          +Input string is: ggg
          +Start cycle
          +----+Iteration N 0
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 1
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 2
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +End cycle
          +Return sValue.length() - (iCountUTF8characters / 2) ---> 3 - (0 / 2) = 3
     +----+ASCIID GetLengthWithUTF
+----+Function result (GetLengthWithUTF("ggg")) = 3

+TEST MIX STRING
+----+Bytes of string (sTestValueMIX.length()) = 9
     +----+START GetLengthWithUTF
          +Input string is: gggжжж
          +Start cycle
          +----+Iteration N 0
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 1
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 2
               +Current character is: g, integer value = 103
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 3
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 4
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 5
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 6
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 7
               +Current character is: Ь integer value = -48
               +----+If statement (sValue[i] > 127) is false.
          +----+Iteration N 8
               +Current character is: ֬ integer value = -74
               +----+If statement (sValue[i] > 127) is false.
          +End cycle
          +Return sValue.length() - (iCountUTF8characters / 2) ---> 9 - (0 / 2) = 9
     +----+ASCIID GetLengthWithUTF
+----+Function result (GetLengthWithUTF("gggжжж")) = 9


===========  END TEST  ==========

算法必须计算字符串中的字符数。从测试结果可以看出,它只有在 AIX 下才能正常工作。

如果有人帮助我理解这种对我来说荒谬的算法在不同操作系统下的行为,我会很高兴。该算法是在 OS AIX 下创建的。从 AIX 迁移到 LINUX 后发现它有问题,我做了更广泛的测试,结果你看到了。我的主要问题是该死的算法在 AIX 下是如何工作的。我无法以任何合乎逻辑的方式解释它。

【问题讨论】:

  • 那个算法不正确;它只适用于 unicode 的一小部分。更好的算法是计算字节数,使ch&amp;0xC0 != 0x80 仅消除非初始代码(0x80-0xBF 范围内的代码)。
  • 是的,你是对的。这个算法是遗留的,非常古老,检查字符串少于 200 个字符。但是它改变了你上面描述的算法。我很想知道这个问题。

标签: c++ linux algorithm aix cyrillic


【解决方案1】:

似乎这两种系统在处理字符符号的方式上有所不同,这是标准所允许的。您的 AIX 编译器将 chars 视为无符号,而其他两个系统将它们视为有符号。

在带有无符号字符的系统上,条件sValue[i] &gt; 127 的行为完全符合人们的预期。但是,相同的表达式在带符号字符的系统上永远不会成功。

这就是为什么对于代码为 128 及以上的字符会得到负数。例如,208 在被视为单字节有符号值时变为 -48

您可以通过强制转换为无符号或使用位掩码检查八位来解决此问题:

if (sValue[i] & 128) {
    ... // MSB is set
}

【讨论】:

  • 该死!这是正确的!非常感谢!!! if (unsigned(sValue[i]) > 127) 是慢版本,但对某些开发人员来说更具可读性;)
猜你喜欢
  • 2018-02-02
  • 1970-01-01
  • 1970-01-01
  • 2021-06-03
  • 2015-11-06
  • 2023-04-09
  • 1970-01-01
  • 2015-09-14
  • 2021-04-06
相关资源
最近更新 更多