【问题标题】:Using Encode::encode with "utf8"将 Encode::encode 与“utf8”一起使用
【发布时间】:2018-08-08 20:42:06
【问题描述】:

所以您可能知道,在 Perl 中,“utf8”意味着 Perl 对 UTF-8 的更松散的理解,它允许在技术上不是 UTF-8 中有效代码点的字符。相比之下,“UTF-8”(或“utf-8”)是 Perl 对 UTF-8 更严格的理解,它不允许无效的代码点。

我有几个与此区别相关的使用问题:

  1. 默认情况下,Encode::encode 会将无效字符替换为替换字符。即使您将较宽松的“utf8”作为编码传递,这是真的吗?

  2. 当您使用“UTF-8”读写open'd 文件时会发生什么?字符替换会发生在坏字符上还是会发生其他情况?

  3. open 与 '>:utf8' 之类的层和 '>:encoding(utf8)' 之类的层一起使用有什么区别?两种方法都可以与“utf8”和“UTF-8”一起使用吗?

【问题讨论】:

    标签: perl


    【解决方案1】:
    On Read,
    Invalid encoding other
    than sequence length
    On Read,
    Outside of Unicode,
    Unicode nonchar, or
    Unicode surrogate
    On Write,
    Outside of Unicode,
    Unicode nonchar, or
    Unicode surrogate
    :encoding(UTF-8) Warns and Replaces Warns and Replaces Warns and Replaces
    :encoding(utf8) Warns and Replaces Accepts Warns and Outputs
    :utf8 Corrupt scalar Accepts Warns and Outputs

    (这是 Perl 5.26 中的状态。)

    请注意,:encoding(UTF-8) 实际上使用 utf8 进行解码,然后检查生成的字符是否在可接受的范围内。这减少了错误输入的错误消息数量,所以很好。

    (编码名称不区分大小写。)


    用于生成上表的测试:

    读取时

    • :encoding(UTF-8)

        $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
           perl -MB -nle'
              use open ":std", ":encoding(UTF-8)";
              my $sv = B::svref_2object(\$_);
              printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
           '
        utf8 "\xFFFF" does not map to Unicode.
        utf8 "\xD800" does not map to Unicode.
        utf8 "\x200000" does not map to Unicode.
        utf8 "\x80" does not map to Unicode.
        E9 (internal: C3.A9, UTF8=1)
        5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
        5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
        5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
        5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
      
    • :encoding(utf8)

        $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
           perl -MB -nle'
              use open ":std", ":encoding(utf8)";
              my $sv = B::svref_2object(\$_);
              printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
           '
        utf8 "\x80" does not map to Unicode.
        E9 (internal: C3.A9, UTF8=1)
        FFFF (internal: EF.BF.BF, UTF8=1)
        D800 (internal: ED.A0.80, UTF8=1)
        200000 (internal: F8.88.80.80.80, UTF8=1)
        5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
      
    • :utf8

        $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
           perl -MB -nle'
              use open ":std", ":utf8";
              my $sv = B::svref_2object(\$_);
              printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
           '
        E9 (internal: C3.A9, UTF8=1)
        FFFF (internal: EF.BF.BF, UTF8=1)
        D800 (internal: ED.A0.80, UTF8=1)
        200000 (internal: F8.88.80.80.80, UTF8=1)
        Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
        0 (internal: 80, UTF8=1)
      

    写入时

    • :encoding(UTF-8)

        $ perl -e'
           use open ":std", ":encoding(UTF-8)";
           print "\x{E9}\n";
           print "\x{FFFF}\n";
           print "\x{D800}\n";
           print "\x{20_0000}\n";
        ' >a
        Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
        Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
        Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
        "\x{ffff}" does not map to utf8.
        "\x{d800}" does not map to utf8.
        "\x{200000}" does not map to utf8.
      
        $ od -t c a
        0000000 303 251  \n   \   x   {   F   F   F   F   }  \n   \   x   {   D
        0000020   8   0   0   }  \n   \   x   {   2   0   0   0   0   0   }  \n
        0000040
      
        $ cat a
        é
        \x{FFFF}
        \x{D800}
        \x{200000}
      
    • :encoding(utf8)

        $ perl -e'
           use open ":std", ":encoding(utf8)";
           print "\x{E9}\n";
           print "\x{FFFF}\n";
           print "\x{D800}\n";
           print "\x{20_0000}\n";
        ' >a
        Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
        Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
      
        $ od -t c a
        0000000 303 251  \n 355 240 200  \n 370 210 200 200 200  \n
        0000015
      
        $ cat a
        é
        ▒
        ▒
      
    • :utf8

      结果与:encoding(utf8) 相同。

    使用 Perl 5.26 测试。


    Encode::encode 默认会用替换字符替换无效字符。即使您将较松散的“utf8”作为编码传递也是如此吗?

    Perl 字符串是 32 位或 64 位字符的字符串,具体取决于构建。 utf8 可以编码任何 72 位整数。因此它能够编码所有可以被要求编码的字符。

    【讨论】:

      猜你喜欢
      • 2016-10-12
      • 2011-03-18
      • 1970-01-01
      • 2017-07-21
      • 1970-01-01
      • 2011-06-28
      • 2019-04-08
      • 1970-01-01
      相关资源
      最近更新 更多