Perl 生成的 JSON 中的 UTF8 字符串在客户端损坏答案

【问题标题】：Perl-generated UTF8 strings in JSON corrupted on client sidePerl 生成的 JSON 中的 UTF8 字符串在客户端损坏
【发布时间】：2019-07-09 08:11:07
【问题描述】：

我有一个 Perl CGI 脚本，它从 PostgreSQL 数据库访问泰语、UTF-8 字符串，并将它们作为 JSON 返回到基于 Web 的前端。当我从数据库中获取字符串并将它们编码为 JSON 之后（基于写入日志文件），这些字符串很好。但是，当客户端收到它们时，它们已损坏，例如：

功能名称“à¹\u0082à¸£à¸\u0087à¹\u0080à¸£à¸µà¸¢à¸\u0099à¸§à¸±à¸\u0094à¸à¸²à¸©à¸µ”

很明显，一些字符被转换为 Unicode 转义序列，但不是全部。

我真的可以就如何解决这个问题提出一些建议。

下面是简化的代码 sn-p。我正在使用“utf8”和“utf8::all”，以及“JSON”。

提前感谢您提供的任何帮助。

my $dataId = $cgi->param('dataid');
my $table = "uploadpoints";
my $sqlcommand = "select id,featurename from $table where dataid=$dataId;";
my $stmt = $gDbh->prepare($sqlcommand);
my $numrows = $stmt->execute;
# print JSON header
print <<EOM;
Content-type: application/json; charset="UTF-8"


EOM
my @retarray;
for (my $i = 0; ($i < $numrows); $i=$i+1)
{
    my $hashref = $stmt->fetchrow_hashref("NAME_lc");
    #my $featurename = $hashref->{'featurename'};
    #logentry("Point $i feature name is: $featurename\n");
    push @retarray,$hashref;
}
my $json = encode_json (\@retarray);
logentry("JSON\n $json");
print $json;

我已经修改并简化了示例，现在在本地运行而不是通过浏览器调用：

my $dataId = 5; 
my $table = "uploadpoints";
my $sqlcommand = "select id,featurename from $table where dataid=$dataId and id=75;";
my $stmt = $gDbh->prepare($sqlcommand);
my $numrows = $stmt->execute;
my @retarray;
for (my $i = 0; ($i < $numrows); $i=$i+1)
{
    my $hashref = $stmt->fetchrow_hashref("NAME_lc");
    my $featurename = $hashref->{'featurename'};
    print "featurename $featurename\n";
    push @retarray,$hashref;
}
my $json = encode_json (\@retarray);
print $json;

在 Stefan 的示例中使用 hexdump，我确定从数据库读取的数据已经采用 UTF-8 格式。看起来好像它们正在 JSON 编码方法中重新编码。但为什么？

JSON 中的数据使用的字节数正好是原始 UTF-8 的两倍。

 perl testcase.pl | hexdump -C
00000000  66 65 61 74 75 72 65 6e  61 6d 65 20 e0 b9 82 e0  |featurename ....|
00000010  b8 a3 e0 b8 87 e0 b9 80  e0 b8 a3 e0 b8 b5 e0 b8  |................|
00000020  a2 e0 b8 99 e0 b9 81 e0  b8 88 e0 b9 88 e0 b8 a1  |................|
00000030  e0 b8 88 e0 b8 b1 e0 b8  99 e0 b8 97 e0 b8 a3 e0  |................|
00000040  b9 8c 0a 5b 7b 22 66 65  61 74 75 72 65 6e 61 6d  |...[{"featurenam|
00000050  65 22 3a 22 c3 a0 c2 b9  c2 82 c3 a0 c2 b8 c2 a3  |e":"............|
00000060  c3 a0 c2 b8 c2 87 c3 a0  c2 b9 c2 80 c3 a0 c2 b8  |................|
00000070  c2 a3 c3 a0 c2 b8 c2 b5  c3 a0 c2 b8 c2 a2 c3 a0  |................|
00000080  c2 b8 c2 99 c3 a0 c2 b9  c2 81 c3 a0 c2 b8 c2 88  |................|
00000090  c3 a0 c2 b9 c2 88 c3 a0  c2 b8 c2 a1 c3 a0 c2 b8  |................|
000000a0  c2 88 c3 a0 c2 b8 c2 b1  c3 a0 c2 b8 c2 99 c3 a0  |................|
000000b0  c2 b8 c2 97 c3 a0 c2 b8  c2 a3 c3 a0 c2 b9 c2 8c  |................|
000000c0  22 2c 22 69 64 22 3a 37  35 7d 5d                 |","id":75}]|
000000cb

进一步的建议？我尝试在 UTF 字符串上使用解码，但出现与宽字符相关的错误。

我确实阅读了 Tom Christianson 的推荐答案以及他的 Unicode 教程，但我承认其中大部分内容都超出了我的想象。此外，我的问题似乎受到了更大的限制。

我确实想知道检索哈希值并将其分配给普通变量是否正在执行某种自动解码或编码。我真的不明白 Perl 何时使用其内部字符格式而不是何时保留外部编码。

更新解决方案

事实证明，由于从数据库中检索到的字符串已经是 UTF-8，我需要使用“to_json”而不是“encode_json”。这解决了问题。虽然在这个过程中学到了很多关于 Perl Unicode 处理的知识......

很清楚的阐述。

【问题讨论】：

改用new JSON->utf8->encode(\@retarray)？
我撤回了我的回答。如果没有 exact 输入字符串，显然不可能编写显示正确行为的测试代码。 IE。您的问题缺少$hashref->{featurename} 的确切内容。它是带有编码 UTF-8 的八位字节字符串吗？它是内部 Perl 表示中的字符串吗？即 UTF-8 解码？请注意，这可能会根据是否在use utf8;或no utf8;下执行操作而改变。
刚刚偶然发现Tom Christiansen's answer。如果您在 Perl 上使用 Unicode/UTF-8 做任何事情，可能会推荐入门。
你配置你的数据库连接了吗？ SET client_encoding TO 'UTF-8'
根据您的 DBD::Pg 版本，您可能还需要设置 $dh->{pg_enable_utf8}=1

标签： json perl unicode utf-8

【解决方案1】：

注意：您可能还应该阅读此answer，相比之下，我的答案低于标准:-)

问题是你必须确定每个字符串的格式，否则你会得到不正确的转换。处理 UTF-8 时，字符串可以有两种格式：

原始 UTF-8 编码的八位字节字符串，即 \x{100} 表示为两个八位字节 0xC4 0x80
内部 Perl 字符串表示，即一个 Unicode 字符 \x{100} (U+0100 Ā LATIN CAPITAL LETTER A WITH MACRON)

如果涉及 I/O，您还需要知道 I/O 层是否进行 UTF-8 解码/编码。对于终端 I/O，您还必须考虑它是否理解 UTF-8。两者结合起来可能会使您难以从代码中获得有意义的调试打印输出。

如果您的 Perl 代码需要在从源代码读取 UTF-8 字符串后处理它们，您必须确保它们是内部 Perl 格式。否则，当您调用需要 Perl 字符串而不是原始八位字节字符串的代码时，您会得到令人惊讶的结果。

我尝试在我的示例代码中展示这一点：

#!/usr/bin/perl
use warnings;
use strict;

use JSON;

open(my $utf8_stdout, '>& :encoding(UTF-8)', \*STDOUT)
    or die "can't reopen STDOUT as utf-8 file handle: $!\n";

my $hex = "C480";
print "${hex}\n";

my $raw = pack('H*', $hex);
print STDOUT       "${raw}\n";
print $utf8_stdout "${raw}\n";

my $decoded;
utf8::decode($decoded = $raw);
print STDOUT       ord($decoded), "\n";
print STDOUT       "${decoded}\n"; # Wide character in print at...
print $utf8_stdout "${decoded}\n";

my $json = JSON->new->encode([$decoded]);
print STDOUT       "${json}\n"; # Wide character in print at...
print $utf8_stdout "${json}\n";

$json = JSON->new->utf8->encode([$decoded]);
print STDOUT       "${json}\n";
print $utf8_stdout "${json}\n";

exit 0;

从我的终端（支持 UTF-8）复制和粘贴。仔细看看行之间的区别：

$ perl dummy.pl
C480
Ā
Ä
256
Wide character in print at dummy.pl line 21.
Ā
Ā
Wide character in print at dummy.pl line 25.
["Ā"]
["Ā"]
["Ā"]
["Ä"]

但是将其与以下比较，其中 STDOUT 不是终端，而是通过管道传输到另一个程序。十六进制转储始终显示“c4 80”，即 UTF-8 编码。

$ perl dummy.pl | hexdump -C
Wide character in print at dummy.pl line 21.
Wide character in print at dummy.pl line 22.
Wide character in print at dummy.pl line 25.
Wide character in print at dummy.pl line 26.
00000000  43 34 38 30 0a c4 80 0a  c4 80 0a 5b 22 c4 80 22  |C480.......[".."|
00000010  5d 0a 5b 22 c4 80 22 5d  0a 43 34 38 30 0a c4 80  |].[".."].C480...|
00000020  0a 32 35 36 0a c4 80 0a  5b 22 c4 80 22 5d 0a 5b  |.256....[".."].[|
00000030  22 c4 80 22 5d 0a                                 |".."].|
00000036

【讨论】：