Ubuntu：Perl 误读带有西里尔字符的文件名答案

【问题标题】：Ubuntu: Perl is misreading filenames with Cyrillic charactersUbuntu：Perl 误读带有西里尔字符的文件名
【发布时间】：2021-06-09 14:33:54
【问题描述】：

我有很多具有西里尔文文件名的文件，例如Deceasedя0я0.25я3.xgboost.json

我用一个函数读入这些文件：

use Devel::Confess 'color'
use utf8;
use autodie ':all';
use open ':std', ':encoding(UTF-8)';

sub json_file_to_ref {
    my $json_filename = shift;
    open my $fh, '<:raw', $json_filename; # Read it unmangled
    local $/;                     # Read whole file
    my $json = <$fh>;             # This is UTF-8
    my $ref = decode_json($json); # This produces decoded text
    return $ref;                  # Return the ref rather than the keys and values.
}

我从perl & python writing out non-ASCII characters into JSON differently得到的

但问题是 Perl 会读取像DeceasedÑ0Ñ0.2Ñ3.xgboost.json 这样的文件，即将я 翻译成Ñ，这意味着当我进行正则表达式搜索时这些文件不会出现。

文件名是这样读取的：

sub list_regex_files {
    my $regex = shift;
    my $directory = '.';
    if (defined $_[0]) {
        $directory = shift
    }
    my @files;
    opendir (my $dh, $directory);
    $regex = qr/$regex/;
    while (my $file = readdir $dh) {
        if ($file !~ $regex) {
            next
        }
        if ($file =~ m/^\.{1,2}$/) {
            next
        }
        my $f = "$directory/$file";
        if (-f $f) {
            if ($directory eq '.') {
                push @files, $file
            } else {
                push @files, $f
            }
        }
    }
    @files
}

但是，如果我注释掉，我可以让文件显示在正则表达式搜索中

use utf8;
use open ':std', ':encoding(UTF-8)';

但是当我尝试读取文件时（以下错误是针对不同文件的），

Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at 4.best.params.pl line 32, <$_[...]> chunk 1.
    main::json_file_to_ref("data/Deceased\x{d1}\x{8f}0\x{d1}\x{8f}0.15\x{d1}\x{8f}3.xgboost.json") called at 4.best.params.pl line 140

我看过类似的帖子，例如 How do I write a file whose *filename* contains utf8 characters in Perl? 和 Perl newbie first experience with Unicode (in filename, -e operator, open operator, and cmd window)，但我没有使用 Windows。

我也试过use feature 'unicode_strings'，但无济于事。

我也试过

use Encode 'decode_utf8';
sub json_file_to_ref {
    my $json_filename = shift;
    open my $fh, '<:raw', decode_utf8($json_filename); # Read it unmangled
    local $/;                     # Read whole file
    my $json = <$fh>;             # This is UTF-8
    my $ref = decode_json($json); # This produces decoded text
    return $ref;                  # Return the ref rather than the keys and values.
}

但这会产生相同的错误消息。

我也试过

use open ':std', ':encoding(cp866)';
use open IO => ':encoding(cp1251)';

如Reading Cyrillic characters from file in perl中建议的那样

但这也失败了。

如何让 Linux Perl 读取通过该子例程写入的文件名？

【问题讨论】：

读取文件名的代码在哪里？
@stark 我已经编辑了问题以包括如何读取文件名
readdir 返回一个字节字符串，这是打开文件必须使用的格式。要将其与文本字符串进行比较，您应该制作一个解码副本以用于您的正则表达式。见perlmonks.org/?node_id=583736
如果你'使用 utf8;'，print decode('utf8', $filename); 至少打印带有西里尔字符的文件名吗？我认为从readdir() 读取的文件名是您需要用来传递给open() 和其他基于文件的函数。我认为在尝试将其与您的正则表达式匹配之前，您应该将文件名decode()'utf8'。基本上，我认为您需要有两个不同的文件名字符串，一个来自readdir() 和'utf8' 版本，并根据您的操作使用适当的一个。见perlmonks.org/?node_id=583752
另见In what encoding does readdir return a filename?

标签： linux perl ubuntu unicode utf-8

【解决方案1】：

正如@Ed Sabol 指出的那样，问题在于文件字符以及文件的读取方式。

要更改的关键行是 readdir $dh 到 decode_utf8(readdir $dh) 这允许 Perl 处理非拉丁文（西里尔文）文件名。还应该加载编码库：use Encode 'decode_utf8';

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';
use autodie ':all';
use Devel::Confess 'color';
use feature 'say';
use JSON 'decode_json';
use utf8;
use DDP;
use Devel::Confess 'color';
use Encode 'decode_utf8'; # necessary for Cyrillic characters
use open ':std', ':encoding(UTF-8)';    # For say to STDOUT.  Also default for open()

sub json_file_to_ref {
    my $json_filename = shift;
    open my $fh, '<:raw', $json_filename; # Read it unmangled
    local $/;                     # Read whole file
    my $json = <$fh>;             # This is UTF-8
    my $ref = decode_json($json); # This produces decoded text
    return $ref;                  # Return the ref rather than the keys and values.
}

sub list_regex_files {
    my $regex = shift;
    my $directory = '.';
    if (defined $_[0]) {
        $directory = shift
    }
    my @files;
    opendir (my $dh, $directory);
    $regex = qr/$regex/;
    while (my $file = decode_utf8(readdir $dh)) {
        if ($file !~ $regex) {
            next
        }
        if ($file =~ m/^\.{1,2}$/) {
            next
        }
        my $f = "$directory/$file";
        if (-f $f) {
            if ($directory eq '.') {
                push @files, $file
            } else {
                push @files, $f
            }
        }
    }
    @files
}
my @files = list_regex_files('я.json$');
p @files;

my $data = json_file_to_ref('я.json');
p $data;

顺便说一句，随着 Perl7 即将推出，非拉丁字符处理似乎是一个明智的默认设置，应该更改

【讨论】：

核心问题是文件系统可以以任何方式存储文件名。否则会很好，但是 Perl 无法知道这一点。 UTF-8、UTF-16、EBCDIC，谁知道呢。文件系统驱动器同样可以接受谁知道什么。欢迎使用互操作性。
您总是可以使用核心B 模块函数B::perlstring() 让Perl 将字符串中的实际内容返回给您，以查看它是否与您的想法相符。您不能只使用print()，因为您的终端可能会解释原始字节并向您显示 看起来像 有效的 unicode 文本。 perlstring() 显示使用非 ASCII 转义的明确表示。