【问题标题】:Sed not working with special CharactersSed 不能使用特殊字符
【发布时间】:2012-09-09 12:10:24
【问题描述】:

我有一个sample text file,其中一些数字被编码为非 Ascii 字符。我有 character map 用于对文件进行编码,但是当我使用 sed 替换每个字符时,我得到了意想不到的结果。

喜欢这些

 ¤»¤ ¡  1 3

3ô1ô ôôôôô1ô
ôôôô
                       ôôôôô¤ôôôôô»ôôôôô¤ôôôôôô ô¡ ô 1 3ô

我尝试过的命令是这些

sed -r 's/`echo ô`/5/g' new.txt
sed -r 's/\ô/5/g' new.txt

还有perl

perl -pe 's/\ô/5/g' < new.txt

我需要帮助。谢谢。

【问题讨论】:

    标签: linux ubuntu sed non-ascii-characters


    【解决方案1】:

    我认为解决此问题的方法是首先以明确的形式获取字符(在两个文件中)。然后遍历映射文件,将每个明确的字符添加到具有其所述值的哈希中。最后,循环遍历明确的样本字符(明确字符的长度为 16),将每个字符替换为其哈希值。如果示例文件包含 ASCII 字符(即其明确形式的长度不是 16),这可能会被破坏。您可能需要根据您的输入来解决此问题,但如果您的示例文本指示您的实际文件,那么您应该没有任何问题。如果结果不符合您的预期,请告诉我。

    运行方式:

    ./translate.pl CharMap.txt sample.txt
    

    translate.pl的内容:

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    # open the files up for reading.
    # ARGV[0] points to the first file listed, 'CharMap.txt'
    # ARGV[1] points to the second file listed, 'sample.txt'
    open CHARMAP, $ARGV[0] or die;
    open SAMPLE, $ARGV[1] or die;
    
    # execute `sed -n 'l0'` on each file and capture output into two arrays
    # the '-n' flag suppresses printing of pattern space
    # the 'l0' command simply means print the pattern space in an unambiguous form
    my @charmap = `sed -n 'l0' $ARGV[0]`;
    my @sample = `sed -n 'l0' $ARGV[1]`;
    
    # declare a hash
    my %charhash;
    
    # loop through the array of character mappings
    for (@charmap) {
        # use a subroutine to sanitize each element
        $_ = sanitize($_);
        # add each unambiguous character to a hash with its mapping pair
        $charhash{ substr $_, 2 } = substr $_, 0, 1;
    }
    
    # now loop through the unambiguous sample data
    # in your sample file there is only a single element so the loop is unnecessary
    for (@sample) {
        # use a subroutine to sanitize each element
        $_ = sanitize($_);
        # so each unambiguous character is 16 readable characters longs.
        # so we need to loop through 16 chars at a time. These can be stored in $1. 
        # then we ask the hash 'what is the value of the element $1?
        # we then print this value.
        print $charhash{$1} while $_ =~ /(.{16})/g;
    
        # print a newline char to replace the chomped input
        print "\n";
    }
    
    close CHARMAP;
    close SAMPLE;
    
    sub sanitize {
    
        # read in the element passed to the subroutine
        my $line = shift;
    
        # remove newline endings
        chomp $line;
    
        # for some reason your files contained this transparent 12 digit unreadable
        # unambiguous character right at the start of the two files. I do not know
        # what it is or what it looks like, but for convenience, I simply remove it
        # from every line, even if I only found on the first line.
        $line =~ s/^\\357\\273\\277//;
    
        # trim off a trailing line ending
        $line =~ s/\$$//;
    
        # trim off a trailing newline ending
        $line =~ s/\\r$//;
    
        return $line;
    }
    

    结果:

    3177191281013,997,094
    

    可以在the sed manual 中找到有关sed l0 的更多信息

    【讨论】:

    • 这个工作正常,但我不知道 perl,所以请你看一下代码吗?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-05-14
    • 1970-01-01
    • 2011-05-17
    • 1970-01-01
    • 2018-07-02
    相关资源
    最近更新 更多