如何在 unix 的目录中查找重复的文件名 [关闭]答案

【问题标题】：How to find duplicate filenames in a directory in unix [closed]如何在 unix 的目录中查找重复的文件名 [关闭]
【发布时间】：2014-08-07 09:41:02
【问题描述】：

以下是我目录中的几个文件。

**$pwd
/opt/offline/**

1  -rw-r--r--. 1 root root  40513 Aug  7 10:02 TN_DAY0OFFER8047_07082014100213_processed
2  -rw-r--r--. 1 root root  32335 Aug  7 10:02 TN_DAY0OFFER8204_07082014100217_processed
3  -rw-r--r--. 1 root root  20126 Aug  7 10:02 TN_DAY0OFFER8047_07082014100221_processed
4  -rw-r--r--. 1 root root 205175 Aug  7 10:02 TN_DAY0OFFER7027_07082014100225_locked
5  -rw-r--r--. 1 root root  15776 Aug  7 10:02 TN_DAY0OFFER7020_07082014100229_locked
6  -rw-r--r--. 1 root root      0 Aug  7 10:02 TN_DAY0OFFER7020_07082014100233_locked

现在第 1 和第 3 个文件具有相同的名称（不考虑时间戳），第 5 和第 6 个文件具有相同的名称。现在我想获取重复文件（即第 3 和第 6 个）并将其分别附加到第 1 和第 5 个，这样就不会出现重复文件和数据丢失......（最好使用 perl 或 shell）。

【问题讨论】：

标签： perl shell unix

【解决方案1】：

这是一个 Perl 脚本，可以满足您的需求。它在当前目录中查找以“TN”开头的文件并构建数组哈希，将具有相似名称的文件组合在一起。然后它会遍历哈希并连接文件，删除旧文件。

不用说，在使用此脚本之前备份您的原始文件！

use strict;
use warnings;

my %merges;
for my $file (glob "TN*") {
    if ($file =~ /(.*)_\d+_(.*)/) {
        push @{$merges{"$1$2"}}, "'$file'";
    }
}

for (keys %merges) {
    my @files = @{$merges{$_}}; 
    my $target = shift @files;  
    if (@files) {
        print "concatenating @files to $target\n";
        `cat @files >> $target && rm @files`;
    }
}

【讨论】：

非常感谢 ... !!!!成功了！！！
为my @parts +1，然后不使用它。可怜的 OP 将永远拥有这条线！ :P
@jaypal 感谢您发现这一点，最初我是 splitting /_/ 上的文件名，但我改变了主意。
@TomFenech 不用担心，我想通了。 :) 我敢肯定，如果您使用更糟糕的方式连接文件并删除它们，您的答案会更有说服力，但TIMTOWDI 我猜...
@jaypal 这个想法确实闪过我的脑海，但我得出的结论是，在基于 Unix 的系统上，cat 和 rm 已经非常擅长做这些工作！没关系...

【解决方案2】：

使用 Bash 4.0。

#!/bin/bash

error_exit() {
    echo "$1" >&2
    exit 1
}

[ -n "$BASH_VERSION" ] && [[ BASH_VERSINFO -ge 4 ]] || error_exit "Script requires Bash 4.0."

[[ -z $1 || ! -d $1 ]] && error_exit "Directory not specified or doesn't exist: $1"

pushd "$1" || error_exit "Unable to change directory to $1."

declare -A MAP

shopt -s nullglob

for F in *_*_*_*; do
    [[ -f $F ]] || continue
    IFS=_ read -ra A B C D __ <<< "$F"
    BASE=${MAP["$A|$B|$D"]}
    if [[ -n $BASE ]]; then
        cat "$F" >> "$BASE"
        rm -f -- "$F"
    else
        MAP["$A|$B|$D"]=$F
    fi
done

用法：

bash script.sh dir

注意：如果您不希望以错误的方式删除或更改文件，请先使用复制的文件进行测试。

cp -a dir /tmp/dir.copy
bash script.sh /tmp/dir.copy

当涉及到文件的操作时，shell 更合适。它也可以与awk 一起使用，但awk 仍然依赖于/bin/sh，而且有时对参数进行清理很困难或很麻烦。

【讨论】：

非常感谢您的努力，但它不支持 BASH 4.0....：（请您把它简单一点，因为我们不需要这么复杂。
@user3916993 Ruby 怎么样？
不...我在 Shell/Perl 中需要它...

【解决方案3】：

使用 Perl：

#!/usr/bin/env perl
use strict;
use warnings;
use File::Glob;
my $dir = $ARGV[0];
die "No argument was passed." if not defined $dir;
die "Argument is not a directory: $dir" if not -d $dir;
chdir "$dir" or die "Unable to change directory to $dir.";
my @files = <*_*_*_*>;
my $map = {};
foreach my $f (@files) {
    next if not -f $f;
    my ($a, $b, $c, $d) = split(/_/, $f);
    my $key = "$a|$b|$d";
    my $base = $map->{$key};
    if (defined $base) {
        open(A, '>>', $base) or die "Unable to open file $base for reading.";
        open(B, '<', $f) or die "Unable to open file $f for reading.";
        while (my $line = <B>) {
            print A $line;
        }
        close(A);
        close(B);
        unlink $f;
    }
    $map->{$key} = $f;
}

用法：

perl script.pl dir

【讨论】：

【解决方案4】：

我认为这里有一些锤子敲碎坚果......

#! /bin/sh -
# Concatenate files sharing a common prefix (before '_').
# The files are concatenated to a file named by the prefix.

curr=XXX

ls *_* | sort | while read fn
do
    pfx=`expr $fn : '\([^_]*\).*'`
    if test $pfx = $curr; then
        # another in this group of files, sharing a prefix
        cat $fn >> $pfx
    else
        # new group of files with prefix $pfx
        cp $fn $pfx
        curr=$pfx
    fi
done

这并没有完全按照您的要求进行，但它似乎符合您的要求（并且不涉及 *shudder*Perl）。

【讨论】：