【问题标题】:Find similar text files查找类似的文本文件
【发布时间】:2014-01-14 07:14:52
【问题描述】:

有没有人有一种特别优雅的命令行(linux、OS X)方法来识别给定目录中“文本相似”的文件?

“文本相似”是指文件应该只有 N 行不同。

【问题讨论】:

  • differ in N number of lines. 订单也算吗?案子?空间?问题不清楚...请举例..
  • 也许您应该使用version control 系统,例如git ...
  • @Kent:我说的是只计算常规 diff 命令中的行数。

标签: linux macos sed awk diff


【解决方案1】:

使用Terraform 意味着有很多文件是从其他文件复制而来的,并且只进行了一些更改。当您想查看文件有什么特别之处时,要弄清楚文件是从哪里复制的,真是令人沮丧。我制作了一个名为similarities.sh 的工具来帮助我确定一个文件与一组文件中的每个文件的相似程度。

#!/bin/bash

fileA="$1"
shift
for fileB in "$@"; do
    (
        # diff once grep twice with the help of tee and stderr
        diff $fileA $fileB | \
            tee >(grep -cE '^< ' >&2) | \
                  grep -cE '^> ' >&2
    # recapture stderr
    ) 2>&1 | (
        read -d '' diffA diffB;
        printf "The files %s and %s have %s:%s diffs out of %s:%s lines.\n" \
            $fileA $fileB $diffA $diffB $(wc -l < $fileA) $(wc -l < $fileB)
    )
done | column -t

它在行动:

$ similarities.sh terraform.tfvars ../*/terraform.tfvars
The  files  terraform.tfvars  and  ../api_proxy/terraform.tfvars                   have  3:3   diffs  out  of  51:51  lines.
The  files  terraform.tfvars  and  ../cf-ip-location-lookup/terraform.tfvars       have  4:12  diffs  out  of  51:59  lines.
The  files  terraform.tfvars  and  ../cf-region-cookie-setter/terraform.tfvars     have  4:8   diffs  out  of  51:55  lines.
The  files  terraform.tfvars  and  ../cf-switch-region-origin/terraform.tfvars     have  4:10  diffs  out  of  51:57  lines.
The  files  terraform.tfvars  and  ../reformat_devops_alerts/terraform.tfvars      have  0:0   diffs  out  of  51:51  lines.
The  files  terraform.tfvars  and  ../restart_location/terraform.tfvars            have  17:3  diffs  out  of  51:37  lines.
The  files  terraform.tfvars  and  ../warehouse-availability-etl/terraform.tfvars  have  3:3   diffs  out  of  51:51  lines.

【讨论】:

    【解决方案2】:

    也许 PMD 是您正在寻找的东西:https://pmd.github.io

    维护好了,使用简单。

    您可能需要重复代码检测:https://pmd.github.io/pmd-5.5.5/usage/cpd-usage.html (您的问题不清楚您是针对代码还是简单的纯文本,但我不明白为什么它在这两种情况下都不起作用)。

    【讨论】:

      【解决方案3】:

      使用 awk

      diff file1 file2 |awk '!/^<|^>|^-/{a=$0;lt[a]=0;gt[a]=0;next}    # Use label (not start from <,>,---) and set the array lt and gt
           /</{lt[a]++}                                                # if has differ "<", sum it into array lt
           />/{gt[a]++}                                                # if has differ ">", sum it into array gt
      END{for (i in lt) 
             sum+=lt[i]>gt[i]?lt[i]:gt[i]                              # compare "<" or ">" lines, take the max and add in variable sum
             printf "Files have differs in %d lines\n",sum             # Do the print job.
             if (sum<3) {print "So files are similar" }
             else{print "So files are not similar"}
          }'
      

      你可以自己定义数字,比如我的命令中如果有两行“if (sum

      测试结果。

      $ cat file1
      a
      b
      a
      d
      b
      c
      c
      
      $ cat file2
      a
      b
      d
      b
      d
      c
      d
      f
      
      $ diff file1 file2
      3d2
      < a
      5a5
      > d
      7,8c7,8
      < c
      <
      ---
      > d
      > f
      
      $  diff file1 file2 |awk '!/^<|^>|^-/{a=$0;lt[a]=0;gt[a]=0;next}/</{lt[a]++}/>/{gt[a]++}END{for (i in lt) sum+=lt[i]>gt[i]?lt[i]:gt[i];printf "Files have differs in %d lines\n",sum;if (sum<3) {print "So files are similar" }else{print "So files are not similar"}}'
      
      Files have differs in 4 lines
      So files are not similar
      

      【讨论】:

        【解决方案4】:

        这是一种粗略的方法,使用统一的diffwc 来计算不同的行数。 Grep 用于过滤掉 diff 上下文:

        diff -U 0  file1 file2  | grep -v ^@ | grep -v ^--- | grep -v ^+++ | wc -l
        

        【讨论】:

        • 如果两行不同,你的命令将导出 4,而不是 2。
        猜你喜欢
        • 1970-01-01
        • 2019-05-23
        • 2013-06-08
        • 2017-08-09
        • 2020-04-01
        • 2012-12-17
        • 2010-12-09
        • 1970-01-01
        • 2017-12-27
        相关资源
        最近更新 更多