如何组合像'diff --unified'这样的文件呢？答案

【问题标题】：How to combine files like 'diff --unified' does it?如何组合像'diff --unified'这样的文件呢？
【发布时间】：2014-01-26 00:52:04
【问题描述】：

我正在寻找一种将两个或多个输入文件组合成一个输出文件的解决方案。它的工作方式与 'diff -U 999999 file1.txt file2.txt > output.txt' 完全相同，但没有差异指示符。

【问题讨论】：

file1.txt 和 file2.txt 是否重叠？（例如，file1.txt 首先包含一些 uniqe，然后是与 file2.txt 相同的内容，而 file2.txt 以公共部分开头，然后是唯一的内容）。我问的原因是，如果是这样，那么我认为我有一个解决方案。
您好 hlovdal，是的，这两个文件可能共享公共部分。实际上我想合并两个日志文件，其中可能有重叠的部分应该出现一次。

标签： merge diff

【解决方案1】：

这是我前段时间为合并一堆日志文件而编写的脚本。我首先开始手动使用 kdiff3，它对小文件效果很好，但随着累积的日志变大，变得痛苦并最终变得无法使用......

我们的日志中经常出现printf("time(NULL) = %d\n", time(NULL));的结果，你必须适应寻找其他一些单调递增的同步标记。

#!/usr/bin/perl 
use strict;
use warnings;

# This program takes two overlapping log files and combines
# them into one, e.g.
#
#          INPUT:                    OUTPUT:
#
#   file1        file2              combined
#    AAA                               AAA
#    AAA                               AAA
#    AAA                               AAA
#    BBB          BBB                  BBB
#    BBB          BBB                  BBB
#    BBB          BBB                  BBB
#                 CCC                  CCC
#                 CCC                  CCC
#                 CCC                  CCC
#                 CCC                  CCC
#

# This programm uses the "time(NULL) = <...time...>" lines in the
# logs to match where the logs start overlapping.

# Example line matched with this function:
# time(NULL) = 1388772638
sub get_first_time_NULL {
    my $filename = shift;
    my $ret = undef;
    open(FILE, $filename);
    while (my $line = <FILE>) {
        if ($line =~ /^time\(NULL\) = (\d+)/) {
            $ret = $1;
            last;
        }
    }
    close(FILE);
    return $ret;
}

my $F1_first_time = get_first_time_NULL($ARGV[0]);
my $F2_first_time = get_first_time_NULL($ARGV[1]);

my $oldest_file;
my $newest_file;
my $newest_file_first_time;

if ($F1_first_time <= $F2_first_time) {
    $oldest_file = $ARGV[0];
    $newest_file = $ARGV[1];
    $newest_file_first_time = $F2_first_time;
} else {
    $oldest_file = $ARGV[1];
    $newest_file = $ARGV[0];
    $newest_file_first_time = $F1_first_time;
}

# Print the "AAA" part
open(FILE, $oldest_file);
while (my $line = <FILE>) {
    print $line;
    last if ($line =~ /^time\(NULL\) = $newest_file_first_time/);
}
close(FILE);

# Print the "BBB" and "CCC" parts
my $do_print = 0;
open(FILE, $newest_file);
while (my $line = <FILE>) {
    print $line if $do_print;
    $do_print = 1 if ($line =~ /^time\(NULL\) = $newest_file_first_time/);
}
close(FILE);

上面的perl脚本只处理两个文件，所以我写了下面的shell脚本一次性处理所有的日志文件：

#!/bin/sh

# This script combines several overlapping logfiles into one
# continous one. See merge_log_files.pl for more details into
# how the logs are merged, this script is only glue to process
# multiple files in one operation.

set -e

MERGE_RESULT="$1"
shift

echo "Processing $1..."
cp "$1" MeRgE.TeMp.1
shift

while [ -n "$1" ]
do
    if [ ! -s "$1" ]
    then
        echo "Skipping empty file $1..."
        shift
        continue
    fi
    echo "Processing $1..."
    perl `echo $0 | sed 's/\.sh$/.pl/'` MeRgE.TeMp.1 "$1" > MeRgE.TeMp.2 && mv MeRgE.TeMp.2 MeRgE.TeMp.1
    shift;
done

mv MeRgE.TeMp.1 $MERGE_RESULT
echo "Done"

【讨论】：