使用 awk 比较两个文件并打印不匹配的记录答案

【问题标题】：Comparison of two files using awk and print non matched records使用 awk 比较两个文件并打印不匹配的记录
【发布时间】：2020-04-16 11:53:04
【问题描述】：

我正在比较两个文件 file1 和 file2，我需要打印在 file1 中比较的 file2 的更新记录。我需要file2的数据更改和新添加的记录

文件1：

1|footbal|play1
2|cricket|play2
3|tennis|play3
5|golf|play5

文件2：

1|footbal|play1
2|cricket|play2
3|tennis|play3
4|soccer|play4
5|golf|play6

输出文件：

4|soccer|play4
5|golf|play6

我已经尝试了以下解决方案，但它不是预期的输出

awk -F'|' 'FNR == NR { a[$3] = $3; a[$1]=$1; next; } { if ( !($3 in a) && !($1 in a) ) { print $0; } }' file1.txt file2.txt

我已经比较了两个文件中的 column1 和 column3

【问题讨论】：

记录是否排序？如果是，您可以使用comm 来获得不同。
实际 id 是随机数

标签： linux awk

【解决方案1】：

请您尝试关注一下。

awk 'BEGIN{FS="|"}FNR==NR{a[$1,$3];next} !(($1,$3) in a)' Input_file1  Input_file2

或非单线形式的解决方案。

awk '
BEGIN{
  FS="|"
}
FNR==NR{
  a[$1,$3]
  next
}
!(($1,$3) in a)
'  Input_file1  Input_file2

说明：为上述代码添加详细说明。

awk '               ##Starting awk program from here.
BEGIN{              ##Starting BEGIN section of this program from here.
  FS="|"            ##Setting FS as pipe here as per Input_file(s).
}                   ##Closing BEGIN block for this awk code here.
FNR==NR{            ##Checking condition FNR==NR which will be TRUE when 1st Input_file named file1 is being read.
  a[$1,$3]          ##Creating an array named a with index os $1,$3 of current line.
  next              ##next will skip all further statements.
}
!(($1,$3) in a)     ##Checking condition if $1,$3 are NOT present in array a then print that line from Input_file2.
'  Input_file1  Input_file2     ##mentioning Input_file names here.

输出如下。

4|soccer|play4
5|golf|play6

【讨论】：

最后一个问题，我们正在逐列比较，而不是读取完整的行？只是为了澄清，因为我需要将 file1 中的 10 列和 file2 中的 10 列与 400 万条记录进行比较。如果逐行读取，则是性能问题。
@rakeshkandukuri，是的，它正在逐列进行操作，它有$1,$3，所以是的，您可以将字段编号放在那里。此外，如果您想比较整行，请使用 $0 本身。如果有任何疑问，请仔细阅读我的解释。
我正在使用命令 NR==FNR {exclude[$0];next} !($0 in exclude)' file1 file2 用于文件完整比较和 'BEGIN{FS="|"}FNR== NR{a[$1,$3];next} !(($1,$3) in a)' Input_file1 Input_file2 使用它来比较两个文件的列。问题是如果我在第二个命令中添加更多列会影响性能吗？目前我正在比较两个文件的 30 列，它比第一个命令花费的时间更多。有什么想法吗？
@rakeshkandukuri，您的意思是您将检查另一个文件中存在的 30 列吗？ Hoonn 我们需要检查它，但除此之外我们别无选择，直到/除非您想比较整行本身，这对于 awk 和我们来说更容易维护。
是的，我将 file1 中的 30 列与 file2 中的 30 列进行比较。我认为这是一个性能问题。如果我使用 $0，它会逐行比较吗？