bash - 将一个文件的两列与第二个文件的一列进行比较并打印匹配项答案

【问题标题】：bash - compare two columns of one file with one column of second file and print matchesbash - 将一个文件的两列与第二个文件的一列进行比较并打印匹配项
【发布时间】：2020-08-20 00:28:02
【问题描述】：

我有两个不同的文件，每个文件大约 1000 行，结构如下：

file1：（名字；姓氏；地址）

Mike;Tyson;First Street 2
Tom;Boyden;Second Street 6
Tom;Cruise;Third Street 9
Mike;Myers;Second Street 4

file2: (First Name Last Name; E-Mail; ID) OR (Last Name First Name;E-Mail; ID)

Mike Tyson;mike@tyson.com;45753
Cruise Tom;tom@cruise.com;23562
Jennifer Lopez;jennifer@lopez.com;92746
Brady Tom;tom@brady.com;27583

我想将 file1 的前两列与 file2 的整个第一列进行比较。如果 file1 的两个条目都存在于 file2 的第一列（以任意顺序），我想打印 file1 的匹配行。然后搜索file1的第二行，再次与file2的整列进行比较，以此类推。

在 file2 中，顺序可以是 (First Name Last Name) 或 (Last Name First Name)，我想在两种情况下都打印匹配的行。

预期输出：

Mike;Tyson;First Street 2
Tom;Cruise;Third Street 9

我对使用 awk、grep 或其他任何东西的解决方案感到满意。

我尝试了类似问题的解决方案，但输出为空：

awk -F';' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' file1 file2

谢谢

【问题讨论】：

1.将 file1 中的第一个 ; 替换为空格。 2.join第一个字段上的文件。指定-o 输出格式以仅打印file1 中的字段 2. 将第一个空格替换为;。
你尝试了什么？
@Ankush 我用我目前尝试过的解决方案更新了帖子
不要简单地连接字符串来尝试创建一个唯一的字符串。 a bc -> abc 和 ab c -> abc。您需要包含一个分隔符以使结果唯一。在 awk 手册页中查找 SUBSEP。

标签： awk grep

【解决方案1】：

$ awk -F'[ ;]' '
    { key=($1 > $2 ? $1 FS $2 : $2 FS $1) }
    NR==FNR { a[key]; next }
    key in a
' file1 file2
Mike Tyson;mike@tyson.com;45753
Cruise Tom;tom@cruise.com;23562

上面使用通用、惯用的方法来生成一致的键，无论关键组件出现的顺序如何，通过在连接它们以创建键值之前对组件进行排序。在本例中只有 2 个组件时，只需进行简单比较即可。

这就是为什么对键的组件进行排序是正确方法的原因。想象一下，您有 3 个组件，1 美元、2 美元和 3 美元，而不是只有 2 个。通过测试每个组合的方法，您需要这样：

NR==FNR { a[$1,$2,$3]; next }
($1,$2,$3) in a || ($1,$3,$2) in a || ($2,$1,$3) in a ||
($2,$3,$1) in a || ($3,$1,$2) in a || ($3,$2,$1) in a

尝试为 $1 到 $4 编写该条件 :-)。

相比之下，如果您使用对您需要的组件进行排序的方法（为方便起见，使用 GNU awk 进行内置排序功能），则更难出错（例如，在比较中忘记组合）：

NR==FNR {
    split($1 FS $2 FS $3,flds)
    asort(flds)
    key = flds[1]
    for (i=2; i in flds; i++) {
        key = key FS flds[i]
    }
    a[key]
    next
}
key in a

现在想象一下，如果您想以任何顺序使用 $1 到 $10。 “测试组件的每个组合方法”成为一个站不住脚的噩梦，而“对组件进行排序以创建密钥”方法只是意味着在第一个 split() 参数中将字段添加到列表中。

【讨论】：

【解决方案2】：

请您尝试关注一下。

awk '
FNR==NR{
  array[$1,$2]
  next
}
(($1,$2) in array) || (($2,$1) in array)
' FS="[ ;]"  Input_file2  FS=";" Input_file1

说明：为上述解决方案添加详细说明。

awk '                                       ##Starting awk program from here.
FNR==NR{                                    ##Checking condition if FNR==NR which will be true when file2 is being read.
  array[$1,$2]                              ##Creating array with index $1,$2 here.
  next                                      ##next will skip all further statement from here.
}
(($1,$2) in array) || (($2,$1) in array)    ##Checking condition if $1,$2 OR $2,$1 is present in array then it will print the line from Input_file1.
' FS="[ ;]"  file2  FS=";" file1            ##Set field separator space or semi-colon for file2 AND set field separator as ; for file1 here.

【讨论】：