【问题标题】:Find common lines between two files查找两个文件之间的共同点
【发布时间】:2014-07-26 10:44:12
【问题描述】:

文件 1:

6
9219045
71608707
105853666
106000373
106000464
106000814
106001204
106001483
106002054

文件 2:

6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-points
55555,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQ5C9LG75v8+ENmmteRa/bBHVxAH4ABQAAAFBgXjgKk6KvTg4FiPfWF/7Ittzk/MpmlBecYkc9Bc+3mAV7R58rcl1hGkFdk3MagFXjUsunbE0qcV+Gy+DwhUWpBYDpA3p9q9oO8zwDJfFqCHQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points
74292,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQPxjL0KWZoaYxWY7clP57tnVxAH4ABQAAAFB6WiMY05SU2WiYqaC7CzwMP2kQ51ec9mkIPh7R4fz2LPwfT8VNpAwH0QLM3I497D2JLfK13S6S90dxpU1ny2VBwaU4imxVchwo7YrcvwvEZXQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points

文件 1 只有一列,我正在使用命令 sort -n file1 对文件进行排序

文件 2 有三列,我正在使用命令 sort -t "," -k 1n,1 file2 对文件进行排序,该命令基于 ist 列进行排序。

现在,我想在 file2 中查找从 file1 中的行开始的行

我尝试过的命令:

grep -w -f file1 file2

join -t "," -1 1 -2 1 -o 2.2 file1 file2

但是,我没有得到想要的结果。请为我提供替代方法。文件 1 有 7124458 行,文件 2 有 42987432 行。

【问题讨论】:

  • 试试split - 我相信你可以查到

标签: linux sorting join grep


【解决方案1】:

使用 awk:

awk -F, 'FNR == NR { ++a[$0]; next } $1 in a' file1 file2

输出:

6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-point

【讨论】:

    【解决方案2】:

    join(1) 假定两个文件都在连接字段上按字母顺序排序。尝试在没有 -n 的情况下对输入进行排序。

    (更准确地说,它取决于 LC_COLLATE 设置。如果您为了两个程序相互通信的利益而进行排序,则为 join 和 sort 设置 LC_ALL=C 可能更可靠,以避免任何故障由于区域设置。)

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2015-10-27
      • 2017-07-06
      • 2011-04-01
      • 1970-01-01
      • 2019-03-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多