根据唯一值过滤列表中的每个子域答案

【问题标题】：Filter each subdomain in list based on unique value根据唯一值过滤列表中的每个子域
【发布时间】：2021-10-17 22:24:43
【问题描述】：

我有两个列表或网址第一个listofdomains.txt 包含以下内容

http://example.com
https://www.example.com
https://abc-test.example.com

第二个urls_params.txt 包含如下

http://example.com/?param1=123
http://example.com/?param1=123&param2=456
https://www.example.com/?param1=123
https://www.example.com/?param1=123&param2=456
https://abc-test.example.com/?param1=123
https://abc-test.example.com/?param1=123&param2=456

我需要在两个列表之间循环以从urls_params.txt grep 所有网址都属于每个子域并将其保存为 subdomain name.txt

例如，所需的输出是文件名为 example.com 并包含

http://example.com/?param1=123
http://example.com/?param1=123&param2=456

其他子域以此类推

我不起作用的解决方案是将listofdomains.txt 列表过滤为仅作为

example.com
www.example.com
abc-test.example.com

并将其保存在名为 list 的文件中然后执行以下命令 while read -r url; do $(cat urls_params.txt | awk -v u="$url" '{print u}') ; done < list

但是输出是错误的

example.com: command not found
www.example.com: command not found
abc-test.example.com: command not found

谢谢

【问题讨论】：

命令替换$(...)周围的命令没用，把它拿出来。哦，丢了useless cat.

标签： shell awk grep

【解决方案1】：

输入（来自问题）：

$ ls
listofdomains.txt  tst.awk  urls_params.txt

脚本：

$ cat tst.awk
{
    dom = $0
    sub("https?://","",dom)
    sub("/.*","",dom)
}
NR==FNR {
    dom2urls[dom] = dom2urls[dom] $0 ORS
    next
}
dom != prev {
    close(out)
    out = dir "/" dom
    prev = dom
}
{ printf "%s", dom2urls[dom] > out }

执行它：

$ awk -v dir="$PWD" -f tst.awk urls_params.txt listofdomains.txt

输出：

$ ls
abc-test.example.com  example.com  listofdomains.txt  tst.awk  urls_params.txt  www.example.com

$ head *.com
==> abc-test.example.com <==
https://abc-test.example.com/?param1=123
https://abc-test.example.com/?param1=123&param2=456

==> example.com <==
http://example.com/?param1=123
http://example.com/?param1=123&param2=456

==> www.example.com <==
https://www.example.com/?param1=123
https://www.example.com/?param1=123&param2=456

您实际上并不需要 listofdomains.txt，除非您想从输出中排除某些域，或者您希望获取空输出文件的某些域未包含在 urls_params.txt 中。

如果您只想为在 urls_params.txt 文件中有条目的域创建输出文件（即没有空的输出文件），那么只需更改：

{ printf "%s", dom2urls[dom] > out }

到：

dom in dom2urls { printf "%s", dom2urls[dom] > out }

【讨论】：

Thx @Ed 这真的很有帮助，最后一件事，我在 bash 脚本中使用它，我有一个用于特定目录位置的全局变量，这将如何在 awk 脚本中定义？换句话说，我需要关注awk -f tst.awk urls_params.txt listofdomains.txt output_dir/
我更新了我的答案。只需将-v dir="$PWD" 更改为-v dir="whatever_you_want"
感谢 Ed，感谢 :)

【解决方案2】：

找到了

while read -r url ; do cat urls_params.txt | grep -E "$url" | tee $url.txt ; done < list

【讨论】：

不要那样做，这是错误的做法。如果您将其复制/粘贴到shellcheck.net，它会告诉您它的一些问题。您可以在unix.stackexchange.com/questions/169716/… 了解其他人
@EdMorton 这个链接改变了我对重建我的 shell 脚本的想法，就像现在这样，用更少的噪音、消耗的方式，真的很感谢提供。 awk 又一次来专业地解决问题
@EdMorton 解决方案提供的另外一件事，它还为域创建了文件，即使它是havet urls inside urls_params.txt`。如何避免这种情况？？
我假设您在询问我的答案，因此我对其进行了编辑以回答您的问题。如果您对我的回答有任何其他问题，请将它们作为 cmets 发布在我的回答下。