使用 curl 和 xargs 获取单个站点地图答案

【问题标题】：Using curl and xargs to get individual sitemaps使用 curl 和 xargs 获取单个站点地图
【发布时间】：2021-07-27 06:11:51
【问题描述】：

我正在尝试使用此 curl 命令下载一堆包含产品 URL 的压缩 xml 站点地图。

它的默认行为转到 robots.txt 文件，找到包含单个站点地图的所有 url 的站点地图文件，解压缩它们，然后在包含所有单个产品的 url 的单个站点地图中找到。

我想做的是将每个单独的站点地图（超过 400 个）下载到自己的文件中，然后在我的本地计算机上操作这些站点地图。

curl -N https://www.example.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    sed 's/\r$//g' |
    xargs -n1 curl -N |
    grep -oP '<loc>\K[^<]*' |
    xargs -n1 curl -N |
    gunzip |
    grep -oP '<loc>\K[^<]*' |
    gzip > \
    somefile.txt.gz

现在它将所有数据放在一个文件中 - 这太大了。我已经尝试了一些这样的事情，最终想出了这个：

curl -N https://www.example.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    xargs -n1 curl -N |
    grep -oP '<loc>\K[^<]*' |
    sort > carid-list-of-compressed-sitemaps.txt

它工作得很好，并给了我一个 gzip 压缩的 xml 站点地图列表，但我不太清楚如何获取其中包含产品 URL 的单个未压缩站点地图。

所以基本上我想下载所有包含单个产品网址的单个产品站点地图。

【问题讨论】：

你能显示5行carid-list-of-compressed-sitemaps.txt吗？为什么用... curl .. | gunzip ..| grep .. | gzip循环这个文件不起作用？
@WalterA - 我不确定我是否理解你的问题...... robots.txt 文件指向一个站点地图索引文件，其中包含超过 400 个 xml.gz 网址。我希望能够将每个站点地图下载到本地自己的文件中。原始脚本确实有效，它只是将所有 url 放入 1 个大文件中。
你说你的第二个命令有效。这样就会生成一个carid-list-of-compressed-sitemaps.txt。下一步将类似于while IFS= read -r sitegz; do ... done < carid-list-of-compressed-sitemaps.txt，处理每个站点地图并在循环中更改每个站点地图的输出文件。
@WalterA - 好吧，现在我开始明白了……我是 bash 的新手。如果您可以将其放入答案中，我会接受。

标签： curl sed grep sitemap xargs

【解决方案1】：

使用 2 个步骤。我删除了第一个sed 命令中的$，因为.* 已经匹配到行尾。
我删除了我的测试站点不需要的 gzip。

caridlist="carid-list-of-compressed-sitemaps.txt"
curl -sN https://www.example.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)/\1/p' |
    xargs -n1 curl -sN |
    grep -oP '<loc>\K[^<]*' > "${carid-list-of-compressed-sitemaps.txt}" 

filenumber=1
urlinfile=1
while IFS= read -r site_url; do
    curl -sN "${site_url}"|
    grep -oP '<loc>\K[^<]*' > somefile_${filenumber}.txt
    ((urlinfile++))
    if ((urlinfile==10)); then
       ((filenumber++))
       urlinfile=1
    fi
done < "${carid-list-of-compressed-sitemaps.txt}"

【讨论】：