在 Shell 脚本中使用 CURL 解析 HTML答案

【问题标题】：Parse HTML with CURL in Shell Script在 Shell 脚本中使用 CURL 解析 HTML
【发布时间】：2016-03-22 14:27:04
【问题描述】：

我正在尝试在 shell 脚本中解析网页的特定内容。

我需要grep<div>标签内的内容。

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>

如果我用grep -E -m 1 -o '<div class="tracklistInfo">'，简历只有<div class="tracklistInfo">

如何访问艺术家(Diplo - Justin Bieber - Skrillex) 以及标题(Where Are U Now)？

【问题讨论】：

标签： html shell curl

【解决方案1】：

使用 xmllint：

a='<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>'

xmllint --html --xpath 'concat(//div[@class="tracklistInfo"]/p[1]/text(), "#", //div[@class="tracklistInfo"]/p[2]/text())' <<<"$a"

您获得：

Diplo - Justin Bieber - Skrillex#Where Are U Now

这很容易分开。

【讨论】：

太棒了，不知道 xmllint
它在大多数真实世界的网站上都会失败，任何 xmllint 认为“无效”的东西都会导致它崩溃。

【解决方案2】：

不要。使用 HTML 解析器。例如，Python 的BeautifulSoup 很容易使用，并且可以非常轻松地做到这一点。

话虽如此，请记住grep 在行上工作。模式匹配每个 line，而不是整个 string。

您可以使用-A 来在匹配后打印出行：

grep -A2 -E -m 1 '<div class="tracklistInfo">'

应该输出：

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>

然后您可以通过管道将其传递到tail 来获取最后一行或倒数第二行：

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
<p>Where Are U Now</p>

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1
<p class="artist">Diplo - Justin Bieber - Skrillex</p>

并用sed 剥离HTML：

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
Where Are U Now

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1 | sed 's/<[^>]*>//g'
Diplo - Justin Bieber - Skrillex

但如前所述，这是善变的，可能会崩溃，而且不是很漂亮。顺便说一下，BeautifulSoup 也是如此：

html = '''<body>
<p>Blah text</p>
<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>
</body>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for track in soup.find_all(class_='tracklistInfo'):
    print(track.find_all('p')[0].text)
    print(track.find_all('p')[1].text)

这也适用于多行 tracklistInfo - 将其添加到 shell 命令需要更多工作 ;-)

【讨论】：

非常感谢。现在我变成了以下简历：Flo Rida Turn Around (5,4,3,2,1) Thats Perfect, 但是我怎样才能删除之前的空间？我可以使用 utf8，因为它包含一个我不工作的空间字符例如：Enrique Iglesias - Nicky Jam El Perdón
@Fabian 是的，这就是为什么你不使用 curl/grep/sed 而是使用 HTML 解析器 ;-)
哦，好的，然后我尝试使用 BeautifulSoup。谢谢
“没有任何效果”除了“糟糕”之外，我无法提供有意义的输入；-)
@Fabian 看起来您将它作为 shell 脚本而不是 Python 脚本运行。 Python 是一种完全不同的编程语言...使用python test.py（或python test.sh）...

【解决方案3】：

cat - > file.html << EOF
<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div><div class="tracklistInfo">
<p class="artist">toto</p>
<p>tata</p>
</div>
EOF


cat file.html | tr -d '\n'  | sed -e "s/<\/div>/<\/div>\n/g" | sed -n 's/^.*class="artist">\([^<]*\)<\/p> *<p>\([^<]*\)<.*$/artist : \1\ntitle : \2\n/p'

【讨论】：

【解决方案4】：

您的标题以“使用 CURL 解析 HTML”开头，但 curl 不是 html 解析器。如果您想使用命令行工具，请改用xidel。

xidel -s "<url>" -e '//div[@class="tracklistInfo"]/p'
Diplo - Justin Bieber - Skrillex
Where Are U Now

xidel -s "<url>" -e '//div[@class="tracklistInfo"]/join(p," | ")'
Diplo - Justin Bieber - Skrillex | Where Are U Now

【讨论】：