使用请求和 bs4 Lib 使用 Python 从 HTML 中抓取隐藏值答案

【问题标题】：Scraping hidden values from HTML with Python Using Requests and bs4 Lib使用请求和 bs4 Lib 使用 Python 从 HTML 中抓取隐藏值
【发布时间】：2023-03-26 15:50:01
【问题描述】：

我正在尝试从具有以下格式代码的 html 源中抓取验证码。

<div id="Custom"><!-- test: vdfnhu --></div>

验证码会随着每次刷新而改变。我的意图是捕获验证码及其验证代码，以便发布到表单。

到目前为止我的代码是：

import requests
import urlparse
import lxml.html
import sys
from bs4 import BeautifulSoup

print "Enter the URL",
url = raw_input()
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c)
div = soup.find('div' , id ='Custom')
comment = next(div.children)
test = comment.partition(':')[-1].strip()
print test

【问题讨论】：

作为旁注，您正在抓取哪个网站使用验证码但在源代码中包含答案？那种完全违背了目的；它甚至没有减慢机器人的速度就惹恼了用户……
这是一个我正在为我的网络安全大师工作的实验室。
I, for one, welcome our new cyber security masters. :)
嗯，我还有很多很多的课程要上。如果我能在 C# 生活中编写所有代码会容易得多。学习 python 并不难，但学习所有的库是一头野兽……
你看过 IronPython 吗？ Python 语言、.NET 库……听起来你可能会喜欢它。

标签： python html web-scraping beautifulsoup captcha

【解决方案1】：

正如the documentation 解释的那样，BeautifulSoup 有NavigableString 和Comment 对象，就像Tag 对象一样，它们都可以是孩子、兄弟姐妹等。Comments and other special strings 有更多详细信息。

所以，你想找到 div 'Custom'：

div = soup.find('div', id='Custom'}

然后你要找找Commentchild：

comment = next(child for child in div.children if isinstance(child, bs4.Comment))

虽然如果格式与您呈现的一样固定不变，您可能希望将其简化为 next(div.children)。另一方面，如果它更多变，您可能想要遍历 all Comment 节点，而不仅仅是获取第一个。

而且，由于 Comment 基本上只是一个字符串（如，它支持所有 str 方法）：

test = comment.partition(':')[-1].strip()

把它放在一起：

>>> html = '''<html><head></head>
...           <body><div id="Custom"><!-- test: vdfnhu --></div>\n</body></html>'''
>>> soup = bs4.BeautifulSoup(html)
>>> div = bs4.find('div', id='Custom')
>>> comment = next(div.children)
>>> test = comment.partition(':')[-1].strip()
>>> test
'vdfnhu'

【讨论】：

与下一个（div.children）一起工作得很好。谢谢！我只是无法将注意力集中在评论上，出于某种原因，它让我陷入了一个心理循环......
@Phil：BeautifulSoup 的文档非常完整且编写良好……但如果您还不知道要搜索什么，则并不总是很容易组织。
我同意，它们写得很好，但正如你所说，对于刚学习图书馆的人来说，组织很难消化。