使用 Scrapy 中的 Beautiful Soup 在整个 html 中搜索特定字符串答案

【问题标题】：Search particular string in entire html using Beautiful Soup in Scrapy使用 Scrapy 中的 Beautiful Soup 在整个 html 中搜索特定字符串
【发布时间】：2018-05-02 12:45:14
【问题描述】：

我想在抓取的 html 页面中搜索特定的字符串，并在字符串存在时执行一些操作。

find = soup.find('word')
print(find)

但这会给出None，即使页面中有word。另外，我试过了：

find = soup.find_all('word')
print(find)

它只给[]。

【问题讨论】：

标签： python-3.x beautifulsoup scrapy

【解决方案1】：

find 方法的作用是搜索标签。因此，当您执行 soup.find('word') 时，您是在要求 BeautifulSoup 查找所有 <word></word> 标记。我认为这不是你想要的。

有几种方法可以执行您的要求。您可以使用re 模块通过这样的正则表达式进行搜索：

import re

is_present = bool(re.search('word', response.text))

但是您可以避免导入额外的模块，因为您使用 Scrapy，它具有用于处理正则表达式的内置方法。只需在选择器上使用re 方法：

is_present = bool(response.xpath('//body').re('word'))

【讨论】：

感谢@Stas，find = bool(re.search('word', content)) 给了我 "source.tell() - here + len( this)) sre_constants.error: nothing to repeat at position 0" 和 find = bool(content.xpath('//body').re(content)) 给出“AttributeError: 'bytes' 对象没有属性 'xpath'”。在这里，内容是我传递给解析器的 response.body。我做错了什么？
find = bool(content.xpath('//body').re('word'))
如果你使用 Scrapy，你应该有一个 Response 对象，它作为第一个参数传递给你的回调函数。
是的@Statas，我将回调函数作为 response.body 传入，并将其作为调用函数的内容。
好的，那就试试这个：is_present = bool(re.search('word', str(content)))

【解决方案2】：

试试find = soup.findAll(text="word")

【讨论】：

这对我来说很好。谢谢。