BeautifulSoup：如何抓取元标记描述内容答案

【问题标题】：BeautifulSoup: how to scrape meta tag description contentBeautifulSoup：如何抓取元标记描述内容
【发布时间】：2022-03-09 18:13:49
【问题描述】：

我正在尝试抓取网站元描述的内容。

示例：

<meta name="description" content="This is the home page meta description.">

我正在寻找的输出是：“这是主页元描述。”

我的代码是：

raw_html = simple_get(companyUrl)
html = BeautifulSoup(raw_html, 'html.parser')
x = html.select('meta', {'name' : 'description'})  ## this line errors out

有人能指出正确的方向吗？

（也是 - 是我的想象，还是 BeautifulSoup 教程/文档达不到其他语言/应用程序的水平？）

【问题讨论】：

您将.select() 与.find_all() 混淆了。使用find_all 或更改您的选择器。
@t.m.adam - 我想这就是我要问的：如何格式化我的选择器以捕获 tag=meta 和 name=description？
好的 - 我已经蛮力解决了，但我希望有一种更优雅的方式来做到这一点。
你为什么不在下面马丁的回答中使用选择器？
好的，我明白你的意思了。您可以使用tag[attribute] 获取属性的文本。例如：content = html.select_one("meta[name='description']")['content'] 都在一行中。

标签： python beautifulsoup

【解决方案1】：

您必须像这样使用 css 选择器：

x = html.select('meta[name="description"]')
print(x[0].attrs["content"])

阅读更多关于 css 选择器的信息here：

【讨论】：

【解决方案2】：

使用BeautifulSoup

from bs4 import BeautifulSoup

html = """<meta name="description" content="This is the home page meta description.">"""

soup = BeautifulSoup(html, 'html.parser')
content = soup.find('meta', {'name':'description'}).get('content')
print(content)  # STDOUT: This is the home page meta description.

使用 regex

的另一种方法

import re

content = re.findall(r"content=\"(.*?)\"", html)

注意： 正则表达式解析速度更快，findall 将返回给定 html

中定义的所有内容属性的值列表

【讨论】：