使用 BeautifulSoup4 抓取网页答案

【问题标题】：Scraping a webpage using BeautifulSoup4使用 BeautifulSoup4 抓取网页
【发布时间】：2018-02-06 09:49:20
【问题描述】：

我正在尝试使用 BeautifulSoup4 打印新闻文章的内容。

网址是：Link

我拥有的当前代码如下，它给出了所需的输出：

page = requests.get('http://www.thehindu.com/news/national/People-showing-monumental-patience-queuing-up-for-a-better-India-says-Venkaiah/article16447029.ece')
soup = BeautifulSoup(page.content, 'html.parser')


article_text = ""
table = soup.find_all("div",{ "id": "content-body-14266949-16447029"})                              

for element in table:
    article_text += ''.join(element.find_all(text = True)) + "\n\n"

print(article_text)

但是，问题是我想抓取多个页面，每个页面都有不同的内容正文编号，格式为 xxxxxxxx-xxxxxxxx（2 块 8 位）。

我尝试用正则表达式替换 soup.find_all 命令：

table = soup.find_all(text=re.compile("content-body-........-........"))

但这会报错：

AttributeError: 'NavigableString' 对象没有属性 'find_all'

谁能指导我需要做什么？

谢谢。

【问题讨论】：

您希望所有作为 href 的链接都打印为输出吗？
没有。我正在尝试打印文章的文本。 soup.find_text() 给了我整个文本，而我需要的内容嵌入到
中的多个
元素中，id 为 content-body-xxxxxxxx-xxxxxxxx。

标签： python web-scraping beautifulsoup

【解决方案1】：

另一种方法可能是使用 css 选择器。选择器简洁明了。你也可以试一试。只需使用您的相关链接更改“url”即可。

import requests ; from bs4 import BeautifulSoup

res = requests.get(url).text
soup = BeautifulSoup(res,"html.parser")

for item in soup.select("div[id^=content-body-] p"):
    print(item.text)

【讨论】：

【解决方案2】：

正则表达式应该没问题！试试

table = soup.find_all("div",{ "id": re.compile('content-body-*')})

【讨论】：

嘿，谢谢。 soup.find('div',attrs={"id":re.compile("content-body-........-........")}).find_all("p ") 工作。

【解决方案3】：

您可以使用 lxml 提取内容 lxml 库允许您使用 xpath 从 html 中提取内容

from lxml import etree
selector=etree.HTML(pageText)
article_text=selector.xpath('//div[@class="article-block-multiple live-snippet"]/div[1]')[0].text

我不使用 BeautifulSoup。我认为您可以像这样使用 BeautifulSoup

table = soup.find_all("div",{ "class": "article-block-multiple live-snippet"]"})

然后使用查找子元素，找到第一个div元素

【讨论】：