用于抓取文章的报纸 api答案

【问题标题】：Newspaper api for scraping articles用于抓取文章的报纸 api
【发布时间】：2020-12-15 23:34:51
【问题描述】：

我使用 python 中的报纸 3k api 来抓取文章。我无法抓取 Times of India 文章，从响应中获取发布日期为空，其余文章正在提供正确的文章。

article = Article(url)
article.download()
article.parse()
result=vars(article)
print(result['publish_date'])

【问题讨论】：

您能否展示您尝试过的代码、错误消息以及您期望发生的情况？
查看所有文章都给出了正确的日期，但是印度时报 (TOI) 文章域文章给出的发布日期为 null TOI 文章可以阻止部分响应吗？
当然，API 的发布者可以完全控制返回的内容，并且可以选择只实现规范的一部分。
能否分享文章网址和回复？
@Shakeel 例如，您使用这篇文章 URL - timesofindia.indiatimes.com/business/india-business/… 或任何 TOI 文章，我将在给定对象响应中将发布日期设为空。

标签： python-3.x python-newspaper newspaper3k

【解决方案1】：

当前版本的 Newspaper 无法从 Times of India HTML 代码中提取“发布日期”，因为日期在 script 标记内。您可以使用 requests 和 BeautifulSoup 提取此日期。后者嵌入在报纸中。我还注意到关键字位于元标记中，因此 Newspaper 无法提取这些关键字。我也添加了一些代码来提取关键字。希望下面的代码可以帮助您查询印度时报上的文章。如果您有任何问题，请告诉我。

import requests
import re as regex
from newspaper import Article
from newspaper.utils import BeautifulSoup

base_url = 'https://timesofindia.indiatimes.com/business/india-business/govt-working-to-reduce-e-vehicle-tax-niti-aayog-ceo/articleshow/78210495.cms'

raw_html = requests.get(base_url)
soup = BeautifulSoup(raw_html.text, 'html.parser')

# parse date published
data = soup.findAll('script')[1]
find_date = regex.search(r'datePublished.{3}\d{4}-\d{2}-\d{2}', data.string)
date_published = find_date.group().split('"')[2]

# parse other elements using Newspaper
article = Article('')
article.download(raw_html.content)
article.parse()
article_tags = article.tags
article_content = article.text
article_title = article.title

# parse keywords
article_meta_data = article.meta_data
article_keywords = sorted({value for (key, value) in article_meta_data.items() if key == 'keywords'})

【讨论】：