【发布时间】:2020-12-11 18:42:35
【问题描述】:
我需要从新闻文章中抓取作者和日期,但我无法访问元标记中的某些信息。
import requests, random, re, os
from bs4 import BeautifulSoup as bs
import urllib.parse
import time
from newspaper import Article
url = ['https://www.wsj.com/articles/covid-19-is-dividing-the-american-worker-11598068859?mod=hp_lead_pos7',
##WALL STREET JOURNAL
for link in url:
#Try 1
#Get the published date -- this is where I have problems.
webpage = requests.get(link)
soup = bs(webpage.text, "html.parser")
date = soup.find("meta", {"name": "article.published"})
print(date)
#Try 2
#Access date from the <time> tag instead
for tag in soup.find_all('time', {"class": "timestamp article__timestamp flexbox__flex--1"}):
date = tag.text
print(date)
#Get the author name -- this part works
article = Article(link, language='en')
article.download()
article.parse()
# print(article.html)
author = article.authors
date = article.publish_date
author = author[0]
day_month = str("Check Date")
print(day_month + "," + "," + "," + str(author))
当我打印出汤时,我可以在输出中获取 Meta 标签,所以我知道它们在那里,但我似乎无法用任何一种方法访问它们。
这是我目前得到的输出: 没有 检查日期,,,Christopher Mims
有什么想法吗?
【问题讨论】:
-
这能回答你的问题吗? Scraping wsj.com
-
我的回答对您有帮助吗?你也看过我的overview document on newspaper吗?
-
@Lifeiscomplex,是的,我只是这样标记的。谢谢!
标签: web-scraping beautifulsoup newspaper3k