【发布时间】:2026-02-01 03:00:02
【问题描述】:
我正在尝试从 nbcnews.com 提取故事。我目前有以下代码:
import urllib2
from bs4 import BeautifulSoup
# The page that I'm getting stories from
url = 'http://www.nbcnews.com/'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
#This is the tag and class that chrome told me "top stories" are stored in
this = soup.find_all('div', attrs={"class": "col-sm-6 col-md-8 col-lg-9"})
#Get the a tags in the previous tag (this is the part that returns FAR too many links
link = [a for i in this for a in i.find_all('a')]
#Get the titles (This works)
title = [a.get_text() for i in link for a in i.find_all('h3')]
#The below strips all newlines and tabs from the title name
newtitle = []
for i in t:
s = ' '.join(i.split())
if s in newtitle:
pass
else:
newtitle.append(s)
print len(link)
print len(title)
当我运行脚本时,“标题”列表(大部分)是正确的,但网站上的标题名称略有不同(如果标题名称接近相同,则没有问题)
我的问题是“链接”列表似乎包含来自各地的链接?有人可以帮我弄这个吗?
或者如果可能的话,是否有类似的 API 可用?如果可以避免的话,我真的不想重新发明*来获取新闻文章。
编辑:更改了变量名中的拼写错误
【问题讨论】:
标签: python beautifulsoup