【发布时间】:2015-09-28 10:50:59
【问题描述】:
我想提取页面上所有具有 class= 'news_item' 的 div 中的所有 href 和 src
html 看起来像这样:
<div class="col">
<div class="group">
<h4>News</h4>
<div class="news_item">
<a href="www.link.com">
<h2 class="link">
here is a link-heading
</h2>
<div class="Img">
<img border="0" src="/image/link" />
</div>
<p></p>
</a>
</div>
从这里我要提取的是:
www.link.com ,这里是链接标题和 /image/link
我的代码是:
def scrape_a(url):
news_links = soup.select("div.news_item [href]")
for links in news_links:
if news_links:
return 'http://www.web.com' + news_links['href']
def scrape_headings(url):
for news_headings in soup.select("h2.link"):
return str(news_headings.string.strip())
def scrape_images(url):
images = soup.select("div.Img[src]")
for image in images:
if images:
return 'http://www.web.com' + news_links['src']
def top_stories():
r = requests.get(url)
soup = BeautifulSoup(r.content)
link = scrape_a(soup)
heading = scrape_headings(soup)
image = scrape_images(soup)
message = {'heading': heading, 'link': link, 'image': image}
print message
问题是它给了我错误:
**TypeError: 'NoneType' object is not callable**
这是回溯:
Traceback (most recent call last):
File "web_parser.py", line 40, in <module>
top_stories()
File "web_parser.py", line 32, in top_stories
link = scrape_a('www.link.com')
File "web_parser.py", line 10, in scrape_a
news_links = soup.select_all("div.news_item [href]")
【问题讨论】:
-
请粘贴堆栈回溯
-
@hjpotter92 完成,请再看帖子
-
div.news_item [href]应该匹配/查找什么? -
我的意思是你把网站敲了三下,但你只需要一次:soup = BeautifulSoup(r.content) 然后在任何地方都使用soup
-
除了 web.com 上没有 news_item div 的事实......你需要类似 soup_news_item = soup.select("div.news_item") 和 soup_news_item.find_all('a' , href=True)
标签: python parsing beautifulsoup