URL 解析错误 [BeautifulSoup]答案

【问题标题】：URL parsing error [BeautifulSoup]URL 解析错误 [BeautifulSoup]
【发布时间】：2011-07-03 05:31:37
【问题描述】：

我正在尝试从网站页面获取 href 链接列表；但是我的代码无法正常工作。当代码不应该附加到urlList 时，它会附加代码。它也在复制 href 链接。

import urllib2
from BeautifulSoup import BeautifulSoup

response = urllib2.urlopen("http://www.gamefaqs.com")
html = response.read()
soup = BeautifulSoup(html)

doNotProcessList = ["gamespot.com", "cnet.com", "gamefaqs.com"]

urlList = []

for link in soup.findAll('a'):
    for bad in doNotProcessList:
        if bad not in link['href']:
            urlList.append(link['href'])

print urlList

示例错误：

[u'http://cbsiprivacy.custhelp.com/app/answers/detail/a_id/1272/', u'http://cbsiprivacy
.custhelp.com/app/answers/detail/a_id/1272/', u'http://www.cbsinteractive.com/terms_of_use.php?tag=ft', u'http://www
.cbsinteractive.com/terms_of_use.php?tag=ft', u'http://www.cbsinteractive.com/terms_of_use.php?tag=ft', u'http://m.g
amefaqs.com/?mob_on=1', u'http://m.gamefaqs.com/?mob_on=1']

错误与 if 语句中的“not”有关，因为删除 not 将导致仅将坏项目存储在列表中，如下所示：

[u'http://membership.gamefaqs.com/1328-4-46.html', u'http://www.gamefaqs.com/user/register.html', u'http://www.games
pot.com/6316274', u'http://www.gamespot.com/6316274', u'http://www.gamespot.com/6316489', u'http://www.gamespot.com/
6316489', u'http://www.gamespot.com/6316225', u'http://www.gamespot.com/6316225', u'http://www.gamespot.com/features
/index.html', u'http://www.gamespot.com/news/6322016.html', u'http://www.gamespot.com/news/6322019.html', u'http://w
ww.gamespot.com/news/6322017.html', u'http://www.gamespot.com/news/6322010.html', u'http://www.gamespot.com/news/632
1996.html', u'http://www.gamespot.com/news/index.html', u'http://www.gamespot.com/features/6314339/index.html', u'ht
tp://www.gamespot.com/features/6313939/index.html', u'http://www.gamespot.com/features/6309202/index.html', u'http:/
/www.gamespot.com/features/6320393/index.html', u'http://www.gamespot.com/features/6162248/index.html', u'http://www
.gamespot.com/gameguides.html', u'http://www.gamespot.com/downloads/index.html', u'http://www.gamespot.com/news/inde
x.html', u'http://www.gamespot.com/pc/index.html', u'http://www.gamespot.com/xbox360/index.html', u'http://www.games
pot.com/wii/index.html', u'http://www.gamespot.com/ps3/index.html', u'http://www.gamespot.com/psp/index.html', u'htt
p://www.gamespot.com/ds/index.html', u'http://www.gamespot.com/ps2/index.html', u'http://www.gamespot.com/gba/index.
html', u'http://www.gamespot.com/mobile/index.html', u'http://www.gamespot.com/cheats.html', u'http://www.gamespot.c
om/forums/index.html', u'http://www.gamespot.com/', u'http://www.gamefaqs.com/features/help/', u'http://sitemap.game
faqs.com/', u'http://www.gamefaqs.com/features/aboutus.html', u'http://reviews.cnet.com/Music/2001-6450_7-0.html', u
'http://reviews.cnet.com/Cell_phones/2001-3504_7-0.html', u'http://reviews.cnet.com/Digital_cameras/2001-6501_7-0.ht
ml', u'http://reviews.cnet.com/Notebooks/2001-3121_7-0.html', u'http://reviews.cnet.com/Handhelds/2001-3127_7-0.html
', u'http://reviews.cnet.com/4521-6531_7-5021436-3.html', u'http://reviews.cnet.com/Web_hosting/2001-6540_7-0.html',
 u'http://clearance.cnet.com', u'http://shopper.cnet.com/4520-5-6276184.html', u'http://www.cnet.com', u'http://www.
gamespot.com', u'http://www.gamespot.com/cheats.html', u'http://www.cnet.com/apple-iphone.html', u'http://www.gamesp
ot.com/reviews.html', u'http://reviews.cnet.com/laptops', u'http://download.cnet.com/windows/antivirus-software/', u
'http://m.gamefaqs.com/?mob_on=1']

【问题讨论】：

标签： python duplicates beautifulsoup

【解决方案1】：

列表理解 FTW：

[link['href'] for link in soup.findAll('a') 
 if not any(bad in link['href'] for bad in doNotProcessList)]

而且，为了可读性...

def condition(x):
    return not any((bad in x) for bad in doNotProcessList)

[link['href'] for link in soup.findAll('a') if condition(link['href'])]

【讨论】：

原创作品，但是我在哪里可以找到关于列表推导的好教程？ Python.org 非常缺乏。
一个列表理解是[modify(x) for x in iterable if condition(x)]，它会生成一个列表...bogotobogo.com/python/python_list_comprehension.html