Python Web Scraping - urlopen 错误 [Errno -2] 名称或服务未知答案

【问题标题】：Python Web Scraping - urlopen error [Errno -2] Name or service not knownPython Web Scraping - urlopen 错误 [Errno -2] 名称或服务未知
【发布时间】：2012-07-23 02:02:44
【问题描述】：

我正在尝试从我的项目的Civic Commons Apps 链接中提取数据。我能够获得我需要的页面的链接。但是当我尝试打开链接时，我得到“urlopen 错误 [Errno -2] Name or service not known”

网页抓取python代码：

from bs4 import BeautifulSoup
from urlparse import urlparse, parse_qs
import re
import urllib2
import pdb

base_url = "http://civiccommons.org"
url = "http://civiccommons.org/apps"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

list_of_links = [] 

for link_tag in soup.findAll('a', href=re.compile('^/civic-function.*')):
   string_temp_link = base_url+link_tag.get('href')
   list_of_links.append(string_temp_link)

list_of_links = list(set(list_of_links)) 

list_of_next_pages = []
for categorized_apps_url in list_of_links:
   categorized_apps_page = urllib2.urlopen(categorized_apps_url)
   categorized_apps_soup = BeautifulSoup(categorized_apps_page.read())

   last_page_tag = categorized_apps_soup.find('a', title="Go to last page")
   if last_page_tag:
      last_page_url = base_url+last_page_tag.get('href')
      index_value = last_page_url.find("page=") + 5
      base_url_for_next_page = last_page_url[:index_value]
      for pageno in xrange(0, int(parse_qs(urlparse(last_page_url).query)['page'][0]) + 1):
         list_of_next_pages.append(base_url_for_next_page+str(pageno))
      
   else:
      list_of_next_pages.append(categorized_apps_url)

我收到以下错误：

urllib2.urlopen(categorized_apps_url)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 400, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 418, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>

当我执行 urlopen 时，我应该注意什么具体的事情吗？因为我没有看到我得到的 http 链接有问题。

[编辑] 在第二次运行时，我收到以下错误：

 File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 400, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 418, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
    raise URLError(err)

相同的代码在我朋友的 Mac 上运行良好，但在我的 ubuntu 12.04 中运行失败。

我还尝试在 scraper wiki 中运行代码并成功完成。但是很少有 url 丢失（与 mac 相比）。这些行为有什么原因吗？

【问题讨论】：

出现错误时categorized_apps_url 的值是多少？
另外，我知道这种评论通常被认为很烦人，但是如果你使用httplib2 或requests 而不是urllib2，你可能会发现生活更加轻松.它们为使用 http 提供了一套更完整的功能。
您的脚本可以在我的计算机上正常运行。我在带有 python 2.7 的 Mac 上运行，并使用 BeautifulSoup 3.2 和 4.0 进行了尝试，在这两种情况下，它都返回了 69 个主链接和 117 个下一页链接的列表。我怀疑是您系统上的某些东西阻止了 python。您是否尝试过直接 ping 这些网址？也许您的防病毒软件阻止了您的脚本？
@kojiro：我无法找到我得到错误的确切链接。它以不同的值中断。并感谢您的建议。我只是在尝试网页抓取。所以欢迎你发表评论。 :)
@MarkGemmill：我检查了，并且网址有效。我也在我朋友的 mac 中尝试了相同的代码，并且效果很好。在我的 Ubuntu 12.04 中失败有什么原因吗？

标签： python web-scraping beautifulsoup

【解决方案1】：

代码可以在我的 Mac 和你朋友的 Mac 上运行。它在 Ubuntu 12.04 服务器的虚拟机实例上运行良好。在您的特定环境中显然存在某些东西 - 您的操作系统（Ubuntu 桌面？）或网络导致它崩溃。例如，我的家庭路由器的默认设置会在 x 秒内限制对同一域的调用次数 - 如果我没有将其关闭，可能会导致此类问题。这可能是很多事情。

在这个阶段，我建议重构您的代码以捕获 URLError 并留出有问题的 url 以重试。如果在多次重试后失败，也会记录/打印错误。甚至可能会输入一些代码来计算错误之间的调用时间。这比让你的脚本完全失败要好，你会得到反馈，说明它是否只是特定的 URL 导致了问题或时间问题（即它是在 x 次 urlopen 调用后失败，还是失败了在 x 微/秒内调用 x 次 urlopen 之后）。如果是时间问题，在循环中插入一个简单的time.sleep(1) 可能会解决问题。

【讨论】：

【解决方案2】：

同步大师，

我最近在跳上一个很久没玩过的旧 ubuntu 盒子后遇到了同样的问题。这个问题实际上是由于您机器上的 DNS 设置引起的。我强烈建议您检查您的 DNS 设置（/etc/resolv.conf 并添加名称服务器 8.8.8.8），然后重试，您应该会成功。

【讨论】：