urllib2 无法打开网站答案

【问题标题】：urllib2 isn't able to open a siteurllib2 无法打开网站
【发布时间】：2013-12-18 01:24:29
【问题描述】：

当我试图打开此链接时 (http://-travka-.tokobagus.com/)

urllib2 给了我这个错误

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 2] No such file or directory>

我认为连字符/破折号开头有问题。我怎么能用 urllib2 打开这样的 url？

完整代码

import urllib
import urllib2
from bs4 import BeautifulSoup 

url = 'http://-travka-.tokobagus.com/'
#url = 'http://www.google.com'
data = urllib2.urlopen(url)
#soup = BeautifulSoup(data)

您会看到我改用 google.com，它运行良好。可能是版本相关的错误？

我的是：

Python - 2.7.4
Ubuntu - 13.04

【问题讨论】：

给你错误的不是 BS，而是urllib2。可以展示一下你的相关代码吗？
@aIKid 代码已添加。请检查
您的代码对我来说很好用。这是在什么平台上的？
@alko 这也是我在 Python 2.6.8 /urllib2 2.6 上的结果。我相信这是 DNS 查找失败的正常错误代码。也许这个子域的 DNS 参差不齐？
这不是一个有效的 URL。 RFC1035 tools.ietf.org/html/rfc1035 在第 2.3.1 节的末尾（以及其他文档）表明连字符只能出现在名称中，而不是名称的第一个或最后一个字符。我想尝试查找无效名称可能会很麻烦！

标签： python web-scraping beautifulsoup urllib2 urllib

【解决方案1】：

将添加此信息请求作为答案，因为它在评论中不可读。 @user3037901，能否为以下命令添加回溯：

import httplib
import urllib2
req = urllib2.Request('http://-travka-.tokobagus.com/')
h = httplib.HTTPConnection(req.get_host())
h.request(req.get_method(), req.get_selector(), req.data, {})

【讨论】：

我收到一个错误socket.error: [Errno 2] 没有这样的文件或目录
@user3037901 我的意思是完整的跟踪。最后至少三行
File "/usr/lib/python2.7/httplib.py", line 791, in sendself.connect()File "/usr/lib/python2.7/httplib.py", line 772, in connectself.timeout, self.source_address)File "/usr/lib/python2.7/socket.py", line 553, in create_connectionfor res in getaddrinfo(host, port, 0, SOCK_STREAM):
@user3037901 这是地址查找时出现的 c 扩展错误。我想知道我们的错误是否因 ubuntu 版本（我的是 12.x）而在符号上有所不同，所以你可能在尝试访问一个错误的命名域时处于死胡同，即开头带有连字符。
@user3037901 如果您不介意，我会将此讨论中的信息转移到您的问题并删除我的答案。

【解决方案2】：

它对我有用。结果如下：

Python 2.7.5 (v2.7.5:ab05e7dd2788, May 13 2013, 13:18:45) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> stream = urllib2.urlopen("http://-travka-.tokobagus.com/")
>>> response = stream.read()
>>> soup = BeautifulSoup(response)
>>> soup.prettify()
u'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml">\n <head>\n  <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>\n  <title>\n   HERRY FIRDAUS NST | TOKOBAGUS.COM\n  </title>\n  <link href="http://-travka-.tokobagus.com" rel="canonical"/>\n  <meta content="-travka- telah menjadi member Tokobagus sejak 01-05-2013. Lihat profil -travka- selengkapnya di Tokobagus." name="description"/>\n  <meta content="index,follow" name="robots"/>\n  <link href="http://as.tokobagus.biz/v6/global/images/favicon-13.ico" rel="shortcut icon" type="image/ico"/>\n  <link href="http://as.tokobagus.biz/v6/global/css/global.min.1.0.18.css" media="screen" rel="stylesheet" type="text/css"/>\n  <link href="http://as.tokobagus.biz/v6/skins/default/css/tbl.min.1.0.10.css" media="screen,print" rel="stylesheet" type="text/css"/>\n  <link href="http://as.tokobagus.biz/v6/skins/d

【讨论】：