【问题标题】:what have I done wrong parsing html with python urllib2 and beautifulsoup我用python urllib2和beautifulsoup解析html做错了什么
【发布时间】:2014-05-17 09:07:32
【问题描述】:

尝试从google上抓取一些链接,学习python

import urllib2
from bs4 import BeautifulSoup

response = urllib2.urlopen('http://www.google.com.au/search?q=python')
html = response.read()
print html
response.close()

我做错了什么?我收到以下错误?

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-4-d990999e71f4> in <module>()
      9 
     10 import urllib2
---> 11 response = urllib2.urlopen('http://www.google.com.au/search?q=python')
     12 html = response.read()
     13 print html

C:\Python27\lib\urllib2.pyc in urlopen(url, data, timeout)
    124     if _opener is None:
    125         _opener = build_opener()
--> 126     return _opener.open(url, data, timeout)
    127 
    128 def install_opener(opener):

C:\Python27\lib\urllib2.pyc in open(self, fullurl, data, timeout)
    395         for processor in self.process_response.get(protocol, []):
    396             meth = getattr(processor, meth_name)
--> 397             response = meth(req, response)
    398 
    399         return response

C:\Python27\lib\urllib2.pyc in http_response(self, request, response)
    508         if not (200 <= code < 300):
    509             response = self.parent.error(
--> 510                 'http', request, response, code, msg, hdrs)
    511 
    512         return response

C:\Python27\lib\urllib2.pyc in error(self, proto, *args)
    433         if http_err:
    434             args = (dict, 'default', 'http_error_default') + orig_args
--> 435             return self._call_chain(*args)
    436 
    437 # XXX probably also want an abstract factory that knows when it makes

C:\Python27\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    367             func = getattr(handler, meth_name)
    368 
--> 369             result = func(*args)
    370             if result is not None:
    371                 return result

C:\Python27\lib\urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
    516 class HTTPDefaultErrorHandler(BaseHandler):
    517     def http_error_default(self, req, fp, code, msg, hdrs):
--> 518         raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
    519 
    520 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

【问题讨论】:

标签: python html beautifulsoup


【解决方案1】:

看起来 google 不允许这种类型的请求

试试:Requestsmechanize

您可以轻松地操纵您的请求标头(用户代理等) 检查,哪个更容易,更适合你

【讨论】:

  • 如果我需要从位于 Intranet 后面并需要通过用户名和密码进行身份验证的页面获取 html,该怎么办?机械化和/或美丽的汤/ urllib2可以做到吗?
  • 当然。阅读机械化文档。或查看请求。我创建了一个脚本,登录到某个页面,然后做了一些事情:我将编辑我的帖子并放置请求链接
  • 你能分享脚本来登录吗,我有这个,但是它不适用于我要抓取的页面,也许这个不适合httpS或其他东西阻止它?导入 urllib2 url = 'somesite' 用户名 = 'usr' 密码 = 'pass' p = urllib2.HTTPPasswordMgrWithDefaultRealm() p.add_password(None, url, username, password) handler = urllib2.HTTPBasicAuthHandler(p) opener = urllib2. build_opener(handler) urllib2.install_opener(opener) page = urllib2.urlopen(url).read() 打印页面
  • 我不能分享所有脚本,但这对开始很有用payload = {'Username':'myusername','Password':'mypassword'}s = requests.session()s.post(login_url,payload) 更多请阅读请求文档,因为它真的很简单:)
  • 你觉得它比机械化美丽的汤更好吗?我已经使用美丽的汤来解析链接和字符串,所以我是机械化和请求的新手,我应该使用 BS 来处理链接的东西还是很容易使用机械和请求 - 也就是不需要美丽的汤?
猜你喜欢
  • 2015-02-19
  • 2023-04-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2011-06-15
  • 1970-01-01
  • 2011-10-17
  • 2020-02-06
相关资源
最近更新 更多