【发布时间】:2017-03-11 17:34:15
【问题描述】:
我正在尝试阅读给定产品的所有 cmets,这既是为了学习 python,也是为了一个项目,为了简化我的任务,我随机选择了一个产品来编码。
我要阅读的链接是亚马逊,我使用 urllib 打开链接
amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
在显示 amazon 时将链接读入“amazon”变量后,我收到以下消息
print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>
所以我在网上阅读,发现我需要使用read命令来读取源代码,但有时它会给我一个网页类型的结果,有时却没有
print(amazon.read())
b''
我如何阅读页面,并将其传递给美丽的汤?
编辑 1
我确实使用了 request.get ,当我检查检索到的页面文本中存在的内容时,我发现以下内容与网站链接不匹配。
print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<!--
To discuss automated access to Amazon data please contact api-services-support@amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>
<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b><a href="http://www.amazon.in/ref=cs_503_link/">Go to the Amazon.in home page to continue shopping</a></b>
</font>
</center>
</body>
</html>
【问题讨论】:
-
您遇到了该错误;您很可能需要将额外的标头传递给您的请求。查找 Urllib 集标头。您需要通过传入 User-Agent 和其他属性在浏览器中充当人。
标签: python web-scraping urllib