【问题标题】:How to Scrape Amazon using python 3如何使用 python 3 抓取亚马逊
【发布时间】:2017-03-11 17:34:15
【问题描述】:

我正在尝试阅读给定产品的所有 cmets,这既是为了学习 python,也是为了一个项目,为了简化我的任务,我随机选择了一个产品来编码。

我要阅读的链接是亚马逊,我使用 urllib 打开链接

amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')

在显示 amazon 时将链接读入“amazon”变量后,我收到以下消息

print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>

所以我在网上阅读,发现我需要使用read命令来读取源代码,但有时它会给我一个网页类型的结果,有时却没有

print(amazon.read())
b''

我如何阅读页面,并将其传递给美丽的汤?

编辑 1

我确实使用了 request.get ,当我检查检索到的页面文本中存在的内容时,我发现以下内容与网站链接不匹配。

print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">

<!--
        To discuss automated access to Amazon data please contact api-services-support@amazon.com.
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->

<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>

<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b><a href="http://www.amazon.in/ref=cs_503_link/">Go to the Amazon.in home page to continue shopping</a></b>
</font>

</center>
</body>
</html>

【问题讨论】:

  • 您遇到了该错误;您很可能需要将额外的标头传递给您的请求。查找 Urllib 集标头。您需要通过传入 User-Agent 和其他属性在浏览器中充当人。

标签: python web-scraping urllib


【解决方案1】:

我个人会为此使用 requests 库而不是 urllib。 Requests 有更多功能

import requests

从那里类似:

resp = requests.get(url) #You can break up your paramters and pass base_url & params to this as well if you have multiple products to deal with
soup = BeautifulSoup(resp.text)

应该回复这个邮件,因为它是相当简单的 http 请求

编辑: 根据您的错误,您将不得不研究要传递的参数以使您的请求看起来正确。一般来说,请求看起来像这样(显然是您发现的值——检查您的浏览器调试/开发人员选项以检查您的网络流量并查看您在使用浏览器时发送到亚马逊的内容):

url = "https://www.base.url.here"
params = {
    'param1': 'value1'
     .....
}
resp = requests.get(url,params)

【讨论】:

    【解决方案2】:

    使用您当前的库 urllib。这是你能做的!使用 .read() 获取 HTML。然后像这样将它传递给 BeautifulSoup。请记住,亚马逊是重度反刮网站。您获得不同结果的可能性可能是因为 HTML 包含在 JavaScript 中。为此,您可能必须使用 Selenium 或 Dryscrape。您可能还需要将标头/Cookie 和额外属性传递到您的请求中。

    amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
    html = amazon.read()
    soup = BeautifulSoup(html)
    

    EDIT ---- 原来你现在正在使用请求。使用像这样在我的标头中传递的请求,我可以获得 200 个响应。

    import requests
    from bs4 import BeautifulSoup
    
    headers = {
        'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
    }
    response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
    soup = BeautifulSoup(response)
    response[200]
    

    --- 使用 Dryscrape

    import dryscrape
    from bs4 import BeautifulSoup
    
    sess = dryscrape.Session(base_url='http://www.amazon.in')
    sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
    sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
    html = sess.body()
    soup = BeautifulSoup(html)
    print soup
    
    ##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html
    

    【讨论】:

    • 我确实使用了您的请求,但无法检索任何内容,这可能是因为您说亚马逊是反爬虫网站,您是否能够运行代码?
    • 嘿。我将更新线程以使其使用 python-requests。这应该可行!
    • 查看编辑后的版本!
    • @lollerskates 确实如此。这就是为什么我在这个帖子的开头提到 Selenium/Dryscrape。
    • 哎呀,我的错。我可能回复了错误的评论
    猜你喜欢
    • 2019-12-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-05-08
    • 1970-01-01
    • 1970-01-01
    • 2020-06-29
    • 2022-11-13
    相关资源
    最近更新 更多