【问题标题】:Beautifulsoup for webscraping is not working?用于网页抓取的 Beautifulsoup 不起作用?
【发布时间】:2017-11-25 15:06:03
【问题描述】:

我正在尝试从网站上抓取一些数据。这是html格式。我要刮字"No description for 632930413867".

HTML代码:

<div class="col-xs-6 col-sm-6 col-md-6 col-lg-6">
  <table class="table product_info_table">
    <tbody>
      <tr>
        <td>GS1 Address</td>
        <td>R.R. 1, Box 2, Malmo, NE 68040</td>
      </tr>
      <tr>
        <td>Description</td>
        <td>
          <div id="read_desc">
            No description for 632930413867
          </div>
        </td>
      </tr>
    </tbody>
  </table>
</div>

以及来自这个html的图片src

  <div class="centered_image header_image">
<img src="https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg" title="UPC 632930413867" alt="UPC 632930413867">

所以我使用这个代码

Baseurl = "https://www.buycott.com/upc/632930413867"
uClient = ''
while uClient == '':
    try:
        uClient = requests.get(Baseurl)
        print("Relax we are getting the data...")

    except:
        print("Connection refused by the server..")
        print("Let me sleep for 7 seconds")
        time.sleep(7)
        print("Was a nice sleep, now let me continue...")
        continue


page_html = uClient.content

uClient.close()
page_soup = soup(page_html, "html.parser")

Productcontainer = page_soup.find_all("div", {"class": "row"})
link = page_soup.find(itemprop="image")

print(Productcontainer)

for item in Productcontainer:
    print(link)
    productdescription = Productcontainer.find("div", {"class": "product_info_table"})
    print(productdescription)

当我运行此代码时,不会显示任何数据。怎么获取description和img src?

【问题讨论】:

    标签: python html beautifulsoup


    【解决方案1】:

    页面上每个(项目和产品描述)只有一个实例,因此您可以使用 find() 直接访问它们,在这种情况下无需使用 find_all():

    import requests
    from bs4 import BeautifulSoup as soup
    
    Baseurl = "https://www.buycott.com/upc/632930413867"
    uClient = ''
    while uClient == '':
        try:
            uClient = requests.get(Baseurl)
            print("Relax we are getting the data...")
    
        except:
            print("Connection refused by the server..")
            print("Let me sleep for 7 seconds")
            time.sleep(7)
            print("Was a nice sleep, now let me continue...")
            continue
    
    page_html = uClient.content
    uClient.close()
    
    page_soup = soup(page_html, "html.parser")
    productdescription = page_soup.find("div", {"id": "read_desc"}).text
    link = page_soup.find("div", {"class": "centered_image header_image"}).find("img")['src']
    print (productdescription)
    print (link)
    

    输出:

    Relax we are getting the data...
    
    No description for 632930413867
    
    https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg
    

    【讨论】:

      【解决方案2】:

      您只需检查 html 并识别包含您要抓取的数据的标签。
      在这种情况下,图片为 div.centered_image.header_image img,描述为 div#read_desc
      bs4 css selectors 为例:

      import requests
      from bs4 import BeautifulSoup 
      
      baseurl = "https://www.buycott.com/upc/632930413867"
      page_html = requests.get(baseurl).content
      soup = BeautifulSoup(page_html, "html.parser")
      image = soup.select_one('div.centered_image.header_image img')['src']
      description = soup.select_one('div#read_desc').text.strip()
      
      print(image)
      print(description)
      

      https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL.SL160.jpg
      632930413867没有描述

      【讨论】:

        【解决方案3】:

        也可以这样做:

        import requests
        from bs4 import BeautifulSoup
        
        soup = BeautifulSoup(requests.get("https://www.buycott.com/upc/632930413867").text, "lxml")
        desc = soup.select("#read_desc")[0].text.strip()
        link = soup.select(".centered_image img")[0]['src'].strip()
        print("{}\n{}".format(desc,link))
        

        输出:

        No description for 632930413867
        https://images-na.ssl-images-amazon.com/images/I/416EuOE5kIL._SL160_.jpg
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2020-08-01
          • 2021-05-10
          • 2013-06-30
          • 1970-01-01
          • 2020-09-17
          • 1970-01-01
          相关资源
          最近更新 更多