【问题标题】:web scraping, get whole page using mechanize网页抓取,使用机械化获取整个页面
【发布时间】:2016-03-17 00:01:47
【问题描述】:

我的目标是从页面中获取所有项目。不过,我只得到了 25 个中的前 10 个。我认为它与表有关,我认为它是某种类型的小部件?我是初学者,还在学习基础知识。

import mechanize,time
from bs4 import BeautifulSoup

br = mechanize.Browser()  
br.set_handle_robots(False)  
br.addheaders = [("User-agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]  

sign_in = br.open('https://sellercentral.amazon.com/gp/homepage.html?')  

br.select_form(name="signinWidget")  
br["username"] = 'spam' 
br["password"] = 'eggs'
logged_in = br.submit() 

orders_html = br.open("https://sellercentral.amazon.com/hz/inventory/ref=ag_invmgr_dnav_xx_?tbla_myitable=sort:{%22sortOrder%22%3A%22DESCENDING%22%2C%22sortedColumnId%22%3A%22date%22};search:;pagination:1;")

print('Login complete...')
time.sleep(5)

soup = BeautifulSoup(orders_html,'html.parser')
partNums = soup.find_all('span', {'class': 'mt-text-content mt-table-main'})

 print(partNums)

 for part in partNums:
  print(part.text)  

 print('Process Complete.')

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup mechanize


    【解决方案1】:

    您可以通过br.response().read()获取页面的HTML

    orders_page = br.open("https://sellercentral.amazon.com/hz/inventory/ref=ag_invmgr_dnav_xx_?tbla_myitable=sort:{%22sortOrder%22%3A%22DESCENDING%22%2C%22sortedColumnId%22%3A%22date%22};search:;pagination:1;")  # loads page
    orders_html = br.response().read()  # saves page source
    

    【讨论】:

    • 返回了相同的结果。我拉了整个页面并搜索表中的最后一个零件号,但它不存在,这就像我只捕获页面首次加载时可见的部分,而不是捕获完整的库存列表。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多