【问题标题】:How to Scrape Product Pages using Python grequests and BeautifulSoup如何使用 Python grequests 和 BeautifulSoup 抓取产品页面
【发布时间】:2021-09-30 15:48:07
【问题描述】:
from bs4 import BeautifulSoup
import grequests
import pandas as pd
    
# STEP 1: Create List of URLs from main archive page
def get_urls():
    urls = []
    for x in range(1,3):
        urls.append(f'http://books.toscrape.com/catalogue/page-{x}.html')
        print(f'Getting page url: {x}', urls)
    return urls

# STEP 2: Async Load HTML Content from page range in step 1
def get_data(urls):
    reqs = [grequests.get(link) for link in urls]
    print('AsyncRequest object > reqs:', reqs)
    resp = grequests.map(reqs)
    print('Status Code > resp (info on page):', resp, '\n')
    return resp

# Step 3: Extract title, author, date, url, thumb from asynch variable resp containing html elements of all scraped pages.
def parse(resp):
    productlist = []

    for r in resp:
        #print(r.request.url)
        sp = BeautifulSoup(r.text, 'lxml')
        items = sp.find_all('article', {'class': 'product_pod'})
        #print('Items:\n', items)

        for item in items:
            product = {
            'title' : item.find('h3').text.strip(),
            'price': item.find('p', {'class': 'price_color'}).text.strip(),
            'single_url': 'https://books.toscrape.com/catalogue/' + item.find(('a')).attrs['href'],
            'thumbnail': 'https://books.toscrape.com/' + item.find('img', {'class': 'thumbnail'}).attrs['src'],
            }
            productlist.append(product)
            print('Added: ', product)
            
    return productlist

urls = get_urls() # (Step 1)
resp = get_data(urls) # (Step 2)
df = pd.DataFrame(parse(resp)) # (Step 3)
df.to_csv('books.csv', index=False)

上述脚本通过异步抓取主存档页面或网站https://books.toscrape.com/的页面使用grequests美丽的汤

在存档页面中,它会提取以下图书信息:

  • 标题
  • 价格
  • 单一产品网址
  • 缩略图网址

问题

我需要一种方法来进一步从单个产品页面中提取信息以获取 UPC 等信息并将信息关联回主数组 productlist

单一产品页面示例https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

【问题讨论】:

    标签: python web-scraping beautifulsoup grequests


    【解决方案1】:

    你需要的单页信息UPCProduct Type,reviews等... `

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
    }
    r = requests.get("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html?")
    soup = BeautifulSoup(r.content, "lxml")
    table = soup.find("article", class_="product_page")
    
    header = [th.get_text(strip=True) for th in table.tr.select("th")][1:]
    header.insert(0, 'S.No')
    
    all_data = []
    for row in table.select("tr:has(td)"):
        tds = [td.get_text(strip=True) for td in row.select("td")]
        all_data.append(tds)
    
    df = pd.DataFrame(all_data, columns=header)
    print(df)
    
    output:
                          S.No
    0         a897fe39b1053632
    1                    Books
    2                   �51.77
    3                   �51.77
    4                    �0.00
    5  In stock (22 available)
    6                        0
    

    【讨论】:

    • 感谢您的支持。不幸的是,这并不能解决我的问题。您的脚本只是(缓慢地)抓取存档页面。我的脚本使用异步抓取只会更快地执行相同的操作。我的问题是我不知道如何抓取单个书页并将这些结果返回到主数组产品列表。
    • 你需要什么信息single book page
    • UPC 和评论数量,谢谢
    • 好的,我更新代码
    • 我有update代码检查一下`
    猜你喜欢
    • 2015-06-07
    • 1970-01-01
    • 2013-01-29
    • 1970-01-01
    • 2014-12-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-06-30
    相关资源
    最近更新 更多