【问题标题】:How to scrape the product information from the page如何从页面中抓取产品信息
【发布时间】:2021-10-09 10:04:35
【问题描述】:

我正在尝试从产品信息中抓取technical detail table ,但他们会为我提供空列表,我尝试抓取表格的页面链接是https://www.amazon.com/Hammermill-Letter-Bright-Sheets-113640C/dp/B072FVQNWM/ref=sr_1_6?dchild=1&qid=1633771276&s=office-products&sr=1-6

import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
base_url='https://www.amazon.com'
productlinks=[]
results = [] 
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36','session':'141-2320098-4829807'}
cookies= {'session': '17ab96bd8ffbe8ca58a78657a918558'}
cookies=cookies
r = requests.get('https://www.amazon.com/s?rh=n%3A1069242&fs=true&ref=lp_1069242_sar', headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.find_all('a',class_="a-link-normal s-underline-text s-underline-link-text a-text-normal",href=True):
    p=link['href']
    l=urljoin(base_url,p)
    productlinks.append(l)
    
results = []    
for link in productlinks:
        r =requests.get(link,headers=headers)
        soup=BeautifulSoup(r.content, 'html.parser')
        try:
            for tr in soup.find('table', id='productDetails_techSpec_section_1').find_all('tr') :
                print(tr.text.strip())
                results.append(tr.text.strip())
        except:
            continue
print(results)

【问题讨论】:

  • 对我来说很好用!!
  • 请分享我的输出
  • 这一行:cookies=cookies 是不必要的。

标签: python web-scraping beautifulsoup


【解决方案1】:

这是我得到的输出:

['ManufacturerAmazon Basics', 'BrandAmazon Basics', 'Item Weight41.6 pounds', 'Product Dimensions18 x 11.8 x 9 inches', 'Item model numberAMZN8RM', 'ColorWhite', 
'Material TypePaper', 'Number of Items8', 'Size8 Reams | 4000 Sheets', 'Sheet Size8.5-x-11-inch', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part NumberAMZN8RM', 'ManufacturerZebra Pen Corporation', 'BrandZebra Pen', 'Item Weight0.336 ounces', 'Product Dimensions1.1 x 6.5 x 7.5 inches', 'Item model number22218', 'Is Discontinued By ManufacturerNo', 'ColorBlack', 'ClosureRetractable', 'Grip TypeRubber', 'Material TypePlastic, Metal, Rubber', 'Number of Items18', 'Size18-Pack', 'Point TypeMedium', 'Line Size1.00 Pen', 'Ink ColorBlack', 'Manufacturer Part Number22218', 'Manufacturer3M Office Products', 'BrandScotch', 'Item Weight3.68 ounces', 'Product Dimensions7.8 x 7.1 x 3 inches', 'Item model number142-6', 'Is Discontinued By ManufacturerNo', 'ColorClear', 'Material TypeSynthetic Rubber Resin', 'Number of Items1', 'Size6 Count', 'Manufacturer Part Number142-6', 'National Stock Number6520-01-356-3964, 5970-01-137-7860, 7530-00-598-7711', 'ManufacturerInternational Paper (Office)', 'BrandHammermill', 'Item Weight40 pounds', 'Product Dimensions17.25 x 11.75 x 8.25 inches', 'Item model number113640C', 'Is Discontinued By ManufacturerNo', 'Color8 Ream | 4000 Sheets', 'Cover MaterialPaper', 'Material TypePaper', 'Number of Items8', 'Size8 Ream | 4000 Sheets', 'Sheet Size8.5 x 11', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part Number113640C', 'ManufacturerNewell Rubbermaid Office', 'BrandEXPO', 'Item Weight2.4 ounces', 'Product Dimensions5.5 x 6.25 x 4.02 inches', 'Item model number1884309', 'Is Discontinued By ManufacturerNo', 'ColorAssorted', 'Grip TypeThumb', 'Material TypePlastic', 'Number of Items1', 'Size8-Count', 'Point TypeUltra Fine', 'Line Size0.5mm millimeters', 'Ink ColorMulticolor', 'Tip TypeFine point', 'Manufacturer Part Number1884309', 'Manufacturer3M Office Products', 'BrandScotch', 'Item Weight3.06 pounds', 'Product Dimensions0.75 
x 8.9 x 11.4 inches', 'Item model numberTP3854-100', 'Is Discontinued By ManufacturerNo', 'ColorClear', 'Material TypeLaminate', 'Number of Items1', 'PackagingRetail', 'Size100-Pack', 'Paper FinishGlossy', 'Manufacturer Part NumberTP3854-100', 'ManufacturerScotch', 'BrandScotch', 'Item Weight10.6 ounces', 'Product Dimensions4.2 x 6.4 x 3.05 inches', 'Item model number6122', 'Is Discontinued By ManufacturerNo', 'ColorTransparent', 'Material TypePlastic', 'Number of Items1', 'Size6 Rolls', 'Manufacturer Part Number6122', 'Manufacturer\tGorilla Glue', 'Part Number\t7700104', 'Item Weight1.5 ounces', 'Product Dimensions1.25 x 3.38 x 6.63 inches', 'Item model number7700104', 'Is Discontinued By ManufacturerNo', 'Size1 Pack', 'ColorClear', 'Style1 - Pack', 'PatternSuper Glue', 'Item Package Quantity1', 'Included Components1 bottle glue', 'Batteries Included?No', 'Batteries Required?No', 'Warranty DescriptionNo', 'Manufacturer0', 'BrandSHARPIE', 'Item Weight3.2 ounces', 'Product Dimensions1 x 1 x 1 inches', 'Item model number30001', 'Is Discontinued By ManufacturerNo', 'ColorBlack (Box)', 'Material TypeAluminum', 'Number of Items1', 'Size12-Count', 'Point TypeFine', 'Line Size0.3mm', 'Ink ColorBlack', 'Tip TypeFine', 'Manufacturer Part NumberSAN30001', 'National Stock Number7520-00-904-1265', 'Manufacturer0', 'BrandSHARPIE', 'Item Weight3.2 ounces', 'Product Dimensions1 x 1 x 1 inches', 'Item model number30001', 'Is Discontinued By ManufacturerNo', 'ColorBlack (Box)', 'Material TypeAluminum', 'Number of Items1', 'Size12-Count', 'Point TypeFine', 'Line Size0.3mm', 'Ink ColorBlack', 'Tip TypeFine', 'Manufacturer Part NumberSAN30001', 'National Stock Number7520-00-904-1265', 'ManufacturerAimoh', 'BrandAimoh', 'Item Weight1.4 pounds', 'Product Dimensions9.7 x 4.3 x 2.2 inches', 'Item model number34100', 'Is Discontinued By ManufacturerNo', 'ColorWhite', 'ClosureSelf-Seal', 'Material TypePaper', 'Size100 Ct.', 'Sheet Size4.125-x-9.5-inch', 'Paper Weight24', 'Paper FinishWove', 'Manufacturer Part Number34100', 'ManufacturerHP Papers', 'BrandHP Papers', 'Item Weight15 pounds', 'Product Dimensions11 x 8.5 x 6.25 inches', 'Item model number112090', 'Is Discontinued By ManufacturerNo', 'Material TypePaper', 'Number of Items1', 'Size3 Ream | 1500 Sheets', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part Number112090', 'ManufactureriBayam', 'BrandIBayam', 'Item Weight3.84 ounces', 'Product Dimensions6.6 x 6.2 x 0.6 inches', 'Item model number18 Pack', 'Is Discontinued By ManufacturerNo', 'ColorBlack, Grey, Red, Blue, Magenta, Pink, 
Purple, Violet, Pale Yellow, Yellow, Orange, Raw Sienna, Sap Green, C Green, O Green, Lake Blue, Burnt Sienna, Crimson', 'ClosurePush Button', 'Grip TypeContoured', 'Material TypePlastic', 'Number of Items18', 'Size18 Unique Colors', 'Point TypeFine', 'Manufacturer Part Number61', 'ManufacturerAmazon Basics', 'BrandAmazon 
Basics', 'Item Weight6.7 ounces', 'Product Dimensions7.4 x 0.3 x 0.3 inches', 'Item model numberPHB-30', 'ColorYellow', 'Pencil Lead Degree (Hardness)HB', 'Material TypeWood', 'Number of Items30', 'Size30 Count (Pack of 1)', 'Point TypeMedium', 'Manufacturer Part NumberPHB-30', 'ManufacturerInternational Paper (Office)', 'BrandHammermill', 'Item Weight15 pounds', 'Product Dimensions11.25 x 8.75 x 6.25 inches', 'Item model number113620', 'Is Discontinued By ManufacturerNo', 'Material TypePaper', 'Number of Items3', 'Size3 Ream | 1500 Sheets', 'Sheet Size8.5 x 11', 'Brightness Rating92', 'Paper Weight20', 'Paper FinishSmooth', 'Manufacturer Part Number113620', 'Manufacturer\tiBayam', 'Part Number\t5234', 'Item Weight1.44 ounces', 'Product Dimensions4 x 3 x 0.6 inches', 'Item model number2 Pack', 'ColorPink & Black', 'MaterialFiberglass', 'Item Package Quantity1', 'Plug ProfileSewing', 'Batteries Included?No', 'Batteries Required?No', 'ManufacturerHewlett Packard SOHO Consumables', 'BrandHP Papers', 'Item Weight6 pounds', 'Product Dimensions11 x 8.5 x 12 inches', 'Item model number203000', 'Is Discontinued By ManufacturerNo', 'ColorWhite', 'Number of Items1', 'Size1 Ream | 500 Sheets', 'Sheet Size8.5 x 11 inch', 'Brightness Rating97', 'Paper Weight24', 'Paper FinishMatte', 'Manufacturer Part Number203000']

我只是将所有数据appended 到result 列表和print 它,并将读取所有trs 的for loop 放入try & except,因为在某些@987654328 @在productlinks中,没有tr

[...]
results = []    
for link in productlinks:
        r =requests.get(link,headers=headers)
        soup=BeautifulSoup(r.content, 'html.parser')
        try:
            for tr in soup.find('table', id='productDetails_techSpec_section_1').find_all('tr') :
                res = "".join(tr.text.strip().split("\n\n\n\u200e"))
                print(res)
                results.append(res)
        except:
            continue
        
print(results)

【讨论】:

  • 你对我的代码有什么改动吗??
  • 当我运行代码时出现这个错误UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' in position 18: character maps to <undefined>
  • 刚刚将代码添加到我的答案中。 @GXMentor
  • 他们提供空括号
  • 这很奇怪,它对我有用,只是向你展示了结果@GXMentor
【解决方案2】:

我在下面提供了一个可行的解决方案。

我通过使用表格 ID 标签找到表格元素(您可以使用 chrome 开发人员工具检查 HTML)。

找到表后,我们遍历表中的所有行。第一列的数据包含在 th 标签中,第二列的数据包含在 td 标签中。我们提取文本并删除任何新行。之后,我将表的结果保存在名为 results 的字典中。

最后,我们遍历结果字典以列出所要求的技术细节。

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36','session':'141-2320098-4829807'}
# Provided URL
url =r'https://www.amazon.com/dp/B072FVQNWM'
HTMLpage = requests.get(url, headers=headers)

# Parrsing the page
soup = BeautifulSoup(HTMLpage.content, 'html.parser')
# Finding the technical details table
tech_table = soup.find('table', id='productDetails_techSpec_section_1')
 # All rows in the table
rows = tech_table.find_all('tr')
results = {"id":[],"val":[]}
for r in rows:
    # Access the th tag and retrun value
    id = r.th.text.strip('\n')
    # Access the td tag and retrun value
    val = r.td.text.strip('\n')
    # save to results
    results['id'].append(id)
    results['val'].append(val)

# print output
for x in range(len(results['id'])):
    print(f'{results["id"][x]}: {results["val"][x]}')

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-05-04
    • 2022-08-14
    • 1970-01-01
    • 2021-10-17
    • 1970-01-01
    • 1970-01-01
    • 2022-06-30
    相关资源
    最近更新 更多