【问题标题】:Web Crawling/Web Scraping网页抓取/网页抓取
【发布时间】:2020-06-18 18:28:34
【问题描述】:

我正在尝试学习如何进行网络抓取/网络抓取,需要一些帮助。我目前正在从以下网站进行网页抓取:http://books.toscrape.com/。然而,我在网上抓取该网站上的价格、评级和封面 url 时遇到了困难。有人能帮我吗?下面列出的是我尝试使用的编码。

在 data.xpath("//article[@class='product_pod']") 中进行审核:

title = i.xpath("h3/a/@title")
price = i.xpath("//p[@class='price_color']/text()")
rating= i.xpath("//p[@class='star-rating']/@class")
coverurl= i.xpath("a/img/@src")
moreinfo= i.xpath("h3/a/@href")
print(title,price,rating,coverurl, moreinfo)

【问题讨论】:

    标签: python web-scraping web-crawler


    【解决方案1】:

    试试下面的代码

    from lxml import html
    import requests
    
    page = requests.get('http://books.toscrape.com/')
    tree = html.fromstring(page.content)
    product_name = tree.xpath('//article[@class="product_pod"]/h3/a/text()')
    product_price=tree.xpath('//div[@class="product_price"]/p/text()[1]')
    cover_image=tree.xpath('//div[@class="image_container"]/a/img/@src')
    rating=tree.xpath('//article[@class="product_pod"]/p/@class')
    
    def Remove(duplicate): 
        final_list = [] 
        for num in duplicate: 
            if num not in final_list: 
                final_list.append(num) 
        return final_list
    product_price=Remove(product_price)
    del product_price[1]
    
    final=zip(product_name, product_price,cover_image,rating)
    for i in final:
        print(i)
    
    o/p:
    
    ('A Light in the ...', '£51.77', 'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg', 'star-rating Three')
    ('Tipping the Velvet', '£53.74', 'media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg', 'star-rating One')
    ('Soumission', '£50.10', 'media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg', 'star-rating One')
    ('Sharp Objects', '£47.82', 'media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg', 'star-rating Four')
    ('Sapiens: A Brief History ...', '£54.23', 'media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg', 'star-rating Five')
    ('The Requiem Red', '£22.65', 'media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg', 'star-rating One')
    ('The Dirty Little Secrets ...', '£33.34', 'media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg', 'star-rating Four')
    ('The Coming Woman: A ...', '£17.93', 'media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg', 'star-rating Three')
    ('The Boys in the ...', '£22.60', 'media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg', 'star-rating Four')
    ('The Black Maria', '£52.15', 'media/cache/58/46/5846057e28022268153beff6d352b06c.jpg', 'star-rating One')
    ('Starving Hearts (Triangular Trade ...', '£13.99', 'media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg', 'star-rating Two')
    ("Shakespeare's Sonnets", '£20.66', 'media/cache/10/48/1048f63d3b5061cd2f424d20b3f9b666.jpg', 'star-rating Four')
    ('Set Me Free', '£17.46', 'media/cache/5b/88/5b88c52633f53cacf162c15f4f823153.jpg', 'star-rating Five')
    ("Scott Pilgrim's Precious Little ...", '£52.29', 'media/cache/94/b1/94b1b8b244bce9677c2f29ccc890d4d2.jpg', 'star-rating Five')
    ('Rip it Up and ...', '£35.02', 'media/cache/81/c4/81c4a973364e17d01f217e1188253d5e.jpg', 'star-rating Five')
    ('Our Band Could Be ...', '£57.25', 'media/cache/54/60/54607fe8945897cdcced0044103b10b6.jpg', 'star-rating Three')
    ('Olio', '£23.88', 'media/cache/55/33/553310a7162dfbc2c6d19a84da0df9e1.jpg', 'star-rating One')
    ('Mesaerion: The Best Science ...', '£37.59', 'media/cache/09/a3/09a3aef48557576e1a85ba7efea8ecb7.jpg', 'star-rating One')
    ('Libertarianism for Beginners', '£51.33', 'media/cache/0b/bc/0bbcd0a6f4bcd81ccb1049a52736406e.jpg', 'star-rating Two')
    ("It's Only the Himalayas", '£45.17', 'media/cache/27/a5/27a53d0bb95bdd88288eaf66c9230d7e.jpg', 'star-rating Two')
    
    #new addition
    list(map(list, zip([[el] for el in product_name], [[el] for el in product_price],[[el] for el in cover_image],[[el] for el in rating])))
    
    o/p:
    [[['A Light in the ...'],
      ['£51.77'],
      ['media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg'],
      ['star-rating Three']],
     [['Tipping the Velvet'],
      ['£53.74'],
      ['media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg'],
      ['star-rating One']],
     [['Soumission'],
      ['£50.10'],
      ['media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg'],
      ['star-rating One']],...]
    
    

    【讨论】:

    • 它奏效了,但我试图为一本书提供一个包含所有信息的括号。例如,我试图让一个括号包含严格意义上的 A Light In The Attic 一书。
    • 很抱歉,您知道如何将每本书的信息放在单括号中吗?例如,[Soumission]、[$50.10]、[Star-One Rating] 等。非常感谢您的帮助。
    • 给答案打上正确的标记,因为这将是对其他人的参考。谢谢
    • 我想我只是给了它一个正确的标记。不过我不确定。我对这个网站很陌生。
    猜你喜欢
    • 2017-11-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-09-07
    • 2021-01-12
    • 2019-03-08
    相关资源
    最近更新 更多