使用 python 初学者级别的网页抓取答案

【问题标题】：Web scraping using python-beginner level使用 python 初学者级别的网页抓取
【发布时间】：2020-11-10 12:05:36
【问题描述】：

您好，我是 python 新手。使用一些演示网站练习网页抓取。我正在尝试抓取这个网站http://books.toscrape.com/ 并想提取

href
姓名/职务
开始评分/星级
价格/价格颜色
库存可用性/库存可用性

我编写了一个基本代码，用于每个书籍级别。

但在那之后，我对如何提取这些信息一无所知。

import requests
from csv import reader,writer
from bs4 import BeautifulSoup


base_url= "http://books.toscrape.com/"

r = requests.get(base_url)

htmlContent = r.content

soup = BeautifulSoup(htmlContent,'html.parser')

for article in soup.find_all('article'):

【问题讨论】：

您必须查看网站的页面来源（right click->view page source 或类似名称）并找到您想要的信息所在的标签。然后你可以使用soup.find_all()和相关标签来提取信息。

标签： python web-scraping

【解决方案1】：

这将为您找到每本书的 href 和名称。如果需要，您还可以提取其他一些信息。

import requests
from csv import reader,writer
from bs4 import BeautifulSoup

base_url= "http://books.toscrape.com/"

r = requests.get(base_url)
soup = BeautifulSoup(r.content,'html.parser')

def extract_info(soup):
    href = []
    for a in soup.find_all('a', href=True): 
        if a.text:
            if "catalogue" in a["href"]:
                href.append(a['href'])
                        
    name = []
    for a in soup.find_all('a', title=True):
        name.append(a.text)

    return href, name

href, name = extract_info(soup)

print(href[0], name[0])

输出将是第一本书的 href 和名称

【讨论】：

【解决方案2】：

使用 python 尝试以下方法 - requests 和 BeautifulSoup。在检查了谷歌浏览器的网络部分 > 文档选项卡后，我已经从网站本身获取了页面 URL。

下面的脚本到底在做什么：

首先它将获取使用创建的页面 URL，页面无参数，然后执行 GET 请求。
URL 是动态的，将在迭代完成后创建。您会注意到 PAGE_NO 参数会在每次迭代后递增。
获取数据后脚本会使用html5.parser库解析HTML代码。
最后，它将遍历在每次迭代或页面中获取的书籍列表，例如：- 标题、超链接、价格、库存可用性和评级。

脚本下面有 50 页和 1k 个结果，每次迭代将提取所有书籍详细信息一页

    import requests
    from urllib3.exceptions import InsecureRequestWarning
    requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
    from bs4 import BeautifulSoup as bs

    def scrap_books_data():

    PAGE_NO = 1  # Page no parameter which will get incremented after every iteration

    while True:
    print('Creating URL to scrape books data for ', str(PAGE_NO))

    URL = 'http://books.toscrape.com/catalogue/page-' + str(PAGE_NO) + '.html' #dynamic URL which will get created after every iteration
    response = requests.get(URL,verify=False) # GET request to fetch data from site
    soup = bs(response.text,'html.parser') #Parse HTML data using 'html5.parser'

    extracted_books_data = soup.find_all('article', class_ = 'product_pod') # find all articles tag where book details are nested

    if len(extracted_books_data) == 0: #break the loop and exit from the script if there in no more data available to process
        break
    else:
        for item in range(len(extracted_books_data)): #iterate over the list of extracted books
            print('-' * 100)
            print('Title : ', extracted_books_data[item].contents[5].contents[0].attrs['title'])
            print('Link : ', extracted_books_data[item].contents[5].contents[0].attrs['href'])
            print('Rating : ', extracted_books_data[item].contents[3].attrs['class'][1])
            print('Price : ', extracted_books_data[item].contents[7].contents[1].text.replace('Â',''))
            print('Availability : ', extracted_books_data[item].contents[7].contents[3].text.replace('\n','').strip())
            print('-' * 100)
        PAGE_NO += 1 #increment page no by 1 to scrape next page data
    scrap_books_data()

【讨论】：