Web Scraping & BeautifulSoup - 下一页解析答案

【问题标题】：Web Scraping & BeautifulSoup - Next Page parsingWeb Scraping & BeautifulSoup - 下一页解析
【发布时间】：2021-11-15 18:45:20
【问题描述】：

我只是在学习网络scraping & 想把这个网站的结果输出到一个csv文件https://www.avbuyer.com/aircraft/private-jets

但我正在努力解析下一页这是我的代码（在 Amen Aziz 的帮助下），它只给了我第一页
我正在使用 Chrome，所以不确定它是否有任何区别我正在运行 Python 3.8.12
提前谢谢你

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.avbuyer.com/aircraft/private-jets')
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_ = 'listing-item premium')
temp=[]
for post in postings:
    link = post.find('a', class_ = 'more-info').get('href')
    link_full = 'https://www.avbuyer.com'+ link
    plane = post.find('h2', class_ = 'item-title').text
    price = post.find('div', class_ = 'price').text
    location = post.find('div', class_ = 'list-item-location').text
    desc = post.find('div', class_ = 'list-item-para').text
    try:
        tag = post.find('div', class_ = 'list-viewing-date').text
    except:
        tag = 'N/A'
    updated = post.find('div', class_ = 'list-update').text
    t=post.find_all('div',class_='list-other-dtl')
    for i in t:
        data=[tup.text for tup in i.find_all('li')]
        years=data[0]
        s=data[1]
        total_time=data[2]

        temp.append([plane,price,location,years,s,total_time,desc,tag,updated,link_full])

df=pd.DataFrame(temp,columns=["plane","price","location","Year","S/N","Totaltime","Description","Tag","Last Updated","link"])


next_page = soup.find('a', {'rel':'next'}).get('href')
next_page_full = 'https://www.avbuyer.com'+next_page
next_page_full

url = next_page_full
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml') 

df.to_csv('/Users/xxx/avbuyer.csv')

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

试试这个：如果你想要cvs file，那么你完成print(df)这一行并使用df.to_csv("prod.csv")我已经编写了代码来获取csv文件

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
temp=[]
for page in range(1, 20):
    response = requests.get("https://www.avbuyer.com/aircraft/private-jets/page-{page}".format(page=page),headers=headers,)
    soup = BeautifulSoup(response.content, 'html.parser')
    postings = soup.find_all('div', class_='grid-x list-content')
    for post in postings:
        plane = post.find('h2', class_='item-title').text
        try:
            price = post.find('div', class_='price').text
        except:
            price=" "
        location = post.find('div', class_='list-item-location').text
        t=post.find_all('div',class_='list-other-dtl')
        for i in t:
            data=[tup.text for tup in i.find_all('li')]
            years=data[0]
            s=data[1]
            total_time=data[2]
            temp.append([plane,price,location,years,s,total_time])

df=pd.DataFrame(temp,columns=["plane","price","location","Years","S/N","Totaltime"])
print(df)

输出：

                      plane         price  ...             S/N         Totaltime
0            Gulfstream G280     Make offer  ...        S/N 2007   Total Time 2528
1    Dassault Falcon 2000LXS     Make offer  ...         S/N 377     Total Time 33
2       Cirrus Vision SF50 G1  Please call   ...        S/N 0080    Total Time 615
3              Gulfstream IV     Make offer  ...        S/N 1148   Total Time 6425
4            Gulfstream G280     Make offer  ...        S/N 2072   Total Time 1918
..                        ...           ...  ...             ...               ...
342       Embraer Phenom 100       Now Sold  ...    S/N 50000035   Total Time 3417
343          Gulfstream G200       Now Sold  ...         S/N 152   Total Time 7209
344     Cessna Citation XLS+       Now Sold  ...           S/N -      Total Time -
345    Cessna Citation Ultra       Now Sold  ...    S/N 560-0393  Total Time 12947
346    Cessna Citation Excel       Now Sold  ...  S/N 560XL-5253   Total Time 4850

【讨论】：

谢谢阿门，但这只是给了我第一页的数据，只有 20 项与以前相同，但我需要在以下页面中总共有 392 架飞机 - 我在标题中提到过我需要多个页面，我在上面的代码中尝试过，但它没有转到下一页...
我还注意到了一些有趣的事情，在贴子 = soup.find_all('div', class_='listing-item premium') 我还必须添加 'listing-item' 作为高级贴子在它们周围有一个框，但不是正常的帖子，所以我必须同时包含“listing-item premium”和“listing-item”，它们具有相同的类别 - 正常的列表从第 6 页的中间开始 - 非常感谢您的帮助！
你试试我的代码他们给了你什么输出
嗨，你的代码只给了我 80 行，就像你在上面显示的那样，但总共有 399 个平面 _ 我还注意到得到了我需要在下面复合 CSS 选择器的所有数据，但我不知道如何?: postings = soup.find_all('div', class_ = 'listing-item premium') postings = soup.find_all('div', class_ = 'listing-item') postings = soup.find_all('div', class_ = 'listing-item wanted-item')
我已经修改了代码检查一下