【问题标题】:Web Scraping & BeautifulSoup - Next Page parsingWeb Scraping & BeautifulSoup - 下一页解析
【发布时间】:2021-11-15 18:45:20
【问题描述】:

我只是在学习网络scraping & 想把这个网站的结果输出到一个csv文件https://www.avbuyer.com/aircraft/private-jets

但我正在努力解析下一页 这是我的代码(在 Amen Aziz 的帮助下),它只给了我第一页
我正在使用 Chrome,所以不确定它是否有任何区别 我正在运行 Python 3.8.12
提前谢谢你

import requests
from bs4 import BeautifulSoup
import pandas as pd
headers= {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.avbuyer.com/aircraft/private-jets')
soup = BeautifulSoup(response.content, 'html.parser')
postings = soup.find_all('div', class_ = 'listing-item premium')
temp=[]
for post in postings:
    link = post.find('a', class_ = 'more-info').get('href')
    link_full = 'https://www.avbuyer.com'+ link
    plane = post.find('h2', class_ = 'item-title').text
    price = post.find('div', class_ = 'price').text
    location = post.find('div', class_ = 'list-item-location').text
    desc = post.find('div', class_ = 'list-item-para').text
    try:
        tag = post.find('div', class_ = 'list-viewing-date').text
    except:
        tag = 'N/A'
    updated = post.find('div', class_ = 'list-update').text
    t=post.find_all('div',class_='list-other-dtl')
    for i in t:
        data=[tup.text for tup in i.find_all('li')]
        years=data[0]
        s=data[1]
        total_time=data[2]

        temp.append([plane,price,location,years,s,total_time,desc,tag,updated,link_full])

df=pd.DataFrame(temp,columns=["plane","price","location","Year","S/N","Totaltime","Description","Tag","Last Updated","link"])


next_page = soup.find('a', {'rel':'next'}).get('href')
next_page_full = 'https://www.avbuyer.com'+next_page
next_page_full

url = next_page_full
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml') 

df.to_csv('/Users/xxx/avbuyer.csv')

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    试试这个: 如果你想要cvs file,那么你完成print(df)这一行并使用df.to_csv("prod.csv")我已经编写了代码来获取csv文件

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    headers = {'User-Agent': 'Mozilla/5.0'}
    temp=[]
    for page in range(1, 20):
        response = requests.get("https://www.avbuyer.com/aircraft/private-jets/page-{page}".format(page=page),headers=headers,)
        soup = BeautifulSoup(response.content, 'html.parser')
        postings = soup.find_all('div', class_='grid-x list-content')
        for post in postings:
            plane = post.find('h2', class_='item-title').text
            try:
                price = post.find('div', class_='price').text
            except:
                price=" "
            location = post.find('div', class_='list-item-location').text
            t=post.find_all('div',class_='list-other-dtl')
            for i in t:
                data=[tup.text for tup in i.find_all('li')]
                years=data[0]
                s=data[1]
                total_time=data[2]
                temp.append([plane,price,location,years,s,total_time])
    
    df=pd.DataFrame(temp,columns=["plane","price","location","Years","S/N","Totaltime"])
    print(df)
    

    输出:

                          plane         price  ...             S/N         Totaltime
    0            Gulfstream G280     Make offer  ...        S/N 2007   Total Time 2528
    1    Dassault Falcon 2000LXS     Make offer  ...         S/N 377     Total Time 33
    2       Cirrus Vision SF50 G1  Please call   ...        S/N 0080    Total Time 615
    3              Gulfstream IV     Make offer  ...        S/N 1148   Total Time 6425
    4            Gulfstream G280     Make offer  ...        S/N 2072   Total Time 1918
    ..                        ...           ...  ...             ...               ...
    342       Embraer Phenom 100       Now Sold  ...    S/N 50000035   Total Time 3417
    343          Gulfstream G200       Now Sold  ...         S/N 152   Total Time 7209
    344     Cessna Citation XLS+       Now Sold  ...           S/N -      Total Time -
    345    Cessna Citation Ultra       Now Sold  ...    S/N 560-0393  Total Time 12947
    346    Cessna Citation Excel       Now Sold  ...  S/N 560XL-5253   Total Time 4850
    

    【讨论】:

    • 谢谢阿门,但这只是给了我第一页的数据,只有 20 项与以前相同,但我需要在以下页面中总共有 392 架飞机 - 我在标题中提到过我需要多个页面,我在上面的代码中尝试过,但它没有转到下一页...
    • 我还注意到了一些有趣的事情,在贴子 = soup.find_all('div', class_='listing-item premium') 我还必须添加 'listing-item' 作为高级贴子在它们周围有一个框,但不是正常的帖子,所以我必须同时包含“listing-item premium”和“listing-item”,它们具有相同的类别 - 正常的列表从第 6 页的中间开始 - 非常感谢您的帮助!
    • 你试试我的代码他们给了你什么输出
    • 嗨,你的代码只给了我 80 行,就像你在上面显示的那样,但总共有 399 个平面 _ 我还注意到得到了我需要在下面复合 CSS 选择器的所有数据,但我不知道如何?: postings = soup.find_all('div', class_ = 'listing-item premium') postings = soup.find_all('div', class_ = 'listing-item') postings = soup.find_all('div', class_ = 'listing-item wanted-item')
    • 我已经修改了代码检查一下
    猜你喜欢
    • 2021-09-03
    • 2019-06-03
    • 2018-04-02
    • 2020-01-27
    • 2021-07-13
    • 2015-12-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多