【问题标题】:Looping pages for scraping with BeautifulSoup循环页面以使用 BeautifulSoup 进行抓取
【发布时间】:2020-11-15 06:09:27
【问题描述】:

我的单页刮刀:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page=1'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

all_data = []
for h3 in soup.select('h3.list_h3'):
    job_title = h3.get_text(strip=True)
    company = h3.find_next(class_="heading_secondary").get_text(strip=True)
    salary = h3.find_next(class_="salary_amount").get_text(strip=True)
    location = h3.find_next(class_="list_city").get_text(strip=True)
    print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))

    all_data.append({
        'Job Title': job_title,
        'Company': company,
        'Salary': salary,
        'Location': location
    })

df = pd.DataFrame(all_data)
df.to_csv('data.csv')

#tips = sns.load_dataset('data.csv')
#print(tips)

给我一​​个 csv 文件,但只有 50 行。 我想刮掉所有页面,想在HTML 代码'class=':'prev_next' 中找到,但 BACK 和 FORWARD 是相同的,只是href不同。所以我决定做范围循环并用它改变页面:

import requests
import pandas as pd
from bs4 import BeautifulSoup

#url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page=1'
#soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for i in range(1, 9):
    url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page='+str(i)
    print(url)
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    for h3 in soup.select('h3.list_h3'):
        try:
            job_title = h3.get_text(strip=True)
            company = h3.find_next(class_="heading_secondary").get_text(strip=True)
            salary = h3.find_next(class_="salary_amount").get_text(strip=True)
            location = h3.find_next(class_="list_city").get_text(strip=True)
            print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))
        except AttributeError:
            
            all_data.append({
                    'Job Title': job_title,
                    'Company': company,
                    'Salary': salary,
                    'Location': location
                })
        
df = pd.DataFrame(all_data)
df.to_csv('data.csv')

运行代码后,它只保存 5 行,所以这比我用来抓取一页的代码少 10 倍。

如何循环页面?页面从18

还有如何清理 Salary 对象?因为它以字符串形式出现,其中包含Nuo 2700Iki 2500 之一或具有两个数字,如@9​​87654329@。因为我想使用 Salary 列作为整数,所以我可以用 Seaborn 做一些绘图。

【问题讨论】:

    标签: python pandas web-scraping beautifulsoup data-cleaning


    【解决方案1】:

    您已在 except 块内缩进添加到 all_data 列表。因此,只有在出现异常时,控件才会进入except。运行以下脚本在 csv 文件中提供了大约 365 行

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    #url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page=1'
    #soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    all_data = []
    for i in range(1, 9):
        url = 'https://www.cvbankas.lt/?padalinys%5B0%5D=76&page='+str(i)
        print(url)
        soup = BeautifulSoup(requests.get(url).content, 'html.parser')
        for h3 in soup.select('h3.list_h3'):
            try:
                job_title = h3.get_text(strip=True)
                company = h3.find_next(class_="heading_secondary").get_text(strip=True)
                salary = h3.find_next(class_="salary_amount").get_text(strip=True)
                location = h3.find_next(class_="list_city").get_text(strip=True)
                print('{:<50} {:<15} {:<15} {}'.format(company, salary, location, job_title))
                all_data.append({
                        'Job Title': job_title,
                        'Company': company,
                        'Salary': salary,
                        'Location': location
                    })
            except AttributeError:
                pass
                
            
    df = pd.DataFrame(all_data)
    df.to_csv('data.csv')
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-08-02
      • 1970-01-01
      • 1970-01-01
      • 2020-10-04
      • 2021-01-31
      • 1970-01-01
      相关资源
      最近更新 更多