从静态网页中抓取数据的美丽汤答案

【问题标题】：Beautiful Soup to Scrape Data from Static Webpages从静态网页中抓取数据的美丽汤
【发布时间】：2021-12-06 23:40:33
【问题描述】：

我正在尝试从包含多个静态网页的表格中获取值。这里是韩语动词的动词变位数据：https://koreanverb.app/

我的 Python 脚本使用 Beautiful Soup。目标是从多个 URL 输入中获取所有结合，并将数据输出到 CSV 文件。

共轭存储在“table-responsive”类的表中的页面上以及“conjugation-row”类的表行下。每页上有多个“共轭行”表行。我的脚本是有人只用“conjugation-row”类抓取第一个表格行。

为什么 for 循环不抓取所有具有类“conjugation-row”的 td 元素？我会很感激一个解决方案，它可以用类“conjugation-row”抓住所有的 tr。我尝试使用job_elements = results.find("tr", class_="conjugation-row")，但出现以下错误：

AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

此外，当我获取数据并输出到 CSV 文件时，数据按预期位于单独的行中，但会留下空白。，它将第二个 URL 的数据行放在索引中的所有数据行之后第一个网址。在此处查看示例输出：

在此处查看代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

# create csv file
outfile = open("scrape.csv","w",newline='')
writer = csv.writer(outfile)

## define first URL to grab conjugation names
url1 = 'https://koreanverb.app/?search=%ED%95%98%EB%8B%A4'

# define dataframe columns
df = pd.DataFrame(columns=['conjugation name'])

# get URL content
response = requests.get(url1)
soup = BeautifulSoup(response.content, 'html.parser')
    
# get table with all verb conjugations
results = soup.find("div", class_="table-responsive")


##### GET CONJUGATIONS AND APPEND TO CSV

# define URLs
urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4', 
        'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
        'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']

# loop to get data
for url in urls:
    response = requests.get(url)
    soup2 = BeautifulSoup(response.content, 'html.parser')
    
    # get table with all verb conjugations
    results2 = soup2.find("div", class_="table-responsive")
    
    # get dictionary form of verb/adjective
    verb_results = soup2.find('dl', class_='dl-horizontal')
    verb_title = verb_results.find('dd')
    verb_title_text = verb_title.text

    job_elements = results2.find_all("tr", class_="conjugation-row")
    for job_element in job_elements:
        conjugation_name = job_element.find("td", class_="conjugation-name")
        conjugation_korean = conjugation_name.find_next_sibling("td")
        conjugation_name_text = conjugation_name.text
        conjugation_korean_text = conjugation_korean.text
        data_column = pd.DataFrame({    'conjugation name': [conjugation_name_text],
                                        verb_title_text: [conjugation_korean_text],

        })
        #data_column = pd.DataFrame({verb_title_text: [conjugation_korean_text]})        
        df = df.append(data_column, ignore_index = True)
        
# save to csv
df.to_csv('scrape.csv')
outfile.close()
print('Verb Conjugations Collected and Appended to CSV, one per column')

【问题讨论】：

您可以使用find_all() 代替返回list 的find()，然后您可以编写一个for 循环来迭代并获取数据。

标签： python csv beautifulsoup

【解决方案1】：

使用find_all() 获取所有job_elements，因为find() 只返回第一次出现并在for 循环中迭代它们，如下所示。

job_elements = results.find_all("tr", class_="conjugation-row")
for job_element in job_elements:
    conjugation_name = job_element.find("td", class_="conjugation-name")
    conjugation_korean = conjugation_name.find_next_sibling("td")
    conjugation_name_text = conjugation_name.text
    conjugation_korean_text = conjugation_korean.text

    # append element to data
    df2 = pd.DataFrame([[conjugation_name_text,conjugation_korean_text]],columns=['conjugation_name','conjugation_korean'])
    df = df.append(df2)

错误是您尝试在 list 类型的变量上使用 find()。

随着你的脚本越来越大，我做了一些修改，比如使用get_conjugations() 函数和一些易于理解的专有名称。首先将conjugation_names 和conjugation_korean_names 添加到pandas Dataframe 列中，然后再添加其他列（korean0、korean1 ...）。

import requests
from bs4 import BeautifulSoup
import pandas as pd

# function to parse the html data & get conjugations
def get_conjugations(url):
    #set return lists
    conjugation_names = []
    conjugation_korean_names = []
    #get html text
    html = requests.get(url).text
    #parse the html text
    soup = BeautifulSoup(html, 'html.parser')
    #get table
    table = soup.find("div", class_="table-responsive")
    table_rows = table.find_all("tr", class_="conjugation-row")
    for row in table_rows:
        conjugation_name = row.find("td", class_="conjugation-name")
        conjugation_korean = conjugation_name.find_next_sibling("td")
        conjugation_names.append(conjugation_name.text)
        conjugation_korean_names.append(conjugation_korean.text)
    #return both lists
    return conjugation_names, conjugation_korean_names

# create csv file
outfile = open("scrape.csv", "w", newline='')

urls = ['https://koreanverb.app/?search=%ED%95%98%EB%8B%A4',
        'https://koreanverb.app/?search=%EB%A8%B9%EB%8B%A4',
        'https://koreanverb.app/?search=%EB%A7%88%EC%8B%9C%EB%8B%A4']

# define dataframe columns
df = pd.DataFrame(columns=['conjugation_name', 'conjugation_korean', 'korean0', 'korean1'])

conjugation_names, conjugation_korean_names = get_conjugations(urls[0])
df['conjugation_name'] = conjugation_names
df['conjugation_korean'] = conjugation_korean_names

for index, url in enumerate(urls[1:]):
    conjugation_names, conjugation_korean_names = get_conjugations(url)
    #set column name
    column_name = 'korean' + str(index)
    df[column_name] = conjugation_korean_names

#save to csv
df.to_csv('scrape.csv')
outfile.close()

# Print DONE
print('Export to CSV Complete')

输出：

,conjugation_name,conjugation_korean,korean0,korean1
0,declarative present informal low,해,먹어,마셔
1,declarative present informal high,해요,먹어요,마셔요
2,declarative present formal low,한다,먹는다,마신다
3,declarative present formal high,합니다,먹습니다,마십니다
...

注意：这假定不同 URL 中的元素顺序相同。

【讨论】：

这很好用 - 谢谢。我已经相应地更新了上面的代码。就像我想要的那样，每个 URL 的值都放在它们自己的列中。但是，它们不会将值放在同一行中。如何确保每列中的值放置在正确的行，即正确的索引，以便它们对齐？
@matt 我没有完全明白你在问什么？也许你可以在关于输出应该如何的问题中添加一个示例数据集。
我添加了一个 CSV 输出的示例来给你一个想法。
效果很好——现在我更好地理解了 find() 和 find_all()。谢谢。