Python Web Scraping：输出到 csv答案

【问题标题】：Python Web Scraping: Output to csvPython Web Scraping：输出到 csv
【发布时间】：2020-09-19 16:58:00
【问题描述】：

我在网络抓取方面取得了一些进展，但是我仍然需要一些帮助来执行一些操作：

import requests
import pandas as pd
from bs4 import BeautifulSoup




url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'

# soup = BeautifulSoup(requests.get(converturl).content, 'html.parser')

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = []

for tr in soup.select('.col-md-4 tbody tr'):

在 col-md-4 类上，我知道有 3 个表我想生成一个 csv，它作为输出具有三个值：名字、姓氏和最后一个值，我想要表的标题名称.

名、姓、头表

任何帮助将不胜感激。

【问题讨论】：

看看有没有帮助，stackoverflow.com/questions/39710903/…
感谢您的链接，但这是使用 pandas，我想使用 beautifulsoup。

标签： python python-3.x beautifulsoup python-requests

【解决方案1】：

这是我自己做的：

import requests
import pandas as pd
from bs4 import BeautifulSoup





url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'


soup = BeautifulSoup(requests.get(url).content, 'html.parser')

filename = url.rsplit('/', 1)[1] + '.csv'


tables = soup.select('.col-md-4 table')
rows = []

for tr in tables:
    t = tr.get_text(strip=True, separator='|').split('|')
    rows.append(t)
    df = pd.DataFrame(rows)
    print(df)
    df.to_csv(filename)

谢谢，

【讨论】：

【解决方案2】：

这可能有效：

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
tables = soup.select('.col-md-4 table')
rows = []

for table in tables:
    cleaned = list(table.stripped_strings)
    header, names = cleaned[0], cleaned[1:]
    data = [name.split(', ') + [header] for name in names]
    rows.extend(data)

result = pd.DataFrame.from_records(rows, columns=['surname', 'name', 'table'])

【讨论】：

感谢您的帮助。我已将代码粘贴到 Visual Studio 上，但出现错误 SyntaxError: 'return' outside function
我已经编辑了答案，您将在 result 变量中获得所需的结果。
您好 Milan 感谢您的支持，我再次尝试了代码，但仍然遇到问题。发生异常：TypeError 'generator' object is not subscriptable 文件“plantillasfcf.py”，第 30 行，在标头中，names = clean[0]，cleaned[1:]
对不起。我已经编辑了答案 - stripped_strings 的输出需要包含在 list 中。再试一次？

【解决方案3】：

您需要首先遍历要抓取的每个表，然后为每个表获取其标题和数据行。对于每一行数据，您要解析出名字和姓氏（以及表的标题）。

这是一个详细的工作示例：

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'http://fcf.cat/equip/1920/1i/sant-ildefons-ue-b'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

out = []

# Iterate through each of the three tables
for table in soup.select(".col-md-4 table"):

    # Grab the header and rows from the table
    header = table.select("thead th")[0].text.strip()
    rows = [s.text.strip() for s in table.select("tbody tr")]

    t = []  # This list will contain the rows of data for this table

    # Iterate through rows in this table
    for row in rows:

        # Split by comma (last_name, first_name)
        split = row.split(",")

        last_name = split[0].strip()
        first_name = split[1].strip()

        # Create the row of data
        t.append([first_name, last_name, header])

    # Convert list of rows to a DataFrame
    df = pd.DataFrame(t, columns=["first_name", "last_name", "table_name"])

    # Append to list of DataFrames
    out.append(df)

# Write to CSVs...
out[0].to_csv("first_table.csv", index=None)  # etc...

每当您进行网络抓取时，我强烈建议您在解析的所有文本上使用strip()，以确保您的数据中没有多余的空格。

我希望这会有所帮助！

【讨论】：