【问题标题】:Python + BeautifulSoup Exporting to CSVPython + BeautifulSoup 导出为 CSV
【发布时间】:2014-03-07 04:28:31
【问题描述】:

我在从 Wikipedia 文章中自动抓取表格中的数据时遇到了一些麻烦。首先,我遇到了编码错误。我指定了 UTF-8 并且错误消失了,但是抓取的数据没有正确显示很多字符。您将能够从代码中看出我是一个完整的新手:

from bs4 import BeautifulSoup
import urllib2

wiki = "http://en.wikipedia.org/wiki/Anderson_Silva"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

Result = ""
Record = ""
Opponent = ""
Method = ""
Event = ""
Date = ""
Round = ""
Time = ""
Location = ""
Notes = ""

table = soup.find("table", { "class" : "wikitable sortable" })

f = open('output.csv', 'w')

for row in table.findAll("tr"):
    cells = row.findAll("td")
    #For each "tr", assign each "td" to a variable.
    if len(cells) == 10:
        Result = cells[0].find(text=True)
        Record = cells[1].find(text=True)
        Opponent = cells[2].find(text=True)
        Method = cells[3].find(text=True)
        Event = cells[4].find(text=True)
        Date = cells[5].find(text=True)
        Round = cells[6].find(text=True)
        Time = cells[7].find(text=True)
        Location = cells[8].find(text=True)
        Notes = cells[9].find(text=True)

        write_to_file = Result + "," + Record + "," + Opponent + "," + Method + "," + Event + "," + Date + "," + Round + "," + Time + "," + Location + "\n"
        write_to_unicode = write_to_file.encode('utf-8')
        print write_to_unicode
        f.write(write_to_unicode)

f.close()

【问题讨论】:

  • 您是否尝试过使用 CSV 模块 (docs.python.org/2/library/csv.html)?它处理引用等。该文档还为您指明了写出不同编码文本的正确方向。但是,对于您的特定问题... UTF-8 无法正确显示什么?根据该页面上的元标记,字符集是 UTF-8。

标签: python csv beautifulsoup


【解决方案1】:

正如 pswaminathan 指出的那样,使用 csv 模块将有很大帮助。这是我的做法:

table = soup.find('table', {'class': 'wikitable sortable'})
with open('out2.csv', 'w') as f:
    csvwriter = csv.writer(f)
    for row in table.findAll('tr'):
        cells = [c.text.encode('utf-8') for c in row.findAll('td')]
        if len(cells) == 10: 
            csvwriter.writerow(cells)

讨论

  • 使用 csv 模块,我创建了一个连接到我的输出文件的 csvwriter 对象。
  • 通过使用with 命令,我不必担心完成后关闭输出文件:它将在 with 块之后关闭。
  • 在我的代码中,cells 是从 tr 标记内的 td 标记中提取的 UTF8 编码文本列表。
  • 我使用了c.text这个结构,它比c.find(text=True)更简洁。

【讨论】:

    猜你喜欢
    • 2021-03-23
    • 2017-02-20
    • 1970-01-01
    • 1970-01-01
    • 2021-06-24
    • 2020-09-05
    • 2018-02-12
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多