使用python和beautifulsoup从html表收集TR到csv答案

【问题标题】：collect TRs from html table to csv, using python and beautifulsoup使用python和beautifulsoup从html表收集TR到csv
【发布时间】：2020-08-05 03:46:09
【问题描述】：

所以我尝试了很多方法；我想从 Hydro 网站收集中断表。然后将其作为表格存储在 csv 中。

我正常检查了 tr 有 3 个 td（除了标题，出现错误，所以我确实输入了 if 来计算 tr tds 如果等于 3。但不知何故，beautifulsoup 它只检测到 1 td。在将其放入 csv 之后： a,b,c d,e,f...

代码：

import requests
import numpy as np
import pandas as pd
import csv
from bs4 import BeautifulSoup

URL = 'http://poweroutages.hydroquebec.com/poweroutages/service-interruption-report/#bis'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

table = soup.findAll('table')[0].findAll('tr')
print(table)
rates = {}
for tr in soup('tr'):

    if len(tr('td')) == 3:
        region_td, interruptions_td, cx_td = tr('td')
        print('hello')
        region = print(region_td)('i')[0]['title']
        interruptions = float(interruptions_td.text)
        cx = print(cx_td)('i')[0]['title']
        rates[region] = [interruptions, cx]

输出或结果：

import requests
import numpy as np
import pandas as pd
import csv
from bs4 import BeautifulSoup

URL = 'http://poweroutages.hydroquebec.com/poweroutages/service-interruption-report/#bis'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

table = soup.findAll('table')[0].findAll('tr')
print(table)
rates = {}
for tr in soup('tr'):

    if len(tr('td')) == 3:
        region_td, interruptions_td, cx_td = tr('td')
        print('hello')
        region = print(region_td)('i')[0]['title']
        interruptions = float(interruptions_td.text)
        cx = print(cx_td)('i')[0]['title']
        rates[region] = [interruptions, cx]

我尝试过的另一种方式有效，但能够将所有内容放在一个数组中，并且页脚有 2 个 tds 而不是 3 个。代码2：

import requests
import numpy as np

import pandas as pd
from bs4 import BeautifulSoup

URL = 'http://poweroutages.hydroquebec.com/poweroutages/service-interruption-report/#bis'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

n_interruptions = [i.text for i in soup.findAll('td')]

outageqc = pd.DataFrame({
    "n_interruptions": n_interruptions

})

outageqc.set_index('n_interruptions', inplace=True)

print(outageqc)
x = len(outageqc) / 3

输出：

索引：[Abitibi-Témiscamingue，81 377 中的 0、0 位客户， Bas-Saint-Laurent，123 929 中的 0、0 位客户，Capitale-Nationale， 424 080 名客户中的 6、2 589 名客户，魁北克中心，28、3 547 140 391 名客户，Chaudière-Appalache，14、3 384 名客户 out of 244 321, Côte-Nord, 0, 0 客户 out of 48 101, Estrie, 44, 1 Gaspésie - Îles-de-la-Madeleine, 90 947 中的 684 位顾客, 0, 0 57 355 名顾客，Lanaudière，9 名顾客，255 267 名顾客中的 240 名顾客， Laurentides, 3, 350 118 中的 51 位客户, Laval, 5, 939 位客户在 193 619 人中，Mauricie，19 人，在 165 191 人中，有 4 251 人，蒙特利尔，4, 1 067 名客户，其中 1 058 896，蒙特雷吉，61, 9 380 786 889 中的客户，魁北克北部，0, 22 127 中的 0 位客户， Outaouais, 6, 219 701 中有 71 位顾客, Saguenay - Lac-Saint-Jean, 130 952、199、27 203 个客户中的 0、0 个客户，共 4 393 261 个客户]

这个错过了页脚第一列。

【问题讨论】：

你试过了吗，pd.read_html 会解析表格并返回数据帧。
感谢 Sushanth，我是 py 编码的新手:) ,,,,, 刚刚尝试过，似乎没那么复杂,,,,,花了我一点时间,,,,但也完成了 csv感谢您的提示；）

标签： python csv beautifulsoup tr

【解决方案1】：

感谢 Sushanth 的提示，，，固定

from bs4 import BeautifulSoup

import io
import requests
import pandas as pd
from zipfile import ZipFile

df = pd.read_html('http://poweroutages.hydroquebec.com/poweroutages/service-interruption-report/#bis')

for i, table in enumerate(df):
    table.to_csv('test.csv', ',')

【讨论】：