【问题标题】:Properly formatting the table after scraped by BeautifulSoup被 BeautifulSoup 抓取后正确格式化表格
【发布时间】:2021-03-18 06:40:17
【问题描述】:

我是 Python 新手。

我一直在尝试从http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm 中抓取一张桌子。目标表的标题为“Utilization by Body System”。

我能够使用 BeautifulSoup 捕获表格;但是,scraped 数据框让我发疯了,我找不到解决问题的方法。

我的代码:

import re
import bs4 as bs4
import urllib.request
source=urllib.request.urlopen('http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm').read()
soup=bs4.BeautifulSoup(source,'lxml')
#find the county utilization table by MDC 
#using the parental tag scrapling method, find the exact table index then save the parental table
table_mdc=soup.find(text=re.compile("Utilization by Body System")).findParent('table')
# print (table_mdc)
# #constuct the table
for row in table_mdc.find_all('tr'):
    for cell in row.find_all('td'):
        print(cell.text)
with open ('utilization.txt','w') as r:
    for row in table_mdc.find_all('tr'):
        for cell in row.find_all('td'):
            r.write(cell.text)

例如,抓取的数据帧打印为:

Utilization by Body System 
MDC Description
Total Cases
Number
Percent
Total Charges
% of Charges
Avg. Charge
Total Days
% of Total Days
Avg. LOS

Total

 
2,594
 
 
100.0%
 
 
$101,757,824
 
 
100.0%
 
 
$39,228
 
 
11,972
 
 
100.0%
 
 
4.6

它的输出和 txt 文件中有很多换行符。理想的txt文件应该是这样的:

(标题中没有“总病例数”)

我应该怎么做才能克服这些问题?

【问题讨论】:

  • 仅供所有网络抓取工具参考:“scrape”的过去时是“scraped”,而不是“scrapped”。

标签: python beautifulsoup


【解决方案1】:
import pandas as pd


df = pd.read_html(
    "http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm", attrs={"id": "dgBodySystem"}, header=0)[0]

print(df)
df.to_csv("data.csv", index=False)

输出:

【讨论】:

    猜你喜欢
    • 2018-12-28
    • 1970-01-01
    • 1970-01-01
    • 2011-03-11
    • 2021-01-04
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多