【问题标题】:how to iter through whole html table and convert to json data?如何遍历整个 html 表并转换为 json 数据?
【发布时间】:2019-11-11 01:37:34
【问题描述】:

我正在研究 python3,我已经将 html 表转换为 json 对象,但它没有遍历整个表,只是给出第一行的输出。 这是我的代码:

html_source= """<div><table cellspacing="0" cellpadding="4" 
rules="all" border="2" id="ctl00_ContentPlaceHolder1_GridView1" 
style="background-color:White;border-color:#3366CC;border- 
width:2px;border-style:Solid;font-size:Medium;font-weight:bold;border- 
collapse:collapse;">
<tr style="color:#CCCCFF;background-color:#003399;font-weight:bold;">
<th scope="col">AC NO</th><th scope="col">PART NO</th><th 
 scope="col">SR NO</th><th scope="col">Voter Name</th><th 
scope="col">ID CARD NO</th><th scope="col">GENDER</th><th 
scope="col">AGE</th><th scope="col">&nbsp;</th><th scope="col">&nbsp; 
</th>
</tr><tr style="color:#003399;background-color:White;">
<td>211</td><td>396</td><td>294</td><td>name 1</td><td>UVP7645302</td> 
<td>M</td><td>28</td><td><input type="button" value="Polling Station 
Address"onclick="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$GridView1&#39;,&#39;View Details$0&#39;)" style="width:150px;" /></td><td><input type="button" value="Family" onclick="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$GridView1& 
#39;,&#39;Family$0&#39;)" /></td>
</tr><th scope="col">AC NO</th><th scope="col">PART NO</th><th 
scope="col">SR NO</th><th scope="col">Voter Name</th><th 
 scope="col">ID CARD NO</th><th scope="col">GENDER</th><th 
 scope="col">AGE</th><th scope="col">&nbsp;</th><th scope="col">&nbsp; 
</th>
</tr><tr style="color:#003399;background-color:White;">
<td>211</td><td>396</td><td>295</td><td>name 2</td><td>UVP7645302</td> 
<td>M</td><td>28</td><td><input type="button" value="Polling Station>"""

soup = BeautifulSoup(html_source,'html.parser')


for table in soup.find_all('table'):
    keys = [th.get_text(strip=True)for th in table.find_all('th')]
    values = [td.get_text(strip=True)for td in table.find_all('td')]
    d = dict(zip(keys,values))
    #print(d)
    mydict =  (json.dumps(d))

empty = {k: v for k, v in d.items() if not v}
for k in empty:
del d[k]
print(json.dumps(d,ensure_ascii=False))

我的预期输出:

{“AC NO”:“211”,“PART NO”:“396”,“SR NO”:“294”,“选民姓名”:“姓名 1”, “身份证号”:“UVP7645302”,“性别”:“M”,“年龄”:“28”},{“AC NO”:“211”,“PART NO”:“396”,“SR NO”:“294”,“选民姓名”:“姓名 2”, “身份证号”:“UVP7645302”,“性别”:“M”,“年龄”:“28”}

实际输出:

{“AC NO”:“211”,“PART NO”:“396”,“SR NO”:“294”,“选民姓名”:“姓名 1”、“身份证号”:“UVP7645302”、“性别”:“M”、“年龄”:“28”}

【问题讨论】:

  • 您好,如果您可以将 HTML 字符串格式化为更易读,以及预期和实际输出,这将有很大帮助。由于它们很长,因此很难准确看出缺少什么
  • 一个问题是你没有收集你找到的字典。声明一个列表来保存数据并将字典附加到列表中。在您的 HTML 的特定情况下,只有一个表格,所以它并没有真正的区别。

标签: python json html-table beautifulsoup


【解决方案1】:

使用pandas 库:

from bs4 import BeautifulSoup
import pandas as pd 

html_source= """<div><table cellspacing="0" cellpadding="4" 
rules="all" border="2" id="ctl00_ContentPlaceHolder1_GridView1" 
style="background-color:White;border-color:#3366CC;border- 
width:2px;border-style:Solid;font-size:Medium;font-weight:bold;border- 
collapse:collapse;">
<tr style="color:#CCCCFF;background-color:#003399;font-weight:bold;">
<th scope="col">AC NO</th><th scope="col">PART NO</th><th 
 scope="col">SR NO</th><th scope="col">Voter Name</th><th 
scope="col">ID CARD NO</th><th scope="col">GENDER</th><th 
scope="col">AGE</th><th scope="col">&nbsp;</th><th scope="col">&nbsp; 
</th>
</tr><tr style="color:#003399;background-color:White;">
<td>211</td><td>396</td><td>294</td><td>name 1</td><td>UVP7645302</td> 
<td>M</td><td>28</td><td><input type="button" value="Polling Station 
Address"onclick="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$GridView1&#39;,&#39;View Details$0&#39;)" style="width:150px;" /></td><td><input type="button" value="Family" onclick="javascript:__doPostBack(&#39;ctl00$ContentPlaceHolder1$GridView1& 
#39;,&#39;Family$0&#39;)" /></td>
</tr><th scope="col">AC NO</th><th scope="col">PART NO</th><th 
scope="col">SR NO</th><th scope="col">Voter Name</th><th 
 scope="col">ID CARD NO</th><th scope="col">GENDER</th><th 
 scope="col">AGE</th><th scope="col">&nbsp;</th><th scope="col">&nbsp; 
</th>
</tr><tr style="color:#003399;background-color:White;">
<td>211</td><td>396</td><td>295</td><td>name 2</td><td>UVP7645302</td> 
<td>M</td><td>28</td><td><input type="button" value="Polling Station>"""

table = pd.read_html(html_source)[0]
print(table.to_dict('records'))

O/P:

[{'AC NO': 211, 'PART NO': 396, 'SR NO': 294, 'Voter Name': 'name 1', 'ID CARD NO': 'UVP7645302', 'GENDER': 'M', 'AGE': 28, 'Unnamed: 7': nan, 'Unnamed: 8': nan}, {'AC NO': 211, 'PART NO': 396, 'SR NO': 295, 'Voter Name': 'name 2', 'ID CARD NO': 'UVP7645302', 'GENDER': 'M', 'AGE': 28, 'Unnamed: 7': nan, 'Unnamed: 8': nan}]

如果您想从字典中删除 Unnamed,请在 print(table.to_dict('records')) 语句之前添加此行

table = table.loc[:,~table.columns.str.startswith('Unnamed')]

O/P:

[{'AC NO': 211, 'PART NO': 396, 'SR NO': 294, 'Voter Name': 'name 1', 'ID CARD NO': 'UVP7645302', 'GENDER': 'M', 'AGE': 28}, {'AC NO': 211, 'PART NO': 396, 'SR NO': 295, 'Voter Name': 'name 2', 'ID CARD NO': 'UVP7645302', 'GENDER': 'M', 'AGE': 28}]

【讨论】:

    猜你喜欢
    • 2021-12-15
    • 1970-01-01
    • 2021-03-14
    • 1970-01-01
    • 2019-12-20
    • 2018-05-13
    • 2020-11-28
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多