【发布时间】:2019-01-07 15:43:50
【问题描述】:
我想从表中抓取数据,如果行中有<td BGCOLOR="#D42A2A">,则取整行<tr>
html 是这样的(多于 2 行):
<tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP"> ITEM_1 </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:00 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP"> ITEM_2 </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 191 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:01 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP"> ITEM_3 </td>
<td height="25" nowrap="NOWRAP"> 07:59:02 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 36 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 36 </td>
<td height="25" nowrap="NOWRAP"> 07:58:01 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#d42a2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 03:10:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
我使用过this,但这里的答案给出了表中的所有行,而不是包含必要属性的行
所以到目前为止我的代码看起来像:
data = []
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
然后我再次抓取该站点以找到 bgcolor 属性,将其添加到列表中,将列表附加到框架并删除任何没有正确 bgcolor 的行。
这一切都非常低效
只有在行的 td.attrs 中存在 bgcolor 时,我如何才能抓取 html 以从表中获取行
编辑:将以下解决方案应用于整个 html 后,脚本将返回空列表(这是我不包含更多 html 的错)。下面的这个 html 是更完整的版本,其中包含更多标签。
<html><head><title></title><style type="text/css">
BODY {
font-family: Tahoma, Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: 11px;
background-color: #FFFFFF
;}TABLE {
font-family: Tahoma, Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: 11px;
background-color: #FFFFFF;}
DIV.boldText {
font-size: 11px;font-weight: bold;
}
</style>
<meta http-equiv="REFRESH" content="10">
</head><body>
<form name="DataViewChooser">
<hr width="95%" align="CENTER" color="#55aa2a">
<table width="95%" align="CENTER">
<tbody><tr><td width="40" height="65" title="(c) ITEMS"><img
src="/icons/geneos_logo.png"></td>
<td width="25" align="LEFT">
<img title="Refresh" style="cursor: hand;" onclick="reloadPage()"
src="/icons/refresh.png"></td>
<td width="25" title="Show Fail and Warning Only" align="LEFT"><img
style="cursor: hand;" onclick="userContractView()" src="/icons/minimise.png"></td>
<td width="25" align="LEFT"><img title="Home" style="cursor: hand;" onclick="goHome()" src="/icons/up.png"></td>
<td align="RIGHT" nowrap="NOWRAP"><img src="/icons/hostgreen.gif">
<div class="boldText"> DASHBOARD-CV_AMER_Dashboard</div> [GROUP]
</td>
</tr></tbody></table><hr width="95%" align="CENTER" color="#55aa2a"></form>
<br><table width="95%" align="CENTER"><tbody><tr><td><table>
<tbody><tr><th height="20" align="LEFT" nowrap="NOWRAP"> AMER
</th>
<td nowrap="NOWRAP" bgcolor="#55aa2a"> </td></tr>
</tbody></table></td></tr></tbody></table>
<br><table width="99%" align="CENTER">
<tbody><tr bgcolor="#c0c0c0">
<th height="20" align="LEFT" nowrap="NOWRAP"> RowName </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Gateway_updatetime
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Gateway_state </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> OrdersCleared </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Ticketsread </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> OrdersNotCleared
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> TicketsNotCleared
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> LastReadingtime
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> LastClearingtime
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> ClearingInProgress
</th>
<th height="20" align="LEFT" nowrap="NOWRAP"> YestVolumes </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Starttime </th>
<th height="20" align="LEFT" nowrap="NOWRAP"> Stoptime </th>
</tr><tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP"> ITEM_4 </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:00 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#d42a2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP"> ITEM_5 </td>
<td height="25" nowrap="NOWRAP"> 07:58:46 </td>
<td height="25" nowrap="NOWRAP"> Connected </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 191 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 07:58:01 </td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a"> --:--:-- </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 0 </td>
<td height="25" nowrap="NOWRAP"> 01:25:00 </td>
<td height="25" nowrap="NOWRAP"> 22:00:00 </td>
</tr>
</tbody></table><script language="JavaScript" src="/cookie.js"></script>
</body></html>'''
另外值得注意的是,我使用 urllib.request 打开 url 然后用 BS 解析
【问题讨论】:
标签: python pandas beautifulsoup