【问题标题】:Scrape <tr> tag if <td> tag has attribute如果 <td> 标签有属性,则刮掉 <tr> 标签
【发布时间】:2019-01-07 15:43:50
【问题描述】:

我想从表中抓取数据,如果行中有&lt;td BGCOLOR="#D42A2A"&gt;,则取整行&lt;tr&gt;

html 是这样的(多于 2 行):

<tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_1&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:46&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:00&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;01:25:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_2&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:46&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;191&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:01&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;01:25:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_3&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:59:02&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;36&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;36&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:01&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#d42a2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;03:10:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>

我使用过this,但这里的答案给出了表中的所有行,而不是包含必要属性的行

所以到目前为止我的代码看起来像:

data = []

rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])

然后我再次抓取该站点以找到 bgcolor 属性,将其添加到列表中,将列表附加到框架并删除任何没有正确 bgcolor 的行。

这一切都非常低效

只有在行的 td.attrs 中存在 bgcolor 时,我如何才能抓取 html 以从表中获取行

编辑:将以下解决方案应用于整个 html 后,脚本将返回空列表(这是我不包含更多 html 的错)。下面的这个 html 是更完整的版本,其中包含更多标签。

<html><head><title></title><style type="text/css">
BODY {
font-family: Tahoma, Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: 11px;
background-color: #FFFFFF
;}TABLE {
font-family: Tahoma, Verdana, Geneva, Arial, Helvetica, sans-serif;
font-size: 11px;
background-color: #FFFFFF;}
DIV.boldText {
font-size: 11px;font-weight: bold;
}
</style>
<meta http-equiv="REFRESH" content="10">
</head><body>
<form name="DataViewChooser">
<hr width="95%" align="CENTER" color="#55aa2a">
<table width="95%" align="CENTER">
<tbody><tr><td width="40" height="65" title="(c) ITEMS"><img 
src="/icons/geneos_logo.png"></td>
<td width="25" align="LEFT">
<img title="Refresh" style="cursor: hand;" onclick="reloadPage()" 
src="/icons/refresh.png"></td>
<td width="25" title="Show Fail and Warning Only" align="LEFT"><img 
style="cursor: hand;" onclick="userContractView()" src="/icons/minimise.png"></td>
<td width="25" align="LEFT"><img title="Home" style="cursor: hand;" onclick="goHome()" src="/icons/up.png"></td>
<td align="RIGHT" nowrap="NOWRAP"><img src="/icons/hostgreen.gif">
<div class="boldText">&nbsp;DASHBOARD-CV_AMER_Dashboard</div>&nbsp; [GROUP]
</td>
</tr></tbody></table><hr width="95%" align="CENTER" color="#55aa2a"></form>
<br><table width="95%" align="CENTER"><tbody><tr><td><table>
<tbody><tr><th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;AMER&nbsp; 
</th>
<td nowrap="NOWRAP" bgcolor="#55aa2a">&nbsp;&nbsp;</td></tr>
</tbody></table></td></tr></tbody></table>
<br><table width="99%" align="CENTER">
<tbody><tr bgcolor="#c0c0c0">
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;RowName&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Gateway_updatetime&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Gateway_state&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;OrdersCleared&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Ticketsread&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;OrdersNotCleared&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;TicketsNotCleared&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;LastReadingtime&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;LastClearingtime&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;ClearingInProgress&nbsp; 
</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;YestVolumes&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Starttime&nbsp;</th>
<th height="20" align="LEFT" nowrap="NOWRAP">&nbsp;Stoptime&nbsp;</th>
</tr><tr bgcolor="#f4f4f4">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_4&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:46&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:00&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#d42a2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;01:25:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>
<tr bgcolor="#ffffff">
<td height="25" nowrap="NOWRAP">&nbsp;ITEM_5&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:46&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;Connected&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;191&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;07:58:01&nbsp;</td>
<td height="25" nowrap="NOWRAP" bgcolor="#55aa2a">&nbsp;--:--:--&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;0&nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp;01:25:00 &nbsp;</td>
<td height="25" nowrap="NOWRAP">&nbsp; 22:00:00&nbsp;</td>
</tr>
</tbody></table><script language="JavaScript" src="/cookie.js"></script>
</body></html>'''

另外值得注意的是,我使用 urllib.request 打开 url 然后用 BS 解析

【问题讨论】:

    标签: python pandas beautifulsoup


    【解决方案1】:

    您可以应用searching function,您可以在其中检查标签名称是否为tr,并检查该行是否包含td 元素和bgcolor="#D42A2A"

    def rows_with_desired_bgcolor(elm):
        return elm.name == 'tr' and elm.find('td', bgcolor="#D42A2A")
    
    table_body.find_all(rows_with_desired_bgcolor)
    

    当然,您可以直接在列表推导中进行相同的检查:

    [tr for tr in table_body('tr') if tr.find('td', bgcolor="#D42A2A")]
    

    其中table_body('tr')table_body.find_all('tr') 的快捷方式。

    【讨论】:

    • 我喜欢你的列表理解,但它会产生一个空列表
    • @swagless_monk 好的,您是否尝试过将bgcolor 值小写?
    • 只能获取文本?我使用tr.text,但需要类似tr.strip
    • @swagless_monk 你可以做.get_text(strip=True) 我猜。
    【解决方案2】:

    你可以使用any:

    from bs4 import BeautifulSoup as soup
    d = soup(content, 'html.parser')
    results = [i for i in d.find_all('tr') if any(c.attrs.get('bgcolor') == "#d42a2a" for c in i.find_all('td'))]
    

    输出:

    [<tr bgcolor="#ffffff">
      <td height="25" nowrap="NOWRAP"> ITEM_3 </td>
      <td height="25" nowrap="NOWRAP"> 07:59:02 </td>
      <td height="25" nowrap="NOWRAP"> Connected </td>
      <td height="25" nowrap="NOWRAP"> 0 </td>
      <td height="25" nowrap="NOWRAP"> 36 </td>
      <td height="25" nowrap="NOWRAP"> 0 </td>
      <td height="25" nowrap="NOWRAP"> 36 </td>
      <td height="25" nowrap="NOWRAP"> 07:58:01 </td>
      <td bgcolor="#d42a2a" height="25" nowrap="NOWRAP"> --:--:-- </td>
      <td height="25" nowrap="NOWRAP"> 0 </td>
      <td height="25" nowrap="NOWRAP"> 0 </td>
      <td height="25" nowrap="NOWRAP"> 03:10:00  </td>
      <td height="25" nowrap="NOWRAP">  22:00:00 </td>
     </tr>]
    

    【讨论】:

    • 您可以避免使用find_all() + any(),因为您可以直接执行.find() 操作。
    • 这很完美!我选择其他作为答案,因为它是一个班轮。
    • @swagless_monk 很高兴为您提供帮助!想着,这也是一个班轮。 table_body 在 alecxe 的答案中是一个 BeautifulSoup 对象,例如 soup(content, 'html.parser'),它必须在不同的行上初始化。如果您确实想要单行解决方案,请在理解中将 d 替换为 soup(content, 'html.parser')
    【解决方案3】:

    找到所有td 包含bgcolor="#d42a2a" 然后选择.parent

    cells = table_body.find_all('td', bgcolor="#d42a2a")
    for cell in cells:
        print(cell.parent) 
        # <tr>...<td bgcolor="#d42a2a">...</tr>
    

    【讨论】:

    • 你测试了吗,结果就是这样?
    • 我确定这不是你的代码,而是我的环境。让我再编辑一些
    • 效果很好,谢谢!我选择@alecxe 作为答案,因为它的一个衬里