【问题标题】:Python Webpage Scraping of HTML TagHTML标签的Python网页抓取
【发布时间】:2019-12-04 01:08:45
【问题描述】:

我是 python 新手,试图从网页中抓取表格,但没有提取任何列的值。以下是单个 tr 的 td 标记示例。

<td class="Column" style="width:200px;"><span id="ctl00_MainContent_Value_ctl1543_Row_Name">email</span></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_1" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl00_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl00$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_276" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl01_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl01$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_2" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl02_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl02$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_5" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl03_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl03$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_3" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl04_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl04$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_7" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl05_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl05$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_4" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl06_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl06$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_6" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl07_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl07$SFV" type="text" value="0.3500"/></td>
   soup = bs(html_content, 'html.parser')


   table_rows = soup.find_all('tr')

   for tr in table_rows:
      td=tr.find_all('td')
      value = td.find('value')
      row =[i.value for i in td]
    print(row)

我尝试了许多不同的方法,但不知道如何从值标签中提取信息。

【问题讨论】:

  • 注意:&lt;input&gt; 标签不使用也不需要结束斜线,并且在 HTML 中从来没有。

标签: html python-3.x web-scraping html-table


【解决方案1】:

您正试图从&lt;td&gt; 标记中获取value 属性。选择 &lt;input&gt; 标签,而不是 &lt;td&gt;

此脚本将选择&lt;td&gt; 中的所有&lt;input&gt; 标签,并将打印datavalue 属性的内容:

html_content = '''<td class="Column" style="width:200px;"><span id="ctl00_MainContent_Value_ctl1543_Row_Name">email</span></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_1" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl00_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl00$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_276" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl01_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl01$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_2" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl02_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl02$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_5" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl03_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl03$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_3" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl04_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl04$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_7" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl05_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl05$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_4" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl06_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl06$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_6" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl07_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl07$SFV" type="text" value="0.3500"/></td>'''

from bs4 import BeautifulSoup as bs

soup = bs(html_content, 'html.parser')

for i in soup.select('td input'):
    print(i['data'], i['value'])

打印:

38_4255_1 0.3500
38_4255_276 0.3500
38_4255_2 0.3500
38_4255_5 0.3500
38_4255_3 0.3500
38_4255_7 0.3500
38_4255_4 0.3500
38_4255_6 0.3500

编辑:选择列名:

html_content = '''<tr><td class="Column" style="width:200px;"><span id="ctl00_MainContent_Value_ctl1543_Row_Name">email</span></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_1" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl00_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl00$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_276" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl01_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl01$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_2" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl02_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl02$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_5" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl03_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl03$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_3" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl04_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl04$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_7" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl05_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl05$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_4" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl06_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl06$SFV" type="text" value="0.3500"/></td>
<td class="Column" style="width:125px"><input class="sf d" data="38_4255_6" id="ctl00_MainContent_Value_ctl1543_Row_SF_ctl07_SFV" maxlength="10" name="ctl00$MainContent$Value$ctl1543$Row$SF$ctl07$SFV" type="text" value="0.3500"/></td></tr>'''

from bs4 import BeautifulSoup as bs

soup = bs(html_content, 'html.parser')

for row in soup.select('tr'):
    header = row.select_one('td').text
    print(header)
    for i in row.select('input'):
        print(i['data'], i['value'])

打印:

email
38_4255_1 0.3500
38_4255_276 0.3500
38_4255_2 0.3500
38_4255_5 0.3500
38_4255_3 0.3500
38_4255_7 0.3500
38_4255_4 0.3500
38_4255_6 0.3500

【讨论】:

    最近更新 更多