【问题标题】:Get href Attribute Link from td tag BeautifulSoup Python从 td 标签 BeautifulSoup Python 获取 href 属性链接
【发布时间】:2013-05-24 10:40:52
【问题描述】:

我是 Python 新手,有人建议我使用 Beautiful soup for Scrapping,但我遇到了一个问题,即根据第 4 列中的年份从 td 标签第 2 列获取 href 属性。

<table class="tableFile2" summary="Results">
         <tr>
            <th width="7%" scope="col">Filings</th>
            <th width="10%" scope="col">Format</th>
            <th scope="col">Description</th>
            <th width="10%" scope="col">Filing Date</th>
            <th width="15%" scope="col">File/Film Number</th>
         </tr>
<tr>
<td nowrap="nowrap">8-K</td>
<td nowrap="nowrap"><a href="/Archives/edgar/data/320193/000119312513199324/0001193125-13-199324-index.htm" id="documentsbutton">&nbsp;Documents</a></td>
<td class="small" >Current report, items 8.01 and 9.01
<br />Acc-no: 0001193125</td>
            <td>2013-05-03</td>
            <td nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;filenum=000-10030&amp;owner=include&amp;count=40">000-10030</a><br>13813281         </td>
         </tr>
<tr class="blueRow">
<td nowrap="nowrap">424B2</td>
<td nowrap="nowrap"><a href="/Archives/edgar/data/320193/000119312513191849/0001193125-13-191849-index.htm" id="documentsbutton">&nbsp;Documents</a></td>
<td class="small" >Prospectus [Rule 424(b)(2)]<br />Acc-no: 0001193125</td>
            <td>2013-05-01</td>
            <td nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;filenum=333-188191&amp;owner=include&amp;count=40">333-188191</a><br>13802405         </td>
         </tr>
<tr>
<td nowrap="nowrap">FWP</td>
<td nowrap="nowrap"><a href="/Archives/edgar/data/320193/000119312513189053/0001193125-13-189053-index.htm" id="documentsbutton">&nbsp;Documents</a></td>
<td class="small" >Filing under Securities Act Rules 163/433 of free writing prospectuses<br />Acc-no: 0001193125-13-189053&nbsp;(34 Act)&nbsp; Size: 52 KB            </td>
            <td>2013-05-01</td>
            <td nowrap="nowrap"><a href="/cgi-bin/browse-edgar?action=getcompany&amp;filenum=333-188191&amp;owner=include&amp;count=40">333-188191</a><br>13800170         </td>
         </tr>
</table>



table = soup.find('table', class="tableFile2")

rows = table.findAll('tr')
for tr in rows:
  cols = tr.findAll('td')
  if "2013" in cols[3]
    link = cols[1].find('a').get('href')
  print

【问题讨论】:

  • 所以您想要来自FormatFiling Date 列的数据?

标签: python beautifulsoup


【解决方案1】:

这在 Python 2.7 中适用于我:

table = soup.find('table', {'class': 'tableFile2'})
rows = table.findAll('tr')
for tr in rows:
    cols = tr.findAll('td')
    if len(cols) >= 4 and "2013" in cols[3].text:
        link = cols[1].find('a').get('href')
        print link

您之前的代码存在一些问题:

  1. soup.find() 需要属性字典(例如,{'class' : 'tableFile2'}
  2. 并非每个cols 实例都至少有3 列,因此您需要先检查长度。

【讨论】:

    猜你喜欢
    • 2021-12-25
    • 2011-07-09
    • 1970-01-01
    • 2020-12-06
    • 1970-01-01
    • 1970-01-01
    • 2017-09-29
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多