【问题标题】:Beautifulsoup span id tags to pandasBeautifulsoup span id 标签到 pandas
【发布时间】:2016-03-09 00:25:08
【问题描述】:

我有以下html:

</tr><tr>
<td>
<span id="Grid_exdate_43">2/15/2005</span>
</td><td>Cash</td><td>
<span id="Grid_CashAmount_43">0.08</span>
</td><td>
<span id="Grid_DeclDate_43">--</span>
</td><td>
<span id="Grid_RecDate_43">2/17/2005</span>
</td><td>
<span id="Grid_PayDate_43">3/10/2005</span>
</td>
</tr><tr>
<td>
<span id="Grid_exdate_44">11/15/2004</span>
</td><td>Cash</td><td>
<span id="Grid_CashAmount_44">3.08</span>
</td><td>
<span id="Grid_DeclDate_44">--</span>
</td><td>
<span id="Grid_RecDate_44">11/17/2004</span>
</td><td>
<span id="Grid_PayDate_44">12/2/2004</span>
</td>
</tr><tr>

每个部分都有相同的5个项目,即:Grid_exdateGrid_CashAmountGrid_DeclDateGrid_RecDateGrid_PayDate。每个部分的每个id 后面都有一个整数,每个部分都会递增。在上面的示例中,我们有第 43 和 44 节。

我需要能够将每个部分保存为 pandas 数据框中的一行。数据框如下:

Grid_exdate   Grid_CashAmount   Grid_DeclDate   Grid_RecDate   Grid_PayDate
2/15/2005     0.08              --              2/17/2005      3/10/2005
11/15/2004    3.08              --              11/17/2004     12/2/2004

我不知道该怎么做。

编辑:

好的,我已经设法找出应该可行的方法:

def get_exdate(self, id):
    return id and re.compile("Grid_exdate_").search(id)

df = pd.DataFrame()
exdate_list = []
for link in soup.find_all(id=self.get_exdate):
    exdate_list.append(link.string)

df['Grid_exdate'] = exdate_list

因此,上面的代码使用正则表达式获取所有Grid_exdate_ 值,将所有结果添加到列表中,然后将其作为列添加到数据框中。

所以我只创建 5 个,每个字段一个。如果有人有更好的解决方案,请告诉我(这可能不是一种非常有效的方法)。否则这应该可以解决问题。

【问题讨论】:

  • 为什么read_html没有解决你的问题?

标签: python pandas beautifulsoup


【解决方案1】:

您可以使用来自文档的 pandas read_html

此函数搜索&lt;table&gt; 元素,并且仅搜索表中每个&lt;tr&gt;&lt;th&gt; 元素中的&lt;tr&gt;&lt;th&gt; 行和&lt;td&gt; 元素。 &lt;td&gt; 代表“表格数据”。

所以在使用你的文件之前,你需要用&lt;table&gt;标签来包装它:

<table>
your html
</table>

然后使用第一个元素,因为read_html 从 html 读取表格到列表:

df = pd.read_html('file.html')

In [444]: df[0]
Out[444]:
            0     1     2   3           4          5
0   2/15/2005  Cash  0.08  --   2/17/2005  3/10/2005
1  11/15/2004  Cash  3.08  --  11/17/2004  12/2/2004

编辑

如果要重命名列:

df1 = df[0]
df1.columns = ["Grid_exdate", "Cash", "Grid_CashAmount", "Grid_DeclDate", "Grid_RecDate", "Grid_PayDate"]

您将拥有 'Cash' 列,因为您将其作为单独的表格单元格:

In [494]: df1
Out[494]:
  Grid_exdate  Cash  Grid_CashAmount Grid_DeclDate Grid_RecDate Grid_PayDate
0   2/15/2005  Cash             0.08            --    2/17/2005    3/10/2005
1  11/15/2004  Cash             3.08            --   11/17/2004    12/2/2004

然后您可以删除“现金”列或编辑您的初始表格

In [496]: df1.drop('Cash', axis=1)
Out[496]:
  Grid_exdate  Grid_CashAmount Grid_DeclDate Grid_RecDate Grid_PayDate
0   2/15/2005             0.08            --    2/17/2005    3/10/2005
1  11/15/2004             3.08            --   11/17/2004    12/2/2004

【讨论】:

    【解决方案2】:

    如果你不想使用pandas read_html,你可以把它解析得更复杂:

    import pandas as pd
    from bs4 import BeautifulSoup
    
    table = BeautifulSoup(open('test.html','r').read())
    
    #generate header from first tr
    h   = [[td.span.get('id') for td in row.select('td') if td.span != None ]
                 for row in table.findAll('tr')]
    #remove empty lists
    h = [x for x in h if x != []]               
    header = h[0]
    print header
    ['Grid_exdate_43', 'Grid_CashAmount_43', 'Grid_DeclDate_43', 'Grid_RecDate_43', 'Grid_PayDate_43']
    
    #if generating header is problematic, you can specify them
    #header = ['Grid_exdate', 'Grid_CashAmount', 'Grid_DeclDate', 'Grid_RecDate', 'Grid_PayDate' ]
    
    #get content of table, remove td with text Cash 
    body   = [[td.text.strip() for td in row.select('td') if td.text.strip() != 'Cash']
                 for row in table.findAll('tr')]
    #remove empty lists
    body = [x for x in body if x != []]                 
    
    cols = zip(*body)
    
    tbl_d  = {name:col for name, col in zip(header,cols)}
    
    df = pd.DataFrame(tbl_d, columns = header)
    
    print df
      Grid_exdate_43 Grid_CashAmount_43 Grid_DeclDate_43 Grid_RecDate_43  \
    0      2/15/2005               0.08               --       2/17/2005   
    1     11/15/2004               3.08               --      11/17/2004   
    
      Grid_PayDate_43  
    0       3/10/2005  
    1       12/2/2004  
    
    #remove last 3 chars of column name
    #more rename info:
    #http://stackoverflow.com/questions/11346283/renaming-columns-in-pandas
    df.rename(columns=lambda x: x[:-3], inplace=True)
    #convert columns to datetime columns
    df['Grid_exdate'] = pd.to_datetime(df['Grid_exdate'])
    df['Grid_RecDate'] = pd.to_datetime(df['Grid_RecDate'])
    df['Grid_PayDate'] = pd.to_datetime(df['Grid_PayDate'])
    
    print df
    
      Grid_exdate Grid_CashAmount Grid_DeclDate Grid_RecDate Grid_PayDate
    0  2005-02-15            0.08            --   2005-02-17   2005-03-10
    1  2004-11-15            3.08            --   2004-11-17   2004-12-02
    

    【讨论】:

      【解决方案3】:

      感谢大家提出的解决方案。最后我选择了以下似乎是最简单的解决方案:

      def get_exdate(self, id):
          return id and re.compile("Grid_exdate_").search(id)
      
      df = pd.DataFrame()
      exdate_list = []
      for link in soup.find_all(id=self.get_exdate):
          exdate_list.append(link.string)
      
      df['Grid_exdate'] = exdate_list
      

      这会使用 re.compile 在 html/soup 中搜索以 Grid_exdate_ 开头的所有内容。然后将结果添加到数据框中。所以我刚刚为每个必填字段创建了一个re.compile 搜索,并将它们全部添加到具有正确列标题的数据框中。

      【讨论】:

        猜你喜欢
        • 2021-06-05
        • 2017-06-29
        • 2014-04-25
        • 2011-02-01
        • 2022-08-11
        • 1970-01-01
        • 1970-01-01
        • 2021-07-26
        • 2020-05-19
        相关资源
        最近更新 更多