pandas read_html 在阅读之前或之后清理答案

【问题标题】：pandas read_html clean up before or after readpandas read_html 在阅读之前或之后清理
【发布时间】：2018-12-23 06:10:04
【问题描述】：

我正在尝试将此 html 中的最后一个表放入数据表中。

代码如下：

import pandas as pd
a=pd.read_html('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')
print (a[23])

如您所见，它已将其读取，但需要进行清理。我的问题是针对有使用此功能经验的人。阅读它然后尝试在之后或之前清理它更好吗？如果有人知道该怎么做，请发布一些代码。谢谢。

【问题讨论】：

你能分享你想要的输出吗？
你不认为使用像 BeautifulSoup 这样的解析器更合适吗？您可以将内容解析为对象而不是数据框，这样会更容易获得所需的结果。
@GRipepi。所需的输出是您在 html 中看到的表格，页面中的最后一个表格。
@iMad 我有一些东西只能使用beautifulsoup 获取部分数据。我以为我可以用 pandas 来尝试这种方式。

标签： python html pandas

【解决方案1】：

清理原始数据总是更好，因为任何处理都可能引入伪影。您的 HTML 表格是使用 span 功能创建的，这就是为什么如果您在 HTML 解析后清理 DataFrame 就无法以通用方式提取数据的原因。所以我建议你安装一个专门用于此目的的小模块：extracting data out of HTML tables。在命令行中运行

pip install html-table-extractor

在获取页面的原始 HTML 之后（您还需要 requests），处理表格并清除重复条目：

import requests
import pandas as pd
from collections import OrderedDict
from html_table_extractor.extractor import Extractor

pd.set_option('display.width', 400)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)

# get raw html
resp = requests.get('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')

# find last table
beg = resp.text.rfind('<table')
end = resp.text.rfind('</table')
html = resp.text[beg:end+8]

# process table
ex = Extractor(html)
ex.parse()
list_of_lines = ex.return_list()

# now you have some columns with recurrent values
df_dirty = pd.DataFrame(list_of_lines)
# print(df_dirty)

## we need to consolidate some columns

# find column names
names_line = 2
col_names = OrderedDict()
# for each column find repetitions
for el in list_of_lines[names_line]:
    col_names[el] = [i for i, x in enumerate(list_of_lines[names_line]) if x == el]

# now consolidate repetitive values
storage = OrderedDict() # this will contain columns
for k in col_names:
    res = []
    for line in list_of_lines[names_line+1:]:  # first 2 lines are empty, third is column names
        joined = [] # <- this list will accumulate *unique* values to become a single cell
        for idx in col_names[k]:
            el = line[idx]
            if joined and joined[-1]==el:   # if value already exist, skip
                continue
            joined.append(el)   # add unique value to cell
        res.append(''.join(joined))   # add cell to column
    storage[k] = res   # add column to storage
df = pd.DataFrame(storage)
print(df)

这将产生以下结果，非常接近原始结果：

                                                                                                        Q1`17                   Q2`17                   Q3`17                   Q4`17                 FY 2017                   Q1`18
0                                                                                      (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)  (Dollars in thousands)
1                                                                                                 (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)             (Unaudited)
2                                                                    Customer metrics                                                                                                                                                
3                                                               Customer accounts (1)                 57,000+                 61,000+                 65,000+                 70,000+                 70,000+                 74,000+
4                                               Customer accounts added in period (1)                  3,300+                  4,000+                  4,100+                  4,700+                 16,100+                  3,900+
5                                                     Deals greater than $100,000 (2)                     294                     372                     337                     590                   1,593                     301
6   Customer accounts that purchased greater than $1 million during the quarter (1,2)                      10                      15                      13                      27                                              13
7                                                                                                                                                                                                                                    
8                                                    Annual recurring revenue metrics                                                                                                                                                
9                                                  Total annual recurring revenue (3)                $439,001                $483,578                $526,211                $596,244                $596,244                $641,946
10                                          Subscription annual recurring revenue (4)                 $71,950                $103,538                $139,210                $195,488                $195,488                $237,533
11                                                                                                                                                                                                                                   
12                                               Geographic revenue metrics - ASC 606                                                                                                                                                
13                                                           United States and Canada                       —                       —                       —                       —                       —                $167,799
14                                                                      International                       —                       —                       —                       —                       —                 $78,408
..                                                                                ...                     ...                     ...                     ...                     ...                     ...                     ...
23                                                                                                                                                                                                                                   
24                                               Additional revenue metrics - ASC 606                                                                                                                                                
25                                              Remaining performance obligations (5)                       —                       —                       —                       —                 $99,580                $114,523
26                                                                                                                                                                                                                                   
27                                               Additional revenue metrics - ASC 605                                                                                                                                                
28                                          Ratable revenue as % of total revenue (6)                     54%                     56%                     63%                     60%                     59%                     72%
29                          Ratable license revenue as % of total license revenue (7)                     19%                     23%                     34%                     34%                     28%                     54%
30                   Services revenues as a % of maintenance and services revenue (8)                     12%                     13%                     12%                     13%                     13%                     11%
31                                                                                                                                                                                                                                   
32                                                         Bookings metrics - ASC 605                                                                                                                                                
33                                        Ratable bookings as % of total bookings (2)                     55%                     61%                     65%                     70%                     64%                     72%
34                        Ratable license bookings as % of total license bookings (2)                     26%                     37%                     45%                     51%                     41%                     59%
35                                                                                                                                                                                                                                   
36                                                                      Other metrics                                                                                                                                                
37                                                                Worldwide employees                   3,193                   3,305                   3,418                   3,489                   3,489                   3,663

【讨论】：

这很好。我试图使用pandas，因为您的解决方案似乎非常具体，我希望有一个更通用的解决方案。但也许这样的东西不存在。
相反，我的解决方案中唯一的参数是列名的行号。其他一切都是通用的。我将更新代码以强调它。
pandas 很棒，但在这种情况下它不是正确的工具。 HTML 的视觉布局与表格结构不匹配。虽然可以清理Dataframe，但需要分别处理每种行类型。
太棒了！你能解释一下常量names_line吗？还有号码8，那只是最后一张桌子吗？有9张桌子？
names_line 是包含列标签的行数。如果您检查list_of_lines，您将逐行查看解析后的表格内容。

【解决方案2】：

下面的Code 使用pd.read_html() 从网站中提取表格。可以根据table format 进一步调整其他参数。

# Import libraries
import pandas as pd

# Read table
link = 'https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm'
a=pd.read_html(link, header=None, skiprows=1)

# Save the dataframe
df = a[23]

# Remove NaN rows/columns
col_list = df.iloc[1]
df = df.loc[4:,[0,1,3,5,7,9,11]] # adjusted column names 
df.columns =  col_list[:len(df.columns)]
df.head(7)

注意：原表格中的空单元格被替换为 NaN 的

网站原始表格中的前几行：

【讨论】：

我很抱歉。我已经更正了代码。这些值现在与添加到帖子中的原始表格相匹配。
仍然在表中删除很多值
@jason：我会看看我是否可以回到这个并进一步更新代码。
对不起，伙计，你从哪里得到“数据”变量？也不能理解'a'变量，你能更明确一点吗，我认为你的代码可以帮助我很多，谢谢
抱歉，更新了上面的代码，将 'data' 更改为 'df'。