清理原始数据总是更好,因为任何处理都可能引入伪影。您的 HTML 表格是使用 span 功能创建的,这就是为什么如果您在 HTML 解析后清理 DataFrame 就无法以通用方式提取数据的原因。所以我建议你安装一个专门用于此目的的小模块:extracting data out of HTML tables。在命令行中运行
pip install html-table-extractor
在获取页面的原始 HTML 之后(您还需要 requests),处理表格并清除重复条目:
import requests
import pandas as pd
from collections import OrderedDict
from html_table_extractor.extractor import Extractor
pd.set_option('display.width', 400)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', None)
# get raw html
resp = requests.get('https://www.sec.gov/Archives/edgar/data/1303652/000130365218000016/a991-01q12018.htm')
# find last table
beg = resp.text.rfind('<table')
end = resp.text.rfind('</table')
html = resp.text[beg:end+8]
# process table
ex = Extractor(html)
ex.parse()
list_of_lines = ex.return_list()
# now you have some columns with recurrent values
df_dirty = pd.DataFrame(list_of_lines)
# print(df_dirty)
## we need to consolidate some columns
# find column names
names_line = 2
col_names = OrderedDict()
# for each column find repetitions
for el in list_of_lines[names_line]:
col_names[el] = [i for i, x in enumerate(list_of_lines[names_line]) if x == el]
# now consolidate repetitive values
storage = OrderedDict() # this will contain columns
for k in col_names:
res = []
for line in list_of_lines[names_line+1:]: # first 2 lines are empty, third is column names
joined = [] # <- this list will accumulate *unique* values to become a single cell
for idx in col_names[k]:
el = line[idx]
if joined and joined[-1]==el: # if value already exist, skip
continue
joined.append(el) # add unique value to cell
res.append(''.join(joined)) # add cell to column
storage[k] = res # add column to storage
df = pd.DataFrame(storage)
print(df)
这将产生以下结果,非常接近原始结果:
Q1`17 Q2`17 Q3`17 Q4`17 FY 2017 Q1`18
0 (Dollars in thousands) (Dollars in thousands) (Dollars in thousands) (Dollars in thousands) (Dollars in thousands) (Dollars in thousands)
1 (Unaudited) (Unaudited) (Unaudited) (Unaudited) (Unaudited) (Unaudited)
2 Customer metrics
3 Customer accounts (1) 57,000+ 61,000+ 65,000+ 70,000+ 70,000+ 74,000+
4 Customer accounts added in period (1) 3,300+ 4,000+ 4,100+ 4,700+ 16,100+ 3,900+
5 Deals greater than $100,000 (2) 294 372 337 590 1,593 301
6 Customer accounts that purchased greater than $1 million during the quarter (1,2) 10 15 13 27 13
7
8 Annual recurring revenue metrics
9 Total annual recurring revenue (3) $439,001 $483,578 $526,211 $596,244 $596,244 $641,946
10 Subscription annual recurring revenue (4) $71,950 $103,538 $139,210 $195,488 $195,488 $237,533
11
12 Geographic revenue metrics - ASC 606
13 United States and Canada — — — — — $167,799
14 International — — — — — $78,408
.. ... ... ... ... ... ... ...
23
24 Additional revenue metrics - ASC 606
25 Remaining performance obligations (5) — — — — $99,580 $114,523
26
27 Additional revenue metrics - ASC 605
28 Ratable revenue as % of total revenue (6) 54% 56% 63% 60% 59% 72%
29 Ratable license revenue as % of total license revenue (7) 19% 23% 34% 34% 28% 54%
30 Services revenues as a % of maintenance and services revenue (8) 12% 13% 12% 13% 13% 11%
31
32 Bookings metrics - ASC 605
33 Ratable bookings as % of total bookings (2) 55% 61% 65% 70% 64% 72%
34 Ratable license bookings as % of total license bookings (2) 26% 37% 45% 51% 41% 59%
35
36 Other metrics
37 Worldwide employees 3,193 3,305 3,418 3,489 3,489 3,663