【发布时间】:2021-11-26 15:00:20
【问题描述】:
考虑以下示例,其中包含基于我抓取的表的数据帧标题:
headers = ['0 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Name and Principal Position|', '1 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '2 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Year|', '3 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Year|', '4 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '5 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Salary| ($)|', '6 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Salary| ($)|', '7 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '8 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Option Awards| ($)|', '9 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Option Awards| ($)|', '10 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '11 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Non-Equity Incentive Plan Compensation| ($)|', '12 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Non-Equity Incentive Plan Compensation| ($)|', '13 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '14 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Change in Pension Value and Nonqualified Deferred Compensation Earnings| ($)|', '15 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Change in Pension Value and Nonqualified Deferred Compensation Earnings| ($)|', '16 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '17 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan All Other Compensation| ($)|', '18 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan All Other Compensation| ($)|', '19 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan nan', '20 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Total| ($)|', '21 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Total| ($)|']
在这种形式中,标题非常混乱,所以我想将它们标准化。 我只对包含特定关键字的列感兴趣,因此我在第一步中使用以下代码过滤这些:
df= df.filter(regex='Name|Year|Salary|Bonus|Period')
我正在使用此代码根据关键字重命名标题列表中的字符串并将它们设置为新标题:
headers = df.columns.values.tolist()
headers = ["Name" if "Name" in ele else ele for ele in headers]
headers = ["Year" if "Year" in ele else ele for ele in headers]
headers = ["Period" if "Period" in ele else ele for ele in headers]
headers = ["Salary" if "Salary" in ele else ele for ele in headers]
headers = ["Bonus" if "Bonus" in ele else ele for ele in headers]
df.columns = headers
因此,只要标题字符串包含字符串“Year”,它就会简单地重命名为“Year”。
只要在给定的标题字符串中一次只出现“姓名”、“年份”、“期间”、“薪水”或“奖金”中的一个,代码就可以正常工作。
然而,在发布的标题示例列表中,关键字“Year”出现在每个字符串中(在每个标题中),因此我的代码会将每个字符串重命名为“Year”。
如果标题包含多个关键词,例如以下,包含“年”和“薪水”
'6 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Salary| ($)|'
我想检查哪些术语已设置为标题。如果“年”已设置为标题,则“工资”应为该列的新标题。
【问题讨论】:
-
你的意思是你的替换列表中的最后一项应该始终应用,以防标题有多个命中?因此,在您的示例中,如果您有一个包含 5 个单词的标题,则 Bonus 应该是替换值,因为它是您检查要替换的最后一个值?
-
感谢您的提问并为您的困惑感到抱歉:几乎在所有情况下,有问题的标题都应包含不超过两个关键字。例如。 “年”和“薪水”或“年”和“奖金”。在这种情况下,代码应检查其中一个术语是否已设置为单个标题,并将另一个术语视为正确的标题。如果尚未将任何术语设置为单个标题,我会收到一条错误消息,因为很可能在第一列之一中有一列仅包含一个术语(如示例中的 I发布有问题。)
-
您的第一个示例有 2 个潜在的标题“姓名”和“年份”。您会立即收到一条错误消息。这个边缘案例呢?
标签: python pandas web-scraping