【问题标题】:Rename strings in a python list using string matching based on existing strings使用基于现有字符串的字符串匹配重命名 python 列表中的字符串
【发布时间】:2021-11-26 15:00:20
【问题描述】:

考虑以下示例,其中包含基于我抓取的表的数据帧标题:

headers = ['0 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Name  and Principal Position|', '1 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan',  '2 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Year|', '3 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Year|', '4 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '5 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Salary| ($)|', '6 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Salary| ($)|', '7 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '8 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Option  Awards| ($)|', '9 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Option  Awards| ($)|', '10 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '11 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Non-Equity  Incentive Plan Compensation| ($)|', '12 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Non-Equity  Incentive Plan Compensation| ($)|', '13 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '14 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Change  in Pension Value and Nonqualified Deferred Compensation  Earnings| ($)|', '15 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Change  in Pension Value and Nonqualified Deferred Compensation  Earnings| ($)|', '16 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '17 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan All  Other Compensation| ($)|', '18 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan All  Other Compensation| ($)|', '19 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan nan', '20 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Total| ($)|', '21 Summary  Compensation Table| for  Fiscal Year End December 31, 2006| nan Total| ($)|']

在这种形式中,标题非常混乱,所以我想将它们标准化。 我只对包含特定关键字的列感兴趣,因此我在第一步中使用以下代码过滤这些:

df= df.filter(regex='Name|Year|Salary|Bonus|Period') 

我正在使用此代码根据关键字重命名标题列表中的字符串并将它们设置为新标题:

headers = df.columns.values.tolist()
headers = ["Name" if "Name" in ele else ele for ele in headers]
headers = ["Year" if "Year" in ele else ele for ele in headers]
headers = ["Period" if "Period" in ele else ele for ele in headers]
headers = ["Salary" if "Salary" in ele else ele for ele in headers]
headers = ["Bonus" if "Bonus" in ele else ele for ele in headers]
df.columns = headers

因此,只要标题字符串包含字符串“Year”,它就会简单地重命名为“Year”。

只要在给定的标题字符串中一次只出现“姓名”、“年份”、“期间”、“薪水”或“奖金”中的一个,代码就可以正常工作。

然而,在发布的标题示例列表中,关键字“Year”出现在每个字符串中(在每个标题中),因此我的代码会将每个字符串重命名为“Year”。

如果标题包含多个关键词,例如以下,包含“年”和“薪水”

'6 Summary Compensation Table| for Fiscal Year End December 31, 2006| nan Salary| ($)|'

我想检查哪些术语已设置为标题。如果“年”已设置为标题,则“工资”应为该列的新标题。

【问题讨论】:

  • 你的意思是你的替换列表中的最后一项应该始终应用,以防标题有多个命中?因此,在您的示例中,如果您有一个包含 5 个单词的标题,则 Bonus 应该是替换值,因为它是您检查要替换的最后一个值?
  • 感谢您的提问并为您的困惑感到抱歉:几乎在所有情况下,有问题的标题都应包含不超过两个关键字。例如。 “年”和“薪水”或“年”和“奖金”。在这种情况下,代码应检查其中一个术语是否已设置为单个标题,并将另一个术语视为正确的标题。如果尚未将任何术语设置为单个标题,我会收到一条错误消息,因为很可能在第一列之一中有一列仅包含一个术语(如示例中的 I发布有问题。)
  • 您的第一个示例有 2 个潜在的标题“姓名”和“年份”。您会立即收到一条错误消息。这个边缘案例呢?

标签: python pandas web-scraping


【解决方案1】:

你也有两个都被用作标题的情况。

试试这个:

headers = df.columns.values.tolist()
words = ['Name','year','salary','bonus','Period']
for i, header in enumerate(headers):
    for word in words:
        if word in header:
            headers[i]=word
            words.remove(word)
            break
df.columns = headers
  • 如果存在 2 个值且没有一个值用作标题,则将选择基于 words 中的顺序的第一个值。
  • 如果两者都已用作标题,则标题将由两个单词组成

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-08-05
    • 2023-03-28
    • 2019-03-30
    • 2013-06-18
    • 2021-08-25
    • 2018-12-17
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多