【发布时间】:2021-05-27 00:25:17
【问题描述】:
我有一个相当大的数据集,包含超过 100k 行,其中包含许多重复项和一些缺失或错误的值。尝试在下面的 sn-p 中简化问题。
sampleData = {
'BI Business Name' : ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www@1', 'www@1', 'www@2', 'www@3', np.nan],
'BI Telephone' : ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
我正在尝试根据重复行更改值,因此如果任何三个字段匹配,那么第四个字段也应该匹配。我应该得到这样的结果:
result = {
'BI Business Name' : ['AAA', 'AAA', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www@1', 'www@1', 'www@2', 'www@3', 'www@3'],
'BI Telephone' : ['999', '999', '666', '12345', '12345']
}
df = pd.DataFrame(result)
我找到了非常冗长的方法 - 这里只显示更改名称的部分。
df['Phone_code_web'] = df['BId Postcode'] + df['BI Website'] + df['BI Telephone']
reference_name = df[['BI Business Name', 'BI Telephone', 'BId Postcode','BI Website']]
reference_name = reference_name.dropna()
reference_name['Phone_code_web'] = reference_name['BId Postcode'] + reference_name['BI Website'] +
reference_name['BI Telephone']
duplicate_ref = reference_name[reference_name['Phone_code_web'].duplicated()]
reference_name = pd.concat([reference_name,duplicate_ref]).drop_duplicates(keep=False)
reference_name
def replace_name(row):
try:
old_name = row['BI Business Name']
reference = row['Phone_code_web']
new_name = reference_name[reference_name['Phone_code_web']==reference].iloc[0,0]
print(new_name)
return new_name
except Exception as e:
return old_name
df['BI Business Name']=df.apply(replace_name, axis=1)
df
有没有更简单的方法?
【问题讨论】:
标签: python-3.x dataframe duplicates user-defined-functions data-cleaning