【发布时间】:2019-10-24 02:51:51
【问题描述】:
我很难想出执行以下操作所需的代码。 this 有一个类似的问题,但我不知道如何使代码适应我的特别需要。
我有一个长度超过 10 万行的 pandas 数据框。以下是当前地址和公寓号码的格式:
当前 DF:
temp = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'], 'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan']}
data = pd.DataFrame(temp)
data
col1 col2
0 220 CENTRAL STREET, 50 50
1 165 EAST 66TH ST, RESI RESI
2 106 SPRUCE STREET, 1 nan
3 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A
6 520 PARK LANE DPH60
7 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A
9 520 PARK SOUTH, DPH54 DPH54
10 520 PARK LANE DPH52
11 62 VEST STREET 21F
12 256 FLARIN AVENUE nan
所需的 DF (data1),它添加了 3 个新列以允许稍后提供不同级别的粒度:
temp1 = {'col1': ['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, RU', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'col2':['50', 'RESI', 'nan', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'building_address':['220 CENTRAL STREET', '165 EAST 66TH ST', '106 SPRUCE STREET', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET', '520 PARK LANE', '520 PARK LANE', '80 BAY STREET LANDING', '520 PARK SOUTH', '520 PARK LANE', '62 VEST STREET', '256 FLARIN AVENUE'],
'apt_no': ['50', 'RESI', '1', 'nan', 'nan', '2A', 'DPH60', 'DPH56', '1A', 'DPH54', 'DPH52', '21F', 'nan'],
'full_address':['220 CENTRAL STREET, 50', '165 EAST 66TH ST, RESI', '106 SPRUCE STREET, 1', '14 EAST 67TH STREET', '1131 OGEN AVENUE', '200 EAST 1ST STREET, 2A', '520 PARK LANE, DPH60', '520 PARK LANE, DPH56', '80 BAY STREET LANDING, 1A', '520 PARK SOUTH, DPH54', '520 PARK LANE, DPH52', '62 VEST STREET, 21F', '256 FLARIN AVENUE']}
data1 = pd.DataFrame(temp1)
data1
col1 col2 building_address apt_no \
0 220 CENTRAL STREET, 50 50 220 CENTRAL STREET 50
1 165 EAST 66TH ST, RESI RESI 165 EAST 66TH ST RESI
2 106 SPRUCE STREET, 1 nan 106 SPRUCE STREET 1
3 14 EAST 67TH STREET nan 14 EAST 67TH STREET nan
4 1131 OGEN AVENUE nan 1131 OGEN AVENUE nan
5 200 EAST 1ST STREET, RU 2A 200 EAST 1ST STREET 2A
6 520 PARK LANE DPH60 520 PARK LANE DPH60
7 520 PARK LANE DPH56 520 PARK LANE DPH56
8 80 BAY STREET LANDING, 1A 1A 80 BAY STREET LANDING 1A
9 520 PARK SOUTH, DPH54 DPH54 520 PARK SOUTH DPH54
10 520 PARK LANE DPH52 520 PARK LANE DPH52
11 62 VEST STREET 21F 62 VEST STREET 21F
12 256 FLARIN AVENUE nan 256 FLARIN AVENUE nan
full_address
0 220 CENTRAL STREET, 50
1 165 EAST 66TH ST, RESI
2 106 SPRUCE STREET, 1
3 14 EAST 67TH STREET
4 1131 OGEN AVENUE
5 200 EAST 1ST STREET, 2A
6 520 PARK LANE, DPH60
7 520 PARK LANE, DPH56
8 80 BAY STREET LANDING, 1A
9 520 PARK SOUTH, DPH54
10 520 PARK LANE, DPH52
11 62 VEST STREET, 21F
12 256 FLARIN AVENUE
在现有的 DF(数据)中,col1 是街道地址,可能包含也可能不包含公寓号码。为简单起见,如果有逗号,我假设 col1 下的值将有一个公寓号码。逗号后面的部分可以认为是公寓号。
col2 仅包含公寓编号。它在列中有nan。在某些情况下,例如在第 5 行中,col2 中的公寓编号('2A')与 col1 中逗号后面的部分('RU')不匹配。在其他情况下,例如在第 2 行中,col2 可能是 nan,但 col1 在逗号后面有一个公寓号。
我想要做的是添加 3 个新列(显示在 Desired DF data1 中):
['building_address'] 基本上只包含逗号之前的所有内容,所以它会说 '220 CENTRAL STREET' 而 col1 会说 '220 CENTRAL STREET, 50'
['apt_no'] 将检查是否有 nan。如果有,它将在 col1 中检查逗号后的值。如果检查成功,它将在 col2 中填充该值。因此,例如,在 data1 第 2 行中,apt_no 将采用 '1' 的值,它是从 col1 中逗号之后的部分获得的。它还会检查col1中逗号后是否有部分,col2中是否有值,如果不同,则取col2中的值。例如,在第 5 行,apt_no 的值为“2A”,取自 col2,即使 col1 在逗号后显示“RU”。最后,如果 col1 中没有逗号且 col2 为 nan,则 'apt_no' 仍为 nan。
['full_address'] 最后,'full address' 将 ['building address'] 和 ['apt_no'] 以建筑地址 apt_no 的格式连接成 1 个字符串(如上所示)。如果 'apt_no' 是 nan,那么 'full address' 将与 'col1' 一样
我已经为此苦苦挣扎了好几个小时,但还没有想出办法。感谢观看。
【问题讨论】:
标签: python regex pandas street-address