【问题标题】:Matching the list with column and extract the Match value from the Column将列表与列匹配并从列中提取匹配值
【发布时间】:2020-01-12 23:40:51
【问题描述】:

我在匹配数据框的列表和列时遇到问题,并从匹配中提取列中的特定匹配值。

数据集:

    address
0   58 Chatham Street, Chatham, New Jersey, 07928
1   3420 W. MacArthur Blvd. Ste. C, Santa Ana, California
2   2016 Chalk Rd, Wake Forest, North Carolina, 27587

我有一个包含州名的列表

state = ['New York','New Jersey','California',...]

渴望结果

    address                                                   State
0   58 Chatham Street, Chatham, New Jersey, 07928             New Jersey
1   3420 W. MacArthur Blvd. Ste. C, Santa Ana, California     California
2   2016 Chalk Rd, Wake Forest, North Carolina, 27587         North Carolina

我尝试过的代码

for i in state:
    ship_add['state'] = ship_add['address'].str.strip(i)

【问题讨论】:

  • 您可以根据逗号将值拆分为新列,因为获取状态的模式在每一行中并不固定 df['address'].str.split(', ', expand=True)
  • 如果你尝试提取那些结尾不是全数字的值怎么办? .str.extract(r'(\w[^,]*)(?:,\s*\d+)?$', expand=False)?

标签: python regex string pandas


【解决方案1】:

试试:

state = ['New York','New Jersey','California','North Carolina']
def search_states(df):
    for i in state:
        if i in df['address']:
            df['states'] = i
            break
        else:
            continue
    return df
df = df.apply(search_states, axis = 1)

这种方法对于更大的数据也会更快。

【讨论】:

    【解决方案2】:

    用途:

    state = ['New York','New Jersey','California','North Carolina']
    
    #word boundary
    pat = '|'.join(r"\b{}\b".format(x) for x in state)
    #if not necessary words boundary
    #pat = '|'.join(state)
    df['State'] = df['address'].str.extract('('+ pat + ')', expand=False)
    print (df)
                                                 address           State
    0      58 Chatham Street, Chatham, New Jersey, 07928      New Jersey
    1  3420 W. MacArthur Blvd. Ste. C, Santa Ana, Cal...      California
    2  2016 Chalk Rd, Wake Forest, North Carolina, 27587  North Carolina
    

    如果匹配拆分值:

    state = ['New York','New Jersey','California','North Carolina']
    
    df1 = df['address'].str.split(', ', expand=True)
    df['State'] = df1.where(df1.isin(state)).ffill(1).iloc[:, -1]
    print (df)
                                                 address           State
    0      58 Chatham Street, Chatham, New Jersey, 07928      New Jersey
    1  3420 W. MacArthur Blvd. Ste. C, Santa Ana, Cal...      California
    2  2016 Chalk Rd, Wake Forest, North Carolina, 27587  North Carolina
    

    【讨论】:

    • 嗨@jezrael,它给了我Nan价值观
    • 嗨@jezrael。这是我的坏事。由于我面对的是 Nan,因此我的数据集有问题。是的,现在它的工作。谢谢
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-15
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多