【问题标题】:Use dictionary to replace a string within a string in Pandas columns使用字典替换 Pandas 列中字符串中的字符串
【发布时间】:2018-03-02 17:18:37
【问题描述】:

我正在尝试使用dictionary keypandas 列中的strings 替换为其values。但是,每一列都包含句子。因此,我必须先对句子进行分词,并检测句子中的某个单词是否与我的字典中的某个键对应,然后将字符串替换为对应的值。

但是,我继续得到它的结果没有。有没有更好的 Pythonic 方法来解决这个问题?

这是我目前的 MVC。在 cmets 中,我指定了问题发生的位置。

import pandas as pd

data = {'Categories': ['animal','plant','object'],
    'Type': ['tree','dog','rock'],
        'Comment': ['The NYC tree is very big','The cat from the UK is small','The rock was found in LA.']
}

ids = {'Id':['NYC','LA','UK'],
      'City':['New York City','Los Angeles','United Kingdom']}


df = pd.DataFrame(data)
ids = pd.DataFrame(ids)

def col2dict(ids):
    data = ids[['Id', 'City']]
    idDict = data.set_index('Id').to_dict()['City']
    return idDict

def replaceIds(data,idDict):
    ids = idDict.keys()
    types = idDict.values()
    data['commentTest'] = data['Comment']
    words = data['commentTest'].apply(lambda x: x.split())
    for (i,word) in enumerate(words):
        #Here we can see that the words appear
        print word
        print ids
        if word in ids:
        #Here we can see that they are not being recognized. What happened?
            print ids
            print word
            words[i] = idDict[word]
            data['commentTest'] = ' '.apply(lambda x: ''.join(x))
    return data

idDict = col2dict(ids)
results = replaceIds(df, idDict)

结果:

None

我正在使用python2.7,当我打印出dict 时,有u' 的Unicode。

我的预期结果是:

类别

评论

类型

评论测试

  Categories  Comment  Type commentTest
0 animal  The NYC tree is very big tree The New York City tree is very big 
1 plant The cat from the UK is small dog  The cat from the United Kingdom is small 
2 object  The rock was found in LA. rock  The rock was found in Los Angeles. 

【问题讨论】:

    标签: python pandas dictionary dataframe replace


    【解决方案1】:

    您可以创建dictionary,然后创建replace

    ids = {'Id':['NYC','LA','UK'],
          'City':['New York City','Los Angeles','United Kingdom']}
    
    ids = dict(zip(ids['Id'], ids['City']))
    print (ids)
    {'UK': 'United Kingdom', 'LA': 'Los Angeles', 'NYC': 'New York City'}
    
    df['commentTest'] = df['Comment'].replace(ids, regex=True)
    print (df)
      Categories                       Comment  Type  \
    0     animal      The NYC tree is very big  tree   
    1      plant  The cat from the UK is small   dog   
    2     object     The rock was found in LA.  rock   
    
                                    commentTest  
    0        The New York City tree is very big  
    1  The cat from the United Kingdom is small  
    2        The rock was found in Los Angeles.  
    

    【讨论】:

    • 为什么是regex=True?从文档中我虽然它应该是 False:“是否将 to_replace 和/或 value 解释为正则表达式。如果这是 True 那么 to_replace 必须是一个字符串。否则,to_replace 必须是 None 因为这个参数将被解释为一个正则表达式或一个列表、字典或正则表达式数组。”
    • @pceccon - 我认为在文档中应该注意它更常用于替换子字符串,现在从文档中完全不清楚。
    【解决方案2】:

    实际上使用str.replace() 比使用replace() 快得多,尽管str.replace() 需要循环:

    ids = {'NYC': 'New York City', 'LA': 'Los Angeles', 'UK': 'United Kingdom'}
    
    for old, new in ids.items():
        df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
    
    #   Categories  Type                                   Comment
    # 0     animal  tree        The New York City tree is very big
    # 1      plant   dog  The cat from the United Kingdom is small
    # 2     object  rock         The rock was found in Los Angeles
    

    replace() 唯一优于 str.replace() 循环的情况是使用小数据帧:

    计时函数供参考:

    def Series_replace(df):
        df['Comment'] = df['Comment'].replace(ids, regex=True)
        return df
    
    def Series_str_replace(df):
        for old, new in ids.items():
            df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
        return df
    

    请注意,如果ids 是数据帧而不是字典,则可以使用itertuples() 获得相同的性能:

    ids = pd.DataFrame({'Id': ['NYC', 'LA', 'UK'], 'City': ['New York City', 'Los Angeles', 'United Kingdom']})
    
    for row in ids.itertuples():
        df['Comment'] = df['Comment'].str.replace(row.Id, row.City, regex=False)
    

    【讨论】:

      猜你喜欢
      • 2016-11-28
      • 2019-10-01
      • 1970-01-01
      • 2013-11-18
      • 1970-01-01
      • 2012-11-17
      相关资源
      最近更新 更多