如何从 Pandas 字典中存在的数据框列中删除单词答案

【问题标题】：How to delete words from a dataframe column that are present in dictionary in Pandas如何从 Pandas 字典中存在的数据框列中删除单词
【发布时间】：2018-02-08 20:03:54
【问题描述】：

扩展名： Removing list of words from a string

我有以下数据框，我想从 df.name 列中删除频繁出现的单词：

df :

name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh  
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark

我正在使用以下代码创建一个包含单词及其频率的新数据框：

df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]

这将导致

df2：

word    freq
Clinton 4
Bill    3
James   3
Clark   3

然后我将其转换为字典，代码如下：sn-p：

    d = dict(zip(df['word'], df['freq']))

现在，如果我要从 df.name 中删除 d 中的单词（这是字典，带有单词：freq），我将使用以下代码 sn-p：

def check_thresh_word(merc,d):
    m = merc.split(' ')
    for i in range(len(m)):
            if m[i] in d.keys():
                return False
    else:
        return True

def rm_freq_occurences(merc,d):
    if check_thresh_word(merc,d) == False:
        nwords = merc.split(' ')
        rwords = [word for word in nwords if word not in d.keys()]
        m = ' '.join(rwords)
    else:
        m=merc
    return m

df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))

但实际上我的数据框（df）包含近 240k 行，我必须使用大于 100 的阈值（上述示例中的阈值 = 3）。因此，由于复杂的搜索，上面的代码需要大量时间来运行。有什么有效的方法让它更快？？

以下是所需的输出：

name
Hayden
Rock
Gates
Vishal
Cameroon
Micky
Michael
Tony Waugh
Tom
Tommy
Avinash
Shreyas
Ramesh
Adam

提前谢谢！！！！！！！

【问题讨论】：

标签： python-2.7 pandas dataframe

【解决方案1】：

使用replace 通过加入列word 的所有值创建的正则表达式，最后strip 跟踪空格：

data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()

另一种解决方案是添加 \s* 以选择零个或多个空格：

pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*

data.name = data.name.replace(pat, '', regex=True)

print (data)
          name
0       Hayden
1         Rock
2        Gates
3       Vishal
4     Cameroon
5        Micky
6      Michael
7   Tony Waugh
8          Tom
9          Tom
10     Avinash
11     Shreyas
12      Ramesh
13        Adam

【讨论】：

解决方案看起来很棒！！！但是名称列包含一些以 utf-8 编码的 unicode 数据......所以它给出了错误，因为 'ascii' 编解码器无法在位置 3 编码字符 u'\xe9'：序数不在范围内（128）跨度>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)
你可以试试'|'.join(x.decode('utf-8') for x in df['word']) 吗？
data.factual_name.replace(u'|'.join(df['word']).encode('utf-8').strip(), '', regex=True)。 str.strip()
如果我使用它，它没有什么可重复的