熊猫数据框替换国际货币符号答案

【问题标题】：Pandas data frame replace international currency sign熊猫数据框替换国际货币符号
【发布时间】：2018-05-03 23:21:04
【问题描述】：

使用在多列中具有国际货币符号的 Excel 文件。除了该文件之外，还有一些国际语言。

Example: Paying £40.50 doesn't make any sense for a one-hour parking. 
Example: Produkty są zbyt drogie (Polish)
Example: 15% de la population féminine n'obtient pas de bons emplois (French)

已采取以下措施作为清理过程

df = df.apply(lambda x: x.str.replace('\\r',' '))
df = df.apply(lambda x: x.str.replace('\\n',' '))
df = df.apply(lambda x: x.str.replace('\.+', ''))
df = df.apply(lambda x: x.str.replace('-', ''))
df = df.apply(lambda x: x.str.replace('&', ''))
df = df.apply(lambda x: x.str.replace(r"[\"\',]", ''))
df = df.apply(lambda x: x.str.replace('[%*]', ''), axis=1)

（如果有更有效的方法 - 非常欢迎）

除此之外：已创建删除停用词的方法

def cleanup(row):
    stops = set(stopwords.words('english'))
    removedStopWords = " ".join([str(i) for i in row.lower().split() 
    return removedStopWords

将此方法应用于包含上述示例的数据框中的所有列：

df = df.applymap(self._row_cleaner)['ComplainColumns']

但UnicodeEncodeError 一直是最大的问题。它首先在英镑符号上引发此错误的地方之一。

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 646: ordinal not in range(128)

尝试以下： df = df.apply(lambda x: x.unicode.replace(u'\xa3', ''))gut 没用。

目标是将所有非字母字符替换为'' 或' '

【问题讨论】：

这对df = df.replace('[^\w\s]','',regex=True)有帮助吗？
不，它把所有东西都换成了 w
我认为它的工作小问题...AttributeError: ("'float' object has no attribute 'lower'", u'occurred at index Positive')内部清理方法removedStopWords = " ".join([str(i) for i in row.lower().split() if i not in stops])
也许你需要使用df.astype(str) 然后apply 在数据框中可能有Nans
我知道只是为了确保您获得更多的信任而不是评论:)

标签： python-2.7 pandas unicode utf-8

【解决方案1】：

如果要替换 [A-z0-9] 以外的所有字符，则可以使用正则表达式替换，即

 df = df.replace('[^\w\s]','',regex=True)

数据框中可能缺少数据，因此您可能需要使用 astype(str)，因为您正在使用带有 .lower() 的列表推导，Nan 将被视为浮点数。

df.astype(str).apply(cleanup)

【讨论】：