从 pandas 的字符串列中删除非 ASCII 字符答案

【问题标题】：Remove non-ASCII characters from string columns in pandas从 pandas 的字符串列中删除非 ASCII 字符
【发布时间】：2018-07-30 04:36:59
【问题描述】：

我有一个带有多个列的熊猫数据框，其中混合了值和不需要的字符。

columnA        columnB    columnC        ColumnD
\x00A\X00B     NULL       \x00C\x00D        123
\x00E\X00F     NULL       NULL              456

我想做的是使这个数据框如下。

columnA  columnB  columnC   ColumnD
AB        NULL       CD        123
EF        NULL       NULL      456

使用下面的代码，我可以从 columnA 中删除 '\x00'，但 columnC 很棘手，因为它在某些行中与 NULL 混合。

col_names = cols_to_clean
fixer = dict.fromkeys([0x00], u'')
for i in col_names:
if df[i].isnull().any() == False:
    if df[i].dtype != np.int64:
            df[i] = df[i].map(lambda x: x.translate(fixer))

有什么有效的方法可以从C列中删除不需要的字符吗？

【问题讨论】：

什么是NULL？是None 吗？还是"NULL"？
像.map(lambda x: x.translate(fixer) if x != "NULL" else x) 这样的东西有什么帮助吗？
Dyz，我觉得NULL相当于“None”

标签： python string pandas dataframe

【解决方案1】：

一般来说，要删除非 ascii 字符，请使用 str.encode 和 errors='ignore'：

df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')

要对多个字符串列执行此操作，请使用

u = df.select_dtypes(object)
df[u.columns] = u.apply(
    lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

虽然这仍然无法处理列中的空字符。为此，您可以使用正则表达式替换它们：

df2 = df.replace(r'\W+', '', regex=True)

【讨论】：

感谢coldspeed，这是一个非常简单且很棒的解决方案。请问代码的第一行是什么意思？
@JoohunLee 是一种确定字符串列的列名的有效方法。
谢谢，顺便说一句，除了某些特殊字符之外，有没有办法删除不需要的字符？例如，如果我有 \x00A\x00-\x00B，应用您的代码将返回“AB”而不是“A-B”。
@JoohunLee 如果要保留 ASCII 字符，请看这个链接：stackoverflow.com/a/20078869/4909087 否则可以使用：x.str.replace('[^\w-]+', '') 在字符类中根据需要添加更多字符。

【解决方案2】：

NULL 的诀窍是什么？
如果您想用真实的NaN 替换字符串'NULL'，请使用replace：

df.replace('NULL',np.NaN, inplace=True)
print(df.isnull())

输出：

columnA columnB columnC columnD 0 假真假假 1 假真真假

或者你需要用空字符串替换'NULL'，在str.replace中使用RegEx

df = df.apply(lambda col: col.str.replace(
               r"[\x00|NULL]", "") if col.dtype == object else col)

print (df.isnull())
print (df.values)

输出：

列A 列B 列C 列D 0 假假假假 1 假假假假 [['AB''''CD'123] ['EF' '' '' 456]]

【讨论】：