如何使用 pandas 清理数据？答案

【问题标题】：How do I clean data using pandas?如何使用 pandas 清理数据？
【发布时间】：2021-08-14 07:28:51
【问题描述】：

我必须' \\n, *, ' ==> '\n *' 但我尝试使用 df['Course_content']=df['Course_content'].replace(' \\n, *, ','\n *',regex=True) 但它不适合我

>>> df['Course_content'][0]
'The syllabus for this course will cover the following:, \\n, *,  The nature and purpose of cost and management accounting, \\n, *,  Source documents and coding, \\n, *,  Cost classification and measuring, \\n, *,  Recording costs, \\n, *,  Spreadsheets'
>>> df['Course_content']=df['Course_content'].replace(' \\n, *,  ','\n *',regex=True)
>>> df['Course_content'][0]
'The syllabus for this course will cover the following:, \\n, *,  The nature and purpose of cost and management accounting, \\n, *,  Source documents and coding, \\n, *,  Cost classification and measuring, \\n, *,  Recording costs, \\n, *,  Spreadsheets'
>>>

我也尝试使用以下代码，但它也不适合我

d = {
'Not Mentioned':'',
"\r\n": "\n",
"\\r": "\n",
'\u00a0':' ',
' \\n, *,':  "\n * ",
' \\n,':'\n',
}
df=df.replace(d.keys(),d.values(),regex=True)

【问题讨论】：

标签： python regex pandas dataframe data-cleaning

【解决方案1】：

您可以将这两个参数放入 r-string 并在第一个参数的* 之前添加一个\。这是必要的，因为 \ 和 * 是正则表达式中的特殊元字符，您必须使用额外的 \ 和/或 r-string 将这些字符“转义”为它们的字面值。

你可以使用：

df['Course_content'] = df['Course_content'].replace(r' \\n, \*,  ', r'\n *', regex=True)

演示：

data = {'Course_content': ['The syllabus for this course will cover the following:, \\n, *,  The nature and purpose of cost and management accounting, \\n, *,  Source documents and coding, \\n, *,  Cost classification and measuring, \\n, *,  Recording costs, \\n, *,  Spreadsheets']}
df = pd.DataFrame(data)

df['Course_content'] = df['Course_content'].replace(r' \\n, \*,  ', r'\n *', regex=True)

结果：

print(df['Course_content'][0])


'The syllabus for this course will cover the following:,\n *The nature and purpose of cost and management accounting,\n *Source documents and coding,\n *Cost classification and measuring,\n *Recording costs,\n *Spreadsheets'

【讨论】：