Pandas：大型 CSV 文件数据操作答案

【问题标题】：Pandas: Large CSV file data manipulationPandas：大型 CSV 文件数据操作
【发布时间】：2021-11-25 18:28:15
【问题描述】：

我有一个来自 CSV 文件的大型数据集。它有两列，第一列是 hh:mm:ss:ms 形式的日期/时间，另一列是数字形式的压力。压力的随机值始终不是数值（例如 150+AA42BB43）。它们在文件中的 50,000 行中随机出现，并且不相同。

我需要一种方法将这些压力值更改为数字，以便对它们执行数据操作。

df_cleaned = df['Pressure'].loc[~df['Pressure'].map(lambda x: isinstance(x, float) | isinstance(x, int))]

我试过这个，但它去掉了我的日期/时间值，也没有清除所有压力值，同时也去掉了我的标题。

我想知道是否有人对我如何轻松清理第二列中的数据有任何建议，同时保持第一列中的日期/时间值准确。

【问题讨论】：

你应该使用df_cleaned = df.loc[....] （甚至df_cleaned = df[....]）而不是df['Pressure'].loc[...]
使用 df_cleaned = df['Pressure']... 你只会得到一列 (Pressure) 而你会跳过其他列 - 这就是你没有 Date/Time 的原因。而且因为它是单列，所以它可以将它作为Series 而不是DataFrame - 这可以删除您的标题，因为系列（单列）不需要标题。
你可以做isinstance(x, (float, int))
您究竟想如何清理压力值？只是摆脱非数字字符并转换为浮点数？

标签： python pandas csv data-cleaning

【解决方案1】：

如果你所有的非数字值都是字符串，我想我有一个答案。

您是否尝试过使用 pandas replace()？比如：

df['Pressure'].replace(to_replace = r'.+', value=0, inplace=True, regex=True)

我使用了一个正则表达式来确定“任何字符串”。 inplace=True 允许修改现有数据框，而不是创建新数据框。

这里，函数将用给定的整数替换任何字符串。我不确定你想放哪个整数，所以我只是用零作为例子。如果您想为每个字符串使用不同的整数，您可以按照in this answer 的说明使用映射。

【讨论】：

【解决方案2】：

你的问题是你使用

df_cleaned = df['Pressure']

这只会得到一列 (Pressure) 并跳过其他列。当你得到单列时，它可能会给你Series 而不是DataFrame - 而Series 只能保留一列，所以它不需要header 来选择列。

你应该在没有['Pressure']的情况下运行它

df_cleaned = df.loc[ ~df['Pressure'].map(...) ]

甚至

df_cleaned = df[ ~df['Pressure'].map(...) ]

顺便说一句：更短的isinstance(x, (float, int))

但是，如果您将 float/int 值作为字符串，则使用 isinstance 可能不起作用 - 因为 isinstance("123", (float, int)) 给出了 False - 您将不得不尝试转换 float("123") 和 int("123") 并捕获错误。

import pandas as pd

data = {
    'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'], 
    'Pressure': [78, '150+AA42BB43', 23], 
}

df = pd.DataFrame(data)

df_cleaned = df[ df['Pressure'].map(lambda x:isinstance(x, (float, int))) ]

print(df_cleaned)

结果：

     DateTime Pressure
0  2021.10.04       78
2  2021.10.06       23

编辑：

如果您有字符串形式的值，那么您可以使用to_numeric 来转换它们，如果值无法转换，则输入NaN

df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')

然后你可以用isna()过滤它

df_cleaned = df[ ~df['Pressure'].isna() ]

import pandas as pd

data = {
    'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'], 
    'Pressure': ['78.2', '150+AA42BB43', '23'], 
}

df = pd.DataFrame(data)

df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')
print(df)

df_cleaned = df[ ~df['Pressure'].isna() ]
print(df_cleaned)

【讨论】：