检查数据集中的奇怪项目

【问题标题】：Checking strange item in dataset检查数据集中的奇怪项目
【发布时间】：2019-05-02 02:34:31
【问题描述】：

我想检查 Python 中行数较少的数据框中的奇怪分类项目

我尝试使用以下代码来显示奇怪的项目。

for i in range(data.shape[1]):
  if data[data.columns[i]].dtype == "object":
    print(data[data.columns[i]].value_counts())

有什么方法可以用更少的行来检查分类数据吗？

【问题讨论】：

你能解释一下你所说的“奇怪”是什么意思吗？您是否尝试从数据中删除异常值？
没有。例如，在“Sex”列中，可能只有“M”和“F”，但也可能有“Boy”或“Girl”。我想把它们都展示出来并清理它们。在 R 中，我将使用 str(df) 来显示这些因子对象。但我不知道如何在 Python 中做到这一点

标签： python data-science data-cleaning

【解决方案1】：

如果您想打印列的所有唯一条目，我建议使用 unique (docs) 方法

>>> a = pd.DataFrame({'sex':['m','f','m','m','m', 'f', 'booooy']})
>>> a.loc[:,'sex'].unique()
Out[1]: array(['m', 'f', 'booooy'], dtype=object)

要将booooy 条目更改为m，您可以使用re.sub (docs) 方法

>>> a.loc[:,'sex'].apply(lambda x: re.sub(r'booooy','m', x))
Out[2]: 
0    m
1    f
2    m
3    m
4    m
5    f
6    m
Name: sex, dtype: object

如果您有很多 re.sub 调用 - 因此您可以将它们放入函数中，而不是应用它们

>>> def filter_text(x):
...    x = re.sub(r'booooy','m',x)
...    x = re.sub(r'girl','f',x)
...    # . . . . . .
...    return x
>>> a.loc[:,'sex'].apply(filter_text)
Out[3]: 
0    m
1    f
2    m
3    m
4    m
5    f
6    m
Name: sex, dtype: object

希望有帮助！

【讨论】：