使用 pandas drop row 清理嘈杂的数据答案

【问题标题】：Clean the noisy data with pandas drop row使用 pandas drop row 清理嘈杂的数据
【发布时间】：2017-05-21 13:03:44
【问题描述】：

我正在尝试使用语法关键字减少大型数据集的噪音。有没有办法根据一组特定的关键字水平修剪数据集。

Input: 

id1, id2, keyword, freq, gp1, gps2 
222, 111, #paris, 100, loc1, loc2 
444, 234, have, 1000, loc3, loc4
434, 134, #USA, 30, loc5, loc6
234, 234, she, 600, loc1, loc2
523, 5234,mobile, 900, loc3, loc4

从这里我需要删除像have、she、and、did 这些对我有用的常用关键字。我正在尝试使用此类关键字消除整行。我正在尝试从数据集中消除噪音以供将来分析之用。

使用一组选择关键字消除此类行的简单方法是什么。

感谢建议，提前谢谢！！

【问题讨论】：

您在 cmets 中提到了两件事：(a) 您正在从 CSV 文件中读取数据；(b) 它真的很大 (2 GB)。这可能会导致其他解决方案比 Pandas Dataframe 更好，因为这是对庞大数据集的非常简单的操作。您使用的是 Windows 还是 Unix？
windows anaconda 3.x 更新包

标签： python windows pandas dataframe anaconda

【解决方案1】：

假设您有一个数据框df... 使用isin 查找哪些行有或没有一个列表或一组单词。然后使用布尔索引来过滤数据框。

list_of_words = ['she', 'have', 'did', 'and']
df[~df.keyword.isin(list_of_words)]

【讨论】：

我推荐这个而不是我的答案。将过滤器构建为 Numpy 数组可能对其他问题有用。
当我将 list_of_words = ['she', 'have', 'did', 'and'] 更改为 list_of_words = set(['she', 'have', 'did', 'and']) 时，我得到了一个小的 (5 - 15 %) 加速。集合针对快速成员查找进行了优化。

【解决方案2】：

给定内存要求的新镜头。我将此添加为新答案，因为旧答案仍然对小文件有用。这个逐行读取输入文件，而不是将整个文件加载到内存中。

将程序保存到filterbigcsv.py，然后使用python filterbigcsv.py big.csv clean.csv 运行它以读取big.csv 并写入clean.csv。对于 1.6 GB 的测试文件，这在我的系统上需要一分钟。内存使用量保持在 3 MB。

此脚本可以处理任何文件大小，您只需等待更长时间即可完成。

import sys


input_filename = sys.argv[1]
output_filename = sys.argv[2]


blacklist = set("""
have she and did
""".strip().split())


blacklist_column_index = 2 # Third column, zero indexed


with open(input_filename, "r") as fin, \
     open(output_filename, "w") as fout:
    for line in fin:
        if line.split(",")[blacklist_column_index].strip(", ") in blacklist:
            pass # Don't pass through
        else:
            fout.write(line) # Print line as it was, with its original line ending

【讨论】：

太棒了！你是一个节省者.. 非常感谢.. 这就是我一直在寻找的.. 使用 IDE 处理如此大的文件，它们只会冻结屏幕并进入无限处理..
很高兴为您提供帮助:)
再次感谢您..如果您有时间，您也可以看看这个问题..对于这个问题，我主要尝试减少数据集。 stackoverflow.com/questions/44077739/…

【解决方案3】：

不久前我做了类似的事情。我对 Pandas 和 Numpy 的配合以及坚持矢量化操作所产生的速度感到惊喜。

下面的示例不需要源文件以外的任何其他文件。根据您的需要修改表格。

from StringIO import StringIO

import pandas as pd
import numpy as np

src = """id1, id2, keyword, freq, gp1, gps2
222, 111, #paris, 100, loc1, loc2
444, 234, have, 1000, loc3, loc4
434, 134, #USA, 30, loc5, loc6
234, 234, she, 600, loc1, loc2
523, 5234,mobile, 900, loc3, loc4
"""

src_handle = StringIO(src)

blacklist_words = """
have she and did
""".split()

# Separate by comma and remove whitespace
table = pd.read_table(src_handle, sep=",\s*")

# You can create a single filter by straight-out comparison
filter_have = table["keyword"] == "have"

# Which you can use as a key directly
print table[filter_have]

# We'll solve this by building the filter you need and applying it.

def filter_on_blacklisted_words(keyword, blacklist_words, dataframe):
    """Filter a Pandas dataframe by removing any rows that has column {keyword}
    in blacklist. Try to keep things vectorized for performance.
    """

    # In the beginning, accept all values, and take the number of values from
    # the dataframe we're using. Zeros is falsey.
    blacklist_filter = np.zeros_like(dataframe[keyword])

    for word in blacklist_words:
        blacklist_filter = np.logical_or(blacklist_filter,
                                         dataframe[keyword] == word)
    return dataframe[np.logical_not(blacklist_filter)]

print filter_on_blacklisted_words("keyword", blacklist_words, table)

【讨论】：

这看起来很棒..我的疑问是加载一个大约 2 GB 大小的 csv 并进行相同的更改。这能做到吗？
我只是尝试一下。如果不行，可以将黑名单加载到内存中，逐行遍历2GB的CSV。无论如何，这里的瓶颈将是读取文件。

【解决方案4】：

给定数据：

df = pd.DataFrame({
    'keyword': ['#paris', 'have', '#USA', 'she', 'mobile']
})
stopwords = set(['have', 'she', 'and', 'did'])

以下方法测试停用词是否是关键字的一部分：

df = df[df['keyword'].str.contains('|'.join(stopwords)) == False]

输出：

  keyword
0  #paris
2    #USA
4  mobile

下一个方法测试停用词是否匹配 (1:1) 关键字：

df = df.drop(df[df['keyword'].map(lambda word: word in stopwords)].index)

输出：

  keyword
0  #paris
2    #USA
4  mobile

【讨论】：

谢谢您的回答.. 我还需要所有列，如上所述，我有一个大数据集，我需要简单的方法来仅删除带有这些关键字的行并将扩孔数据保留为它是..
只需将缺失的列添加到给定的数据中。它有效。