如何使用来自另一个数据帧的部分匹配来过滤数据帧答案

【问题标题】：How to filter a dataframe using partial matches from another dataframe如何使用来自另一个数据帧的部分匹配来过滤数据帧
【发布时间】：2015-10-19 18:24:18
【问题描述】：

我有两个数据框，我想使用其中一个数据框来过滤另一个数据框并创建一个新的数据框。这两个数据框有一列信息相似，但不是完全匹配。我一直在尝试使用str.contains，但到目前为止，当我尝试时，我不断收到TypeError: 'Series' objects are mutable, thus they cannot be hashed。这是我的数据框示例和我尝试过的代码。

promoter = pd.read_csv('promoter_coordinate.csv')
print(promoter.head())

AssociatedGeneName            B      C    D E                                   F
            plexB_1  NC_004353.3  64381  - Drosophila melanogaster (Fruit fly)  region 
               ci_1  NC_004353.3  76925  - Drosophila melanogaster (Fruit fly)  region   
             RS3A_1  NC_004353.3  87829  - Drosophila melanogaster (Fruit fly)  region   
              pan_1  NC_004353.3  89986  + Drosophila melanogaster (Fruit fly)  region  
              pan_2  NC_004353.3  90281  + Drosophila melanogaster (Fruit fly)  region   

data = pd.read_csv('FBgn with gene name.csv')
print(data.head())
Gene AssociatedGeneName   FBgn Number     timepoint
CG10002        fkh        FBgn0000659          2   
CG10002        fkh        FBgn0000659          2   
CG10002        fkh        FBgn0000659          2   
CG10002        fkh        FBgn0000659          2   
CG10006    CG10006        FBgn0036461          2   

x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]

两个列表的头部都没有匹配，但基本上理想的结果将类似于以下内容，其中将比较名为“AssociatedGeneName”的两列。

AssociatedGeneName            B      C    D  E                                    F    
             fkh_1  NT_033777.2  24410805 -  Drosophila melanogaster (Fruit fly)  region

基本上我想要一个数据框，其中包含promoter 中的所有值，这些值与data['AssociatedGeneName'] 中的值部分匹配如果有人能指出正确的方向，我将不胜感激。我对编码比较陌生，我一直在使用 python 和 pandas，并且更愿意继续使用 python 来解决这个问题。这是我不断收到的错误。

x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]

Traceback (most recent call last):
  File "<pyshell#15>", line 1, in <module>
    x = promoter[promoter['AssociatedGeneName'].str.contains(data['Associated Gene Name'])]
  File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 1226, in contains
na=na, regex=regex)
  File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 203, in str_contains
regex = re.compile(pat, flags=flags)
  File "C:\Python34\lib\re.py", line 219, in compile
return _compile(pattern, flags)
  File "C:\Python34\lib\re.py", line 278, in _compile
return _cache[type(pattern), pattern, flags]
  File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 663, in __hash__
    ' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed

【问题讨论】：

data 中是否应该有重复的行？或者protmoter 中的每一行是否只有一个部分匹配？
是的，应该有重复的行。
你想合并这两个数据框吗？
不，我只想要promoter 中的行，其中data 中有部分匹配。我会尝试提供的解决方案，看看它是否适合我！

标签： python csv pandas filtering dataframe

【解决方案1】：

首先创建一个函数，检查来自promoter 的值是否与来自data 的部分匹配，这将检查data 中的每个值

def contain_partial(x , y = data.AssociatedGeneName):
        res = []
        for z in y:
            res.append(z in x)
        return res

这将是函数的结果

contains = promoter.AssociatedGeneName.apply(contain_partial)

然后在最后检查是否至少有一个值为真然后返回真并过滤 promoter

promoter[contains.apply(any)]

【讨论】：

【解决方案2】：

str.contains 接受一个字符串作为参数并检查该字符串是否包含在每个promoter.AssociatedGene 条目中，然后为每个索引（行）返回True 或False。

但是，当您将data.AssociatedGene 传递给str.contains 函数时，您传递的是pandas.Series，这就是您收到错误的原因。

如果你只想要启动子部分匹配的行，那么你可以

where_inds_par = [ where(promoter.AssociatedGeneName.str.contains(partial) )[0] for partial in data.AssociatedGeneName  ]

现在，where_inds_par 的每个元素本身就是一个长度为 >= 0 的索引数组。此外，由于您的 data.AssociatedGeneName 列是多余的，因此会有一些冗余，但是您可以使用 set 和一些花哨的列表理解将其过滤掉

inds_par = list(set( i for sublist in where_inds_par for i in sublist )) # set finds the unique elements
promoter_par = promoter.ix[ promoter.index[ inds_par], ]

【讨论】：