【发布时间】:2015-10-19 18:24:18
【问题描述】:
我有两个数据框,我想使用其中一个数据框来过滤另一个数据框并创建一个新的数据框。这两个数据框有一列信息相似,但不是完全匹配。我一直在尝试使用str.contains,但到目前为止,当我尝试时,我不断收到TypeError: 'Series' objects are mutable, thus they cannot be hashed。这是我的数据框示例和我尝试过的代码。
promoter = pd.read_csv('promoter_coordinate.csv')
print(promoter.head())
AssociatedGeneName B C D E F
plexB_1 NC_004353.3 64381 - Drosophila melanogaster (Fruit fly) region
ci_1 NC_004353.3 76925 - Drosophila melanogaster (Fruit fly) region
RS3A_1 NC_004353.3 87829 - Drosophila melanogaster (Fruit fly) region
pan_1 NC_004353.3 89986 + Drosophila melanogaster (Fruit fly) region
pan_2 NC_004353.3 90281 + Drosophila melanogaster (Fruit fly) region
data = pd.read_csv('FBgn with gene name.csv')
print(data.head())
Gene AssociatedGeneName FBgn Number timepoint
CG10002 fkh FBgn0000659 2
CG10002 fkh FBgn0000659 2
CG10002 fkh FBgn0000659 2
CG10002 fkh FBgn0000659 2
CG10006 CG10006 FBgn0036461 2
x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]
两个列表的头部都没有匹配,但基本上理想的结果将类似于以下内容,其中将比较名为“AssociatedGeneName”的两列。
AssociatedGeneName B C D E F
fkh_1 NT_033777.2 24410805 - Drosophila melanogaster (Fruit fly) region
基本上我想要一个数据框,其中包含promoter 中的所有值,这些值与data['AssociatedGeneName'] 中的值部分匹配如果有人能指出正确的方向,我将不胜感激。我对编码比较陌生,我一直在使用 python 和 pandas,并且更愿意继续使用 python 来解决这个问题。这是我不断收到的错误。
x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
x = promoter[promoter['AssociatedGeneName'].str.contains(data['Associated Gene Name'])]
File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 1226, in contains
na=na, regex=regex)
File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 203, in str_contains
regex = re.compile(pat, flags=flags)
File "C:\Python34\lib\re.py", line 219, in compile
return _compile(pattern, flags)
File "C:\Python34\lib\re.py", line 278, in _compile
return _cache[type(pattern), pattern, flags]
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 663, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed
【问题讨论】:
-
data中是否应该有重复的行?或者protmoter中的每一行是否只有一个部分匹配? -
是的,应该有重复的行。
-
你想合并这两个数据框吗?
-
不,我只想要
promoter中的行,其中data中有部分匹配。我会尝试提供的解决方案,看看它是否适合我!
标签: python csv pandas filtering dataframe