【问题标题】:Python, Pandas to match data frame and indicate findings from a listPython,Pandas 匹配数据框并从列表中指示结果
【发布时间】:2019-01-06 08:51:42
【问题描述】:

有一个水果列表,我想检查它们是否存在以及哪些存在于数据框中(不管哪些列),并指出它们。

import pandas as pd

Fruits = ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]

data = {'ID': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"], 
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "guava and coconut", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "grape", "Lai fun", "Damson", "Liangpi", "Custard Apple and Crab apples", "Misua", "nana Coconut Berry", "Damson", "Paomo", "Ramen", "Rice vermicelli"]}

df = pd.DataFrame(data)
df = df[['ID', 'Content', 'Content_1']]

s = pd.Series(data['Content'])
s_1 = pd.Series(data['Content_1'])

df["found_content"] = s[s.str.contains('|'.join(Fruits))]
df["found_content_1"] = s_1[s_1.str.contains('|'.join(Fruits))]

writer = pd.ExcelWriter('C:\\TEM\\22522.xlsx')
df.to_excel(writer,'Sheet1', index = False)
writer.save()

代码的问题是:

  1. 它不显示水果,而是显示全部内容。例如 14805 的行,它应该只是“黑莓”而不是整个原始内容。
  2. 它区分大小写,因此缺少一些发现,例如 14805 行。
  3. 我想使用“;”将结果分开,如 85964 行。

我怎样才能实现它?谢谢。

这是当前输出和想要输出的屏幕截图。

【问题讨论】:

  • 这有点风,如果可能的话,你能简化这个例子吗?
  • @coldspeed,感谢您的评论。这是为了提供更多的样品进行测试。下次我会注意的。

标签: python pandas dataframe


【解决方案1】:

使用str.findallre.I 忽略大小写,然后通过str.join 加入列表:

import re
#\b for word boundary - general use
pat = r'(\b{}\b)'.format('|'.join(Fruits))
df["found_content"] = df['Content'].str.findall(pat, re.I).str.join(';')
df["found_content_1"] = df['Content_1'].str.findall(pat, re.I).str.join(';')
print (df)
        ID             Content                      Content_1  found_content  \
0      488         Kalo Beruin              Jook-sing noodles                  
1    14805  this is Blackberry                          grape     Blackberry   
2    23591        Khara Beruin                        Lai fun                  
3   470995   guava and coconut                         Damson  guava;coconut   
4    56251               Lapha                        Liangpi                  
5    85964           Loha Sura  Custard Apple and Crab apples                  
6     5268            Matichak                          Misua                  
7   322624        Miniket Rice             nana Coconut Berry                  
8   342225          Mou Beruin                         Damson                  
9   380689             Moulata                          Paomo                  
10  480562       oh Goji Berry                          Ramen     Goji Berry   
11    5623        purple Grape                Rice vermicelli          Grape   

              found_content_1  
0                              
1                       grape  
2                              
3                      Damson  
4                              
5   Custard Apple;Crab apples  
6                              
7                     Coconut  
8                      Damson  
9                              
10                             
11         

另一种解决方案是使用title 而不是re.I

pat = r'(\b{}\b)'.format('|'.join(Fruits))
df["found_content"] = df['Content'].str.title().str.findall(pat).str.join(';')
df["found_content_1"] = df['Content_1'].str.title().str.findall(pat).str.join(';')
print (df)
        ID             Content                      Content_1  found_content  \
0      488         Kalo Beruin              Jook-sing noodles                  
1    14805  this is Blackberry                          grape     Blackberry   
2    23591        Khara Beruin                        Lai fun                  
3   470995   guava and coconut                         Damson  Guava;Coconut   
4    56251               Lapha                        Liangpi                  
5    85964           Loha Sura  Custard Apple and Crab apples                  
6     5268            Matichak                          Misua                  
7   322624        Miniket Rice             nana Coconut Berry                  
8   342225          Mou Beruin                         Damson                  
9   380689             Moulata                          Paomo                  
10  480562       oh Goji Berry                          Ramen     Goji Berry   
11    5623        purple Grape                Rice vermicelli          Grape   

              found_content_1  
0                              
1                       Grape  
2                              
3                      Damson  
4                              
5   Custard Apple;Crab Apples  
6                              
7                     Coconut  
8                      Damson  
9                              
10                             
11                 

【讨论】:

  • 谢谢您,先生!但是,当我将代码应用于另一个文件/数据框时,结果显示为 NaN。我正在检查。
  • 先生,您能否告知何时匹配,但显示“NaN”? (我将代码应用于工作簿。它有发现,但都显示 NaN)
  • 难题,似乎是数据问题。真实数据是否保密?
  • @MarkK - 它是 python2 吗?
  • @MarkK - 编码有问题,真的不容易帮助,因为它取决于数据。但一个想法 - 如何工作thisthis
【解决方案2】:

也许是这样的:

import pandas as pd

Fruits = ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]

data = {'ID': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"], 
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "guava and coconut", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "grape", "Lai fun", "Damson", "Liangpi", "Custard Apple and Crab apples", "Misua", "nana Coconut Berry", "Damson", "Paomo", "Ramen", "Rice vermicelli"]}

df = pd.DataFrame(data)
df["found_content"] = df['Content'].str.extract('(?P<Fruits>{})'.format("|".join(Fruits)), expand=True).fillna('')
df["found_content_1"] = df['Content_1'].str.extract('(?P<Fruits>{})'.format("|".join(Fruits)), expand=True).fillna('')

writer = pd.ExcelWriter('filename.xlsx')
df.to_excel(writer,'Sheet1', index = False)
writer.save()

【讨论】:

  • 感谢您的帮助。当代码应用于样本时,它只选择 1 个结果而不是所有结果
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-07-24
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多