根据它们在另一个数据帧中的频率将值附加到一个数据帧答案

【问题标题】：Append values to one dataframe based on their frequency in another dataframe根据它们在另一个数据帧中的频率将值附加到一个数据帧
【发布时间】：2019-10-14 05:12:52
【问题描述】：

我有两个数据框，df1 是 groupby 的乘积，或者df.groupby('keyword'):

df1

keyword     string

   A        "This is a test string for the example" 
            "This is also a test string based on the other string"
            "This string is a test string based on the other strings"
   B        "You can probably guess that this is also a test string"
            "Yet again, another test string"
            "This is also a test"

和 df2

这是一个空数据框，现在我也有一个特定值的列表：

keyword_list = ['string', 'test']

基本上我想计算keyword_list 和df1 中每个单词的频率，以及出现最多的单词根据df1 中的关键字将该单词附加到新数据框中的特定列，所以 df2 的 'A' 被分配了 df1 的 string 列中出现的最高值。

理想情况下，由于'string' 是df1 的A 关键字列中出现的最高值，它被分配string 等等。

df2

keyword    High_freq_word

   A         "string"
   B         "test"

如果您需要澄清或有道理，请告诉我！

更新：

@anky_91 提供了一些很棒的代码，但是输出有点尴尬

df['matches'] = df.description.str.findall('|'.join(keyword_list))

    df.groupby(odf.Type.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))

得到你

df1

keyword     string                                                     

   A        "This is a test string for the example" 
            "This is also a test string based on the other string"
            "This string is a test string based on the other strings"
   B        "You can probably guess that this is also a test string"
            "Yet again, another test string"
            "This is also a test"

但是它添加了一个新列：

matches

['string','test']
['test', 'string','string]
[etc...]

我可以想办法将其数字转换，然后将该值分配给列，但更大的问题是将这个新列附加到新数据帧。

由于它是一个 groupby 有几个重复的值，我试图找到一种 pythonic 方式将“最常用词”映射到关键字本身，而不是基于关键字列表的整个模式。

【问题讨论】：

标签： python pandas

【解决方案1】：

据我了解，您可以执行以下操作：

from itertools import chain
from scipy.stats import mode

keyword_list = ['string', 'test']
df['matches']=df.string.str.findall('|'.join(keyword_list)) #find all matches
df.groupby(df.keyword.ffill()).matches.apply(lambda x: ''.join(mode(list(chain.from_iterable(x)))[0]))

keyword
A    string
B      test
Name: matches, dtype: object

【讨论】：

抱歉，回复晚了，现在我只需要将它附加到新数据框中，而不是在旧数据框中创建新列！
@SebastianGoslin 您可以将reset_index() 调用到输出并将其分配给新的数据帧。 :)
正要问，因为我刚刚得到一个关键错误！
好的，现在我打电话给reset_index()ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'时遇到了这个问题
看起来像熊猫问题，我会尝试将值附加到字典，然后将其映射到新的数据框，希望这会起作用