将 apply() 的系列结果附加到新的 DataFrame？答案

【问题标题】：Append series results from apply() to new DataFrame?将 apply() 的系列结果附加到新的 DataFrame？
【发布时间】：2020-05-06 19:34:27
【问题描述】：

我有一个 apply 函数，它遍历索引列表，将其插入 scikit-learn KNN 模型，并返回两个 n 大小的列表（邻居距离和邻居索引）。（想象一下这是一个电影推荐系统）。

我想将这些结果添加到新的 DF。

例如：如果我的函数遍历 3 个索引，并且 n-neighbor 参数为 5，我应该得到一个包含 2 个列且长度为 3x5=15 的 DataFrame。但目前我的脚本正在将整个列表附加到一行，如下所示。

这是我的代码。 movies 是具有输入索引的 DF。

testDF = pd.DataFrame()

def get_distances_indices(index):

    distances, indices = model_knn.kneighbors(data[index], n_neighbors=6)

    distances = pd.Series(distances.flatten().tolist())
    indices = pd.Series(indices.flatten().tolist())

    return indices, distances

testDF[['index','distance']] = testDF.append(movies.apply(lambda row: get_distances_indices(row['index']), axis=1).apply(pd.Series),ignore_index=True)

感谢任何帮助。我是初学者，看到文章说在此处使用 apply 将有助于加快获取邻居列表的过程。

为简单起见，这里有一个可复制的示例：我只希望列表/系列以垂直顺序显示，而不是水平显示。

testDF = pd.DataFrame()
moviesData = {'movie': ['The Big Whale', 'Stack Underflow'], 'index': [3, 99]}
movies = pd.DataFrame(data=moviesData)

def get_distances_indices(index):
    list1 = [51, 700, 999]
    list2 = [.2, .3, .4]
    df2 = pd.Series(list1)
    df3 = pd.Series(list2)

    return df2,df3

testDF[['index','distance']] = movies.apply(lambda row: get_distances_indices(row['index']), axis=1).apply(pd.Series)
testDF.head()

【问题讨论】：

请看How to make good reproducible pandas examples。我们并不真正关心数据来自哪里。我们需要小样本数据结构，我们可以将其复制并粘贴到我们的解释器和所需的输出数据结构中。
@timgeb 我添加了一个可重现的示例，如果我应该添加其他内容，请告诉我。谢谢

标签： python pandas lambda append apply

【解决方案1】：

你可以试试这样的：

...

def get_distances_indices(index):
    list1 = [51, 700, 999]
    list2 = [.2, .3, .4]

    # return a dictionary
    return {'index':list1, 'distance':list2}

d = movies.apply(lambda row: get_distances_indices(row['index']), axis=1)

# flatten the resulting lists
l1 = [item for sublist in [x['index'] for x in d] for item in sublist]
l2 = [item for sublist in [x['distance'] for x in d] for item in sublist]

data_tuples = list(zip(l1,l2))
pd.DataFrame(data=data_tuples, columns=['index', 'distance'], index=None,)

如果我正确理解了您的问题，这应该会给您想要的结果：

index   distance
0   51  0.2
1   700 0.3
2   999 0.4
3   51  0.2
4   700 0.3
5   999 0.4

【讨论】：

我相信这就是我想要的，谢谢。