Python：fuzzywuzzy，输出第一个值是正确的，其他都是NaN答案

【问题标题】：Python: fuzzywuzzy, the output of the first value is correct, the others are NaNPython：fuzzywuzzy，输出第一个值是正确的，其他都是NaN
【发布时间】：2021-10-11 07:04:36
【问题描述】：

我遇到了一个非常奇怪的问题：我有两个 df，我必须通过相似性将一个 df 的字符串与另一个 df 的字符串匹配。目标列是电视节目的名称（program_name_1 & program_name_2）。为了让他从有限的一组数据中进行选择，我还使用了“通道”列作为过滤器。

该函数应用模糊算法，并给出program_name_1 与program_name_2 列中元素的匹配以及它们之间的相似度得分。

真正奇怪的是，输出仅适用于第一个通道，但对于所有下一个通道却没有。仅打印 program_name_1 的第一列 (scorer_test_2) 始终是正确的，但 scorer_test_2 (应该打印 program_name_2) 和相似度列是 NaN。

我对 dfs 做了很多检查：我确信列的名称与列表中的名称相同，并且在其他渠道中，有我要求的所有数据。

最奇怪的是第一个通道和其他所有通道都在同一个df中，因此通道之间的数据没有差异。

我将向您展示“toys dts”，以便您更好地理解问题：

df1 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_1': ['party','animals','gucci','the simpson', 'cars', 'mathematics', 'bikes', 'chef']}
df2 = {'Channel': ['1','1','1','2','2','2','3','4'], 'program_name_2': ['parties','gucci_gucci','animal','simpsons', 'math', 'the car', 'bike', 'cooking']}
df1 = pd.DataFrame(df1, columns = ['Channel','program_name_1'])
df2 = pd.DataFrame(df2, columns = ['Channel','program_name_2'])

将为 df1 打印：

  Channel program_name_1
       1          party
       1        animals
       1          gucci
       2    the simpson
       2           cars
       2    mathematics
       3          bikes
       4           chef

对于 df2：

  Channel program_name_2
       1        parties
       1    gucci_gucci
       1         animal
       2       simpsons
       2           math
       2        the car
       3           bike
       4        cooking

这里是代码：

scorer_test_1 = df_1.loc[(df_1['Channel'] == '1')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '1')]['program_name_2']

# creation of a function for the score
def scorer_tester_function(x):
    matching_list = []
    similarity = []
    # iterate on the rows
    for i in scorer_test_1:
        if pd.isnull(i):
            matching_list.append(np.null)
            similarity.append(np.null)
        else:
            ratio = process.extract(i, scorer_test_2, limit=5, scorer=scorer_dict[x])
            matching_list.append(ratio[0][0])
            similarity.append(ratio[0][1])
    my_df = pd.DataFrame()
    my_df['program_name_1'] = scorer_test_1
    my_df['program_name_2'] = pd.Series(matching_list)
    my_df['similarity'] = pd.Series(similarity)

    return my_df

print(scorer_tester_function('R').head())

我想为所有通道获得的输出，但如果我通过代码中的第一个通道，我只会得到这样的输出：

对于频道[1]：

program_name_1 program_name_2 similarity
    party          parties        95
    animals        animal         95
    gucci        gucci_gucci      75

对于频道[2]：

  program_name_1 program_name_2 similarity
   the simpson     simpsons        85
      cars          the car        75
   mathematics       math          70

如果我要求频道 2 或下一个，这是我得到的输出：

代码：

scorer_test_1 = df_1.loc[(df_1['Channel'] == '2')]['program_name_1']
scorer_test_2 = df_2.loc[(df_2['Channel'] == '2')]['program_name_2']

输出：

  Channel program_name_1 program_name_2 similarity
     2     the simpson        NaN           NaN
     2        cars            NaN           NaN
     2    mathematics         NaN           NaN

我希望有人可以帮助我:)

谢谢！

【问题讨论】：

标签： python pandas dataframe nan fuzzywuzzy

【解决方案1】：

这是因为索引不匹配，添加第一个数据序列后重置索引可以完成工作！

def scorer_tester_function(x):
    matching_list = []
    similarity = []
    # iterate on the rows
    for i in scorer_test_1:
        if pd.isnull(i):
            matching_list.append(np.null)
            similarity.append(np.null)
        else:
            ratio = process.extract(i, scorer_test_2, limit=5)#, scorer=scorer_dict[x])
            matching_list.append(ratio[0][0])
            similarity.append(ratio[0][1])
    my_df = pd.DataFrame()
    my_df['program_name_1'] = scorer_test_1
    print(my_df.index)
    my_df.reset_index(inplace=True)
    print(my_df.index)
    my_df['program_name_2'] = pd.Series(matching_list)
    my_df['similarity'] = pd.Series(similarity)

    return my_df

【讨论】：