【问题标题】:Similarity between 2 dataframe columns2个数据框列之间的相似性
【发布时间】:2018-05-28 06:15:07
【问题描述】:

我有两个数据框,每个都有一个名为 Song 的列。然而,有时歌曲的拼写不同。如何使用 difflib(或类似的东西)在另一个数据帧的新列中获取一个数据帧的 Song 拼写?

例如:

Dataframe1

Song           Artist

like a virgi   madonna


Dataframe2

Song          Rank

like a virgin  2


Result

Song            Artist    SongAlt

like a virgin   Madonna   like a virgi

【问题讨论】:

    标签: python dataframe similarity sentence-similarity


    【解决方案1】:

    第 1 步:合并任何可以合并的内容

    In [67]: df1
    Out[67]: 
               Song    Artist
    0        mysong  myartist
    1  like a virgi   madonna
    
    In [68]: df2
    Out[68]: 
                Song  Rank
    0         mysong     1
    1  like a virgin     2
    
    In [69]: merged = pd.merge(df1, df2, on='Song')
    
    In [70]: merged
    Out[70]: 
         Song    Artist  Rank
    0  mysong  myartist     1
    

    第 2 步:找出剩余的内容

    In [71]: unmerged = df2[~df2.isin(merged)].dropna()
    
    In [72]: unmerged
    Out[72]: 
                Song  Rank
    1  like a virgin   2.0
    

    第 3 步:使用 difflib 的 get_close_matches 获得最接近的匹配项

    In [73]: songs = list(df1['Song'].unique())
    
    In [74]: def closest(a):
        ...:     try:
        ...:         return difflib.get_close_matches(a, songs)[0]
        ...:     except IndexError:
        ...:         return "Not Found"
    
    In [75]: unmerged['closest_song'] = unmerged.apply(lambda row: closest(row['Song']), axis=1)
    
    In [76]: unmerged
    Out[76]: 
                Song  Rank  closest_song
    1  like a virgin   2.0  like a virgi
    

    第 4 步:根据需要获取相似度百分比

    In [77]: def similar(a, b):
        ...:     return difflib.SequenceMatcher(None, a, b).ratio()
    
    In [78]: unmerged['Similarity'] = unmerged.apply(lambda row: similar(row['closest_song'], row['Song']), axis=1)
    
    In [79]: unmerged
    Out[79]: 
                Song  Rank  closest_song  Similarity
    1  like a virgin   2.0  like a virgi        0.96
    

    第 5 步:使用最接近的值进行合并

    In [80]: unmerged.rename(columns={'Song': 'Old_Song', 'closest_song': 'Song'}, inplace=True)
    
    In [81]: new = unmerged.merge(df1, on='Song')[['Song', 'Artist', 'Rank']]
    Out[81]: 
               Song   Artist  Rank
    0  like a virgi  madonna   2.0
    
    In [82]: pd.concat([merged, new])
    Out[82]: 
               Song    Artist  Rank
    0        mysong  myartist   1.0
    0  like a virgi   madonna   2.0
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-07-21
      • 1970-01-01
      • 2018-01-11
      • 2013-11-17
      相关资源
      最近更新 更多