【问题标题】:Pandas match values and create new table without duplicates熊猫匹配值并创建没有重复的新表
【发布时间】:2020-07-24 10:45:05
【问题描述】:

我有以下 2 个数据框:

sentences = pd.read_csv(
    'sentences and translations/SpaSentandEng2.csv', sep='\t')
print(sentences.head())

words = pd.read_csv(
    'sentences and translations/5kWords.csv', sep='\t', header=None)
print(words.head())

输出如下:

0                   Tengo que irme a dormir                   I have to go to sleep.
1               Simplemente no sé qué decir           I just don't know what to say.
2                 Yo estaba en las montañas                  I was in the mountains.
3                     No sé si tengo tiempo         I don't know if I have the time.
4  La educación en este mundo me decepciona  Education in this world disappoints me.
     0      1
0   de  17177
1   no  15397
2    a  14887
3   la  14653
4  que  14446

words数据框表示句子数据框的“Spa”列中每个单词的频率。

我正在尝试通过将每个单词与一个句子及其翻译相匹配来创建一个新的数据框,例如:

   spa                    eng                                 word
1  estoy de acuerdo       I agree                               de
2  no sé si tengo tiempo  I don't know if I have the time       sé
.
.
.

句子的开头或结尾不应包含单词,如果该句子已与另一个单词匹配,我想避免将单词与句子匹配。

我可以将单词与以下句子匹配:

sentences[sentences['Spa'].str.contains(" " +
                                               words[0][0]+' ', regex=False, case=False, na=False)]

但是,我不知道从这里做什么。我应该如何进行?

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    另一种方法

    1. 基本上围绕使用explode()连接单词和句子
    2. 拆分完成后,使用切片排除第一个和最后一个单词
    3. 保留使用过的句子列表,以便排除它们
    4. join 将提供多条记录,使用 iloc 切片仅获取第一个记录
    import re
    trans = '''0                   Tengo que irme a dormir                   I have to go to sleep.
    1               Simplemente no sé qué decir           I just don't know what to say.
    2                 Yo estaba en las montañas                  I was in the mountains.
    3                     No sé si tengo tiempo         I don't know if I have the time.
    4  La educación en este mundo me decepciona  Education in this world disappoints me.'''
    wordst = '''0   de  17177
    1   no  15397
    2    a  14887
    3   la  14653
    4  en  14446'''
    sentances = pd.DataFrame([[col.strip() for col in re.split("  ",t) if col!=""] for t in trans.split("\n")], 
                 columns=["ID","spa","eng"]).drop("ID",1)
    words = pd.DataFrame([[col.strip() for col in re.split("  ",t) if col!=""] for t in wordst.split("\n")], 
                 columns=["ID","word", "count"]).drop("ID",1)
    
    
    sjoin=sentances.assign(word=sentances.apply(lambda r: r["spa"].split(" ")[1:-1], axis=1))\
        .explode("word") # strip off first and last words
    
    used=[]
    df = pd.DataFrame()
    for word in words["word"].values:
        df = pd.concat([df, 
                        words[words["word"]==word]  # match current word
                        .merge(sjoin[~sjoin["spa"].isin(used)]).  # exclude previously matched sentances
                        drop("count", 1).reindex(columns=["spa","eng","word"]).iloc[0:1,]]) # cleanup,  but most importantly just take first sentance for this word
        used = df["spa"].values
    df
    
    

    输出

                            spa                             eng word
     Simplemente no sé qué decir  I just don't know what to say.   no
         Tengo que irme a dormir          I have to go to sleep.    a
       Yo estaba en las montañas         I was in the mountains.   en
    

    【讨论】:

      猜你喜欢
      • 2019-08-25
      • 1970-01-01
      • 2020-09-26
      • 2022-01-24
      • 1970-01-01
      • 2018-11-01
      • 2020-01-07
      • 2018-08-03
      • 2020-09-26
      相关资源
      最近更新 更多