【发布时间】:2020-11-27 22:51:15
【问题描述】:
我有 2 个数据框:
充当带有列的字典:
- “分数”
- “翻译”
- 具有不同单词变体的多个列
另一种一栏:“句子”
目标是:
- 将句子分成单词
- 在字典中查找单词(在不同的列中)并返回分数
- 将分数最高的单词的分数作为“句子分数”
df_sentences = pd.DataFrame([["I run"],
["he walks"],
["we run and walk"]],
columns=['Sentence'])
df_dictionary = pd.DataFrame([[10, "I", "you", "he"],
[20, "running", "runs", "run"],
[30, "walking", "walk", "walks"]],
columns=['score', 'variantA', 'variantB', 'variantC'])
Out[1]:
Sentence Score
0 "I run" 30
1 "he walks" 40
2 "we run and walk" "error 'and' not found"
我在使用 for 循环和列表方面已经走了很长一段路,但这很慢,所以我正在寻找一种工作方式,让我可以在 pandas 数据框中完成所有/大部分工作。
这就是我使用 for 循环的方式:
for sentence in textaslist[:1]:
words = split_into_words(sentence)[0] # returns list of words
length = split_into_words(sentence)[1] #returns number of words
if minsentencelength <= length <= maxsentencelength: # filter out short and long sentences
for word in words:
score = LookupInDictionary.lookup(word, mydictionary)
if str(score) != "None":
do_something()
else:
print(word, " not found in dictionary list")
not_found.append(word) # Add word to not found list
print("The following words were not found in the dictionary: ", not_found)
使用
def lookup(word, df):
if word in df.values: # Check if the dictionary contains the word
print(word,"was found in the dictionary")
lookupreturn = df.loc[df.values == word,'score'] # find the score of each word (first column)
score = lookupreturn.values[0] # take only the first instance of the word in the dictionary
return(bare)
问题是当我使用pandas“合并”功能时,我需要使用right_on left_on参数指定在哪一列中查找,我似乎无法找到如何在整个字典数据框中搜索并返回第一个以有效方式显示分数的列
【问题讨论】:
-
请提供一小部分示例数据作为我们可以复制和粘贴的文本。包括相应的期望结果。查看how to make good reproducible pandas examples 上的指南。
-
我添加了一些示例数据,希望现在更清楚:-)