【发布时间】:2019-08-27 19:28:29
【问题描述】:
我得到了一个非常奇怪的数据。我有带有键和值的字典,我想在其中使用该字典来搜索这些关键字是否仅是文本的开头和/或结尾而不是句子的中间。我尝试在下面创建简单的数据框来显示我迄今为止尝试过的问题案例和 python 代码。我如何让它只搜索句子的开头或结尾?这个搜索整个文本子字符串。
代码:
d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
"The word Apple is commonly confused with Apple Corp which is a business",
"Apple Corp is a business they make computers",
"Apple Corp also writes App",
"The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df
原始数据框:
id text
1 The word Apple is commonly confused with Apple Corp which is a business
2 Apple Corp is a business they make computers
3 Apple Corp also writes App
4 The Apple Corp also writes App
代码试用:
def matcher(k):
x = (i for i in d if i in k)
# i.startswith(k) getting error
return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df
错误:
TypeError: 'in <string>' requires string as left operand, not bool
当我使用这个x = (i for i in d if i.startswith(k) in k)
如果我尝试此操作,则为空值x = (i for i in d if i.startswith(k) == True in k)
TypeError: sequence item 0: expected str instance, NoneType found
当我使用这个x = (i.startswith(k) for i in d if i in k)
以上代码的结果...创建新字段'text_value':
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business Company;Application
2 Apple Corp is a business they make computers Company;Application
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Company;Application
试图得到这样的最终输出:
id text text_value
1 The word Apple is commonly confused with Apple Corp which is a business NaN
2 Apple Corp is a business they make computers Company
3 Apple Corp also writes App Company;Application
4 The Apple Corp also writes App Application
【问题讨论】:
-
你的实际输出和想要的输出不一样吗?
-
没有。它不是。我将添加原始 DataFrame 以减少混乱。
-
为什么 id 2 有“应用程序”?它以“计算机”结尾,而不是“应用程序”。
-
@BenoitDrogo。是的,请参阅这些部分的 text_value。它们是不同的。我基本上是想展示我尝试过的和没用的。应用程序 id 2 的好消息。我的错字。我修好了
-
这里最大的麻烦是
apple corp是两个词,意味着你不能轻易定义第一个“值”。
标签: python-3.x pandas dictionary startswith ends-with