【问题标题】:Python get first and last value from string using dictionary key valuesPython使用字典键值从字符串中获取第一个和最后一个值
【发布时间】:2019-08-27 19:28:29
【问题描述】:

我得到了一个非常奇怪的数据。我有带有键和值的字典,我想在其中使用该字典来搜索这些关键字是否仅是文本的开头和/或结尾而不是句子的中间。我尝试在下面创建简单的数据框来显示我迄今为止尝试过的问题案例和 python 代码。我如何让它只搜索句子的开头或结尾?这个搜索整个文本子字符串。

代码:

d = {'apple corp':'Company','app':'Application'} #dictionary
l1 = [1, 2, 3,4]
l2 = [
    "The word Apple is commonly confused with Apple Corp which is a business",
    "Apple Corp is a business they make computers",
    "Apple Corp also writes App",
    "The Apple Corp also writes App"
]
df = pd.DataFrame({'id':l1,'text':l2})
df['text'] = df['text'].str.lower()
df

原始数据框:

id   text 
1    The word Apple is commonly confused with Apple Corp which is a business         
2    Apple Corp is a business they make computers                                    
3    Apple Corp also writes App                                                      
4    The Apple Corp also writes App                                                  

代码试用:

def matcher(k):
    x = (i for i in d if i in k)
    # i.startswith(k) getting error
    return ';'.join(map(d.get, x))
df['text_value'] = df['text'].map(matcher)
df

错误: TypeError: 'in <string>' requires string as left operand, not bool 当我使用这个x = (i for i in d if i.startswith(k) in k)

如果我尝试此操作,则为空值x = (i for i in d if i.startswith(k) == True in k)

TypeError: sequence item 0: expected str instance, NoneType found 当我使用这个x = (i.startswith(k) for i in d if i in k)

以上代码的结果...创建新字段'text_value':

id   text                                                                            text_value
1    The word Apple is commonly confused with Apple Corp which is a business         Company;Application
2    Apple Corp is a business they make computers                                    Company;Application
3    Apple Corp also writes App                                                      Company;Application
4    The Apple Corp also writes App                                                  Company;Application

试图得到这样的最终输出:

id   text                                                                            text_value
1    The word Apple is commonly confused with Apple Corp which is a business         NaN
2    Apple Corp is a business they make computers                                    Company
3    Apple Corp also writes App                                                      Company;Application
4    The Apple Corp also writes App                                                  Application

【问题讨论】:

  • 你的实际输出和想要的输出不一样吗?
  • 没有。它不是。我将添加原始 DataFrame 以减少混乱。
  • 为什么 id 2 有“应用程序”?它以“计算机”结尾,而不是“应用程序”。
  • @BenoitDrogo。是的,请参阅这些部分的 text_value。它们是不同的。我基本上是想展示我尝试过的和没用的。应用程序 id 2 的好消息。我的错字。我修好了
  • 这里最大的麻烦是apple corp是两个词,意味着你不能轻易定义第一个“值”。

标签: python-3.x pandas dictionary startswith ends-with


【解决方案1】:

您需要一个可以接受flagmatcher 函数,然后调用它两次以获得startswithendswith 的结果。

def matcher(s, flag="start"):
    if flag=="start":
        for i in d:
            if s.startswith(i):
                return d[i]
    else:
        for i in d:
            if s.endswith(i):
                return d[i]
    return None

df['st'] = df['text'].apply(matcher)
df['ed'] = df['text'].apply(matcher, flag="end")
df['text_value'] = df[['st', 'ed']].apply(lambda x: ';'.join(x.dropna()),1)
df = df[['id','text', 'text_value']]

text_value 列如下所示:

0                       
1                Company
2    Company;Application
3            Application
Name: text_value, dtype: object

【讨论】:

    【解决方案2】:
    joined = "|".join(d.keys())
    
    pat = '(?i)^(?:the\\s*)?(' + joined + ')\\b.*?|.*\\b(' + joined + ')$'+'|.*'
    
    get = lambda x: d.get(x.group(1),"") + (';' +d.get(x.group(2),"") if x.group(2) else '')
    
    df.text.str.replace(pat,get)
    
    
    0                       
    1                Company
    2    Company;Application
    3    Company;Application
    Name: text, dtype: object
    

    【讨论】:

      猜你喜欢
      • 2019-06-26
      • 2013-07-02
      • 2016-11-20
      • 1970-01-01
      • 1970-01-01
      • 2023-04-07
      • 2011-04-11
      • 1970-01-01
      相关资源
      最近更新 更多