【问题标题】:string manipulation in python using list使用列表在python中进行字符串操作
【发布时间】:2018-04-04 05:23:35
【问题描述】:

我有一些推文,其中包含一些速记文本,例如 ur、bcz 等。我正在使用字典来映射正确的单词。我知道我们不能在 python 中改变字符串。因此,在用正确的单词替换之后,我将副本存储在一个新列表中。它的工作。如果任何推文有多个速记文本,我将面临问题。

我的代码一次替换一个单词。如何在一个字符串中多次替换单词。 这是我的代码

# some sample tweets
tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]

short_text={
    "bcz" : "because",
    "ur" : "your",
    "grt" : "great",
    "gr8" : "great",
    "u" : "you"
        }

import re

def find_word(text,search):
    result = re.findall('\\b'+search+'\\b',text,flags=re.IGNORECASE)
    if len(result) > 0:
        return True
    else:
        return False


corrected_tweets=list()
for i in tweet:
    tweettoken=i.split()
    for short_word in short_text:
        print("current iteration")
        for tok in tweettoken:
            if(find_word(tok,short_word)):
                print(tok)
                print(i)
                newi = i.replace(tok,short_text[short_word])
                corrected_tweets.append(newi)       
            print(newi)

我的输出是

['stats is great',
 'india is grt because it is colourfull',
 'india is great bcz it is colourfull',
 'your movie is great',
 'i hate your book of hatred']

我需要的是推文 2 和 3 应该附加一次并进行所有更正。我是 python 新手。任何帮助都会很棒。

【问题讨论】:

  • 每个更正的版本都需要一个条目吗?像 'india is grt bcz it is colourfull' 它附加了两次,因为有 2 个短词。
  • 我需要一个条目将所有正确的拼写。所以应该只有一个条目同时更正。
  • 在您的 if 条件下,如果 find_word() 返回 true,则将其附加到您的 Corrected_tweets 列表中。但应该在所有 short_word 更正后完成。

标签: python string list string-matching


【解决方案1】:

在单词边界上使用正则表达式函数,在字典中获取替换(默认为原始单词,因此如果找不到则返回相同的单词)

tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]

short_text={
    "bcz" : "because",
    "ur" : "your",
    "grt" : "great",
    "gr8" : "great",
    "u" : "you"
        }

import re

changed = [re.sub(r"\b(\w+)\b",lambda m:short_text.get(m.group(1),m.group(1)),x) for x in tweet]

结果:

['stats is great', 'india is great because it is colourfull', 'i like you', 'your movie is great', 'i hate your book of hatred']

这种方法非常快,因为它对每个单词都有O(1) 查找(不依赖于字典的长度)

re+word 边界与str.split 相比的优势在于,它也适用于单词用标点符号分隔的情况。

【讨论】:

    【解决方案2】:

    您可以为此使用列表组合:

    [' '.join(short_text.get(s, s) for s in new_str.split()) for new_str in tweet]
    

    结果:

    In [1]: tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
       ...:
    
    In [2]: short_text={
       ...:     "bcz" : "because",
       ...:     "ur" : "your",
       ...:     "grt" : "great",
       ...:     "gr8" : "great",
       ...:     "u" : "you"
       ...:         }
    
    In [4]: [' '.join(short_text.get(s, s) for s in new_str.split()) for new_str in tweet]
    Out[4]:
    ['stats is great',
     'india is great because it is colourfull',
     'i like you',
     'your movie is great',
     'i hate your book of hatred']
    

    【讨论】:

      【解决方案3】:

      你可以试试这个方法:

      tweet = ['stats is gr8', 'india is grt bcz it is colourfull', 'i like you','your movie is grt', 'i hate ur book of hatred' ]
      
      short_text={
          "bcz" : "because",
          "ur" : "your",
          "grt" : "great",
          "gr8" : "great",
          "u" : "you"
              }
      
      for j,i in enumerate(tweet):
          data=i.split()
          for index_np,value in enumerate(data):
              if value in short_text:
                  data[index_np]=short_text[value]
      
          tweet[j]=" ".join(data)
      
      print(tweet)
      

      输出:

      ['stats is great', 'india is great because it is colourfull', 'i like you', 'your movie is great', 'i hate your book of hatred']
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2015-05-31
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2013-10-23
        • 1970-01-01
        • 2021-06-27
        相关资源
        最近更新 更多