【问题标题】:Stripping punctuation from text in Python [duplicate]从Python中的文本中删除标点符号[重复]
【发布时间】:2017-03-02 03:49:48
【问题描述】:

我正在尝试从文本文件中获取标记(单词)并将其从所有标点符号中删除。我正在尝试以下方法:

import re 

with open('hw.txt') as f:
    lines_after_254 = f.readlines()[254:]
    sent = [word for line in lines_after_254 for word in line.lower().split()]
    words = re.sub('[!#?,.:";]', '', sent)

我收到以下错误:

return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

【问题讨论】:

    标签: python string mapreduce nlp special-characters


    【解决方案1】:

    您的脚本中有几件事。您不是在标记化,而是将所有内容拆分为单个字符!此外,您将在将所有内容拆分为字符后删除特殊字符。

    更好的方法是读取输入字符串,删除特殊字符,然后标记输入字符串。

    import re
    
    # open the input text file and read
    string = open('hw.txt').read()
    print string
    
    # remove the special charaters from the read string
    no_specials_string = re.sub('[!#?,.:";]', '', string)
    print no_specials_string
    
    # split the text and store words in a list
    words = no_specials_string.split()
    print words
    

    或者,如果你想先拆分成标记,然后删除特殊字符,你可以这样做:

    import re
    
    # open the input text file and read
    string = open('hw.txt').read()
    print string
    
    # split the text and store words in a list
    words = string.split()
    print words
    
    # remove special characters from each word in words
    new_words = [re.sub('[!#?,.:";]', '', word) for word in words]
    print new_words
    

    【讨论】:

      【解决方案2】:

      re.sub 将应用于字符串而不是列表!

      print re.sub(pattern, '', sent)
      

      应该是

      print [re.sub(pattern, '', s) for s in sent]
      

      希望这会有所帮助!

      【讨论】:

        【解决方案3】:

        使用下面的remove_puncts() 函数

        import string
        translator = str.maketrans('', '', string.punctuation)
        def remove_puncts(input_string):
            return input_string.translate(translator)
        

        示例用法

        input_string = """"YH&W^(*D)#IU*DEO)#brhtr<><}{|_}vrthyb,.,''fehsvhrr;[vrht":"]`~!@#$%svbrxs"""
        remove_puncts(input_string)
        'YHWDIUDEObrhtrvrthybfehsvhrrvrhtsvbrxs'
        

        编辑

        速度比较

        结果表明使用translator 方法比使用正则表达式替换更快

        import re, string, time
        
        pattern = '[!#?,.:";]'
        def regex_sub(input_string):
            return re.sub(pattern, '', input_string)
        
        translator = str.maketrans('', '', string.punctuation)
        def string_translator(input_string):
            return input_string.translate(translator)
        
        input_string = """cwsx#?;.frvcdr"""
        string_translator(input_string)
        regex_sub(input_string)
        
        passes = 1000000
        t1 = time()
        for i in range(passes):
            a = string_translator(input_string)
        
        t2 = time()
        for i in range(passes):
            a = regex_sub(input_string)
        
        t3 = time()
        
        string_translator_time = t2 - t1
        regex_sub_time = t3 - t2
        
        print(string_translator_time) # 1.341651439666748
        print(regex_sub_time) # 3.44773268699646
        

        【讨论】:

          【解决方案4】:

          没有任何内容被读入您的列表

          In [14]: with open('data', 'r') as f:
              ...:     l=f.readlines()[254:]
              ...:     
          
          In [15]: l
          Out[15]: []
          

          假设你想要一个单词列表,试试这个

          with open('data', 'r') as f:
               lines = [line.strip() for line in f]
          
          sent= [w for word in lines[:254] for w in re.split('\s+', word)]
          
          find = '[!#?,.:";]'
          replace = ''
          
          words = [re.sub(find, replace, word) for word in sent]
          

          @Keerthana Prabhakaran 指出 re.sub 已更正

          【讨论】:

          • 这仍然保留错误!
          • 错误是return _compile(pattern, flags).sub(repl, string, count),这里sent是一个列表!!
          猜你喜欢
          • 2013-11-19
          • 1970-01-01
          • 2013-09-05
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多