将字符串转换为单词列表？答案

【问题标题】：Converting a String to a List of Words?将字符串转换为单词列表？
【发布时间】：2011-09-05 02:56:39
【问题描述】：

我正在尝试使用 python 将字符串转换为单词列表。我想采取以下措施：

string = 'This is a string, with words!'

然后转换成这样的：

list = ['This', 'is', 'a', 'string', 'with', 'words']

注意省略了标点符号和空格。最快的方法是什么？

【问题讨论】：

标签： python string list words text-segmentation

【解决方案1】：

试试这个：

import re

mystr = 'This is a string, with words!'
wordList = re.sub("[^\w]", " ",  mystr).split()

工作原理：

来自文档：

re.sub(pattern, repl, string, count=0, flags=0)

返回通过替换repl替换字符串中最左边不重叠出现的模式获得的字符串。如果未找到该模式，则字符串原样返回。 repl 可以是字符串或函数。

在我们的例子中：

pattern 是任何非字母数字字符。

[\w] 表示任意字母数字字符，等于字符集 [a-zA-Z0-9_]

a 到 z、A 到 Z、0 到 9 和下划线。

所以我们匹配任何非字母数字字符并将其替换为空格。

然后我们 split() 它按空格分割字符串并将其转换为列表

所以“你好世界”

变成“你好世界”

与 re.sub

然后是 ['hello' , 'world']

拆分后（）

如果有任何疑问，请告诉我。

【讨论】：

记住也要处理撇号和连字符，因为它们不包含在\w 中。
您可能还想处理格式化的撇号和不间断的连字符。
string.split() 更容易

【解决方案2】：

鉴于迟到的回复，我认为这是其他人绊倒这篇文章的最简单方法：

>>> string = 'This is a string, with words!'
>>> string.split()
['This', 'is', 'a', 'string,', 'with', 'words!']

【讨论】：

您需要从单词中分离并消除标点符号（例如，“string”和“words！”）。因为它，这不符合 OP 的要求。

【解决方案3】：

要正确地做到这一点是相当复杂的。对于您的研究，它被称为词标记化。如果你想看看别人做了什么，你应该看看NLTK，而不是从头开始：

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second."
>>> sentences = nltk.sent_tokenize(paragraph)
>>> for sentence in sentences:
...     nltk.word_tokenize(sentence)
[u'Hi', u',', u'this', u'is', u'my', u'first', u'sentence', u'.']
[u'And', u'this', u'is', u'my', u'second', u'.']

【讨论】：

【解决方案4】：

最简单的方法：

>>> import re
>>> string = 'This is a string, with words!'
>>> re.findall(r'\w+', string)
['This', 'is', 'a', 'string', 'with', 'words']

【讨论】：

【解决方案5】：

为了完整性，使用string.punctuation：

import re
import string
x = re.sub('['+string.punctuation+']', '', s).split()

这也处理换行符。

【讨论】：

【解决方案6】：

嗯，你可以使用

import re
list = re.sub(r'[.!,;?]', ' ', string).split()

请注意，string 和 list 都是内置类型的名称，因此您可能不想将它们用作变量名称。

【讨论】：

【解决方案7】：

受@mtrw 的回答启发，但经过改进以仅去除单词边界处的标点符号：

import re
import string

def extract_words(s):
    return [re.sub('^[{0}]+|[{0}]+$'.format(string.punctuation), '', w) for w in s.split()]

>>> str = 'This is a string, with words!'
>>> extract_words(str)
['This', 'is', 'a', 'string', 'with', 'words']

>>> str = '''I'm a custom-built sentence with "tricky" words like https://stackoverflow.com/.'''
>>> extract_words(str)
["I'm", 'a', 'custom-built', 'sentence', 'with', 'tricky', 'words', 'like', 'https://stackoverflow.com']

【讨论】：

【解决方案8】：

单词的正则表达式会给你最大的控制权。您需要仔细考虑如何处理带有破折号或撇号的单词，例如“I'm”。

【讨论】：

【解决方案9】：

就我个人而言，我认为这比提供的答案略干净

def split_to_words(sentence):
    return list(filter(lambda w: len(w) > 0, re.split('\W+', sentence))) #Use sentence.lower(), if needed

【讨论】：

【解决方案10】：

list=mystr.split(" ",mystr.count(" "))

【讨论】：

【解决方案11】：

这样可以消除字母表之外的每个特殊字符：

def wordsToList(strn):
    L = strn.split()
    cleanL = []
    abc = 'abcdefghijklmnopqrstuvwxyz'
    ABC = abc.upper()
    letters = abc + ABC
    for e in L:
        word = ''
        for c in e:
            if c in letters:
                word += c
        if word != '':
            cleanL.append(word)
    return cleanL

s = 'She loves you, yea yea yea! '
L = wordsToList(s)
print(L)  # ['She', 'loves', 'you', 'yea', 'yea', 'yea']

我不确定这是否是快速或最佳的，甚至是正确的编程方式。

【讨论】：

【解决方案12】：

def split_string(string):
    return string.split()

此函数将返回给定字符串的单词列表。在这种情况下，如果我们如下调用函数，

string = 'This is a string, with words!'
split_string(string)

函数的返回输出为

['This', 'is', 'a', 'string,', 'with', 'words!']

【讨论】：

【解决方案13】：

这是我对无法使用正则表达式的编码挑战的尝试，

outputList = "".join((c if c.isalnum() or c=="'" else ' ') for c in inputStr ).split(' ')

撇号的作用似乎很有趣。

【讨论】：

【解决方案14】：

可能不是很优雅，但至少你知道发生了什么。

my_str = "Simple sample, test! is, olny".lower()
my_lst =[]
temp=""
len_my_str = len(my_str)
number_letter_in_data=0
list_words_number=0
for number_letter_in_data in range(0, len_my_str, 1):
    if my_str[number_letter_in_data] in [',', '.', '!', '(', ')', ':', ';', '-']:
        pass
    else:
        if my_str[number_letter_in_data] in [' ']:
            #if you want longer than 3 char words
            if len(temp)>3:
                list_words_number +=1
                my_lst.append(temp)
                temp=""
            else:
                pass
        else:
            temp = temp+my_str[number_letter_in_data]
my_lst.append(temp)
print(my_lst)

【讨论】：

如果存在更优化的解决方案，这个解决方案的意义何在？

【解决方案15】：

您可以尝试这样做：

tryTrans = string.maketrans(",!", "  ")
str = "This is a string, with words!"
str = str.translate(tryTrans)
listOfWords = str.split()

【讨论】：