python拆分文本文件功能答案

【问题标题】：python split a text file functionpython拆分文本文件功能
【发布时间】：2015-02-04 00:38:13
【问题描述】：

我编写了一个标记化函数，它基本上读取字符串表示并将其拆分为单词列表。

我的代码：

def tokenize(document):
    x = document.lower() 
    return re.findall(r'\w+', x)

我的输出：

tokenize("Hi there. What's going on? first-class")
['hi', 'there', 'what', 's', 'going', 'on', 'first', 'class']

期望的输出：

['hi', 'there', "what's", 'going', 'on', 'first-class']

基本上，我希望撇号和连字符在列表中保留为单个单词以及双引号。如何更改我的函数以获得所需的输出。

【问题讨论】：

你能按空格分割吗？

标签： python regex list function split

【解决方案1】：

\w+ 匹配一个或多个单词字符；这不包括撇号或连字符。

您需要在此处使用character set 来准确告诉 Python 您要匹配的内容：

>>> import re
>>> def tokenize(document):
...     return re.findall("[A-Za-z'-]+", document)
...
>>> tokenize("Hi there. What's going on? first-class")
['hi', 'there', "what's", 'going', 'on', 'first-class']
>>>

您也会注意到我删除了x = document.lower() 行。这不再是必需的，因为我们可以通过简单地将A-Z 添加到字符集中来匹配大写字符。

【讨论】：