【问题标题】:How to separate comma from word (tokenization)如何将逗号与单词分开(标记化)
【发布时间】:2021-01-28 16:35:22
【问题描述】:

我对分词有一些问题,任务是将一个句子分成单词。

这就是我目前所做的。

def tokenize(s):

    d = []
    start = 0
    
    while start < len(s):
        while start < len(s) and s[start].isspace():
            start = start+1

        end = start
        while end < len(s) and not s[end].isspace():
            end = end+1

        d = d + [s[start:end]]
        start = end
            
    print(d)

运行程序:

>>> tokenize("He was walking, it was fun")
['He', 'was', 'walking,', 'it', 'was', 'fun']

这很好用,但问题是如您所见,我的程序将在单词 walk 中包含逗号。我想将逗号(和其他“符号”)分隔为一个单独的“单词”。

如:

['He', 'was', 'walking', ',', 'it', 'was', 'fun']

如何修改我的代码来解决这个问题?

提前致谢!

【问题讨论】:

  • 如果你的目标是将一个句子拆分成单词,并且非字母字符将是他们自己的单词,那么解析像From 1969-2009, David Peters-Foster woke up between 9:15AM and 10:15AM to go for a jog around the cul-de-sac with his neighbor's husband.这样的句子输出将是一团糟
  • 你可以在这些情况下使用正则表达式,比如import re 然后print(re.findall(r'[^\W_]+|[^\w\s]|_', text))

标签: python tokenize


【解决方案1】:

这是一个可能的修改建议,该修改适用于您的具体示例,但肯定会因“你好吗?!”之类的示例而失败:

def tokenize(s):

    d = []
    start = 0
    
    while start < len(s):
        while start < len(s) and s[start].isspace():
            start = start+1

        end = start
        while end < len(s) and not s[end].isspace():
            end = end+1

        if(s[end-1] in ["!", ",", ".", ";", ":"]):
            d = d + [s[start:(end-1)]]
            d = d + [s[end-1]]
        else:
            d = d + [s[start:end]]
        
        start = end


            
    print(d)

tokenize("He was walking, it was fun!")
# ['He', 'was', 'walking', ',', 'it', 'was', 'fun', '!']

【讨论】:

    【解决方案2】:

    另一种方法是使用split 函数,如下所示

    def tokenize(s):
        d1 = s.split(",")
        d3 = []
        for d2 in d1:
            for d in d2.split():
                d3.append( d )
            d3.append( "," )
        d3.pop(-1)
        print(d3)
    tokenize("He was walking, it was fun")
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-02-27
      • 1970-01-01
      • 1970-01-01
      • 2017-02-22
      • 2019-03-09
      相关资源
      最近更新 更多