【问题标题】:How do I get characters that are non-letters and non-digits appended in a list?如何获取列表中附加的非字母和非数字字符?
【发布时间】:2020-11-28 12:04:03
【问题描述】:

这是关于简单的字数统计,用于收集文档中出现的单词以及出现频率。

我尝试编写一个函数,输入是文本行列表。我遍历所有行,将它们拆分为单词,累积识别的单词,最后返回完整的列表。

首先,我有一个 while 循环,它遍历列表中的所有字符,但忽略了空格。在这个 while 循环中,我还尝试识别我有什么样的词。在这种情况下,有三种词:

  • 以字母开头的;
  • 以数字开头的;
  • 以及那些只包含一个既不是字母也不是数字的字符。

我有三个 if 语句来检查我有什么样的字符。当我知道我遇到了什么样的单词时,我会尝试提取单词本身。当单词以字母或数字开头时,我将所有连续的同类字符作为单词的一部分。

但是,在第三个 if 语句中,当我处理当前字符既不是字母也不是数字的情况时,我就会遇到问题。

当我给出输入时

wordfreq.tokenize(['15,    delicious&   Tarts.'])

我希望输出是

['15', ',', 'delicious', '&', 'tarts', '.']

当我在 Python 控制台中测试函数时,它看起来像这样:

PyDev console: starting.
Python 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) 
[Clang 6.0 (clang-600.0.57)] on darwin
import wordfreq
wordfreq.tokenize(['15,    delicious&   Tarts.'])
['15', 'delicious', 'tarts']

该函数不考虑逗号、& 和点!我该如何解决? 代码见下文。

(lower() 方法是因为我想忽略大小写,例如 'Tarts' 和 'tarts' 是同一个词。)

# wordfreq.py

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):
            while line[start].isspace():
                start = start + 1
            if line[start].isalpha():
                end = start
                while line[end].isalpha():
                    end = end + 1
                word = line[start:end]
                words.append(word.lower())
                start = end
            elif line[start].isdigit():
                end = start
                while line[end].isdigit():
                    end = end + 1
                word = line[start:end]
                words.append(word)
                start = end
            else:
                words.append(line[start])
            start = start + 1
    return words

【问题讨论】:

  • 请添加一个示例,说明您的函数的典型输入与所需输出的外观相似
  • 参见上面所需输入/输出的示例。

标签: python counting word identify


【解决方案1】:

我发现了问题所在。线

start = start + 1

应该在最后一个 else 语句中的位置。

所以我的代码看起来像这样,并给了我上面指定的所需输入:

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):
            while line[start].isspace():
                start = start + 1
            end = start
            if line[start].isalpha():
                while line[end].isalpha():
                    end = end + 1
                word = line[start:end]
                word = word.lower()
                words.append(word)
                start = end
            elif line[start].isdigit():
                while line[end].isdigit():
                    end = end + 1
                word = line[start:end]
                words.append(word)
                start = end
            else:
                word = line[start]
                words.append(word)
                start = start + 1
    return words

但是,当我使用下面的测试脚本来确保没有遗漏函数“tokenize”的极端情况时;...

import io
import sys
import importlib.util

def test(fun,x,y):
    global pass_tests, fail_tests
    if type(x) == tuple:
        z = fun(*x)
    else:
        z = fun(x)
    if y == z:
        pass_tests = pass_tests + 1
    else:
        if type(x) == tuple:
            s = repr(x)
        else:
            s = "("+repr(x)+")"
        print("Condition failed:")
        print("   "+fun.__name__+s+" == "+repr(y))
        print(fun.__name__+" returned/printed:")
        print(str(z))
        fail_tests = fail_tests + 1

def run(src_path=None):
    global pass_tests, fail_tests

    if src_path == None:
        import wordfreq
    else:
        spec = importlib.util.spec_from_file_location("wordfreq", src_path+"/wordfreq.py")
        wordfreq = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(wordfreq)

    pass_tests = 0
    fail_tests = 0
    fun_count  = 0

    def printTopMost(freq,n):
        saved = sys.stdout
        sys.stdout = io.StringIO()
        wordfreq.printTopMost(freq,n)
        out = sys.stdout.getvalue()
        sys.stdout = saved
        return out

    if hasattr(wordfreq, "tokenize"):
        fun_count = fun_count + 1
        test(wordfreq.tokenize, [], [])
        test(wordfreq.tokenize, [""], [])
        test(wordfreq.tokenize, ["   "], [])
        test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])
        test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])
        test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])
        test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])
        test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])
    else:
        print("tokenize is not implemented yet!")

    if hasattr(wordfreq, "countWords"):
        fun_count = fun_count + 1
        test(wordfreq.countWords, ([],[]), {})
        test(wordfreq.countWords, (["clean","water"],[]), {"clean":1,"water":1})
        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],[]), {"clean":1,"water":2,"is":1,"drinkable":1})
        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],["is"]), {"clean":1,"water":2,"drinkable":1})
    else:
        print("countWords is not implemented yet!")

    if hasattr(wordfreq, "printTopMost"):
        fun_count = fun_count + 1
        test(printTopMost,({},10),"")
        test(printTopMost,({"horror": 5, "happiness": 15},0),"")
        test(printTopMost,({"C": 3, "python": 5, "haskell": 2, "java": 1},3),"python                  5\nC                       3\nhaskell                 2\n")
    else:
        print("printTopMost is not implemented yet!")

    print(str(pass_tests)+" out of "+str(pass_tests+fail_tests)+" passed.")

    return (fun_count == 3 and fail_tests == 0)

if __name__ == "__main__":
    run()

...我得到以下输出:

/usr/local/bin/python3.7 "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py"
Traceback (most recent call last):
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 81, in <module>
    run()
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 50, in run
    test(wordfreq.tokenize, ["   "], [])
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 10, in test
    z = fun(x)
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/wordfreq.py", line 44, in tokenize
    while line[start].isspace():
IndexError: string index out of range

为什么说字符串索引超出范围?我该如何解决这个问题?

【讨论】:

    【解决方案2】:

    itertools.groupby 可以大大简化这一点。基本上,您可以根据字符的类别或类型(字母、数字或标点符号)对字符串中的字符进行分组。在此示例中,我只定义了这三个类别,但您可以根据需要定义任意多或尽可能少的类别。任何不匹配任何类别的字符(在本例中为空格)都会被忽略:

    def get_tokens(string):
        from itertools import groupby
        from string import ascii_lowercase, ascii_uppercase, digits, punctuation as punct
        alpha = ascii_lowercase + ascii_uppercase
    
        yield from ("".join(group) for key, group in groupby(string, key=lambda char: next((category for category in (alpha, digits, punct) if char in category), "")) if key)
    
    print(list(get_tokens("15,    delicious&   Tarts.")))
    

    输出:

    ['15', ',', 'delicious', '&', 'Tarts', '.']
    >>> 
    

    【讨论】:

      【解决方案3】:

      我不知道你为什么要分上下,但你可以这样做:

      input = ['15,    delicious&   Tarts.']
      line = input[0]
      words = line.split(' ')
      words = [word for word in words if word]
      out:
      ['15,', 'delicious&', 'Tarts.']
      

      编辑,看到你编辑了你想要的输出。跳过此行即可获得该输出:

          words = [word for word in words if word]
      

      【讨论】:

        猜你喜欢
        • 2018-04-09
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2013-01-06
        • 1970-01-01
        相关资源
        最近更新 更多