如何获取列表中附加的非字母和非数字字符？答案

【问题标题】：How do I get characters that are non-letters and non-digits appended in a list?如何获取列表中附加的非字母和非数字字符？
【发布时间】：2020-11-28 12:04:03
【问题描述】：

这是关于简单的字数统计，用于收集文档中出现的单词以及出现频率。

我尝试编写一个函数，输入是文本行列表。我遍历所有行，将它们拆分为单词，累积识别的单词，最后返回完整的列表。

首先，我有一个 while 循环，它遍历列表中的所有字符，但忽略了空格。在这个 while 循环中，我还尝试识别我有什么样的词。在这种情况下，有三种词：

以字母开头的；
以数字开头的；
以及那些只包含一个既不是字母也不是数字的字符。

我有三个 if 语句来检查我有什么样的字符。当我知道我遇到了什么样的单词时，我会尝试提取单词本身。当单词以字母或数字开头时，我将所有连续的同类字符作为单词的一部分。

但是，在第三个 if 语句中，当我处理当前字符既不是字母也不是数字的情况时，我就会遇到问题。

当我给出输入时

wordfreq.tokenize(['15,    delicious&   Tarts.'])

我希望输出是

['15', ',', 'delicious', '&', 'tarts', '.']

当我在 Python 控制台中测试函数时，它看起来像这样：

PyDev console: starting.
Python 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) 
[Clang 6.0 (clang-600.0.57)] on darwin
import wordfreq
wordfreq.tokenize(['15,    delicious&   Tarts.'])
['15', 'delicious', 'tarts']

该函数不考虑逗号、& 和点！我该如何解决？代码见下文。

（lower() 方法是因为我想忽略大小写，例如 'Tarts' 和 'tarts' 是同一个词。）

# wordfreq.py

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):
            while line[start].isspace():
                start = start + 1
            if line[start].isalpha():
                end = start
                while line[end].isalpha():
                    end = end + 1
                word = line[start:end]
                words.append(word.lower())
                start = end
            elif line[start].isdigit():
                end = start
                while line[end].isdigit():
                    end = end + 1
                word = line[start:end]
                words.append(word)
                start = end
            else:
                words.append(line[start])
            start = start + 1
    return words

【问题讨论】：

请添加一个示例，说明您的函数的典型输入与所需输出的外观相似
参见上面所需输入/输出的示例。

标签： python counting word identify

【解决方案1】：

我发现了问题所在。线

start = start + 1

应该在最后一个 else 语句中的位置。

所以我的代码看起来像这样，并给了我上面指定的所需输入：

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):
            while line[start].isspace():
                start = start + 1
            end = start
            if line[start].isalpha():
                while line[end].isalpha():
                    end = end + 1
                word = line[start:end]
                word = word.lower()
                words.append(word)
                start = end
            elif line[start].isdigit():
                while line[end].isdigit():
                    end = end + 1
                word = line[start:end]
                words.append(word)
                start = end
            else:
                word = line[start]
                words.append(word)
                start = start + 1
    return words

但是，当我使用下面的测试脚本来确保没有遗漏函数“tokenize”的极端情况时；...

import io
import sys
import importlib.util

def test(fun,x,y):
    global pass_tests, fail_tests
    if type(x) == tuple:
        z = fun(*x)
    else:
        z = fun(x)
    if y == z:
        pass_tests = pass_tests + 1
    else:
        if type(x) == tuple:
            s = repr(x)
        else:
            s = "("+repr(x)+")"
        print("Condition failed:")
        print("   "+fun.__name__+s+" == "+repr(y))
        print(fun.__name__+" returned/printed:")
        print(str(z))
        fail_tests = fail_tests + 1

def run(src_path=None):
    global pass_tests, fail_tests

    if src_path == None:
        import wordfreq
    else:
        spec = importlib.util.spec_from_file_location("wordfreq", src_path+"/wordfreq.py")
        wordfreq = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(wordfreq)

    pass_tests = 0
    fail_tests = 0
    fun_count  = 0

    def printTopMost(freq,n):
        saved = sys.stdout
        sys.stdout = io.StringIO()
        wordfreq.printTopMost(freq,n)
        out = sys.stdout.getvalue()
        sys.stdout = saved
        return out

    if hasattr(wordfreq, "tokenize"):
        fun_count = fun_count + 1
        test(wordfreq.tokenize, [], [])
        test(wordfreq.tokenize, [""], [])
        test(wordfreq.tokenize, ["   "], [])
        test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])
        test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])
        test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])
        test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])
        test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])
    else:
        print("tokenize is not implemented yet!")

    if hasattr(wordfreq, "countWords"):
        fun_count = fun_count + 1
        test(wordfreq.countWords, ([],[]), {})
        test(wordfreq.countWords, (["clean","water"],[]), {"clean":1,"water":1})
        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],[]), {"clean":1,"water":2,"is":1,"drinkable":1})
        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],["is"]), {"clean":1,"water":2,"drinkable":1})
    else:
        print("countWords is not implemented yet!")

    if hasattr(wordfreq, "printTopMost"):
        fun_count = fun_count + 1
        test(printTopMost,({},10),"")
        test(printTopMost,({"horror": 5, "happiness": 15},0),"")
        test(printTopMost,({"C": 3, "python": 5, "haskell": 2, "java": 1},3),"python                  5\nC                       3\nhaskell                 2\n")
    else:
        print("printTopMost is not implemented yet!")

    print(str(pass_tests)+" out of "+str(pass_tests+fail_tests)+" passed.")

    return (fun_count == 3 and fail_tests == 0)

if __name__ == "__main__":
    run()

...我得到以下输出：

/usr/local/bin/python3.7 "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py"
Traceback (most recent call last):
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 81, in <module>
    run()
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 50, in run
    test(wordfreq.tokenize, ["   "], [])
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 10, in test
    z = fun(x)
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/wordfreq.py", line 44, in tokenize
    while line[start].isspace():
IndexError: string index out of range

为什么说字符串索引超出范围？我该如何解决这个问题？

【讨论】：

【解决方案2】：

itertools.groupby 可以大大简化这一点。基本上，您可以根据字符的类别或类型（字母、数字或标点符号）对字符串中的字符进行分组。在此示例中，我只定义了这三个类别，但您可以根据需要定义任意多或尽可能少的类别。任何不匹配任何类别的字符（在本例中为空格）都会被忽略：

def get_tokens(string):
    from itertools import groupby
    from string import ascii_lowercase, ascii_uppercase, digits, punctuation as punct
    alpha = ascii_lowercase + ascii_uppercase

    yield from ("".join(group) for key, group in groupby(string, key=lambda char: next((category for category in (alpha, digits, punct) if char in category), "")) if key)

print(list(get_tokens("15,    delicious&   Tarts.")))

输出：

['15', ',', 'delicious', '&', 'Tarts', '.']
>>>

【讨论】：

【解决方案3】：

我不知道你为什么要分上下，但你可以这样做：

input = ['15,    delicious&   Tarts.']
line = input[0]
words = line.split(' ')
words = [word for word in words if word]
out:
['15,', 'delicious&', 'Tarts.']

编辑，看到你编辑了你想要的输出。跳过此行即可获得该输出：

    words = [word for word in words if word]

【讨论】：