IndexError：字符串索引超出范围——即使程序给出了所需的输出答案

【问题标题】：IndexError: string index out of range – even though the program gives the desired outputIndexError：字符串索引超出范围——即使程序给出了所需的输出
【发布时间】：2020-11-30 02:04:56
【问题描述】：

如果输入是文本行列表，我尝试编写一个函数“tokenize”。我遍历所有行，将它们拆分为单词，累积识别的单词，最后返回完整的列表。

“tokenize”函数如下所示：

def tokenize(lines):
    words = []
    for line in lines:
        start = 0
        while start < len(line):
            while line[start].isspace():
                start = start + 1
            end = start
            if line[start].isalpha():
                while line[end].isalpha():
                    end = end + 1
                word = line[start:end]
                word = word.lower()
                words.append(word)
                start = end
            elif line[start].isdigit():
                while line[end].isdigit():
                    end = end + 1
                word = line[start:end]
                words.append(word)
                start = end
            else:
                word = line[start]
                words.append(word)
                start = start + 1
    return words

当我给出输入时

wordfreq.tokenize(['15,    delicious&   Tarts.'])

它给出了输出

['15', ',', 'delicious', '&', 'tarts', '.']

这是所需的输出，所以没有错。

但是，当我使用下面的测试脚本来确保没有遗漏函数“tokenize”的极端情况时；...

import io
import sys
import importlib.util

def test(fun,x,y):
    global pass_tests, fail_tests
    if type(x) == tuple:
        z = fun(*x)
    else:
        z = fun(x)
    if y == z:
        pass_tests = pass_tests + 1
    else:
        if type(x) == tuple:
            s = repr(x)
        else:
            s = "("+repr(x)+")"
        print("Condition failed:")
        print("   "+fun.__name__+s+" == "+repr(y))
        print(fun.__name__+" returned/printed:")
        print(str(z))
        fail_tests = fail_tests + 1

def run(src_path=None):
    global pass_tests, fail_tests

    if src_path == None:
        import wordfreq
    else:
        spec = importlib.util.spec_from_file_location("wordfreq", src_path+"/wordfreq.py")
        wordfreq = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(wordfreq)

    pass_tests = 0
    fail_tests = 0
    fun_count  = 0

    def printTopMost(freq,n):
        saved = sys.stdout
        sys.stdout = io.StringIO()
        wordfreq.printTopMost(freq,n)
        out = sys.stdout.getvalue()
        sys.stdout = saved
        return out

    if hasattr(wordfreq, "tokenize"):
        fun_count = fun_count + 1
        test(wordfreq.tokenize, [], [])
        test(wordfreq.tokenize, [""], [])
        test(wordfreq.tokenize, ["   "], [])
        test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])
        test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])
        test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])
        test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])
        test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])
    else:
        print("tokenize is not implemented yet!")

    if hasattr(wordfreq, "countWords"):
        fun_count = fun_count + 1
        test(wordfreq.countWords, ([],[]), {})
        test(wordfreq.countWords, (["clean","water"],[]), {"clean":1,"water":1})
        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],[]), {"clean":1,"water":2,"is":1,"drinkable":1})
        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],["is"]), {"clean":1,"water":2,"drinkable":1})
    else:
        print("countWords is not implemented yet!")

    if hasattr(wordfreq, "printTopMost"):
        fun_count = fun_count + 1
        test(printTopMost,({},10),"")
        test(printTopMost,({"horror": 5, "happiness": 15},0),"")
        test(printTopMost,({"C": 3, "python": 5, "haskell": 2, "java": 1},3),"python                  5\nC                       3\nhaskell                 2\n")
    else:
        print("printTopMost is not implemented yet!")

    print(str(pass_tests)+" out of "+str(pass_tests+fail_tests)+" passed.")

    return (fun_count == 3 and fail_tests == 0)

if __name__ == "__main__":
    run()

...我得到以下输出：

/usr/local/bin/python3.7 "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py"
Traceback (most recent call last):
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 81, in <module>
    run()
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 50, in run
    test(wordfreq.tokenize, ["   "], [])
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 10, in test
    z = fun(x)
  File "/Users/ericjohannesson/Documents/Fristående kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/wordfreq.py", line 44, in tokenize
    while line[start].isspace():
IndexError: string index out of range

为什么说字符串索引超出范围？我已经调试了 'tokenize' 功能，我觉得它很好，为什么它仍然抱怨？

【问题讨论】：

"这是所需的输出，所以没有什么问题。"好的;如果您手动使用失败的示例测试？它会失败，是吗？在找出问题所在时，您的代码正确执行的事情并不有趣。
“我已经调试了 'tokenize' 功能，它看起来不错” 我不清楚你认为“调试”实际上需要什么。您的第一步应该是查看特别是失败的测试用例，并查看失败的代码行，并找出必须为真 导致IndexError。然后倒过来试着解释一下。
@MisterMiyagi 如果我的输入是 []（即空行），那么 words = []。然后 start == 0 和 len(line) == 0。这意味着 start
使用调试器逐步处理失败的测试，就像 Karl 建议的那样
您的线路while line[start].isspace() 不执行此检查。前面的start < len(line)只验证有东西，不验证有有效的开始。

标签： python loops while-loop index-error

【解决方案1】：

当输入是一个包含多个空格长的单个字符串的列表时，以及当它是一个包含多字符单个字符串的列表时，您都会遇到测试代码的问题。因此，在您的 wordfreq 中扩展 tokenize 函数，同时在空时提前返回：

if not lines or all(x.isspace() for x in lines):
    return words

并在 for 循环中检查迭代器的长度：

while end != len(line) and line[end].isalpha():

完整程序，wordfreq.py:

def tokenize(lines):
    words = []

    if not lines or all(x.isspace() for x in lines):
        return words

    for line in lines:
        start = 0
        while start < len(line):
            while line[start].isspace():
                start += 1
            end = start
            if line[start].isalpha():

                while end != len(line) and line[end].isalpha():
                    end += 1

                words.append(line[start:end].lower())
                start = end
            elif line[start].isdigit():
                while line[end].isdigit():
                    end += 1
                words.append(line[start:end])
                start = end
            else:
                words.append(line[start])
                start += 1
    return words


print(tokenize(['15,    delicious&   Tarts.']))
print(tokenize([]))
print(tokenize([""]))
print(tokenize(["   "]))
print(tokenize(["This is a simple sentence"]))
print(tokenize(["I told you!"]))
print(tokenize(["The 10 little chicks"]))
print(tokenize(["15th anniversary"]))
print(tokenize(["He is in the room, she said."]))

['15', ',', 'delicious', '&', 'tarts', '.']
[]
[]
[]
['this', 'is', 'a', 'simple', 'sentence']
['i', 'told', 'you', '!']
['the', '10', 'little', 'chicks']
['15', 'th', 'anniversary']
['he', 'is', 'in', 'the', 'room', ',', 'she', 'said', '.']

【讨论】：