在python文件中的特定单词前获取10个单词答案

【问题标题】：Get 10 words before a particular word in a file in python在python文件中的特定单词前获取10个单词
【发布时间】：2023-01-27 23:06:43
【问题描述】：

我有一个文件，其中逐行包含句子。我需要在特定单词（不区分大小写）之前得到 10 个单词，但它也可以在前一行中。例如：如果我想要单词 ball 并且它是第二行的第四位，那么我需要该行中的 3 个单词和前一行甚至之前的 7 个单词。我也想不出从前几行中准确获取 10 个单词的方法。这是我到目前为止所拥有的：


for line in file:
            # reading each word        
            for words in line.split():
                y = 'myword'.lower
                if y = words.lower:
                    index = words.index(y)
                    i = 0, z = 0
                    for words in line[i]:
                        sentence += words
                        if str(len(sentence.split()) != 10:
                        i--
                    
                    print(sentence)

【问题讨论】：

您需要跟踪句子边界吗？
当您尝试执行此代码时，if y = words.lower: 行没有错误？
我强烈建议学习the official python tutorial 或其他一些课程，以更好地了解 python 语法。

标签： python nlp

【解决方案1】：

将整个文件转换为单词列表是可行的：

words_list = list()
with open('text.txt', 'r') as f:
    words_list = f.read().split()

ret = str()
for word in words_list:
  if 'even' == word:
    start_index = words_list.index(word) -10
    ret = ' '.join(words_list[start_index : words_list.index(word)+1])

print(ret)

【讨论】：

【解决方案2】：

您的代码可能无法正常工作，因为 lower() 是方法，而不是属性。此外，考虑将您的单词放在循环之外，这样它就不会在每次迭代时都被创建。

如果您的代码仍然无效，我创建了以下应该有效的代码：

myword = "myword"
sentence = ""

split_sentence = s.split(" ")

for index, word in enumerate(split_sentence):
    # remove special characters
    if re.sub("[.!?,'@#$%^&*()
]", "", word).lower() == myword:
        # make sure the start index is inbounds
        start_index = index-11 if index-11 > 0 else 0
        for word_index in range(start_index, start_index+10):
            sentence += f"{split_sentence[word_index]} "

print(sentence)

这应该创建一个包含 10 个词的句子，这些词指向您要查找的词，包括标点符号。如果您只需要单词而不需要标点符号，那么这应该可以解决问题：

myword = "myword"
sentence = ""

# remove special characters
split_sentence = re.sub("[.!?,'@#$%^&*()
]", "", s).split(" ")

for index, word in enumerate(split_sentence):
    if word.lower() == myword:
        # make sure the start index is inbounds
        start_index = index-11 if index-11 > 0 else 0
        for word_index in range(start_index, start_index+10):
            sentence += f"{split_sentence[word_index]} "

print(sentence)

【讨论】：

“您的代码可能无法运行，因为 lower() 是一种方法，而不是属性。”你很乐观。还有一个事实是在条件中使用=而不是==，并尝试在与i = 0, z = 0同一行上分配两个变量，并在与int 10比较之前将len(sentence.split()转换为str ，以及在将 i 设置为 0 之后的无意义的 for words in line[i]，以及使用 i-- 减少 i 的尝试，这根本不是 python，并且无论如何都没有效果，因为 i 总是在之前重置为 0正在使用。也许我错过了一到二十个其他问题。
嗯，这是一个开始吧？ :)

【解决方案3】：

不知道你的档案怎么样。所以，我放了一个字符串来模拟它。我的版本取之前的 10 个词，如果没有，取之前的所有词，并给你一个最终列表，其中包含包含该词的所有短语的所有词。

def get_10_words(file, word_to_find):
file_10_words_list = []
cont = 0
for line in file.lower().split('
'):
    new_line = line.split(' ')
    for c in range(10):
        new_line.insert(0, '')
    try:
        word_index = new_line.index(word_to_find.lower())
    except ValueError:
        print(f"Line {cont + 1} hasn't got {word_to_find.title()}")
    else:
        words_before_list = [new_line[element + word_index] for element in range(-10, 0)]
        words_before_list = [element for element in words_before_list if element != '']
        file_10_words_list.append(words_before_list)
    cont += 1
return file_10_words_list

if __name__ == '__main__':
words = get_10_words('This is the line one This is the line one This is the line one Haha
'
                     'This is the line two This is the line two This is the line two How
'
                     'This is the line tree Haha', 'Haha')

print(words)

如果我的代码中有什么不清楚的地方，你可以在这里问我！

【讨论】：

模拟打开文件的好方法是from io import StringIO; file = StringIO('some text') 或更好的with StringIO('some text') as file:

【解决方案4】：

因为你标记了nlp，这里有一个带有spacy的命题。

#pip install spacy
#python -m spacy download en_core_web_sm
import spacy

with open("file.txt", "r") as f:
    text = f.read()

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

searchedWord = "StackOverflow"

occu = [i for i,word in enumerate(doc) if word.text == searchedWord]

out = []
for i in occu:
    if token.is_punct or token.is_space:
        i-=1
        w = [token.text for token in doc[i-4:i]]
        out.append(w)
    else:
        w = [token.text for token in doc[i-4:i]]
        out.append(w)

注意：在本例中，我们将搜索到的单词之前的 4 个单词作为目标（同时跳过标点符号和空格）。结果将是一个嵌套列表，以处理该单词在文本文件中出现多次的情况。我们使用的是英语模型，但当然还有许多其他可用语言，请查看列表here。

输出：

print(out)

#[['A', 'question', 'from', 'Whichman'], ['An', 'answer', 'from', 'Timeless']]

使用的输入/文本文件：

【讨论】：

【解决方案5】：

如果您只需要 10 个单词的序列或块，其中最后一个单词满足某些条件，那么首先创建块然后检查它们是否匹配而不是先匹配然后创建块通常更容易和更清晰。由于您似乎并不关心句子边界，只需通过 split 将输入视为一个连续的单词序列即可。

words = text.split()
chunk_size = 10

for i in range(len(words) - chunk_size + 1):
    chunk = words[i:i + chunk_size]
    if chunk[-1].lower() == "ball":
        print(chunk)

将此应用于您的问题文本会返回 ['as', 'well.', 'For', 'eg:', 'if', 'I', 'want', 'the', 'word', 'ball']。

【讨论】：