Python - 字符串中的索引到匹配的单词答案

【问题标题】：Python - index in string to matching wordPython - 字符串中的索引到匹配的单词
【发布时间】：2020-06-08 06:56:34
【问题描述】：

我正在寻找一种将字符串中的索引转换为索引所在单词的有效方法。

例如，如果这是我的字符串：

This is a very stupid string

我得到的索引是 10，所以输出应该是very。此外，如果索引是 11,12 或 13 - 输出应该是 very。

可以假设单词每次被 1 个空格隔开。用 for 循环或其他东西来做这件事并不难，问题是是否有更有效的方法（因为我的文本很大，而且我有很多索引可以转换为单词）。

例如，让索引为 10、13、16，因此输出应为：

10 very
13 very
16 stupid

任何帮助将不胜感激！

【问题讨论】：

循环有什么问题？您只需直接转到您的位置并左右移动，直到当前字符为空格。如果字符串中只有一个长度为 n 的单词，复杂度为 O(n)
如果index=4 不在单词中，会发生什么情况？
大量文本是否已经全部在内存中，还是必须从流中处理？
@JonClements - 好点。假设它不会发生
你能否建立一个字母是空格的索引列表，然后bisect那个？

标签： python arrays string pandas

【解决方案1】：

这不是很有效，因为它使用正则表达式，但它是一种不使用任何循环来解决问题的方法。

import re

def stuff(pos):
    x = "This is a very stupid string"
    pattern = re.compile(r'\w+\b')
    pattern2 = re.compile(r'.*(\b\w+)')
    end = pattern.search(x, pos=pos).span()[1]
    print(pattern2.search(x, endpos=end).groups()[0])

stuff(2)
stuff(10)
stuff(11)
stuff(16)

结果：

This
very
very
stupid

【讨论】：

【解决方案2】：

以下应该执行得很好。首先使用split 获取字符串中的单词，然后使用enumerate 和列表推导式找到它们开始的索引：

words = s.split()
# ['This', 'is', 'a', 'very', 'stupid', 'string']
# Obtain the indices where all words begin
ix_start_word = [i+1 for i,s in enumerate(s) if s==' ']
# [5, 8, 10, 15, 22]

现在您可以使用NumPy's np.searchsorted 来获取给定索引的单词：

words[np.searchsorted(ix_start_word, ix)]

检查上面的例子：

words[np.searchsorted(ix_start_word, 11)]
#'very'

words[np.searchsorted(ix_start_word, 13)]
# 'very'

words[np.searchsorted(ix_start_word, 16)]
# 'stupid'

【讨论】：

很好的解决方案。只需要将[0] 添加到ix_start_word
在words 上使用for 循环计算ix_start_word 是否比通过s 枚举更快？

【解决方案3】：

我并不为它的干净程度感到特别自豪，但我认为它可以解决问题：

from numpy import cumsum, array

sample = 'This is a very stupid string'

words = sample.split(' ')
lens = [len(_)+1 for _ in words]

ends = cumsum(lens)
starts = array([0] + list(ends[:-1]))

output = {}
for a, b, c in zip(starts, ends, words):
    for i in range(a, b):
        output[i] =  c
for a, b in output.items():
    print(a, b)

0 This
1 This
2 This
3 This
4 This
5 is
6 is
7 is
8 a
9 a
10 very
11 very
12 very
13 very
14 very
15 stupid
16 stupid
17 stupid
18 stupid
19 stupid
20 stupid
21 stupid
22 string
23 string
24 string
25 string
26 string
27 string
28 string

【讨论】：

它可以对填充输出的循环使用一些理解，我不记得如何附加 np 数组，所以我做了一个快速而肮脏的解决方案，将它们转换为列表，然后转换为数组跨度>
如果您的意图是建立一个索引-> 字典查找，那么您可以将其简化为：output = {i: m.group() for m in re.finditer('[^ ]+', s) for i in range(*m.span())}