查找包含字符串的列表子集的索引答案

【问题标题】：find index of subset of list containing strings查找包含字符串的列表子集的索引
【发布时间】：2021-04-20 08:37:21
【问题描述】：

我正在使用 Python 进行 NLP，我已将音频文件转换为文本，然后找到语音中每个单词的时间偏移量，然后将单词存储在 wordlist 中加上时间在 timelist 中。

我有三个列表，第一个列表名为 strlist，第二个名为 wordlist，第三个名为 timelist strlist 包含短语让我们说

strlist = ["in", "the", "family"]

单词列表包含段落或让我们说句子

wordlist = ["there", "are", "few", "things", "to", " be", "in", "the", "family", "means"]

timelist 包含一些针对wordlist 中存储的每个单词的时间值让我们假设

timelist=[2,3,4,5,7,4,8,9,5,3]

我想知道strlist 的短语（由几个词组成）是否出现在wordlist 中。如果它存在，那么我想根据这些词检查 timelist 中存储的时间值。

 from pathlib import Path
  import io
 from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file('proven- 
mystery-310205-f04fb2ab3d69.json')
str='in my family'
strlist = list(str.split(" "))
timelist=[]
wordlist=[]
strlist.append("")
for i in strlist:
  print(i)
speech_file = Path("C:/Users/Tani/PycharmProjects/pythonProject/t.wav")
print("Start")

from google.cloud import speech_v1 as speech

print("checking credentials")

client = speech.SpeechClient(credentials=credentials)

print("Checked")
with io.open(speech_file, 'rb') as audio_file:
    content = audio_file.read()

print("audio file read")

audio = speech.RecognitionAudio(content=content)

print("config start")
config = speech.RecognitionConfig(
     encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
     language_code='en-US',
     audio_channel_count=2,
     enable_separate_recognition_per_channel=True,
     enable_word_time_offsets=True)
 print("Recognizing:")
 response = client.recognize(config=config,audio=audio)

 print("Recognized")

 for result in response.results:
     alternative = result.alternatives[0]
     #print('Transcript: {}'.format(alternative.transcript))

  for word_info in alternative.words:
       word = word_info.word
       start_time = word_info.start_time
       end_time = word_info.end_time
       wordlist.append(word)
       timelist.append(start_time.seconds)
 print(str)
 for a, b in zip(wordlist,timelist):
      print('Word: {}, time: {}'.format(
      a,
      b))
 print("findout time")


 for s in strlist:
    if s in wordlist:
       position = wordlist.index(s)
       time_s = timelist[position]
       print(f"Word: '{s}', Time: {time_s}")

【问题讨论】：

标签： python nlp timestamp nltk speech

【解决方案1】：

我有一个代码可以完成这项工作。当然可以改进：

strlist = ["in", "the", "family"]
wordlist = ["there", "are", "few", "things", "to", " be", "in", "the", "family", "means"]
timelist=[2,3,4,5,7,4,8,9,5,3]

for s in strlist:
    if s in wordlist:
        position = wordlist.index(s)
        time_s = timelist[position]
        print(f"Word: '{s}', Time: {time_s}")

输出是：

Word: 'in', Time: 8
Word: 'the', Time: 9
Word: 'family', Time: 5

还有另一个代码产生相同的结果，但它只有在你没有重复的单词时才有效：

strlist = ["in", "the", "family"]
wordlist = ["there", "are", "few", "things", "to", " be", "in", "the", "family", "means"]
timelist=[2,3,4,5,7,4,8,9,5,3]

map = {word: time for word, time in zip(wordlist, timelist)}
for s in strlist:
    print(f"Word: '{s}', Time: {map[s]}")

随意测试两者。

【讨论】：

您提供的答案不适合我，因为它没有提供正确的时间
此代码返回 8、9 和 5 表示单词“in”、“the”和“family”。我没有看到错误。除非您的列表错误，否则代码工作正常。我编辑了打印行以更清楚地显示信息。
正如我在描述中所说的，我假设这些列表不是真实的，实际上可以在列表中重复使用真实的单词。例如，在 2 索引上，如果上面有单词“in”会给出“in”的时间，而不是索引 6 上的时间，这是错误的
你能解释一下你提供的第二个代码吗？我不明白它是如何工作的
您必须在wordlist 和timelist 之间提供一致的映射规则。您的描述含蓄地说，如果“in”位于wordlist 中的索引 6，那么我们必须采用timelist 中的索引 6。这是有道理。如果不正确，则必须在描述中提供不同的映射规则，否则无法解决问题。在您的示例中，“in”一词的实际时间应该是多少？