【问题标题】:find index of subset of list containing strings查找包含字符串的列表子集的索引
【发布时间】:2021-04-20 08:37:21
【问题描述】:

我正在使用 Python 进行 NLP,我已将音频文件转换为文本,然后找到语音中每个单词的时间偏移量,然后将单词存储在 wordlist 中加上时间在 timelist 中。

我有三个列表,第一个列表名为 strlist,第二个名为 wordlist,第三个名为 timelist strlist 包含短语让我们说

strlist = ["in", "the", "family"]

单词列表包含段落或让我们说句子

wordlist = ["there", "are", "few", "things", "to", " be", "in", "the", "family", "means"]

timelist 包含一些针对wordlist 中存储的每个单词的时间值让我们假设

timelist=[2,3,4,5,7,4,8,9,5,3]

我想知道strlist 的短语(由几个词组成)是否出现在wordlist 中。如果它存在,那么我想根据这些词检查 timelist 中存储的时间值。

 from pathlib import Path
  import io
 from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file('proven- 
mystery-310205-f04fb2ab3d69.json')
str='in my family'
strlist = list(str.split(" "))
timelist=[]
wordlist=[]
strlist.append("")
for i in strlist:
  print(i)
speech_file = Path("C:/Users/Tani/PycharmProjects/pythonProject/t.wav")
print("Start")

from google.cloud import speech_v1 as speech

print("checking credentials")

client = speech.SpeechClient(credentials=credentials)

print("Checked")
with io.open(speech_file, 'rb') as audio_file:
    content = audio_file.read()

print("audio file read")

audio = speech.RecognitionAudio(content=content)

print("config start")
config = speech.RecognitionConfig(
     encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
     language_code='en-US',
     audio_channel_count=2,
     enable_separate_recognition_per_channel=True,
     enable_word_time_offsets=True)
 print("Recognizing:")
 response = client.recognize(config=config,audio=audio)

 print("Recognized")

 for result in response.results:
     alternative = result.alternatives[0]
     #print('Transcript: {}'.format(alternative.transcript))

  for word_info in alternative.words:
       word = word_info.word
       start_time = word_info.start_time
       end_time = word_info.end_time
       wordlist.append(word)
       timelist.append(start_time.seconds)
 print(str)
 for a, b in zip(wordlist,timelist):
      print('Word: {}, time: {}'.format(
      a,
      b))
 print("findout time")


 for s in strlist:
    if s in wordlist:
       position = wordlist.index(s)
       time_s = timelist[position]
       print(f"Word: '{s}', Time: {time_s}")

【问题讨论】:

    标签: python nlp timestamp nltk speech


    【解决方案1】:

    我有一个代码可以完成这项工作。当然可以改进:

    strlist = ["in", "the", "family"]
    wordlist = ["there", "are", "few", "things", "to", " be", "in", "the", "family", "means"]
    timelist=[2,3,4,5,7,4,8,9,5,3]
    
    for s in strlist:
        if s in wordlist:
            position = wordlist.index(s)
            time_s = timelist[position]
            print(f"Word: '{s}', Time: {time_s}")
    

    输出是:

    Word: 'in', Time: 8
    Word: 'the', Time: 9
    Word: 'family', Time: 5
    

    还有另一个代码产生相同的结果,但它只有在你没有重复的单词时才有效:

    strlist = ["in", "the", "family"]
    wordlist = ["there", "are", "few", "things", "to", " be", "in", "the", "family", "means"]
    timelist=[2,3,4,5,7,4,8,9,5,3]
    
    map = {word: time for word, time in zip(wordlist, timelist)}
    for s in strlist:
        print(f"Word: '{s}', Time: {map[s]}")
    

    随意测试两者。

    【讨论】:

    • 您提供的答案不适合我,因为它没有提供正确的时间
    • 此代码返回 8、9 和 5 表示单词“in”、“the”和“family”。我没有看到错误。除非您的列表错误,否则代码工作正常。我编辑了打印行以更清楚地显示信息。
    • 正如我在描述中所说的,我假设这些列表不是真实的,实际上可以在列表中重复使用真实的单词。例如,在 2 索引上,如果上面有单词“in”会给出“in”的时间,而不是索引 6 上的时间,这是错误的
    • 你能解释一下你提供的第二个代码吗?我不明白它是如何工作的
    • 您必须在wordlisttimelist 之间提供一致的映射规则。您的描述含蓄地说,如果“in”位于wordlist 中的索引 6,那么我们必须采用timelist 中的索引 6。这是有道理。如果不正确,则必须在描述中提供不同的映射规则,否则无法解决问题。在您的示例中,“in”一词的实际时间应该是多少?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-01-31
    • 1970-01-01
    • 1970-01-01
    • 2013-02-05
    • 1970-01-01
    • 2018-07-26
    相关资源
    最近更新 更多