虚拟助手语音识别_语音识别与虚拟现实

虚拟助手语音识别

Speech recognition is useful for VR not only for simulating conversations with AI agents but also for the user to communicate with any application that requires a great number of options. Typing out a response or command might be too impractical, and overcrowding the application with buttons or other GUI elements could get confusing very fast. But anyone who is capable of speech can easily speak while they are in a VR experience.

语音识别不仅可用于模拟与AI代理的对话，而且还可用于用户与需要大量选项的任何应用程序进行通信，从而对VR很有用。键入响应或命令可能太不切实际，并且使用按钮或其他GUI元素使应用程序过于拥挤可能会很快造成混乱。但是任何有语音能力的人都可以在VR体验中轻松说话。

Unity Labs’ virtual reality (VR) authoring platform Carte Blanche will have a personal assistant called U, with whom the user will be able to speak in order to easily perform certain actions. We at Labs have been researching speech recognition and analysis tools that could be used to implement these voice commands.

Unity Labs的虚拟现实(VR)创作平台Carte Blanche将拥有一个名为U的私人助理，用户将可以与之交谈，以便轻松地执行某些操作。我们实验室的研究人员一直在研究可用于实现这些语音命令的语音识别和分析工具。

The first section of this article presents concepts and theory behind speech recognition. It serves as a primer introducing related concepts and links for the reader to get more information on this field. The second section presents a Unity Asset Store package and public repository we are making available that provides a wrapper for several speech-to-text solutions and a sample scene that compares text transcriptions from each API.

本文的第一部分介绍了语音识别背后的概念和理论。它是介绍相关概念和链接的入门书籍，以使读者可以获取有关该领域的更多信息。第二部分介绍了我们提供的Unity Asset Store软件包和公共存储库，其中提供了几种语音到文本解决方案的包装，以及一个示例场景，用于比较每个API的文本转录。

I.语音识别和语义分析如何工作？ (I. How do speech recognition and semantic analysis work?)

Speech recognition is the transcription from speech to text by a program. Semantic analysis goes a step further by attempting to determine the intended meaning of this text. Even the best speech recognition and semantic analysis software today is far from perfect. Although we humans solve these tasks very intuitively and without much apparent effort, trying to get a program to perform both poses problems that are much more difficult to solve than one might think.

语音识别是程序从语音到文本的转录。语义分析通过尝试确定本文的预期含义而走了一步。即使是当今最好的语音识别和语义分析软件也远非完美。尽管我们人类非常直观地解决了这些任务，并且并没有付出明显的努力，但是试图使一个程序同时执行这两个问题却带来了比人们想象的要难得多的问题。

One vital component of today’s statistically-based speech recognition is acoustic modeling. This process involves starting with a waveform and from it determining the probabilities of distinct speech sounds, or phonemes (e.g. “s”, “p”, “ē”, and “CH” in “speech”), occurring at discrete moments in time. The acoustic model most commonly used is a hidden Markov model (HMM), which is a type of probability graph called a Bayesian network (Fig. 1). HMMs are named so because some of the states in the model are hidden – you only have the outputs of these states to work with, and the goal is to use these to determine what the hidden states might have been. For acoustic modeling, this means looking at the waveform output and trying to figure out the most probable input phonemes – what the speaker intended to say.

当今基于统计的语音识别的重要组成部分是声学建模。此过程涉及从波形开始，然后从波形确定在离散的时间点发生的不同语音或音素 (例如“ s”中的“ s”，“ p”，“ē”和“ CH”)的概率。最常用的声学模型是隐马尔可夫模型 (HMM)，这是一种称为贝叶斯网络的概率图(图1)。 HMM之所以这样命名，是因为模型中的某些状态是隐藏的–您只能使用这些状态的输出，而目标是使用这些状态来确定隐藏状态可能是什么。对于声学建模，这意味着要查看波形输出并尝试找出最可能的输入音素-说话人打算说的话。

Figure 1: This is an acoustic model for the pronunciation of “x”. The ovals represent the phonemes we are trying to recognize. They are not directly observable, but they produce probabilistically a waveform that is fully observable (bottom). The arrows between phonemes represent the possible transitions between phonemes to compose simple sounds such as “x”. The goal of speech recognition is to find the most likely sequence of hidden phonemes to explain the sequence of waveforms we observe. The waveform itself is observable, while the phonemes must be determined from these observable states in time.

图1：这是“ x”发音的声学模型。椭圆形代表我们试图识别的音素。它们不是直接可观察到的，但可能会产生完全可观察到的波形(底部)。音素之间的箭头表示音素之间可能的过渡，以构成简单的声音，例如“ x”。语音识别的目的是找到最可能的隐藏音素序列，以解释我们观察到的波形序列。波形本身是可观察的，而音素必须根据这些可观察的状态及时确定。

Once you have probabilities for granular sounds, you need to string these sounds into words. In order to do this, you need a language model, which will tell you how likely it is that a particular sequence of sounds corresponds to a particular word or sequence of words. For example, the phonemes “f”, “ō”, “n”, “ē”, and “m” in sequence quite clearly correspond to the English word “phoneme”. In some cases more context is needed – for instance, if you are given the phonetic word “T͟Her”, it could correspond to “there”, “their”, or “they’re”, and you must use the context of surrounding words to determine which one is most likely. The problem of language modeling is similar to the problem of acoustic modeling, but at a larger scale. So it is unsurprising that similar probabilistic AI systems such as HMMs and artificial neural networks are used to automate this task. Figuring out words from a person’s speech, even given a sequence of phonemes, is easier said than done because languages can be extremely complex. That’s not even to say for all the different accents and intonations one might use. Even humans can have difficulty understanding each other, so it’s no wonder why this is such a difficult challenge for an AI agent.

一旦有了粒状声音的可能性，就需要将这些声音字符串化。为此，您需要一个语言模型，该模型将告诉您特定声音序列对应于特定单词或单词序列的可能性。例如，音素“ f”，“ō”，“ n”，“ē”和“ m”的顺序非常清楚地对应于英语单词“ phoneme”。在某些情况下，需要更多的上下文–例如，如果给您语音单词“T͟Her”，则该单词可能对应于“ there”，“ their”或“ there're”，并且您必须使用周围单词的上下文确定最有可能的一个。语言建模的问题与声学建模的问题相似，但是规模较大。因此，毫不奇怪的是，类似的概率AI系统(例如HMM和人工神经网络)用于自动执行此任务。即使给定一系列音素，从一个人的语音中找出单词也比说起来容易做起来难，因为语言可能非常复杂。这甚至不是说一个人可能会使用的所有不同的口音和语调。甚至人类之间也很难相互理解，因此也就不足为奇了，对于AI代理来说，这是一个如此艰巨的挑战。

In any case, at this point you have speech that has been transcribed into text, and now your agent needs to determine what that text means. This is where semantic analysis comes in. Humans practice semantic analysis all the time. For example, even before reading this sentence, you were probably pretty confident that you would see an example of how humans practice semantic analysis. (How’s that for meta?) That’s because you were able to use the context clues in previous sentences (e.g. “Humans practice semantic analysis all the time.”) to make a very good guess at what the next few sentences might include. So in order for a VR experience with simulated people to feel real, its AI needs to be skilled at analyzing your words and giving an appropriate response.

无论如何，这时您的语音已被转录为文本，现在您的业务代表需要确定该文本的含义。这就是语义分析的用武之地。人类一直在实践语义分析。例如，即使在读这句话之前，您可能也很自信，您会看到人类如何进行语义分析的示例。 (对于meta来说，这是怎么回事？)这是因为您能够使用先前句子中的上下文线索(例如，“人类一直在练习语义分析”。)可以很好地猜测接下来的几句话可能包含的内容。因此，为了使模拟人员能够感受到真实的VR体验，其AI需要熟练地分析您的单词并给出适当的响应。

Good semantic analysis involves a constant learning process for the AI system. There are tons of different ways to express the same intent, and the programmer cannot possibly account for all of them in the initial set of phrases to watch out for. The AI needs to be good at using complex neural networks to connect words and phrases and determine the user’s intent. And sometimes it doesn’t just need to understand words and their meanings to do this – it needs to understand the user as well. If an application is designed to be used by a single person for an extended period of time, a good AI will pick up on their speech patterns and not only tailor its responses to this person, but also figure out what to expect them to say in specific situations.

良好的语义分析涉及AI系统的不断学习过程。表达相同意图的方法有无数种，程序员可能无法在所有需要注意的初始短语中考虑所有这些方法。 AI必须善于使用复杂的神经网络来连接单词和短语并确定用户的意图。有时它不仅需要理解单词及其含义，还需要理解用户。如果设计一个应用程序供一个人长时间使用，那么一个好的AI将会掌握他们的语音模式，不仅可以针对该人定制其响应，还可以弄清楚他们会说些什么。具体情况。

It’s also worth noting that both semantic analysis and speech recognition can be improved if your AI is used for a very specific purpose. A limited number of concepts to worry about means a limited number of words, phrases, and intents to watch out for. But of course, if you want an AI to resemble a human as much as possible, it will have to naturally respond to anything the user might say to it, even if it does serve a specific purpose.

还值得注意的是，如果将AI用于非常特定的目的，则可以同时改善语义分析和语音识别。需要担心的概念数量有限，意味着需要注意的单词，短语和意图数量有限。但是，当然，如果您希望AI尽可能类似于人，即使它确实有特定的用途，它也必须自然响应用户可能对它说的任何话。

二。有哪些语音到文本工具？ (II. What speech-to-text tools are out there?)

Labs’ initial research on speech recognition has involved the evaluation of existing speech-to-text solutions. We developed a package for the Asset Store that integrates several of these solutions as Unity C# scripts. The package includes a sample scene that compares the text transcriptions from each API side-by-side and also allows the user to select a sample phrase from a given list of phrases, speak that phrase, and see how quantitatively accurate each result is. The code is also available in a public repository.

实验室在语音识别方面的初步研究涉及对现有语音到文本解决方案的评估。我们为资产商店开发了一个程序包，该程序包将其中一些解决方案集成为Unity C＃脚本。该软件包包括一个示例场景，该场景可以并排比较每个API的文本转录，还允许用户从给定的短语列表中选择一个示例短语，说出该短语，并查看每个结果在定量上的准确性。该代码也可以在公共存储库中找到。

The speech-to-text package interfaces Windows dictation recognition, Google Cloud Speech, IBM Watson, and Wit.ai. All of these respond to background speech relatively well, but some of them, such as Windows and Wit.ai, will insert short words at the beginning and end of the recording, probably picking up on some of the beginning and ending background speech that is not obscured by foreground speech. Each solution has its own quirks and patterns and its own methods for dealing with phrases designed to provide challenges for speech recognition.

语音转文本包可与Windows听写识别， Google Cloud Speech ， IBM Watson和Wit.ai进行接口。所有这些对背景语音的响应都比较好，但是其中一些，例如Windows和Wit.ai，会在录音的开始和结尾插入简短的单词，可能是从某些开始和结束的背景语音中提取出来的。不会被前台语音遮盖。每个解决方案都有自己的怪癖和模式，以及自己的处理短语的方法，这些短语旨在为语音识别带来挑战。

Windows dictation recognition was recently added to Unity (under Unity.Windows.Speech). Because the asset package is specifically for speech-to-text transcriptions, it only uses this library’s DictationRecognizer, but the Windows Speech library also has a KeywordRecognizer and a GrammarRecognizer. Windows uses streaming speech-to-text, which means it collects small chunks of audio as they are recorded and returns results in real time. The interim results returned as the user is speaking are temporary – after a pause in speech, the recognizer will come up with a hard result based on the entire block of speech.

Windows听写识别最近已添加到Unity(在Unity.Windows.Speech下)。因为资产包专门用于语音到文本的转录，所以它仅使用该库的DictationRecognizer，但是Windows Speech库还具有KeywordRecognizer和GrammarRecognizer。 Windows使用语音流转文本，这意味着它会在记录音频时收集小块音频，并实时返回结果。用户讲话时返回的临时结果是暂时的–语音暂停后，识别器将根据整个语音块得出较难的结果。

We integrated Watson streaming and non-streaming (where the entire recording is sent at once) speech-to-text into the package as well – in fact, IBM has its own Watson SDK for Unity. Like Windows, it also has built-in keyword recognition. Watson currently supports US English, UK English, Japanese, Spanish, Brazilian Portuguese, Modern Standard Arabic, and Mandarin. An interesting feature we discovered about Watson is that it detects pauses in speech such as um and uh and replaces them with %HESITATION. So far we haven’t seen any other types of replacement by Watson.

我们还将Watson流式传输和非流式传输(整个记录都一次发送)语音到文本集成到了程序包中-实际上，IBM有自己的Watson SDK for Unity 。与Windows一样，它也具有内置的关键字识别功能。沃森目前支持美国英语，英国英语，日语，西班牙语，巴西葡萄牙语，现代标准阿拉伯语和普通话。我们发现的有关Watson的一个有趣功能是，它可以检测语音中的停顿(例如um和uh)并将其替换为％HESITATION。到目前为止，我们还没有看到其他任何类型的Watson替代品。

Google Cloud Speech also has support for both streaming and non-streaming recognition. From what we have tested, Google appears to have the widest vocabulary out of all of the four options – it even recognizes slang terms such as cuz. Google Cloud Speech also supports over 80 languages. It is currently in beta and open to anyone who has a Google Cloud Platform account.

Google Cloud Speech还支持流和非流识别。根据我们的测试，Google在这四个选项中似乎拥有最广泛的词汇-它甚至可以识别诸如cuz之类的语。 Google Cloud Speech还支持80多种语言。它目前处于测试阶段，并且向拥有Google Cloud Platform帐户的任何人开放。

Wit.ai not only includes streaming and non-streaming speech recognition, but it also has an easy to use conversational bot creation tool. All you need to do is specify several different ways to express each intent needed by your bot, create stories that describe a potential type of conversation with the bot, and then start feeding it data – the AI can learn from the inputs it receives. Wit.ai even includes a way to validate the text it receives against the entities (traits, keywords, free text) it observes, as well as a way to validate speech-to-text transcriptions. Our Asset Store package only includes non-streaming Wit.ai speech-to-text due to time constraints.

Wit.ai不仅包含流式和非流式语音识别，而且还具有易于使用的会话式机器人创建工具。您所需要做的就是指定几种不同的方式来表达机器人所需的每种意图，创建描述与机器人可能进行的对话的故事，然后开始提供数据-AI可以从收到的输入中学习。 Wit.ai甚至包括一种针对所观察到的实体(特征，关键字，自由文本)验证其收到的文本的方法，以及一种验证语音转文本转录的方法。由于时间限制，我们的资产商店套餐仅包含非流式传输Wit.ai语音到文本。

The sample scene in our speech-to-text package includes several test phrases – many of which were found on websites that listed good phrases to test speech recognition (one is a list of Harvard sentences and the other is an article about stress cases for speech recognition). For example, “Dimes showered down from all sides,” (a Harvard sentence) includes a wide range of phonemes, and “They’re gonna wanna tell me I can’t do it,” (a sentence we thought up ourselves) includes contractions and slang terms. The Windows speech-to-text solution seems to be the only one that has a hard time picking up on “they’re” instead of “there” or “their”, even though the context makes it clear which one is needed, and Windows does pick up on “we’re”. Most of the APIs usually preserve the terms “gonna” and “wanna” as they are, but Google interprets them as “going to” and “want to”, which is strange considering it also uses the term “cuz” (Wit.ai can also recognize “cuz”). A funny test phrase found in Will Styler’s article is “I’m gonna take a wok from the Chinese restaurant.” We never once got the word “wok” to appear – they all translated it as “walk” every time, which still makes perfect sense even given the context of the sentence. This kind of sentence is a huge stress test – even plenty of humans would need more clarification than just the context of that one sentence itself. For example, if you know that the “I” in the sentence is a thief, that would make “wok” much more likely than “walk”.

我们的语音转文本包中的示例场景包括几个测试短语-在网站上找到了很多测试短语，这些短语列出了测试语音识别的好短语(一个是哈佛句子列表，另一个是关于语音压力案例的文章认可 )。例如，“迪姆斯从四面八方冲下来”(哈佛的一句话)包括各种各样的音素，“他们想告诉我我做不到”(我们想过自己的一句话)包括收缩和语术语。 Windows语音到文本解决方案似乎是唯一一个很难解决“他们”而不是“那里”或“他们的”问题的解决方案，即使上下文明确指出了需要使用哪种解决方案也是如此。 Windows确实会出现“我们是”的情况。大多数API通常都按原样保留术语“ gonna”和“ wanna”，但是Google将它们解释为“ going to”和“ want to”。考虑到它也使用术语“ cuz”，这很奇怪(Wit.ai也可以识别“ cuz”)。威尔·斯泰勒(Will Styler)的文章中有一个有趣的测试用语是：“我要从中国餐馆里拿锅”。我们从来没有出现过“炒锅”这个词–他们每次都将其翻译为“步行”，即使在句子的上下文中，这仍然是十分合理的。这种句子是一个巨大的压力测试-甚至很多人都需要更多的澄清，而不仅仅是该句子本身的上下文。例如，如果您知道句子中的“ I”是小偷，那会使“炒锅”比“走路”的可能性更大。

The package we developed is meant to be an easy way to compare a few of the biggest speech-to-text options in Unity and integrate them into your projects. And if there are other APIs you would like to try out in Unity, this package should make it relatively easy to create a class that inherits from one of the base speech-to-text services and integrate it into the sample scene and/or widgets. In addition to the individual speech-to-text SDKs, the package includes several helper classes and functions (a recording manager, audio file creation and conversion, etc.) to facilitate the integration and comparing of more APIs.

我们开发的软件包是一种比较Unity中一些最大的语音到文本选项并将其集成到您的项目中的简便方法。而且，如果您要在Unity中尝试其他API，则此软件包应使创建从基本语音转文本服务之一继承并将其集成到示例场景和/或小部件中的类相对容易。。除了单独的语音到文本SDK外，该软件包还包括一些帮助程序类和功能(录音管理器，音频文件创建和转换等)，以促进更多API的集成和比较。

结论与未来工作 (Conclusion and future work)

What makes speech recognition so difficult to get right is the sheer multitude of variables to look out for. For each language you want to recognize, you need a ton of data about all the words that exist (including slang terms and shortened forms), how those words are used in conjunction with each other, the range of tones and accents that can affect pronunciation, all the redundancies and contradictions inherent to human language, and much more.

使语音识别如此难以正确实现的原因是要寻找的众多变量。对于您要识别的每种语言，您需要大量有关所有存在的单词(包括语和缩写形式)，这些单词如何相互结合使用，可能影响发音的音调和重音的大量数据，人类语言固有的所有冗余和矛盾等等。

Our Asset Store package currently integrates a few speech-to-text solutions – but these are enough to easily compare some of the biggest solutions out there and to see what general strengths and weaknesses exist among today’s speech recognition tools. It is a starting point for Unity developers to see what works for their specific needs and to add further functionality. You can integrate more speech-to-text tools, add a semantic analysis step to the architecture, and add whatever other layers are necessary for your Unity projects. Refer to this article for a review of several semantic analysis tools.

目前，我们的Asset Store软件包集成了一些语音到文本解决方案-但这足以轻松比较其中一些最大的解决方案，并了解当今的语音识别工具之间存在哪些一般优点和缺点。这是Unity开发人员了解什么可以满足其特定需求并添加更多功能的起点。您可以集成更多语音到文本工具，向体系结构添加语义分析步骤，并添加Unity项目所需的其他任何层。请参阅该文章的几个语义分析工具进行审查。

This research was motivated by Carte Blanche’s initial plan to integrate AI agent U to respond to voice commands. Accomplishing this involves speech-to-text transcription and keyword recognition. Another interesting yet difficult challenge would be creating an agent with whom the user can have a conversation. We humans often speak in sentences or sentence fragments and throw in “um”s and “ah”s and words that reflect our feelings. If an AI agent in a VR application can understand not just keywords but every part of a person’s conversational speech, then it will introduce a whole other level of immersion inside the VR environment.

这项研究是由Carte Blanche最初计划进行的，该计划整合了AI代理U以响应语音命令。要做到这一点涉及语音到文本的转录和关键字识别。另一个有趣但又困难的挑战是创建与用户进行对话的代理。我们人类经常用句子或句子片段说话，并用“ um”，“ ah”和单词来反映我们的感受。如果VR应用程序中的AI代理不仅可以理解关键字，还可以理解一个人的会话语音的每个部分，那么它将在VR环境中引入其他层次的沉浸感。

The ability to have a natural conversation with something in VR (keyword “something” – even if we can get it to feel real to the user, in the end it’s still a user interface) is widely applicable to a variety of applications outside of just Carte Blanche – for example, virtual therapy such as SimSensei, the virtual therapist developed by the USC Institute for Creative Technologies (ICT) . However as we all know, there are a variety of different ways to express the same intent, even in just one language – so creating natural conversations is no easy task.

与VR中的事物进行自然对话的能力(关键字“某物” –即使我们可以使它对用户真实，最终它仍然是一个用户界面)广泛地适用于多种应用程序，而不仅仅是卡特·布兰奇(Carte Blanche)–例如，虚拟疗法，例如由USC创新技术研究所(ICT)开发的虚拟治疗师SimSensei 。但是，众所周知，即使仅使用一种语言，也可以通过多种不同的方式表达相同的意图-因此，建立自然的对话绝非易事。

You can find the speech-to-text package on the Asset Store. The BitBucket repository for the package can be found here. Anyone is welcome to create their own forks. Drop us a note at [email protected] if you find it useful, we’d love to hear from you!

您可以在Asset Store上找到语音转文本包。可以在此处找到该软件包的BitBucket存储库。欢迎任何人创建自己的叉子。如果您觉得有用，请在[email protected]上给我们留言。

Images: Amy DiGiovanni

图片：Amy DiGiovanni

Amy DiGiovanni & Dioselin Gonzalez work at Unity Labs; Amy is a Software Engineer and Dio is a VR Principal Engineer.

Amy DiGiovanni和Dioselin Gonzalez在Unity Labs工作； 艾米(Amy)是软件工程师，迪奥(Dio)是VR首席工程师。

翻译自: https://blogs.unity3d.com/2016/08/02/speech-recognition-and-vr/

虚拟助手语音识别

I.语音识别和语义分析如何工作？ (I. How do speech recognition and semantic analysis work?)

二。 有哪些语音到文本工具？ (II. What speech-to-text tools are out there?)

结论与未来工作 (Conclusion and future work)

二。有哪些语音到文本工具？ (II. What speech-to-text tools are out there?)