如何找到 SpeechSynthesizer 所选声音的音频格式答案

【问题标题】：How can I find the audio format of the selected voice of the SpeechSynthesizer如何找到 SpeechSynthesizer 所选声音的音频格式
【发布时间】：2016-03-13 00:54:13
【问题描述】：

在 C# 的文本到语音应用程序中，我使用 SpeechSynthesizer 类，它有一个名为 SpeakProgress 的事件，每个说出的单词都会触发该事件。但对于某些声音，参数e.AudioPosition 与输出音频流不同步，输出波形文件的播放速度比该位置显示的要快（参见this related question）。

无论如何，我正在尝试查找有关比特率的确切信息以及与所选语音相关的其他信息。正如我所经历的，如果我可以使用此信息初始化波形文件，同步问题将得到解决。但是，如果我在SupportedAudioFormat 中找不到此类信息，我不知道有其他方法可以找到它们。例如，“Microsoft David Desktop”语音在VoiceInfo 中不提供支持的格式，但它似乎支持 PCM 16000 hz、16 位格式。

如何找到 SpeechSynthesizer 所选声音的音频格式

 var formats = CurVoice.VoiceInfo.SupportedAudioFormats;

 if (formats.Count > 0)
 {
     var format = formats[0];
     reader.SetOutputToWaveFile(CurAudioFile, format);
 }
 else
 {
        var format = // How can I find it, if the audio hasn't provided it?           
        reader.SetOutputToWaveFile(CurAudioFile, format );
}

【问题讨论】：

标签： c# audio sapi speechsynthesizer

【解决方案1】：

更新：此答案已在调查后进行了编辑。最初我从内存中建议 SupportedAudioFormats 可能只是来自（可能配置错误的）注册表数据；调查表明，对我来说，在 Windows 7 上，情况确实如此，并且在 Windows 8 上得到了备份。

SupportedAudioFormats 的问题

System.Speech 封装了古老的 COM 语音 API (SAPI)，一些声音是 32 位和 64 位的，或者可能配置错误（在 64 位机器的注册表上，HKLM/Software/Microsoft/Speech/Voices 与 HKLM/Software/Wow6432Node/Microsoft/Speech/Voices。

我已将 ILSpy 指向 System.Speech 及其 VoiceInfo 类，并且我非常确信 SupportedAudioFormats 仅来自注册表数据，因此在枚举 SupportedAudioFormats 时如果您的 TTS 有可能返回零结果引擎没有为您的应用程序的平台目标（x86、Any 或 64 位）正确注册，或者供应商根本没有在注册表中提供此信息。

语音可能仍支持不同、更多或更少的格式，这取决于语音引擎（代码）而不是注册表（数据）。所以它可以在黑暗中拍摄。在这方面，标准 Windows 语音通常比第三方语音更一致，但它们仍然不一定有用地提供SupportedAudioFormats。

很难找到这些信息

我发现仍然可以获得当前语音的当前格式 - 但这确实依赖于反射来访问 System.Speech SAPI 包装器的内部。

因此，这是非常脆弱的代码！而且我不建议在生产中使用。

注意：以下代码确实要求您调用 Speak() 一次进行设置；在没有 Speak() 的情况下，需要更多的调用来强制设置。不过，我可以打电话给Speak("") 什么都不说，效果很好。

实施：

[StructLayout(LayoutKind.Sequential)]
struct WAVEFORMATEX
{
    public ushort wFormatTag;
    public ushort nChannels;
    public uint nSamplesPerSec;
    public uint nAvgBytesPerSec;
    public ushort nBlockAlign;
    public ushort wBitsPerSample;
    public ushort cbSize;
}

WAVEFORMATEX GetCurrentWaveFormat(SpeechSynthesizer synthesizer)
{
    var voiceSynthesis = synthesizer.GetType()
                                    .GetProperty("VoiceSynthesizer", BindingFlags.Instance | BindingFlags.NonPublic)
                                    .GetValue(synthesizer, null);

    var ttsVoice = voiceSynthesis.GetType()
                                 .GetMethod("CurrentVoice", BindingFlags.Instance | BindingFlags.NonPublic)
                                 .Invoke(voiceSynthesis, new object[] { false });

    var waveFormat = (byte[])ttsVoice.GetType()
                                     .GetField("_waveFormat", BindingFlags.Instance | BindingFlags.NonPublic)
                                     .GetValue(ttsVoice);

    var pin = GCHandle.Alloc(waveFormat, GCHandleType.Pinned);
    var format = (WAVEFORMATEX)Marshal.PtrToStructure(pin.AddrOfPinnedObject(), typeof(WAVEFORMATEX));
    pin.Free();

    return format;
}

用法：

SpeechSynthesizer s = new SpeechSynthesizer();
s.Speak("Hello");
var format = GetCurrentWaveFormat(s);
Debug.WriteLine($"{s.Voice.SupportedAudioFormats.Count} formats are claimed as supported.");
Debug.WriteLine($"Actual format: {format.nChannels} channel {format.nSamplesPerSec} Hz {format.wBitsPerSample} audio");

为了测试它，我将 Microsoft Anna 的 AudioFormats 注册表项重命名为 HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/Tokens/MS-Anna-1033-20-Dsk/Attributes，导致 SpeechSynthesizer.Voice.SupportedAudioFormats 在查询时没有元素。以下是这种情况下的输出：

0 formats are claimed as supported.
Actual format: 1 channel 16000 Hz 16 audio

【讨论】：

谢谢，不过我注意到，平台目标已经是“x86”了。
@Ahmad 有趣。您对 HKEY_LOCAL_MACHINE/Software/Microsoft/Speech/Voices/Tokens/（选择的语音引擎）/Attributes/AudioFormats 有什么价值？在这台 PC（Microsoft Anna 的 Win7）上，默认值为 REG_SZ 字符串“18”。如果我重命名 AudioFormats 键，我在枚举时没有得到任何格式。看起来像一个位掩码（虽然存储为 REG_SZ），因为我可以调整各种位，但有些组合是非法的。同样在 HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/等下，它们有什么不同吗？想知道这是否是安装程序/注册表/语音问题，似乎不是 API 问题。
那里没有“AudioFormats”属性。它似乎在 Window 8.1 中没有这样的属性
@Ahmad 嗯。那么应该可以通过反射来提取它。我会更新我的答案。
非常感谢，它有效！然后你提取speak开始后的格式？！

【解决方案2】：

您无法从代码中获取此信息。您只能收听所有格式（从 8 kHz 之类的劣质格式到 48 kHz 之类的高质量格式）并观察它停止变得更好的地方，我认为这就是您所做的。

在内部，语音引擎只向语音“询问”原始音频格式一次，我相信这个值只是语音引擎内部使用的，语音引擎不会以任何方式暴露这个值。

更多信息：

假设您是一家语音公司。您已经录制了 16 kHz、16 位、单声道的计算机语音。

用户可以让您的声音以 48 kHz、32 位、立体声说话。语音引擎执行此转换。语音引擎并不关心它是否真的听起来更好，它只是进行格式转换。

假设用户想让你的声音说话。他要求将文件保存为 48 kHz、16 位、立体声。

SAPI / System.Speech 使用此方法调用您的声音：

STDMETHODIMP SpeechEngine::GetOutputFormat(const GUID * pTargetFormatId, const WAVEFORMATEX * pTargetWaveFormatEx,
GUID * pDesiredFormatId, WAVEFORMATEX ** ppCoMemDesiredWaveFormatEx)
{
    HRESULT hr = S_OK;

    //Here we need to return which format our audio data will be that we pass to the speech engine.
    //Our format (16 kHz, 16 bit, mono) will be converted to the format that the user requested. This will be done by the SAPI engine.

    enum SPSTREAMFORMAT sample_rate_at_which_this_voice_was_recorded = SPSF_16kHz16BitMono; //Here you tell the speech engine which format the data has that you will pass back. This way the engine knows if it should upsample you voice data or downsample to match the format that the user requested.

    hr = SpConvertStreamFormatEnum(sample_rate_at_which_this_voice_was_recorded, pDesiredFormatId, ppCoMemDesiredWaveFormatEx);

    return hr;
}

这是您必须“揭示”您声音的录制格式的唯一地方。

所有“可用格式”都会告诉您您的声卡/Windows 可以进行哪些转换。

我希望我解释清楚了吗？作为语音供应商，您不支持任何格式。您只需告诉他们语音引擎您的音频数据是什么格式，以便它可以进行进一步的转换。

【讨论】：