Scipy、Numpy：音频分类器、语音/语音活动检测答案

【问题标题】：Scipy, Numpy: Audio classifier,Voice/Speech Activity DetectionScipy、Numpy：音频分类器、语音/语音活动检测
【发布时间】：2015-08-05 05:37:37
【问题描述】：

我正在编写一个程序来自动分类记录的音频电话文件（wav 文件），这些文件至少包含一些人声（仅 DTMF、拨号音、铃声、噪音）。

我的第一个方法是使用 ZCR（过零率）和计算能量来实现简单的 VAD（语音活动检测器），但是这两个参数都将 DTMF、Dialtones 与语音混淆了。这个技术失败了，所以我实现了一个简单的方法来计算 FFT 在 200Hz 和 300Hz 之间的方差。我的numpy代码如下

wavefft = np.abs(fft(frame))
n = len(frame)
fx = np.arange(0,fs,float(fs)/float(n))
stx = np.where(fx>=200)
stx = stx[0][0]
endx = np.where(fx>=300)
endx = endx[0][0]
return np.sqrt(np.var(wavefft[stx:endx]))/1000

这导致了 60% 的准确率。

接下来，我尝试使用 SVM（支持向量机）和 MFCC（梅尔频率倒谱系数）实现基于机器学习的方法。结果完全不正确，几乎所有样品都被错误标记。应该如何训练具有 MFCC 特征向量的 SVM？我使用 scikit-learn 的粗略代码如下

[samplerate, sample] = wavfile.read ('profiles/noise.wav')
noiseProfile = MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/ring.wav')
ringProfile =  MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/voice.wav')
voiceProfile = MFCC(samplerate, sample)

machineData = []
for noise in noiseProfile:
    machineData.append(noise)

for voice in voiceProfile:
    machineData.append(voice)

dataLabel = []
for i in range(0, len(noiseProfile)):
    dataLabel.append (0)
for i in range(0, len(voiceProfile)):
    dataLabel.append (1)

clf = svm.SVC()
clf.fit(machineData, dataLabel)

我想知道我可以实施哪些替代方法？

【问题讨论】：

您可能需要调整 SVC 参数或使用不同的内核。很难说。您可以执行grid serch 来寻找最佳参数。在进行学习之前，我会建议你 shuffle machineData 和 dataLabel（具有相同的索引）。
@imaluengo 谢谢，关于不使用机器学习的替代方法有什么想法吗？

标签： numpy machine-learning scipy

【解决方案1】：

如果您不必使用 scipy/numpy，您可以查看 webrtvad，它是 Google 出色的 WebRTC 语音活动检测代码的 Python 包装器。 WebRTC 使用高斯混合模型 (GMM)，效果很好，而且速度非常快。

这里有一个例子说明你可以如何使用它：

import webrtcvad

# audio must be 16 bit PCM, at 8 KHz, 16 KHz or 32 KHz.
def audio_contains_voice(audio, sample_rate, aggressiveness=0, threshold=0.5):
    # Frames must be 10, 20 or 30 ms.
    frame_duration_ms = 30

    # Assuming split_audio is a function that will split audio into
    # frames of the correct size.
    frames = split_audio(audio, sample_rate, frame_duration)

    # aggressiveness tells the VAD how aggressively to filter out non-speech.
    # 0 will have the most false-positives for speech, 3 the least.
    vad = webrtc.Vad(aggressiveness)

    num_voiced = len([f for f in frames if vad.is_voiced(f, sample_rate)])
    return float(num_voiced) / len(frames) > threshold

【讨论】：