librosa.util.exceptions.ParameterError：单声道音频的形状无效：ndim=2，shape=(1025, 5341)答案

【问题标题】：librosa.util.exceptions.ParameterError: Invalid shape for monophonic audio: ndim=2, shape=(1025, 5341)librosa.util.exceptions.ParameterError：单声道音频的形状无效：ndim=2，shape=(1025, 5341)
【发布时间】：2019-01-16 03:31:21
【问题描述】：

我正在尝试使用 python 从音频文件中的背景噪声中分离语音，然后提取 mfcc 特征

但我得到“librosa.util.exceptions.ParameterError: Invalid shape for monophonic audio: ndim=2, shape=(1025, 5341)” 错误

这是代码

from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import librosa

import librosa.display

import scipy
from scipy.io.wavfile import write
import soundfile as sf
from sklearn.preprocessing import normalize
from scipy.io.wavfile import read, write
from scipy.fftpack import rfft, irfft

y, sr = librosa.load('/home/osboxes/Desktop/AccentReco1/audio-files/egyptiansong.mp3', duration=124)

y=rfft(y) 

# And compute the spectrogram magnitude and phase
S_full, phase = librosa.magphase(librosa.stft(y))


# We'll compare frames using cosine similarity, and aggregate similar frames
# by taking their (per-frequency) median value.
#
# To avoid being biased by local continuity, we constrain similar frames to be
# separated by at least 2 seconds.
#
# This suppresses sparse/non-repetetitive deviations from the average spectrum,
# and works well to discard vocal elements.

S_filter = librosa.decompose.nn_filter(S_full,
                                       aggregate=np.median,
                                       metric='cosine',
                                       width=int(librosa.time_to_frames(2, sr=sr)))

# The output of the filter shouldn't be greater than the input
# if we assume signals are additive.  Taking the pointwise minimium
# with the input spectrum forces this.
S_filter = np.minimum(S_full, S_filter)

# We can also use a margin to reduce bleed between the vocals and instrumentation masks.
# Note: the margins need not be equal for foreground and background separation
margin_i, margin_v = 2, 10
power = 2

mask_i = librosa.util.softmask(S_filter,
                               margin_i * (S_full - S_filter),
                               power=power)

mask_v = librosa.util.softmask(S_full - S_filter,
                               margin_v * S_filter,
                               power=power)

# Once we have the masks, simply multiply them with the input spectrum
# to separate the components

S_foreground = mask_v * S_full
S_background = mask_i * S_full

# extract mfcc feature from data
mfccs = np.mean(librosa.feature.mfcc(y=S_foreground, sr=sr, n_mfcc=40).T,axis=0) 
print(mfccs)

有什么想法吗？

【问题讨论】：

标签： python-3.x audio speech-recognition voice-recognition mfcc

【解决方案1】：

您正在尝试获取频谱图的 MFCC。

您必须使用反向 STFT 将它们转换回音频样本。

from librosa.core import istft
vocals = istft(S_foreground )

【讨论】：