【发布时间】:2020-03-12 09:25:16
【问题描述】:
我在第一学位有一个期末项目,我想构建一个神经网络,它将获取 wav 文件的前 13 个 mfcc 系数,并从一群说话者的音频文件中返回谁说话。
我想让你注意:
- 我的音频文件与文本无关,因此它们有不同的长度和单词
- 我已经用 10 个扬声器的大约 35 个音频文件训练了机器(第一个扬声器大约有 15 个,第二个大约 10 个,第三个和第四个大约 5 个)
我定义了:
X=mfcc(sound_voice)
Y=zero_array + 1 在第 i_th 位置(其中第 i_th 位置为 0 表示第一个扬声器,1 表示第二个,2 表示第三个...)
然后训练机器,然后检查机器的输出是否有一些文件......
这就是我所做的……但不幸的是,结果看起来完全是随机的……
你能帮我理解为什么吗?
这是我在 python 中的代码 -
from sklearn.neural_network import MLPClassifier
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
from os import listdir
from os.path import isfile, join
from random import shuffle
import matplotlib.pyplot as plt
from tqdm import tqdm
winner = [] # this array count how much Bingo we had when we test the NN
for TestNum in tqdm(range(5)): # in every round we build NN with X,Y that out of them we check 50 after we build the NN
X = []
Y = []
onlyfiles = [f for f in listdir("FinalAudios/") if isfile(join("FinalAudios/", f))] # Files in dir
names = [] # names of the speakers
for file in onlyfiles: # for each wav sound
# UNESSECERY TO UNDERSTAND THE CODE
if " " not in file.split("_")[0]:
names.append(file.split("_")[0])
else:
names.append(file.split("_")[0].split(" ")[0])
names = list(dict.fromkeys(names)) # names of speakers
vector_names = [] # vector for each name
i = 0
vector_for_each_name = [0] * len(names)
for name in names:
vector_for_each_name[i] += 1
vector_names.append(np.array(vector_for_each_name))
vector_for_each_name[i] -= 1
i += 1
for f in onlyfiles:
if " " not in f.split("_")[0]:
f_speaker = f.split("_")[0]
else:
f_speaker = f.split("_")[0].split(" ")[0]
(rate, sig) = wav.read("FinalAudios/" + f) # read the file
try:
mfcc_feat = python_speech_features.mfcc(sig, rate, winlen=0.2, nfft=512) # mfcc coeffs
for index in range(len(mfcc_feat)): # adding each mfcc coeff to X, meaning if there is 50000 coeffs than
# X will be [first coeff, second .... 50000'th coeff] and Y will be [f_speaker_vector] * 50000
X.append(np.array(mfcc_feat[index]))
Y.append(np.array(vector_names[names.index(f_speaker)]))
except IndexError:
pass
Z = list(zip(X, Y))
shuffle(Z) # WE SHUFFLE X,Y TO PERFORM RANDOM ON THE TEST LEVEL
X, Y = zip(*Z)
X = list(X)
Y = list(Y)
X = np.asarray(X)
Y = np.asarray(Y)
Y_test = Y[:50] # CHOOSE 50 FOR TEST, OTHERS FOR TRAIN
X_test = X[:50]
X = X[50:]
Y = Y[50:]
clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=2) # create the NN
clf.fit(X, Y) # Train it
for sample in range(len(X_test)): # add 1 to winner array if we correct and 0 if not, than in the end it plot it
if list(clf.predict([X[sample]])[0]) == list(Y_test[sample]):
winner.append(1)
else:
winner.append(0)
# plot winner
plot_x = []
plot_y = []
for i in range(1, len(winner)):
plot_y.append(sum(winner[0:i])*1.0/len(winner[0:i]))
plot_x.append(i)
plt.plot(plot_x, plot_y)
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()
这是我的 zip 文件,其中包含代码和音频文件:https://ufile.io/eggjm1gw
【问题讨论】:
-
你的伪代码要么留下太多解释,要么不正确。可以分享一下代码吗?通常,您会拍摄简短的音频片段,在它们上计算
mfcc并分配给扬声器。您的数据将至少分为训练集、验证集和测试集。通过说话人识别,您通常不必依赖非常短的片段,但可以使用一两秒。后者然后你会切割成重叠的部分,等于你在训练期间使用的部分,并对所有这些部分进行分类。这种方法产生了非常高的准确性。 -
@LukaszTracewski 嗨,谢谢你的回答,我添加了我的代码并上传了我的音频文件,希望它可以帮助你理解我错在哪里
-
@LukaszTracewski 你认为我应该拍摄较小的音频照片,因为目前我有一些可以长达 20 秒...
标签: machine-learning audio neural-network signal-processing voice-recognition