当前位置:网站首页>Analysis of c voice endpoint detection (VAD) implementation process
Analysis of c voice endpoint detection (VAD) implementation process
2022-06-29 08:53:00 【J ..】
Preface :
Most of the early methods are based on acoustic feature extraction , In time domain , 1975 year , Rabiner Et al. Proposed a speech endpoint detection method based on short-time energy and zero crossing rate , This is the first systematic and complete voice endpoint detection algorithm . This method has three thresholds , The first two are set high by short-term energy value 、 Lower two thresholds , Conduct preliminary judgment on the end point position , The third one is set by the short-time zero crossing rate , Finally, the starting point and ending point of speech inclination are determined . This method requires less computation , It can meet the requirements of real-time , And at high signal-to-noise ratio It has good detection performance , So it is widely used in various fields of speech signal processing . after , Many scholars have made some improvements on the basis of this method , Such as the double threshold detection algorithm based on absolute value energy , Guyaqiang et al. Proposed a speech endpoint detection method by calculating the difference between the zero crossing rate and short-term energy of voice and noise , Chenzhenbiao et al. Proposed an algorithm combining multi subband energy features and optimal edge detection decision criteria . In the frequency domain , Koba By performing fast Fourier transform on each noisy speech tone , A speech endpoint detection algorithm based on frequency domain information is proposed , This is the first analysis in frequency domain , Then the speech endpoint detection algorithm with corresponding features is extracted . after , Many speech endpoint detection algorithms based on frequency domain features are proposed , Such as speech endpoint detection algorithm based on information entropy 、 Speech endpoint detection algorithm based on Mel cepstrum, etc . But whether based on time domain or frequency domain , Each has its own advantages and disadvantages , So people want to use the method of combining time and frequency domain , Give better play to their respective advantages , therefore , Lin Et al. Proposed enhanced time-frequency Parameter this parameter includes time domain and frequency domain signals . But whether it's based on the time domain 、 The method based on frequency domain or the combination of time and frequency domain , They have high recognition rate under high signal-to-noise ratio , But the actual voice endpoint detection environment is complex and diverse , And it is difficult to guarantee a high signal-to-noise ratio . And then , A speech endpoint detection method based on model matching is proposed , These methods usually include training phase and testing phase , In the training phase , Just like people learn knowledge , Let the model learn from the training corpus , To get model parameters , In the test phase , Just compare the speech to be divided with the trained model , You can distinguish them . hidden Markov model (HMM) It was first published in some statistical papers , Now it is the most widely used model in speech recognition technology .1998 year , Zhu Jie and others will HMM The model is applied to speech endpoint detection with background noise , It is the first time to propose a new method based on HMM Model of voice endpoint detection algorithm . Experimental results show that , The accuracy of this method is obviously higher than that of the method based on double threshold , And for some bursts 、 Nasal and weak fricative sounds, etc , It is also rare to lose voice detection . Besides , Dongenqing and others will also support vector machines based on statistical theory (SVM) It is applied to the endpoint detection method of noisy speech signal , Based on SVM And verify its effectiveness , However, due to the large amount of computation , This method usually takes a long time in the training stage , But in the actual classification, the amount of calculation is small , Less time consuming . These methods can still achieve high accuracy under low signal-to-noise ratio , But it also has its limitations , First , Because there are few parameters in the model , Therefore, it is difficult to better describe the data , secondly , These methods estimate parameters from limited data sets , May not make full use of information . Recently, , Because the method based on neural network can overcome the limitations of the method based on model matching , And can get better results , More and more people use this learning method for speech endpoint detection , Including deep belief network , This has greatly expanded people's exploration ideas in this field . However, this kind of method may fall into local optimal value due to the selection of initial point , This will affect the effect of voice endpoint detection .
analysis :
At present, the mainstream VAD Algorithm :
- Speech endpoint detection algorithm based on short-time energy and zero crossing rate
- Speech endpoint detection algorithm based on information entropy
- Speech endpoint detection algorithm based on short-time energy frequency value
- Speech endpoint detection based on neural network
Through the analysis of ,1、2、3 It is applicable to the environment with high sex noise ratio . In the environment with low signal-to-noise ratio , The detection effect is not ideal . And the way based on neural network , In the low SNR environment , The detection effect is better than 1、2、3.
Reference article :
- https://www.docin.com/p-1537242532.html
- https://blog.csdn.net/ffmpeg4976/article/details/52349007
- https://blog.csdn.net/baienguon/article/details/80539296
Project background :
As the company's project adopts C# Development , Therefore, this paper adopts C# call python Training model for voice endpoint detection .
Knowledge reserve :( Share some of the audio articles I learned )
- Audio properties :https://blog.csdn.net/aoshilang2249/article/details/38469051
- MFCC Detailed explanation :https://blog.csdn.net/suan2014/article/details/82021324
- Fourier transform :https://blog.csdn.net/u011947630/article/details/81513075
- Some have forgotten to collect . In short, if you don't understand, check more .
Realize the idea :
- Collect audio files of two positive and negative samples of noise and human voice , Use Python By extraction MFCC Eigenvalues to train , Save model file .
- Because I use a method based on Keras LSTM To train , but Keras Not provided C# Interface So we can only turn the model into .pb file . And then use TensorflowSharp To call .pb File to predict the results .
- (C#) I set my collection interval to 1s, The range of movement is 500ms.
- First collect 1s The audio information of , Then extract 1s The audio MFCC Eigenvalue prediction , If the prediction result is human , Is the starting point of the sound , Save the audio , If it is noise , Give up . And then move 500ms, Extracting 1s The audio of MFCC Eigenvalue prediction , If it is a human voice and the previous one is also a human voice , Then the audio is offset by 500ms Save the audio , If the previous one is noise , It means that the audio is the starting point . In a word, it is just like this , Finally, it will detect a segment of vocal audio that conforms to the characteristics of the model .
- The above is what I realized VAD The general idea of , It may not be perfect , But it's also the best solution I can think of . After all, my ability is limited , If you guys have any good solutions , Welcome to guide you , Thank you very much !!
Code :
1.Python Training sample model be based on Kersa LSTM:
quote :https://github.com/Renovamen/Speech-Emotion-Recognition
https://github.com/hcmlab/vadnet
Training sample audio : 16k 16bit
Part of the main code :( Because of the large amount of code , So only the main implementation code is posted )
from keras.utils import np_utils
from DNN_Model import LSTM_Model
from Utilities import get_data
DATA_PATH = 'DataSet/'
CLASS_LABELS = ("hg", "nd", "pz")
def train():
FLATTEN = False
NUM_LABELS = len(CLASS_LABELS)
SVM = False
x_train, x_test, y_train, y_test = get_data(DATA_PATH, class_labels=CLASS_LABELS, flatten=FLATTEN, _svm=SVM)
y_train = np_utils.to_categorical(y_train)
y_test_train = np_utils.to_categorical(y_test, len(CLASS_LABELS))
print('-------------------------------- LSTM Start --------------------------------')
model = LSTM_Model(input_shape=x_train[0].shape, num_classes=NUM_LABELS)
model.train(x_train, y_train, x_test, y_test_train, n_epochs=100)
model.evaluate(x_test, y_test)
model.save_model("LSTM1")
print('-------------------------------- LSTM End --------------------------------')
train()Read data and extract mfcc The eigenvalue
# Read data from a given folder and extract MFCC features
import os
import sys
from typing import Tuple
import numpy as np
import scipy.io.wavfile as wav
from sklearn.model_selection import train_test_split
from keras.models import model_from_json
from sklearn.externals import joblib
from MFCC_COM import get_mfcc
import scipy.io.wavfile
mean_signal_length = 16000
"""
get_feature(): Extract an audio MFCC Eigenvector
Input :
file_path(str): The audio path
mfcc_len(int): Every frame MFCC Characteristic number
flatten(bool): Whether dimension reduction data
Output :
numpy.ndarray: Of the audio MFCC Eigenvector
"""
def get_feature(file_path: str, mfcc_len: int = 39, flatten: bool = False):
# Some audio uses scipy.io.wavfile Reading newspaper "Incomplete wav chunk" error
# It seems to be because scipy Only read pcm and float Format , Others wav Not these two formats ...
# fs, signal = wav.read(file_path)
# signal, fs = librosa.load(file_path, 16000)
fs, signal = scipy.io.wavfile.read(file_path)
s_len = len(signal)
# If the audio signal is less than mean_signal_length, Then expand it
if s_len < mean_signal_length:
pad_len = mean_signal_length - s_len
pad_rem = pad_len % 2
pad_len //= 2
signal = np.pad(signal, (pad_len, pad_len + pad_rem), 'constant', constant_values=0)
# Or cut it open
else:
pad_len = s_len - mean_signal_length
pad_len //= 2
signal = signal[pad_len:pad_len + mean_signal_length]
mel_coefficients = get_dll(signal, fs, mfcc_len)
# use SVM & MLP Dimension reduction data is required for modeling
if flatten:
mel_coefficients = np.ravel(mel_coefficients)
return mel_coefficients
def get_feature_result(signal, fs, mfcc_len: int = 39, flatten: bool = False):
# Some audio uses scipy.io.wavfile Reading newspaper "Incomplete wav chunk" error
# It seems to be because scipy Only read pcm and float Format , Others wav Not these two formats ...
# fs, signal = wav.read(file_path)
# signal, fs = librosa.load(file_path)
s_len = len(signal)
# If the audio signal is less than mean_signal_length, Then expand it
if s_len < mean_signal_length:
pad_len = mean_signal_length - s_len
pad_rem = pad_len % 2
pad_len //= 2
signal = np.pad(signal, (pad_len, pad_len + pad_rem), 'constant', constant_values=0)
# Or cut it open
else:
pad_len = s_len - mean_signal_length
pad_len //= 2
signal = signal[pad_len:pad_len + mean_signal_length]
mel_coefficients = get_mfcc(signal, fs, mfcc_len)
# mel_coefficients = librosa.feature.mfcc(signal, fs, n_mfcc=39)
# use SVM & MLP Dimension reduction data is required for modeling
if flatten:
mel_coefficients = np.ravel(mel_coefficients)
return mel_coefficients
def get_dll(signal, fs, mfcc_len):
mfcc = get_mfcc(signal, fs, mfcc_len)
mfcc = mfcc.Array
array = []
for _ in range(mfcc.Length):
array.append(mfcc[_])
return np.array(array).reshape(-1, mfcc_len)
def get_data(data_path: str, mfcc_len: int = 39,
class_labels: Tuple = ("angry", "fear", "happy", "neutral", "sad", "surprise"), flatten: bool = False,
_svm: bool = False):
data = []
labels = []
cur_dir = os.getcwd()
sys.stderr.write('Curdir: %s\n' % cur_dir)
os.chdir(data_path)
# Traversal folder
for i, directory in enumerate(class_labels):
sys.stderr.write("Started reading folder %s\n" % directory)
os.chdir(directory)
# Read the audio in this folder
for filename in os.listdir('.'):
if not filename.endswith('wav'):
continue
filepath = os.getcwd() + '/' + filename
# Extract the feature vector of the audio
feature_vector = get_feature(file_path=filepath, mfcc_len=mfcc_len, flatten=flatten)
data.append(feature_vector)
labels.append(i)
sys.stderr.write("Ended reading folder %s\n" % directory)
os.chdir('..')
os.chdir(cur_dir)
# Divide the training set and the test set
x_train, x_test, y_train, y_test = train_test_split(np.array(data), np.array(labels), test_size=0.001,
random_state=42)
return np.array(x_train), np.array(x_test), np.array(y_train), np.array(y_test)
def get_data(data_path: str, mfcc_len: int = 39,
class_labels: Tuple = ("angry", "fear", "happy", "neutral", "sad", "surprise"), flatten: bool = False,
_svm: bool = False):
data = []
labels = []
cur_dir = os.getcwd()
sys.stderr.write('Curdir: %s\n' % cur_dir)
os.chdir(data_path)
# Traversal folder
for i, directory in enumerate(class_labels):
sys.stderr.write("Started reading folder %s\n" % directory)
os.chdir(directory)
# Read the audio in this folder
for filename in os.listdir('.'):
if not filename.endswith('wav'):
continue
filepath = os.getcwd() + '/' + filename
# Extract the feature vector of the audio
feature_vector = get_feature(file_path=filepath, mfcc_len=mfcc_len, flatten=flatten)
data.append(feature_vector)
labels.append(i)
sys.stderr.write("Ended reading folder %s\n" % directory)
os.chdir('..')
os.chdir(cur_dir)
# Divide the training set and the test set
x_train, x_test, y_train, y_test = train_test_split(np.array(data), np.array(labels), test_size=0.001,
random_state=42)
return np.array(x_train), np.array(x_test), np.array(y_train), np.array(y_test)
'''
load_model_dnn():
load CNN & LSTM Model of
Input :
model_name(str): The model name
load_model(str): Model type (DNN / ML)
Output :
model: Loaded model
'''
def load_model(model_name: str, load_model: str):
if load_model == 'DNN':
# load json
model_path = 'Models/' + model_name + '.h5'
model_json_path = 'Models/' + model_name + '.json'
json_file = open(model_json_path, 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
# Load weights
model.load_weights(model_path)
elif load_model == 'ML':
model_path = 'Models/' + model_name + '.m'
model = joblib.load(model_path)
return model
Model:
# CNN & LSTM
import sys
import numpy as np
from keras import Sequential
from keras.layers import LSTM as KERAS_LSTM, Dense, Dropout, Conv2D, Flatten, BatchNormalization, Activation, \
MaxPooling2D
from Common_Model import Common_Model
# class CNN and class LSTM Inherited this kind of ( Realized make_model Method )
class DNN_Model(Common_Model):
'''
__init__(): Initialize the neural network
Input :
input_shape(tuple): Tensor shape
num_classes(int): Number of label types
'''
def __init__(self, input_shape, num_classes, **params):
super(DNN_Model, self).__init__(**params)
self.input_shape = input_shape
self.model = Sequential()
self.make_model()
self.model.add(Dense(num_classes, activation='softmax'))
self.model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(self.model.summary(), file=sys.stderr)
'''
save_model(): Weight the model with model_name.h5 and model_name.json Names are stored in /Models Under the table of contents
'''
def save_model(self, model_name):
h5_save_path = 'Models/' + model_name + '.h5'
self.model.save_weights(h5_save_path)
save_json_path = 'Models/' + model_name + '.json'
with open(save_json_path, "w") as json_file:
json_file.write(self.model.to_json())
'''
train(): Train the model on a given training set
Input :
x_train (numpy.ndarray): Training set samples
y_train (numpy.ndarray): Training set label
x_val (numpy.ndarray): Test set samples
y_val (numpy.ndarray): Test set label
n_epochs (int): epoch Count
'''
def train(self, x_train, y_train, x_val=None, y_val=None, n_epochs=50):
best_acc = 0
if x_val is None or y_val is None:
x_val, y_val = x_train, y_train
for i in range(n_epochs):
# Every epoch Are randomly arranged training data
p = np.random.permutation(len(x_train))
x_train = x_train[p]
y_train = y_train[p]
'''
fit( x, y, batch_size=32, epochs=10, verbose=1, callbacks=None,
validation_split=0.0, validation_data=None, shuffle=True,
class_weight=None, sample_weight=None, initial_epoch=0)
x: input data . If the model has only one input , that x The type is numpy
array, If the model has multiple inputs , that x The type should be list,list Is corresponding to each input numpy array
y: label ,numpy array
batch_size: Integers , Specifies the gradient descent for each batch Number of samples included . One for training batch The sample will be calculated as a gradient descent , Optimize the target function one step .
epochs: Integers , At the end of training epoch value , The training will be at the end of the day epoch Value , When there is no setting initial_epoch when , It is the total number of rounds of training ,
Otherwise, the total number of rounds of training is epochs - inital_epoch
verbose: The log shows ,0 Output log information for non-standard output stream ,1 Record for the output progress bar ,2 For each epoch Output line record
callbacks:list, The elements are keras.callbacks.Callback The object of . This list The callback function will be called at the appropriate time during the training , Refer to the callback function
validation_split:0~1 The floating point number between , A percentage of the data used to specify the training set is used as the validation set . Validation sets will not be trained ,
And in each epoch End - of - test model metrics , Like the loss function 、 Precision etc. . Be careful ,validation_split The division of the shuffle Before , So if your data itself is ordered , You need to manually scramble it before you specify it validation_split, Otherwise, an uneven sample of the validation set may occur .
validation_data: In the form of (X,y) Of tuple, Is the specified validation set . This parameter overrides validation_spilt.
shuffle: Boolean or string , Is generally a Boolean value , Indicates whether the sequence of input samples is randomly scrambled during training . If string “batch”,
It's used to deal with HDF5 The special case of data , It will be batch Internally scrambles the data .
class_weight: Dictionaries , Mapping different categories to different weights , This parameter is used to adjust the loss function during training ( Only for training )
sample_weight: Weights of numpy array, Used to adjust the loss function during training ( For training purposes only ). You can pass a 1D The vector with the same length as the sample is used to carry on the sample 1 Yes 1 A weighted ,
Or in the case of temporal data , The form of passing one is (samples,sequence_length) To assign different weights to the samples on each time step . In this case be sure to add when compiling the model sample_weight_mode=’temporal’.
initial_epoch: Specified from this parameter epoch Start training , It's useful to continue the previous training .
fit The function returns a History The object of , Its History.history Attribute records the value of loss function and other indicators epoch Changing circumstances , If there is a verification set , Also contains the change of these indicators of the verification set
'''
self.model.fit(x_train, y_train, batch_size=32, epochs=1)
# The change of loss rate during training
# Calculate the loss rate and accuracy rate
loss, acc = self.model.evaluate(x_val, y_val)
if acc > best_acc:
best_acc = acc
print("TRAIN:%d / %d" % (i, n_epochs))
self.trained = True
'''
recognize_one(): Recognize the emotion of an audio
Input :
sample: Sample to predict
Output :
Predicted results , Confidence probability (int, numpy.ndarray)
'''
def recognize_one(self, sample):
# No training and loading of models
if not self.trained:
sys.stderr.write("No Model.")
sys.exit(-1)
return np.argmax(self.model.predict(np.array([sample]))), self.model.predict(np.array([sample]))[0]
def make_model(self):
raise NotImplementedError()
class CNN_Model(DNN_Model):
def __init__(self, **params):
params['name'] = 'CNN'
super(CNN_Model, self).__init__(**params)
def make_model(self):
self.model.add(Conv2D(8, (13, 13), input_shape=(self.input_shape[0], self.input_shape[1], 1)))
self.model.add(BatchNormalization(axis=-1))
self.model.add(Activation('relu'))
self.model.add(Conv2D(8, (13, 13)))
self.model.add(BatchNormalization(axis=-1))
self.model.add(Activation('relu'))
self.model.add(MaxPooling2D(pool_size=(2, 1)))
self.model.add(Conv2D(8, (13, 13)))
self.model.add(BatchNormalization(axis=-1))
self.model.add(Activation('relu'))
self.model.add(Conv2D(8, (2, 2)))
self.model.add(BatchNormalization(axis=-1))
self.model.add(Activation('relu'))
self.model.add(MaxPooling2D(pool_size=(2, 1)))
self.model.add(Flatten())
self.model.add(Dense(64))
self.model.add(BatchNormalization())
self.model.add(Activation('relu'))
self.model.add(Dropout(0.2))
class LSTM_Model(DNN_Model):
def __init__(self, **params):
params['name'] = 'LSTM'
super(LSTM_Model, self).__init__(**params)
def make_model(self):
#
self.model.add(KERAS_LSTM(128, input_shape=(self.input_shape[0], self.input_shape[1])))
self.model.add(Dropout(0.5))
self.model.add(Dense(32, activation='relu')) # Standard one-dimensional fully connected layer
self.model.add(Dense(16, activation='tanh'))
2.python Can pass librosa and scipy To get mfcc The eigenvalue , But I found that C# Acquired mfcc Eigenvalues and python Acquired mfcc Different eigenvalues , It is impossible to accurately predict the results ( I don't know why , The formula is too complicated , There are big men who understand , Welcome to guide ). The way I think is through python call C#dll To get MFCC
C#: Translate it into dll
quote :https://github.com/ar1st0crat/NWaves
using NumSharp;
using NWaves.FeatureExtractors;
using NWaves.FeatureExtractors.Base;
using System.Collections.Generic;
namespace MFCC
{
public class MYMFCC
{
public NDArray GetMFCC(float[] input, int sr, int mfcc_size)
{
var mfccExtractor = new MfccExtractor(sr, mfcc_size);
var mfccVectors = mfccExtractor.ComputeFrom(input);
List<float> result = new List<float>();
foreach (FeatureVector vector in mfccVectors)
{
foreach (float _ in vector.Features)
{
result.Add(_);
}
}
return np.array(result.ToArray());
}
}
}
Python: ( Quote relevant dll)import clr
clr.FindAssembly("MFCCSharp.dll")
clr.FindAssembly("NWaves.dll")
clr.FindAssembly("NumSharp.dll")
from MFCC import *
from NWaves import *
from NumSharp import *
instance = MYMFCC()
def get_mfcc(input, sr, mfcc_len):
instance = MYMFCC()
return instance.GetMFCC(input, sr, mfcc_len)
3. Because of me python The training model is based on keras Of , but keras Not provided c# Interface . All will now keras The resulting model .h5 File conversion to .pb file . adopt tensorflowsharp To call .pb file .
python:( .h5 -> .pb)
quote :https://blog.csdn.net/qq_25109263/article/details/81285952
from keras.models import load_model
import tensorflow as tf
from keras import backend as K, Sequential
from tensorflow.python.framework import graph_io
from keras.models import model_from_json
def freeze_session(session, keep_var_names=None, output_names=None, clear_devices=True):
from tensorflow.python.framework.graph_util import convert_variables_to_constants
graph = session.graph
with graph.as_default():
freeze_var_names = list(set(v.op.name for v in tf.global_variables()).difference(keep_var_names or []))
output_names = output_names or []
output_names += [v.op.name for v in tf.global_variables()]
input_graph_def = graph.as_graph_def()
if clear_devices:
for node in input_graph_def.node:
node.device = ""
frozen_graph = convert_variables_to_constants(session, input_graph_def,
output_names, freeze_var_names)
return frozen_graph
def load_model():
model_path = 'Models/LSTM1.h5'
model_json_path = 'Models/LSTM1.json'
json_file = open(model_json_path, 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
# Load weights
model.load_weights(model_path)
return model
"""---------------------------------- Configuration path -----------------------------------"""
epochs = 20
h5_model_path = 'Models/LSTM1.h5'
output_path = 'PBModels'
pb_model_name = 'LSTM1.pb'
"""---------------------------------- Import keras Model ------------------------------"""
K.set_learning_phase(0)
net_model = load_model()
print('input is :', net_model.input.name)
print('output is:', net_model.output.name)
"""---------------------------------- Save as .pb Format ------------------------------"""
sess = K.get_session()
frozen_graph = freeze_session(K.get_session(), output_names=[net_model.output.op.name])
graph_io.write_graph(frozen_graph, output_path, pb_model_name, as_text=False)
4. Use C# call python Training the model to achieve prediction .
Class library :TensorflowSharp (tensorflow) , NumSharp (numpy) , NWave ( obtain MFCC)
using NumSharp;
using NWaves.FeatureExtractors;
using NWaves.FeatureExtractors.Base;
using System;
using System.Collections.Generic;
using System.IO;
using TensorFlow;
namespace TensorflowSharpDemo
{
class VoiceTest
{
private static String[] CLASS_LABELS = new string[] { "anjian", "di", "huanjie", "mingdi", "pengzhuang", "qita", "shebeiyuyin", "voice" };
private static int VOICE_INDEX = 7;
private static int FRAME_RATE = 16000;
private static int FRAME_MOVE = 8000;
private static TFGraph graph;
private static TFSession session;
public static void test()
{
//CLASS_LABELS = ("anjian", "di", "huanjie", "mingdi", "pengzhuang", "qita", "shebeiyuyin", "voice")
graph = new TFGraph();
graph.Import(File.ReadAllBytes("../../model/LSTM1.pb"), "");
session = new TFSession(graph);
List<float> audios = new WAVReader().ReadWAVFile("../../test/123.wav");
NDArray input = audioToFrames(audios.ToArray(), 16000);
logits(input, "");
}
private static float[] get_feature_result(float[] input)
{
float[] mfcc = getMfcc(input);
return mfcc;
}
/// <summary>
/// Frame shift moves audio frames , repair 0
/// </summary>
private static float[] padd(float[] input, int frameMove)
{
int con = input.Length % frameMove;
if (con == 0)
return input;
int dis = frameMove - con;
List<float> coll = new List<float>();
int b_len = dis / 2;
for (int i = 0; i < b_len; i++)
{
coll.Add(0);
}
for (int i = 0; i < input.Length; i++)
{
coll.Add(input[i]);
}
int e_len = dis - b_len;
for (int i = 0; i < b_len; i++)
{
coll.Add(0);
}
return coll.ToArray();
}
/// <summary>
///
/// </summary>
/// <param name="input"> Audio </param>
/// <param name="frameLen"> Frame length </param>
/// <param name="frameMove"> Frame shift </param>
/// <returns></returns>
private static float[] fragment(float[] input, int frameLen, int frameMove)
{
input = padd(input, frameMove);
List<float> frames = new List<float>();
int n_step = 0;
while (n_step * frameMove + frameLen <= input.Length)
{
for (int k = n_step * frameMove; k < n_step * frameMove + frameLen; k++)
{
frames.Add(input[k]);
}
n_step++;
}
return frames.ToArray();
}
private static NDArray audioToFrames(float[] input, int frame)
{
input = fragment(input, FRAME_RATE, FRAME_MOVE);
int row = input.Length / FRAME_RATE;
NDArray array = np.array(input).reshape(row, FRAME_RATE);
return array;
}
private static NDArray predict_model(float[] input, int sr)
{
var mfcc = get_feature_result(input);
int ax = mfcc.Length / 39;
var tensor = TFTensor.FromBuffer(new TFShape(1, ax, 39), mfcc, 0, mfcc.Length);
var runner = session.GetRunner();
runner.AddInput(graph["lstm_1_input"][0], tensor).Fetch(graph["dense_3/Softmax"][0]);
var output = runner.Run();
var result = output[0];
int result_count = ((float[][])result.GetValue(jagged: true)).Length;
List<float> resultColl = new List<float>();
for (int i = 0; i < result_count; i++)
{
float[] a = ((float[][])result.GetValue(jagged: true))[i];
string s = null;
for (int j = 0; j < a.Length; j++)
{
resultColl.Add(a[j]);
}
}
return np.array(resultColl.ToArray()).reshape(1, CLASS_LABELS.Length);
}
private static void logits(NDArray input, string outFileDir)
{
int length = input.shape[0];
int n_step = 0;
List<NDArray> label = new List<NDArray>();
while (n_step < length)
{
float[] temp = (float[])input[n_step].Array;
NDArray result = predict_model(temp, FRAME_RATE);
label.Add(result);
n_step++;
}
List<float[]> voice = new List<float[]>();
List<float> voice_temp = new List<float>();
int noiseCount = 0;
for (int i = 0; i < length; i++)
{
int index = np.argmax(label[i]);
float[] sound = (float[])input[i].Array;
if (index == VOICE_INDEX)
{
noiseCount = 0;
// The last frame was noise
if (voice_temp.Count == 0)
{
voice_temp = floatToAddList(sound, voice_temp);
continue;
}
// The last frame was the voice
voice_temp = floatToAddList(sound, voice_temp, FRAME_MOVE);
continue;
}
else
{
noiseCount++;
// Continuous detection for more than one frame ( Self determination ) It's all noise , It means that a complete speech is detected
//if (noiseCount >= FRAME_RATE / FRAME_MOVE)
//{
// if (voice_temp.Count > 0)
// voice.Add(voice_temp.ToArray());
// voice_temp = new List<float>();
// noiseCount = 0;
//}
}
}
if (voice_temp.Count > 0)
{
voice.Add(voice_temp.ToArray());
}
for (int j = 0; j < voice.Count; j++)
{
toSaveAudio(voice[j], "voice" + j);
}
}
private static List<float> floatToAddList(float[] audio, List<float> coll)
{
foreach (float _f in audio)
{
coll.Add(_f);
}
return coll;
}
private static List<float> floatToAddList(float[] audio, List<float> coll, int frameMove)
{
int index = 0;
foreach (float _f in audio)
{
if (index >= audio.Length - frameMove)
coll.Add(_f);
index++;
}
return coll;
}
private static void toSaveAudio(float[] audio, string name)
{
string audioFileName = @"D:\WorkSpace\VS\TensorflowSharpDemo\TensorflowSharpDemo\voice\" + name + ".wav";
WaveSaveHelper.Save(audioFileName, audio);
}
/// <summary>
/// obtain MFCC
/// </summary>
private static float[] getMfcc(float[] input)
{
var mfccExtractor = new MfccExtractor(16000, 39);
var mfccVectors = mfccExtractor.ComputeFrom(input);
List<float> result = new List<float>();
foreach (FeatureVector vector in mfccVectors)
{
foreach (float _ in vector.Features)
{
result.Add(_);
}
}
return result.ToArray();
}
}
}
problem :
1. python adopt libroas or scripy obtain mfcc Eigenvalues and C# Acquisition of implementation mfcc Different eigenvalues . So lead to C# The result predicted by the calling model is incorrect . My solution is through C# Write a program to get mfcc Eigenvalue dll( be based on NWave.dll https://github.com/ar1st0crat/NWaves), And then use python To call .( There is no way , I haven't understood this place for several days , Or maybe I made a mistake somewhere , In short, if there is a big man who understands this place , Welcome to give us your advice , Thank you very much .)
2. Found by test , There will be some noise in the detected audio , But I feel the effect is acceptable .
3. Still trying to optimize and test ... = = 、
summary :
Through the recent period of study and efforts , The function of endpoint detection is preliminarily realized , However, the effect needs to be improved , I will continue to work hard , What new experiences do you want , I will share it with you in time . Because I just touched the audio , Or a little white , There are too many places to understand very thoroughly , It can be said that I only know a little , So there are some incorrect statements , Please forgive me a lot , I also hope you can put forward your valuable opinions and corrections , Thank you very much !!!!
Here's what I wrote demo, If you want to see something, you can come down and have a look ,demo There are no audio files in , After downloading, you need to save the corresponding folder name of your own training audio file , Then train the model file , Again into .pb file , The final will be pb File copy to c# In the project , for c# call .( Limited ability to express , If there is something you don't understand, you can leave a message . = =、)
demo Address :https://download.csdn.net/download/haiyangyunbao813/11155896
边栏推荐
- Compare homekit, MI family, and zhiting family cloud edition for what scene based experiences
- Mutex mutex
- 编程语言
- 2022第六季完美童模 清远赛区 海选赛圆满落幕
- 批量处理实验接触角数据-MATLAB分析
- 闭关修炼(二十一)Servlet生命周期、service方法源码分析、线程安全问题
- 2022年7月系统集成项目管理工程师认证招生简章
- 802.11--802.11n protocol phy
- How to recover data loss of USB flash disk memory card
- [most complete] download and installation of various versions of PS and tutorial of small test ox knife (Photoshop CS3 ~ ~ Photoshop 2022)
猜你喜欢

2022年7月产品经理认证招生简章(NPDP)

Huawei equipment is configured with medium-sized network WLAN basic services

uniapp引入组件不生效解决方法

Debugging nocturnal simulator with ADB command

各种级数(调和、几何)总结

TypeScript 变量声明 —— 类型断言

闭关修炼(二十五)基础web安全

2022年7月(软考高级)信息系统项目管理师认证招生简章

Résumé des différentes séries (harmoniques, géométriques)

ThreadLocal线程变量
随机推荐
A high-frequency problem, three kinds of model thinking to solve this risk control problem
航芯开发板&调试器
背包九讲——全篇详细理解与代码实现
The return values of hostname -f and uname -n may be different
单例模式的理解
js for in循环 for of循环的区别以及用法
Huawei equipment is configured with medium-sized network WLAN basic services
闭关修炼(二十四)浅入了解跨域问题
Oracle subquery
Does the SQL server run with administrator privileges? Or run it as a normal user?
考研英语易混词整理【闪过】
The sixth season of 2022 perfect children's model Qingyuan competition area audition came to a successful conclusion
CDGA|交通行业做好数字化转型的核心是什么?
工作好多年,回忆人生--高中三年
2022春夏系列 KOREANO ESSENTIAL重塑时装生命力
Leetcode (142) - circular linked list II
Huawei equipment is configured with small network WLAN basic services
(III) encoder self attention mask
随心玩玩(三)Mirai框架QQ机器人
华为设备配置小型网络WLAN基本业务