当前位置：网站首页>Analysis of c voice endpoint detection (VAD) implementation process

Analysis of c voice endpoint detection (VAD) implementation process

2022-06-29 08:53:00 【J ..】

Preface ：

Most of the early methods are based on acoustic feature extraction , In time domain , 1975 year , Rabiner Et al. Proposed a speech endpoint detection method based on short-time energy and zero crossing rate , This is the first systematic and complete voice endpoint detection algorithm . This method has three thresholds , The first two are set high by short-term energy value 、 Lower two thresholds , Conduct preliminary judgment on the end point position , The third one is set by the short-time zero crossing rate , Finally, the starting point and ending point of speech inclination are determined . This method requires less computation , It can meet the requirements of real-time , And at high signal-to-noise ratio It has good detection performance , So it is widely used in various fields of speech signal processing . after , Many scholars have made some improvements on the basis of this method , Such as the double threshold detection algorithm based on absolute value energy , Guyaqiang et al. Proposed a speech endpoint detection method by calculating the difference between the zero crossing rate and short-term energy of voice and noise , Chenzhenbiao et al. Proposed an algorithm combining multi subband energy features and optimal edge detection decision criteria . In the frequency domain , Koba By performing fast Fourier transform on each noisy speech tone , A speech endpoint detection algorithm based on frequency domain information is proposed , This is the first analysis in frequency domain , Then the speech endpoint detection algorithm with corresponding features is extracted . after , Many speech endpoint detection algorithms based on frequency domain features are proposed , Such as speech endpoint detection algorithm based on information entropy 、 Speech endpoint detection algorithm based on Mel cepstrum, etc . But whether based on time domain or frequency domain , Each has its own advantages and disadvantages , So people want to use the method of combining time and frequency domain , Give better play to their respective advantages , therefore , Lin Et al. Proposed enhanced time-frequency Parameter this parameter includes time domain and frequency domain signals . But whether it's based on the time domain 、 The method based on frequency domain or the combination of time and frequency domain , They have high recognition rate under high signal-to-noise ratio , But the actual voice endpoint detection environment is complex and diverse , And it is difficult to guarantee a high signal-to-noise ratio . And then , A speech endpoint detection method based on model matching is proposed , These methods usually include training phase and testing phase , In the training phase , Just like people learn knowledge , Let the model learn from the training corpus , To get model parameters , In the test phase , Just compare the speech to be divided with the trained model , You can distinguish them . hidden Markov model （ＨＭＭ） It was first published in some statistical papers , Now it is the most widely used model in speech recognition technology .1998 year , Zhu Jie and others will ＨＭＭ The model is applied to speech endpoint detection with background noise , It is the first time to propose a new method based on ＨＭＭ Model of voice endpoint detection algorithm . Experimental results show that , The accuracy of this method is obviously higher than that of the method based on double threshold , And for some bursts 、 Nasal and weak fricative sounds, etc , It is also rare to lose voice detection . Besides , Dongenqing and others will also support vector machines based on statistical theory （SVM） It is applied to the endpoint detection method of noisy speech signal , Based on SVM And verify its effectiveness , However, due to the large amount of computation , This method usually takes a long time in the training stage , But in the actual classification, the amount of calculation is small , Less time consuming . These methods can still achieve high accuracy under low signal-to-noise ratio , But it also has its limitations , First , Because there are few parameters in the model , Therefore, it is difficult to better describe the data , secondly , These methods estimate parameters from limited data sets , May not make full use of information . Recently, , Because the method based on neural network can overcome the limitations of the method based on model matching , And can get better results , More and more people use this learning method for speech endpoint detection , Including deep belief network , This has greatly expanded people's exploration ideas in this field . However, this kind of method may fall into local optimal value due to the selection of initial point , This will affect the effect of voice endpoint detection .

analysis :

At present, the mainstream VAD Algorithm ：

Speech endpoint detection algorithm based on short-time energy and zero crossing rate
Speech endpoint detection algorithm based on information entropy
Speech endpoint detection algorithm based on short-time energy frequency value
Speech endpoint detection based on neural network

Through the analysis of ,1、2、3 It is applicable to the environment with high sex noise ratio . In the environment with low signal-to-noise ratio , The detection effect is not ideal . And the way based on neural network , In the low SNR environment , The detection effect is better than 1、2、3.

Reference article :

Project background :

As the company's project adopts C# Development , Therefore, this paper adopts C# call python Training model for voice endpoint detection .

Knowledge reserve ：（ Share some of the audio articles I learned ）

Audio properties :https://blog.csdn.net/aoshilang2249/article/details/38469051
MFCC Detailed explanation :https://blog.csdn.net/suan2014/article/details/82021324
Fourier transform ：https://blog.csdn.net/u011947630/article/details/81513075
Some have forgotten to collect . In short, if you don't understand, check more .

Realize the idea :

Collect audio files of two positive and negative samples of noise and human voice , Use Python By extraction MFCC Eigenvalues to train , Save model file .
Because I use a method based on Keras LSTM To train , but Keras Not provided C# Interface So we can only turn the model into .pb file . And then use TensorflowSharp To call .pb File to predict the results .
（C#） I set my collection interval to 1s, The range of movement is 500ms.
First collect 1s The audio information of , Then extract 1s The audio MFCC Eigenvalue prediction , If the prediction result is human , Is the starting point of the sound , Save the audio , If it is noise , Give up . And then move 500ms, Extracting 1s The audio of MFCC Eigenvalue prediction , If it is a human voice and the previous one is also a human voice , Then the audio is offset by 500ms Save the audio , If the previous one is noise , It means that the audio is the starting point . In a word, it is just like this , Finally, it will detect a segment of vocal audio that conforms to the characteristics of the model .
The above is what I realized VAD The general idea of , It may not be perfect , But it's also the best solution I can think of . After all, my ability is limited , If you guys have any good solutions , Welcome to guide you , Thank you very much ！！

Code ：

1.Python Training sample model be based on Kersa LSTM:

quote ：https://github.com/Renovamen/Speech-Emotion-Recognition

https://github.com/hcmlab/vadnet

Training sample audio ： 16k 16bit

Part of the main code ：( Because of the large amount of code , So only the main implementation code is posted )

from keras.utils import np_utils
from DNN_Model import LSTM_Model
from Utilities import get_data

DATA_PATH = 'DataSet/'

CLASS_LABELS = ("hg", "nd", "pz")


def train():
    FLATTEN = False
    NUM_LABELS = len(CLASS_LABELS)
    SVM = False

    x_train, x_test, y_train, y_test = get_data(DATA_PATH, class_labels=CLASS_LABELS, flatten=FLATTEN, _svm=SVM)

    y_train = np_utils.to_categorical(y_train)
    y_test_train = np_utils.to_categorical(y_test, len(CLASS_LABELS))

    print('-------------------------------- LSTM Start --------------------------------')
    model = LSTM_Model(input_shape=x_train[0].shape, num_classes=NUM_LABELS)
    model.train(x_train, y_train, x_test, y_test_train, n_epochs=100)
    model.evaluate(x_test, y_test)
    model.save_model("LSTM1")
    print('-------------------------------- LSTM End --------------------------------')


train()

Read data and extract mfcc The eigenvalue

#  Read data from a given folder and extract MFCC features 

import os
import sys
from typing import Tuple
import numpy as np
import scipy.io.wavfile as wav
from sklearn.model_selection import train_test_split
from keras.models import model_from_json
from sklearn.externals import joblib
from MFCC_COM import get_mfcc
import scipy.io.wavfile

mean_signal_length = 16000

"""
get_feature():  Extract an audio MFCC Eigenvector 

 Input :
    file_path(str):  The audio path 
    mfcc_len(int):  Every frame MFCC Characteristic number 
    flatten(bool):  Whether dimension reduction data 

 Output :
    numpy.ndarray:  Of the audio MFCC Eigenvector 
"""


def get_feature(file_path: str, mfcc_len: int = 39, flatten: bool = False):
    #  Some audio uses scipy.io.wavfile Reading newspaper  "Incomplete wav chunk" error
    #  It seems to be because scipy Only read pcm and float Format , Others wav Not these two formats ...
    # fs, signal = wav.read(file_path)
    # signal, fs = librosa.load(file_path, 16000)

    fs, signal = scipy.io.wavfile.read(file_path)

    s_len = len(signal)

    #  If the audio signal is less than mean_signal_length, Then expand it 
    if s_len < mean_signal_length:
        pad_len = mean_signal_length - s_len
        pad_rem = pad_len % 2
        pad_len //= 2
        signal = np.pad(signal, (pad_len, pad_len + pad_rem), 'constant', constant_values=0)

    #  Or cut it open 
    else:
        pad_len = s_len - mean_signal_length
        pad_len //= 2
        signal = signal[pad_len:pad_len + mean_signal_length]

    mel_coefficients = get_dll(signal, fs, mfcc_len)
    #   use  SVM & MLP  Dimension reduction data is required for modeling 
    if flatten:
        mel_coefficients = np.ravel(mel_coefficients)

    return mel_coefficients


def get_feature_result(signal, fs, mfcc_len: int = 39, flatten: bool = False):
    #  Some audio uses scipy.io.wavfile Reading newspaper  "Incomplete wav chunk" error
    #  It seems to be because scipy Only read pcm and float Format , Others wav Not these two formats ...
    # fs, signal = wav.read(file_path)
    # signal, fs = librosa.load(file_path)

    s_len = len(signal)

    #  If the audio signal is less than mean_signal_length, Then expand it 
    if s_len < mean_signal_length:
        pad_len = mean_signal_length - s_len
        pad_rem = pad_len % 2
        pad_len //= 2
        signal = np.pad(signal, (pad_len, pad_len + pad_rem), 'constant', constant_values=0)

    #  Or cut it open 
    else:
        pad_len = s_len - mean_signal_length
        pad_len //= 2
        signal = signal[pad_len:pad_len + mean_signal_length]

    mel_coefficients = get_mfcc(signal, fs, mfcc_len)

    # mel_coefficients = librosa.feature.mfcc(signal, fs,  n_mfcc=39)

    #   use  SVM & MLP  Dimension reduction data is required for modeling 
    if flatten:
        mel_coefficients = np.ravel(mel_coefficients)

    return mel_coefficients


def get_dll(signal, fs, mfcc_len):
    mfcc = get_mfcc(signal, fs, mfcc_len)
    mfcc = mfcc.Array
    array = []
    for _ in range(mfcc.Length):
        array.append(mfcc[_])
    return np.array(array).reshape(-1, mfcc_len)


def get_data(data_path: str, mfcc_len: int = 39,
             class_labels: Tuple = ("angry", "fear", "happy", "neutral", "sad", "surprise"), flatten: bool = False,
             _svm: bool = False):
    data = []
    labels = []
    cur_dir = os.getcwd()
    sys.stderr.write('Curdir: %s\n' % cur_dir)
    os.chdir(data_path)
    #  Traversal folder 
    for i, directory in enumerate(class_labels):
        sys.stderr.write("Started reading folder %s\n" % directory)
        os.chdir(directory)
        #  Read the audio in this folder 
        for filename in os.listdir('.'):
            if not filename.endswith('wav'):
                continue
            filepath = os.getcwd() + '/' + filename
            #  Extract the feature vector of the audio 
            feature_vector = get_feature(file_path=filepath, mfcc_len=mfcc_len, flatten=flatten)
            data.append(feature_vector)
            labels.append(i)
        sys.stderr.write("Ended reading folder %s\n" % directory)
        os.chdir('..')
    os.chdir(cur_dir)

    #  Divide the training set and the test set 
    x_train, x_test, y_train, y_test = train_test_split(np.array(data), np.array(labels), test_size=0.001,
                                                        random_state=42)
    return np.array(x_train), np.array(x_test), np.array(y_train), np.array(y_test)


def get_data(data_path: str, mfcc_len: int = 39,
             class_labels: Tuple = ("angry", "fear", "happy", "neutral", "sad", "surprise"), flatten: bool = False,
             _svm: bool = False):
    data = []
    labels = []
    cur_dir = os.getcwd()
    sys.stderr.write('Curdir: %s\n' % cur_dir)
    os.chdir(data_path)
    #  Traversal folder 
    for i, directory in enumerate(class_labels):
        sys.stderr.write("Started reading folder %s\n" % directory)
        os.chdir(directory)
        #  Read the audio in this folder 
        for filename in os.listdir('.'):
            if not filename.endswith('wav'):
                continue
            filepath = os.getcwd() + '/' + filename
            #  Extract the feature vector of the audio 
            feature_vector = get_feature(file_path=filepath, mfcc_len=mfcc_len, flatten=flatten)
            data.append(feature_vector)
            labels.append(i)
        sys.stderr.write("Ended reading folder %s\n" % directory)
        os.chdir('..')
    os.chdir(cur_dir)

    #  Divide the training set and the test set 
    x_train, x_test, y_train, y_test = train_test_split(np.array(data), np.array(labels), test_size=0.001,
                                                        random_state=42)
    return np.array(x_train), np.array(x_test), np.array(y_train), np.array(y_test)


'''
load_model_dnn(): 
     load  CNN & LSTM  Model of 

 Input :
    model_name(str):  The model name 
    load_model(str):  Model type （DNN / ML）

 Output :
    model:  Loaded model 
'''


def load_model(model_name: str, load_model: str):
    if load_model == 'DNN':
        #  load json
        model_path = 'Models/' + model_name + '.h5'
        model_json_path = 'Models/' + model_name + '.json'

        json_file = open(model_json_path, 'r')
        loaded_model_json = json_file.read()
        json_file.close()
        model = model_from_json(loaded_model_json)

        #  Load weights 
        model.load_weights(model_path)

    elif load_model == 'ML':
        model_path = 'Models/' + model_name + '.m'
        model = joblib.load(model_path)

    return model

Model:

# CNN & LSTM
import sys
import numpy as np
from keras import Sequential
from keras.layers import LSTM as KERAS_LSTM, Dense, Dropout, Conv2D, Flatten, BatchNormalization, Activation, \
    MaxPooling2D
from Common_Model import Common_Model


# class CNN  and  class LSTM  Inherited this kind of （ Realized make_model Method ）
class DNN_Model(Common_Model):
    '''
    __init__():  Initialize the neural network 

     Input :
        input_shape(tuple):  Tensor shape 
        num_classes(int):  Number of label types 
    '''

    def __init__(self, input_shape, num_classes, **params):
        super(DNN_Model, self).__init__(**params)
        self.input_shape = input_shape
        self.model = Sequential()
        self.make_model()
        self.model.add(Dense(num_classes, activation='softmax'))
        self.model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        print(self.model.summary(), file=sys.stderr)

    '''
    save_model():  Weight the model with  model_name.h5  and  model_name.json  Names are stored in  /Models  Under the table of contents 
    '''

    def save_model(self, model_name):
        h5_save_path = 'Models/' + model_name + '.h5'
        self.model.save_weights(h5_save_path)

        save_json_path = 'Models/' + model_name + '.json'
        with open(save_json_path, "w") as json_file:
            json_file.write(self.model.to_json())

    '''
    train():  Train the model on a given training set 

     Input :
        x_train (numpy.ndarray):  Training set samples 
        y_train (numpy.ndarray):  Training set label 
        x_val (numpy.ndarray):  Test set samples 
        y_val (numpy.ndarray):  Test set label 
        n_epochs (int): epoch Count 

    '''

    def train(self, x_train, y_train, x_val=None, y_val=None, n_epochs=50):
        best_acc = 0
        if x_val is None or y_val is None:
            x_val, y_val = x_train, y_train
        for i in range(n_epochs):
            #  Every epoch Are randomly arranged training data 
            p = np.random.permutation(len(x_train))
            x_train = x_train[p]
            y_train = y_train[p]
            '''
            fit( x, y, batch_size=32, epochs=10, verbose=1, callbacks=None,
                validation_split=0.0, validation_data=None, shuffle=True, 
                class_weight=None, sample_weight=None, initial_epoch=0)
                
            x： input data . If the model has only one input , that x The type is numpy 
                array, If the model has multiple inputs , that x The type should be list,list Is corresponding to each input numpy array
            y： label ,numpy array
                batch_size： Integers , Specifies the gradient descent for each batch Number of samples included . One for training batch The sample will be calculated as a gradient descent , Optimize the target function one step .
            epochs： Integers , At the end of training epoch value , The training will be at the end of the day epoch Value , When there is no setting initial_epoch when , It is the total number of rounds of training ,
                     Otherwise, the total number of rounds of training is epochs - inital_epoch
            verbose： The log shows ,0 Output log information for non-standard output stream ,1 Record for the output progress bar ,2 For each epoch Output line record 
            callbacks：list, The elements are keras.callbacks.Callback The object of . This list The callback function will be called at the appropriate time during the training , Refer to the callback function 
            validation_split：0~1 The floating point number between , A percentage of the data used to specify the training set is used as the validation set . Validation sets will not be trained ,
                               And in each epoch End - of - test model metrics , Like the loss function 、 Precision etc. . Be careful ,validation_split The division of the shuffle Before , So if your data itself is ordered , You need to manually scramble it before you specify it validation_split, Otherwise, an uneven sample of the validation set may occur .
            validation_data： In the form of （X,y） Of tuple, Is the specified validation set . This parameter overrides validation_spilt.
            shuffle： Boolean or string , Is generally a Boolean value , Indicates whether the sequence of input samples is randomly scrambled during training . If string “batch”, 
                       It's used to deal with HDF5 The special case of data , It will be batch Internally scrambles the data .
            class_weight： Dictionaries , Mapping different categories to different weights , This parameter is used to adjust the loss function during training （ Only for training ）            
            sample_weight： Weights of numpy array, Used to adjust the loss function during training （ For training purposes only ）. You can pass a 1D The vector with the same length as the sample is used to carry on the sample 1 Yes 1 A weighted ,
                            Or in the case of temporal data , The form of passing one is （samples,sequence_length） To assign different weights to the samples on each time step . In this case be sure to add when compiling the model sample_weight_mode=’temporal’.            
            initial_epoch:  Specified from this parameter epoch Start training , It's useful to continue the previous training .
            
            fit The function returns a History The object of , Its History.history Attribute records the value of loss function and other indicators epoch Changing circumstances , If there is a verification set , Also contains the change of these indicators of the verification set 
            
            '''
            self.model.fit(x_train, y_train, batch_size=32, epochs=1)
            #  The change of loss rate during training 
            #  Calculate the loss rate and accuracy rate 
            loss, acc = self.model.evaluate(x_val, y_val)
            if acc > best_acc:
                best_acc = acc

            print("TRAIN:%d /  %d" % (i, n_epochs))
        self.trained = True

    '''
    recognize_one():  Recognize the emotion of an audio 

     Input :
        sample:  Sample to predict 
    
     Output :
         Predicted results , Confidence probability (int, numpy.ndarray)
    '''

    def recognize_one(self, sample):
        #  No training and loading of models 
        if not self.trained:
            sys.stderr.write("No Model.")
            sys.exit(-1)
        return np.argmax(self.model.predict(np.array([sample]))), self.model.predict(np.array([sample]))[0]

    def make_model(self):
        raise NotImplementedError()


class CNN_Model(DNN_Model):

    def __init__(self, **params):
        params['name'] = 'CNN'
        super(CNN_Model, self).__init__(**params)

    def make_model(self):
        self.model.add(Conv2D(8, (13, 13), input_shape=(self.input_shape[0], self.input_shape[1], 1)))
        self.model.add(BatchNormalization(axis=-1))
        self.model.add(Activation('relu'))
        self.model.add(Conv2D(8, (13, 13)))
        self.model.add(BatchNormalization(axis=-1))
        self.model.add(Activation('relu'))
        self.model.add(MaxPooling2D(pool_size=(2, 1)))
        self.model.add(Conv2D(8, (13, 13)))
        self.model.add(BatchNormalization(axis=-1))
        self.model.add(Activation('relu'))
        self.model.add(Conv2D(8, (2, 2)))
        self.model.add(BatchNormalization(axis=-1))
        self.model.add(Activation('relu'))
        self.model.add(MaxPooling2D(pool_size=(2, 1)))
        self.model.add(Flatten())
        self.model.add(Dense(64))
        self.model.add(BatchNormalization())
        self.model.add(Activation('relu'))
        self.model.add(Dropout(0.2))


class LSTM_Model(DNN_Model):

    def __init__(self, **params):
        params['name'] = 'LSTM'
        super(LSTM_Model, self).__init__(**params)

    def make_model(self):
        #
        self.model.add(KERAS_LSTM(128, input_shape=(self.input_shape[0], self.input_shape[1])))
        self.model.add(Dropout(0.5))
        self.model.add(Dense(32, activation='relu'))  #  Standard one-dimensional fully connected layer 
        self.model.add(Dense(16, activation='tanh'))

2.python Can pass librosa and scipy To get mfcc The eigenvalue , But I found that C# Acquired mfcc Eigenvalues and python Acquired mfcc Different eigenvalues , It is impossible to accurately predict the results （ I don't know why , The formula is too complicated , There are big men who understand , Welcome to guide ）. The way I think is through python call C#dll To get MFCC

C#： Translate it into dll

quote :https://github.com/ar1st0crat/NWaves

using NumSharp;
using NWaves.FeatureExtractors;
using NWaves.FeatureExtractors.Base;
using System.Collections.Generic;

namespace MFCC
{
    public class MYMFCC
    {

        public NDArray GetMFCC(float[] input, int sr, int mfcc_size)
        {



            var mfccExtractor = new MfccExtractor(sr, mfcc_size);

            var mfccVectors = mfccExtractor.ComputeFrom(input);

            List<float> result = new List<float>();

            foreach (FeatureVector vector in mfccVectors)
            {

                foreach (float _ in vector.Features)
                {
                    result.Add(_);
                }

            }

            return np.array(result.ToArray());
        }
    }
}

Python: ( Quote relevant dll)

import clr

clr.FindAssembly("MFCCSharp.dll")
clr.FindAssembly("NWaves.dll")
clr.FindAssembly("NumSharp.dll")
from MFCC import *
from NWaves import *
from NumSharp import *

instance = MYMFCC()


def get_mfcc(input, sr, mfcc_len):

    instance = MYMFCC()

    return instance.GetMFCC(input, sr, mfcc_len)

3. Because of me python The training model is based on keras Of , but keras Not provided c# Interface . All will now keras The resulting model .h5 File conversion to .pb file . adopt tensorflowsharp To call .pb file .

python:( .h5 -> .pb)

quote :https://blog.csdn.net/qq_25109263/article/details/81285952

from keras.models import load_model
import tensorflow as tf
from keras import backend as K, Sequential
from tensorflow.python.framework import graph_io
from keras.models import model_from_json

def freeze_session(session, keep_var_names=None, output_names=None, clear_devices=True):
    from tensorflow.python.framework.graph_util import convert_variables_to_constants
    graph = session.graph
    with graph.as_default():
        freeze_var_names = list(set(v.op.name for v in tf.global_variables()).difference(keep_var_names or []))
        output_names = output_names or []
        output_names += [v.op.name for v in tf.global_variables()]
        input_graph_def = graph.as_graph_def()
        if clear_devices:
            for node in input_graph_def.node:
                node.device = ""
        frozen_graph = convert_variables_to_constants(session, input_graph_def,
                                                      output_names, freeze_var_names)
        return frozen_graph


def load_model():
    model_path = 'Models/LSTM1.h5'
    model_json_path = 'Models/LSTM1.json'

    json_file = open(model_json_path, 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    model = model_from_json(loaded_model_json)

    #  Load weights 
    model.load_weights(model_path)
    return model


"""---------------------------------- Configuration path -----------------------------------"""
epochs = 20
h5_model_path = 'Models/LSTM1.h5'
output_path = 'PBModels'
pb_model_name = 'LSTM1.pb'

"""---------------------------------- Import keras Model ------------------------------"""
K.set_learning_phase(0)

net_model = load_model()

print('input is :', net_model.input.name)
print('output is:', net_model.output.name)

"""---------------------------------- Save as .pb Format ------------------------------"""
sess = K.get_session()
frozen_graph = freeze_session(K.get_session(), output_names=[net_model.output.op.name])
graph_io.write_graph(frozen_graph, output_path, pb_model_name, as_text=False)

4. Use C# call python Training the model to achieve prediction .

Class library ：TensorflowSharp (tensorflow) , NumSharp (numpy) , NWave ( obtain MFCC)

using NumSharp;
using NWaves.FeatureExtractors;
using NWaves.FeatureExtractors.Base;
using System;
using System.Collections.Generic;
using System.IO;
using TensorFlow;


namespace TensorflowSharpDemo
{
    
    class VoiceTest
    {
        private static String[] CLASS_LABELS = new string[] { "anjian", "di", "huanjie", "mingdi", "pengzhuang", "qita", "shebeiyuyin", "voice" };


        private static int VOICE_INDEX = 7;

        private static int FRAME_RATE = 16000;

        private static int FRAME_MOVE = 8000;

        private static TFGraph graph;

        private static TFSession session;


        public static void test()
        {
            //CLASS_LABELS = ("anjian", "di", "huanjie", "mingdi", "pengzhuang", "qita", "shebeiyuyin", "voice")
            graph = new TFGraph();

            graph.Import(File.ReadAllBytes("../../model/LSTM1.pb"), "");

            session = new TFSession(graph);



            List<float> audios = new WAVReader().ReadWAVFile("../../test/123.wav");

            NDArray input = audioToFrames(audios.ToArray(), 16000);

            logits(input, "");


        }

        private static float[] get_feature_result(float[] input)
        {
            float[] mfcc = getMfcc(input);

            return mfcc;

        }


        /// <summary>
        ///   Frame shift moves audio frames , repair 0
        /// </summary>
        private static float[] padd(float[] input, int frameMove)
        {

            int con = input.Length % frameMove;

            if (con == 0)
                return input;


            int dis = frameMove - con;

            List<float> coll = new List<float>();

            int b_len = dis / 2;

            for (int i = 0; i < b_len; i++)
            {
                coll.Add(0);
            }

            for (int i = 0; i < input.Length; i++)
            {
                coll.Add(input[i]);
            }

            int e_len = dis - b_len;

            for (int i = 0; i < b_len; i++)
            {
                coll.Add(0);
            }
            return coll.ToArray();
        }


        

        /// <summary>
        /// 
        /// </summary>
        /// <param name="input"> Audio </param>
        /// <param name="frameLen"> Frame length </param>
        /// <param name="frameMove"> Frame shift </param>
        /// <returns></returns>
        private static float[] fragment(float[] input, int frameLen, int frameMove)
        {

            input = padd(input, frameMove);

            List<float> frames = new List<float>();

            int n_step = 0;

            while (n_step * frameMove + frameLen <= input.Length)
            {
                for (int k = n_step * frameMove; k < n_step * frameMove + frameLen; k++)
                {
                    frames.Add(input[k]);
                }
                n_step++;
            }

            return frames.ToArray();
        }


        private static NDArray audioToFrames(float[] input, int frame)
        {
            input = fragment(input, FRAME_RATE, FRAME_MOVE);

            int row = input.Length / FRAME_RATE;

            NDArray array = np.array(input).reshape(row, FRAME_RATE);

            return array;
        }

        private static NDArray predict_model(float[] input, int sr)
        {

            var mfcc = get_feature_result(input);

            int ax = mfcc.Length / 39;

            var tensor = TFTensor.FromBuffer(new TFShape(1, ax, 39), mfcc, 0, mfcc.Length);

            var runner = session.GetRunner();

            runner.AddInput(graph["lstm_1_input"][0], tensor).Fetch(graph["dense_3/Softmax"][0]);

            var output = runner.Run();

            var result = output[0];



            int result_count = ((float[][])result.GetValue(jagged: true)).Length;

            List<float> resultColl = new List<float>();

            for (int i = 0; i < result_count; i++)
            {
                float[] a = ((float[][])result.GetValue(jagged: true))[i];
                string s = null;
                for (int j = 0; j < a.Length; j++)
                {
                    resultColl.Add(a[j]);
                }
            }

            return np.array(resultColl.ToArray()).reshape(1, CLASS_LABELS.Length);

        }
        private static void logits(NDArray input, string outFileDir)
        {
            int length = input.shape[0];
            int n_step = 0;

            List<NDArray> label = new List<NDArray>();

            while (n_step < length)
            {
                float[] temp = (float[])input[n_step].Array;
                NDArray result = predict_model(temp, FRAME_RATE);
                label.Add(result);
                n_step++;
            }

            List<float[]> voice = new List<float[]>();
            

            List<float> voice_temp = new List<float>();

            int noiseCount = 0;

            for (int i = 0; i < length; i++)
            {
                int index = np.argmax(label[i]);

                float[] sound = (float[])input[i].Array;

                if (index == VOICE_INDEX)
                {
                    noiseCount = 0;
                    // The last frame was noise 
                    if (voice_temp.Count == 0)
                    {
                        voice_temp = floatToAddList(sound, voice_temp);
                        continue;
                    }
                    // The last frame was the voice 
                    voice_temp = floatToAddList(sound, voice_temp, FRAME_MOVE);
                    continue;
                }
                else
                {

                    noiseCount++;

                    //  Continuous detection for more than one frame ( Self determination ) It's all noise , It means that a complete speech is detected  
                    //if (noiseCount >= FRAME_RATE / FRAME_MOVE)
                    //{
                    //    if (voice_temp.Count > 0)
                    //        voice.Add(voice_temp.ToArray());
                    //    voice_temp = new List<float>();
                    //    noiseCount = 0;
                    //}
                }

            }

            if (voice_temp.Count > 0)
            {
                voice.Add(voice_temp.ToArray());
            }

            for (int j = 0; j < voice.Count; j++)
            {
                toSaveAudio(voice[j], "voice" + j);
            }
        }
        
       
        private static List<float> floatToAddList(float[] audio, List<float> coll)
        {
            foreach (float _f in audio)
            {
                coll.Add(_f);
            }

            return coll;
        }

        private static List<float> floatToAddList(float[] audio, List<float> coll, int frameMove)
        {
            int index = 0;
            foreach (float _f in audio)
            {
                if (index >= audio.Length - frameMove)
                    coll.Add(_f);
                index++;
            }

            return coll;
        }
        

        private static void toSaveAudio(float[] audio, string name)
        {

            string audioFileName = @"D:\WorkSpace\VS\TensorflowSharpDemo\TensorflowSharpDemo\voice\" + name + ".wav";

            WaveSaveHelper.Save(audioFileName, audio);
        }


        /// <summary>
        ///    obtain MFCC
        /// </summary>
        private static float[] getMfcc(float[] input)
        {

            var mfccExtractor = new MfccExtractor(16000, 39);

            var mfccVectors = mfccExtractor.ComputeFrom(input);


            List<float> result = new List<float>();

            foreach (FeatureVector vector in mfccVectors)
            {

                foreach (float _ in vector.Features)
                {
                    result.Add(_);
                }

            }

            return result.ToArray();
        }
    }
}

problem :

1. python adopt libroas or scripy obtain mfcc Eigenvalues and C# Acquisition of implementation mfcc Different eigenvalues . So lead to C# The result predicted by the calling model is incorrect . My solution is through C# Write a program to get mfcc Eigenvalue dll( be based on NWave.dll https://github.com/ar1st0crat/NWaves), And then use python To call .( There is no way , I haven't understood this place for several days , Or maybe I made a mistake somewhere , In short, if there is a big man who understands this place , Welcome to give us your advice , Thank you very much .)

2. Found by test , There will be some noise in the detected audio , But I feel the effect is acceptable .

3. Still trying to optimize and test ... = = 、

summary ：

Through the recent period of study and efforts , The function of endpoint detection is preliminarily realized , However, the effect needs to be improved , I will continue to work hard , What new experiences do you want , I will share it with you in time . Because I just touched the audio , Or a little white , There are too many places to understand very thoroughly , It can be said that I only know a little , So there are some incorrect statements , Please forgive me a lot , I also hope you can put forward your valuable opinions and corrections , Thank you very much ！！！！

Here's what I wrote demo, If you want to see something, you can come down and have a look ,demo There are no audio files in , After downloading, you need to save the corresponding folder name of your own training audio file , Then train the model file , Again into .pb file , The final will be pb File copy to c# In the project , for c# call .（ Limited ability to express , If there is something you don't understand, you can leave a message . = =、）

demo Address ：https://download.csdn.net/download/haiyangyunbao813/11155896

原网站

版权声明
本文为[J ..]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/180/202206290814184312.html