Speech Processing Project

Linear Predictive coding using Voice excited

Vocoder

ECE 5525

Osama Saraireh

Fall 2005

Dr. Veton Kepuska

The basic form of pitch excited LPC vocoder is shown below

The speech signal is filtered to no more than one half the system sampling frequency and then A/D conversion is performed. The speech is processed on a frame by frame basis where the analysis frame length can be variable. For each frame a pitch period estimation is made along with a voicing decision. A linear predictive coefficient analysis is performed to obtain an inverse model of the speech spectrum A (z). In addition a gain parameter G, representing some function of the speech energy is computed. An encoding procedure is then applied for transforming the analyzed parameters into an efficient set of transmission parameters with the goal of minimizing the degradation in the synthesized speech for a specified number of bits. Knowing the transmission frame rate and the number of bits used for each transmission parameters, one can compute a noise-free channel transmission bit rate.

At the receiver, the transmitted parameters are decoded into quantized versions of the coeifficent analysis and pitch estimation parameters. An excitation signal for synthesis is then constructed from the transmitted pitch and voicing parameters. The excitation signal then drives a synthesis filter 1/A (z) corresponding to the analysis model A (z). The digital samples s^(n) are then passed through an D/A converter and low pass filtered to generate the synthetic speech s(t). Either before or after synthesis, the gain is used to match the synthetic speech energy to the actual speech energy. The digital samples are the converted to an analog signal and passed through a filter similar to the one at the input of the system.

Linear predictive coding (LPC) of speech

The linear predictive coding (LPC) method for speech analysis and synthesis is based on modeling the Vocal tract as a linear All-Pole (IIR) filter having the system transfer function:

simple speech production

Where p is the number of poles, G is the filter Gain, and a[k] are the parameters that determine the poles. There are two mutuallyexclusive ways excitation functions to model voiced and unvoiced speech sounds. For a short time-basis analysis, voiced speech is considered periodic with a fundamentalfrequency of Fo, and a pitch period of 1/Fo, which depends on the speaker. Hence, Voiced speech is generated by exciting the all pole filter model by a periodic impulse train. On the other hand, unvoiced sounds are generated by exciting the all-pole filter by the output of a random noise generator.

The fundamental difference between these two types of speech sounds comes from the way they are produced. The vibrations of the vocal cords produce voiced sounds. The rate at which the vocal cords vibrate dictates the pitch of the sound. On the other hand, unvoiced sounds do not rely on the vibration of the vocal cords. The unvoiced sounds are created by the constriction of the vocal tract. The vocal cords remain open and the constrictions of the vocal tract force air out to produce the unvoiced sounds

Given a short segment of a speech signal, lets say about 20 ms or 160 samples at a sampling rate 8 KHz, the speech encoder at the transmitter must determine the proper excitation function, the pitch period for voiced speech, the gain, and the coefficientsap[k]. The block diagram below describes the encoder/decoder for the Linear Predictive Coding. The parameters of the model are determinedadaptively from the data and modeled into a binary sequence and transmitted to the receiver. At the receiver point, the speech signal is the synthesized from the model and excitation signal.

The parameters of the all-pole filter model are determined from the speech samples by means of linear prediction. To be specific the output of the Linear Prediction filter is

and the corresponding error between the observed sample S(n) and the predicted value is

by minimizing the sum of the squared error we can determine the pole parameters of the model. The result of differentiating the sum above with respect to each of the parameters and equation the result to zero, is a sep of p linear equations

where m=1,2,….p

where represent the autocorrelation of the sequence defined as

the equation above can be expressed in matrix form as

where is a pxp autocorrelation matrix, is a px1 autocorrelation vector, and a is a px1 vector of model parameters.

[row col] = size(data);

if col==1 data=data'; end

nframe = 0;

msfr = round(sr/1000*fr); % Convert ms to samples

msfs = round(sr/1000*fs); % Convert ms to samples

duration = length(data);

speech = filter([1 -preemp], 1, data)'; % Preemphasize speech

msoverlap = msfs - msfr;

ramp = [0:1/(msoverlap-1):1]'; % Compute part of window

for frameIndex=1:msfr:duration-msfs+1 % frame rate=20ms

frameData = speech(frameIndex:(frameIndex+msfs-1)); % frame size=30ms

nframe = nframe+1;

autoCor = xcorr(frameData); % Compute the cross correlation

autoCorVec = autoCor(msfs+[0:L]);

These equations can be solved in MATLB by using the Levinson-Durbin algorithm.

% Levinson's method

err(1) = autoCorVec(1);

k(1) = 0;

A = [];

for index=1:L

numerator = [1 A.']*autoCorVec(index+1:-1:2);

denominator = -1*err(index);

k(index) = numerator/denominator; % PARCOR coeffs

A = [A+k(index)*flipud(A); k(index)];

err(index+1) = (1-k(index)^2)*err(index);

The gain parameter of the filter can be obtained by the input-output relationship as follow

where X(n) represent the input sequence.

We can further manipulate this equation and in terms of the error sequence we have

then

if the input excitation is normalized to unit energy by design, then

where G^2 is set equal to the residual energy resulting from the least square optimization .

% filter response

if 0

gain=0;

cft=0:(1/255):1;

for index=1:L

gain = gain + aCoeff(index,nframe)*exp(-i*2*pi*cft).^index;

end

gain = abs(1./gain);

spec(:,nframe) = 20*log10(gain(1:128))';

plot(20*log10(gain));

title(nframe);

drawnow;

end

if 0

impulseResponse = filter(1, aCoeff(:,nframe), [1 zeros(1,255)]);

freqResp = 20*log10(abs(fft(impulseResponse)));

plot(freqResp);

end

once the LPC coefficients are computed, we can determine weather the input speech frame is voiced, and if it is indeed voiced sound, then what is the pitch. We can determine the pitch by computing the following sequence in matlab:

whwre is defined as follow

which is defined as the autocorrelation sequence of the prediction coefficients. The pitch id detected by finding the peak of the normalized sequence

In the time interval corresponds to 3 to 15 ms in the 20ms sampling frame. If the value of this peak is at least 0.25, the frame of speech is considered voiced with a pitch period equal to the value of , where is a maximum value.

If the peak value is less than 0.25, the frame speech is considered unvoiced and the pitch would equal to zero.

errSig = filter([1 A'],1,frameData); % find excitation noise

G(nframe) = sqrt(err(L+1)); % gain

autoCorErr = xcorr(errSig); % calculate pitch & voicing information

[B,I] = sort(autoCorErr);

num = length(I);

if B(num-1) > .01*B(num)

pitch(nframe) = abs(I(num) - I(num-1));

else

pitch(nframe) = 0;

end

The value of the LPC coefficients, the pitch period, and the type of excitation are then transmitted to the receiver. The decoder synthesizes the speech signal by passing the proper excitation through the all pole filter model of the vocal tract.

Typically the pitch period requires 6 bits, the gain parameters are represented in 5 bits after the dynamic range is compressed logrithmaticaly, and the prediction coefficients require 8-10 bits normally for accuracy reasons. This is very important in LPC because any small changes in the prediction coefficients result in large change in the pole positions of the filter model, which cause instability in the model. This is overcome by using the PARACOR method .

Is speech frame Voiced or Unvoiced?

Once the LPC coefficients are competed, we can determine weather the input speech frame is voiced, and if so, what the pitch is.

If the speech frame is decided to be voiced, an impulse train is employed to represent it, with nonzero taps occurring every pitch period. A pitch-detecting algorithm is used in order to determine to correct pitch period / frequency. The autocorrelation function is used to estimate the pitch period as . However, if the frame is unvoiced, then white noise is used to represent it and a pitch period of T=0 is transmitted. Therefore, either white noise or impulse train becomes the excitation of the LPC synthesis filter

Two types of LPC vocoders were implemented in MATLAB

Plain LPC Vocoderdiagram is shown below:

%LPC vocoder

function [ outspeech ] = speechcoder1( inspeech )

!

% Parameters:

% inspeech : wave data with sampling rate Fs

% (Fs can be changed underneath if necessary)

% Returns:

% outspeech : wave data with sampling rate Fs

% (coded and resynthesized)

if ( nargin ~= 1)

error('argument check failed');

end;

Fs = 16000; % sampling rate in Hertz (Hz)

Order = 10; % order of the model used by LPC

% encoded the speech using LPC

[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);

% decode/synthesize speech using LPC and impulse-trains as excitation

outspeech = synlpc(aCoeff, pitch, Fs, G)

results:

residual plot:

voice excited LPC Vocoder(utilizing DCT for high compression rate/low bits)

the input speech signal in each frame is filtered with the estimated transfer function of LPC analyzer. This filtered signal is called the residual.

To achieve a high compression rate ,the discrete cosine transform (DCT) of the residual signal could be employed. The DCT concentrates most of the energy of the signal in the first few coefficients. Thus one way to compress the signal is to transfer only the coefficients, which contain most of the energy.

function [ outspeech ] = speechcoder2( inspeech )

% Parameters:

% inspeech : wave data with sampling rate Fs

% (Fs can be changed underneath if necessary)

% Returns:

% outspeech : wave data with sampling rate Fs

% (coded and resynthesized)

if ( nargin ~= 1)

error('argument check failed');

end;

Fs = 16000; % sampling rate in Hertz (Hz)

Order = 10; % order of the model used by LPC

% encoded the speech using LPC

[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);

% perform a discrete cosine transform on the residual

resid = dct(resid);

[a,b] = size(resid);

% only use the first 50 DCT-coefficients this can be done

% because most of the energy of the signal is conserved in these coeffs

resid = [ resid(1:50,:); zeros(430,b) ];

% quantize the data

resid = uencode(resid,4);

resid = udecode(resid,4);

% perform an inverse DCT

resid = idct(resid);

% add some noise to the signal to make it sound better

noise = [ zeros(50,b); 0.01*randn(430,b) ];

resid = resid + noise;

% decode/synthesize speech using LPC and the compressed residual as excitation

outspeech = synlpc2(aCoeff, resid, Fs, G);

results

noise = [ zeros(50,b); 0.01*randn(430,b) ];

resid = resid + noise;

MATLAB files:

clear all;

%osama saraireh

% speech processing

%Dr. Veton Kepuska

%FIT FAll 2005

a= input ('please load the speech signal as a .wav file ' , 's');

Inputsoundfile = a ;

[inspeech, Fs, bits] = wavread(Inputsoundfile); % read the wavefile

outspeech1 = speechcoder1(inspeech); % plain LPC vocoder

outspeech2 = speechcoder2(inspeech); % Voice excitded LPC vocoder

% plot results

figure(1);

subplot(3,1,1);

plot(inspeech);

grid;

subplot(3,1,2);

plot(outspeech1);

grid;

subplot(3,1,3);

plot(outspeech2);

grid;

disp('Press any key to play the original sound file');

pause;

soundsc(inspeech, Fs);

disp('Press any key to play the LPC compressed file!');

pause;

soundsc(outspeech1, Fs);

disp('Press a key to play the voice-excited LPC compressed sound!');

pause;

soundsc(outspeech2, Fs);

function [aCoeff,resid,pitch,G,parcor,stream] = proclpc(data,sr,L,fr,fs,preemp)

% L - The order of the analysis. .

% fr - Frame time increment, in ms. Defaults to 20ms

% fs - Frame size in ms.

% preemp - default 0.9378

% aCoeff - The LPC analysis results,

% resid - The LPC residual,

% pitch - calculated by finding the peak in the residual's autocorrelation

%for each frame.

% G - The LPC gain for each frame.

% parcor - The parcor coefficients.

% stream - The LPC analysis' residual or excitation signal as one long vector.

if (nargin<3), L = 10; end

if (nargin<4), fr = 20; end

if (nargin<5), fs = 30; end

if (nargin<6), preemp = .9378; end

[row col] = size(data);

if col==1 data=data'; end

nframe = 0;

msfr = round(sr/1000*fr); % Convert ms to samples

msfs = round(sr/1000*fs); % Convert ms to samples

duration = length(data);

speech = filter([1 -preemp], 1, data)'; % Preemphasize speech

msoverlap = msfs - msfr;

ramp = [0:1/(msoverlap-1):1]'; % Compute part of window

for frameIndex=1:msfr:duration-msfs+1 % frame rate=20ms

frameData = speech(frameIndex:(frameIndex+msfs-1)); % frame size=30ms

nframe = nframe+1;

autoCor = xcorr(frameData); % Compute the cross correlation

autoCorVec = autoCor(msfs+[0:L]);

% Levinson's method

err(1) = autoCorVec(1);

k(1) = 0;

A = [];

for index=1:L

numerator = [1 A.']*autoCorVec(index+1:-1:2);

denominator = -1*err(index);

k(index) = numerator/denominator; % PARCOR coeffs

A = [A+k(index)*flipud(A); k(index)];

err(index+1) = (1-k(index)^2)*err(index);

end

aCoeff(:,nframe) = [1; A];

parcor(:,nframe) = k';

% filter response

if 0

gain=0;

cft=0:(1/255):1;

for index=1:L

gain = gain + aCoeff(index,nframe)*exp(-i*2*pi*cft).^index;

end

gain = abs(1./gain);

spec(:,nframe) = 20*log10(gain(1:128))';

plot(20*log10(gain));

title(nframe);

drawnow;

end

% Calculate the filter response

% from the filter's impulse

% response (to check above).

if 0

impulseResponse = filter(1, aCoeff(:,nframe), [1 zeros(1,255)]);

freqResponse = 20*log10(abs(fft(impulseResponse)));

plot(freqResponse);

end

errSig = filter([1 A'],1,frameData); % find excitation noise

G(nframe) = sqrt(err(L+1)); % gain

autoCorErr = xcorr(errSig); % calculate pitch & voicing information

[B,I] = sort(autoCorErr);

num = length(I);

if B(num-1) > .01*B(num)

pitch(nframe) = abs(I(num) - I(num-1));

else

pitch(nframe) = 0;

end

% improve the compressed sound quality

resid(:,nframe) = errSig/G(nframe);

if(frameIndex==1) % add residual frames using a trapezoidal window

stream = resid(1:msfr,nframe);

else

stream = [stream];

overlap+resid(1:msoverlap,nframe).*ramp;

resid(msoverlap+1:msfr,nframe);

end

if(frameIndex+msfr+msfs-1 > duration)

stream = [stream; resid(msfr+1:msfs,nframe)];

else

overlap = resid(msfr+1:msfs,nframe).*flipud(ramp);

end

end

stream = filter(1, [1 -preemp], stream)';

Speech Model one

LPC Vocoder:

function [ outspeech ] = speechcoder1( inspeech )

% Parameters:

% inspeech : wave data with sampling rate Fs

% outputs:

% outspeech : wave data with sampling rate Fs

% (coded and resynthesized)

if ( nargin ~= 1)

error('argument check failed');

end;

Fs = 8000; % sampling rate in Hertz (Hz)

Order = 10; % order of the model used by LPC

% encoded the speech using LPC

[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);

% decode/synthesize speech using LPC and impulse-trains as excitation

outspeech = synlpc(aCoeff, pitch, Fs, G);

% Voice-excited LPC vocoder

function [ outspeech ] = speechcoder2( inspeech )

% Parameters:

% inspeech : wave data with sampling rate Fs

% (Fs can be changed underneath if necessary)

% output:

% outspeech : wave data with sampling rate Fs

% (coded and resynthesized)

if ( nargin ~= 1)

error('argument check failed');

end;

Fs = 16000; % sampling rate in Hertz (Hz)

Order = 10; % order of the model used by LPC

% encoded the speech using LPC

[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);

% perform a discrete cosine transform on the residual

resid = dct(resid);

[a,b] = size(resid);

% only use the first 50 DCT-coefficients this can be done

% because most of the energy of the signal is conserved in these coeffs

resid = [ resid(1:50,:); zeros(430,b) ];

% quantize the data

resid = uencode(resid,4);

resid = udecode(resid,4);

% perform an inverse DCT

resid = idct(resid);

% add some noise to the signal to make it sound better

noise = [ zeros(50,b); 0.01*randn(430,b) ];

resid = resid + noise;

% decode/synthesize speech using LPC and the compressed residual as excitation

outspeech = synlpc2(aCoeff, resid, Fs, G)

References

Linear Prediction of Speech, J.D MARKEL, A.H GRAY, Jr. Pages 10-96, 190-158

Digital signal Processing, Alan V. Oppenheim/ Ronald W. Schafer

Digital signal processing using MATLAB, Vinay K. Ingle, John Proakid