Speech Processing Project
Linear Predictive coding using Voice excited
Vocoder
ECE 5525
Osama Saraireh
Fall 2005
Dr. Veton Kepuska
The basic form of pitch excited LPC vocoder is shown below
The speech signal is filtered to no more than one half the system sampling frequency and then A/D conversion is performed. The speech is processed on a frame by frame basis where the analysis frame length can be variable. For each frame a pitch period estimation is made along with a voicing decision. A linear predictive coefficient analysis is performed to obtain an inverse model of the speech spectrum A (z). In addition a gain parameter G, representing some function of the speech energy is computed. An encoding procedure is then applied for transforming the analyzed parameters into an efficient set of transmission parameters with the goal of minimizing the degradation in the synthesized speech for a specified number of bits. Knowing the transmission frame rate and the number of bits used for each transmission parameters, one can compute a noise-free channel transmission bit rate.
At the receiver, the transmitted parameters are decoded into quantized versions of the coeifficent analysis and pitch estimation parameters. An excitation signal for synthesis is then constructed from the transmitted pitch and voicing parameters. The excitation signal then drives a synthesis filter 1/A (z) corresponding to the analysis model A (z). The digital samples s^(n) are then passed through an D/A converter and low pass filtered to generate the synthetic speech s(t). Either before or after synthesis, the gain is used to match the synthetic speech energy to the actual speech energy. The digital samples are the converted to an analog signal and passed through a filter similar to the one at the input of the system.
Linear predictive coding (LPC) of speech
The linear predictive coding (LPC) method for speech analysis and synthesis is based on modeling the Vocal tract as a linear All-Pole (IIR) filter having the system transfer function:
simple speech production
Where p is the number of poles, G is the filter Gain, and a[k] are the parameters that determine the poles. There are two mutuallyexclusive ways excitation functions to model voiced and unvoiced speech sounds. For a short time-basis analysis, voiced speech is considered periodic with a fundamentalfrequency of Fo, and a pitch period of 1/Fo, which depends on the speaker. Hence, Voiced speech is generated by exciting the all pole filter model by a periodic impulse train. On the other hand, unvoiced sounds are generated by exciting the all-pole filter by the output of a random noise generator.
The fundamental difference between these two types of speech sounds comes from the way they are produced. The vibrations of the vocal cords produce voiced sounds. The rate at which the vocal cords vibrate dictates the pitch of the sound. On the other hand, unvoiced sounds do not rely on the vibration of the vocal cords. The unvoiced sounds are created by the constriction of the vocal tract. The vocal cords remain open and the constrictions of the vocal tract force air out to produce the unvoiced sounds
Given a short segment of a speech signal, lets say about 20 ms or 160 samples at a sampling rate 8 KHz, the speech encoder at the transmitter must determine the proper excitation function, the pitch period for voiced speech, the gain, and the coefficientsap[k]. The block diagram below describes the encoder/decoder for the Linear Predictive Coding. The parameters of the model are determinedadaptively from the data and modeled into a binary sequence and transmitted to the receiver. At the receiver point, the speech signal is the synthesized from the model and excitation signal.
The parameters of the all-pole filter model are determined from the speech samples by means of linear prediction. To be specific the output of the Linear Prediction filter is
and the corresponding error between the observed sample S(n) and the predicted value is
by minimizing the sum of the squared error we can determine the pole parameters of the model. The result of differentiating the sum above with respect to each of the parameters and equation the result to zero, is a sep of p linear equations
where m=1,2,….p
where represent the autocorrelation of the sequence defined as
the equation above can be expressed in matrix form as
where is a pxp autocorrelation matrix, is a px1 autocorrelation vector, and a is a px1 vector of model parameters.
[row col] = size(data);
if col==1 data=data'; end
nframe = 0;
msfr = round(sr/1000*fr); % Convert ms to samples
msfs = round(sr/1000*fs); % Convert ms to samples
duration = length(data);
speech = filter([1 -preemp], 1, data)'; % Preemphasize speech
msoverlap = msfs - msfr;
ramp = [0:1/(msoverlap-1):1]'; % Compute part of window
for frameIndex=1:msfr:duration-msfs+1 % frame rate=20ms
frameData = speech(frameIndex:(frameIndex+msfs-1)); % frame size=30ms
nframe = nframe+1;
autoCor = xcorr(frameData); % Compute the cross correlation
autoCorVec = autoCor(msfs+[0:L]);
These equations can be solved in MATLB by using the Levinson-Durbin algorithm.
% Levinson's method
err(1) = autoCorVec(1);
k(1) = 0;
A = [];
for index=1:L
numerator = [1 A.']*autoCorVec(index+1:-1:2);
denominator = -1*err(index);
k(index) = numerator/denominator; % PARCOR coeffs
A = [A+k(index)*flipud(A); k(index)];
err(index+1) = (1-k(index)^2)*err(index);
The gain parameter of the filter can be obtained by the input-output relationship as follow
where X(n) represent the input sequence.
We can further manipulate this equation and in terms of the error sequence we have
then
if the input excitation is normalized to unit energy by design, then
where G^2 is set equal to the residual energy resulting from the least square optimization .
% filter response
if 0
gain=0;
cft=0:(1/255):1;
for index=1:L
gain = gain + aCoeff(index,nframe)*exp(-i*2*pi*cft).^index;
end
gain = abs(1./gain);
spec(:,nframe) = 20*log10(gain(1:128))';
plot(20*log10(gain));
title(nframe);
drawnow;
end
if 0
impulseResponse = filter(1, aCoeff(:,nframe), [1 zeros(1,255)]);
freqResp = 20*log10(abs(fft(impulseResponse)));
plot(freqResp);
end
once the LPC coefficients are computed, we can determine weather the input speech frame is voiced, and if it is indeed voiced sound, then what is the pitch. We can determine the pitch by computing the following sequence in matlab:
whwre is defined as follow
which is defined as the autocorrelation sequence of the prediction coefficients. The pitch id detected by finding the peak of the normalized sequence
In the time interval corresponds to 3 to 15 ms in the 20ms sampling frame. If the value of this peak is at least 0.25, the frame of speech is considered voiced with a pitch period equal to the value of , where is a maximum value.
If the peak value is less than 0.25, the frame speech is considered unvoiced and the pitch would equal to zero.
errSig = filter([1 A'],1,frameData); % find excitation noise
G(nframe) = sqrt(err(L+1)); % gain
autoCorErr = xcorr(errSig); % calculate pitch & voicing information
[B,I] = sort(autoCorErr);
num = length(I);
if B(num-1) > .01*B(num)
pitch(nframe) = abs(I(num) - I(num-1));
else
pitch(nframe) = 0;
end
The value of the LPC coefficients, the pitch period, and the type of excitation are then transmitted to the receiver. The decoder synthesizes the speech signal by passing the proper excitation through the all pole filter model of the vocal tract.
Typically the pitch period requires 6 bits, the gain parameters are represented in 5 bits after the dynamic range is compressed logrithmaticaly, and the prediction coefficients require 8-10 bits normally for accuracy reasons. This is very important in LPC because any small changes in the prediction coefficients result in large change in the pole positions of the filter model, which cause instability in the model. This is overcome by using the PARACOR method .
Is speech frame Voiced or Unvoiced?
Once the LPC coefficients are competed, we can determine weather the input speech frame is voiced, and if so, what the pitch is.
If the speech frame is decided to be voiced, an impulse train is employed to represent it, with nonzero taps occurring every pitch period. A pitch-detecting algorithm is used in order to determine to correct pitch period / frequency. The autocorrelation function is used to estimate the pitch period as . However, if the frame is unvoiced, then white noise is used to represent it and a pitch period of T=0 is transmitted. Therefore, either white noise or impulse train becomes the excitation of the LPC synthesis filter
Two types of LPC vocoders were implemented in MATLAB
Plain LPC Vocoderdiagram is shown below:
%LPC vocoder
function [ outspeech ] = speechcoder1( inspeech )
!
% Parameters:
% inspeech : wave data with sampling rate Fs
% (Fs can be changed underneath if necessary)
% Returns:
% outspeech : wave data with sampling rate Fs
% (coded and resynthesized)
if ( nargin ~= 1)
error('argument check failed');
end;
Fs = 16000; % sampling rate in Hertz (Hz)
Order = 10; % order of the model used by LPC
% encoded the speech using LPC
[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);
% decode/synthesize speech using LPC and impulse-trains as excitation
outspeech = synlpc(aCoeff, pitch, Fs, G)
results:
residual plot:
voice excited LPC Vocoder(utilizing DCT for high compression rate/low bits)
the input speech signal in each frame is filtered with the estimated transfer function of LPC analyzer. This filtered signal is called the residual.
To achieve a high compression rate ,the discrete cosine transform (DCT) of the residual signal could be employed. The DCT concentrates most of the energy of the signal in the first few coefficients. Thus one way to compress the signal is to transfer only the coefficients, which contain most of the energy.
function [ outspeech ] = speechcoder2( inspeech )
% Parameters:
% inspeech : wave data with sampling rate Fs
% (Fs can be changed underneath if necessary)
% Returns:
% outspeech : wave data with sampling rate Fs
% (coded and resynthesized)
if ( nargin ~= 1)
error('argument check failed');
end;
Fs = 16000; % sampling rate in Hertz (Hz)
Order = 10; % order of the model used by LPC
% encoded the speech using LPC
[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);
% perform a discrete cosine transform on the residual
resid = dct(resid);
[a,b] = size(resid);
% only use the first 50 DCT-coefficients this can be done
% because most of the energy of the signal is conserved in these coeffs
resid = [ resid(1:50,:); zeros(430,b) ];
% quantize the data
resid = uencode(resid,4);
resid = udecode(resid,4);
% perform an inverse DCT
resid = idct(resid);
% add some noise to the signal to make it sound better
noise = [ zeros(50,b); 0.01*randn(430,b) ];
resid = resid + noise;
% decode/synthesize speech using LPC and the compressed residual as excitation
outspeech = synlpc2(aCoeff, resid, Fs, G);
results
noise = [ zeros(50,b); 0.01*randn(430,b) ];
resid = resid + noise;
MATLAB files:
clear all;
%osama saraireh
% speech processing
%Dr. Veton Kepuska
%FIT FAll 2005
a= input ('please load the speech signal as a .wav file ' , 's');
Inputsoundfile = a ;
[inspeech, Fs, bits] = wavread(Inputsoundfile); % read the wavefile
outspeech1 = speechcoder1(inspeech); % plain LPC vocoder
outspeech2 = speechcoder2(inspeech); % Voice excitded LPC vocoder
% plot results
figure(1);
subplot(3,1,1);
plot(inspeech);
grid;
subplot(3,1,2);
plot(outspeech1);
grid;
subplot(3,1,3);
plot(outspeech2);
grid;
disp('Press any key to play the original sound file');
pause;
soundsc(inspeech, Fs);
disp('Press any key to play the LPC compressed file!');
pause;
soundsc(outspeech1, Fs);
disp('Press a key to play the voice-excited LPC compressed sound!');
pause;
soundsc(outspeech2, Fs);
function [aCoeff,resid,pitch,G,parcor,stream] = proclpc(data,sr,L,fr,fs,preemp)
% L - The order of the analysis. .
% fr - Frame time increment, in ms. Defaults to 20ms
% fs - Frame size in ms.
% preemp - default 0.9378
% aCoeff - The LPC analysis results,
% resid - The LPC residual,
% pitch - calculated by finding the peak in the residual's autocorrelation
%for each frame.
% G - The LPC gain for each frame.
% parcor - The parcor coefficients.
% stream - The LPC analysis' residual or excitation signal as one long vector.
if (nargin<3), L = 10; end
if (nargin<4), fr = 20; end
if (nargin<5), fs = 30; end
if (nargin<6), preemp = .9378; end
[row col] = size(data);
if col==1 data=data'; end
nframe = 0;
msfr = round(sr/1000*fr); % Convert ms to samples
msfs = round(sr/1000*fs); % Convert ms to samples
duration = length(data);
speech = filter([1 -preemp], 1, data)'; % Preemphasize speech
msoverlap = msfs - msfr;
ramp = [0:1/(msoverlap-1):1]'; % Compute part of window
for frameIndex=1:msfr:duration-msfs+1 % frame rate=20ms
frameData = speech(frameIndex:(frameIndex+msfs-1)); % frame size=30ms
nframe = nframe+1;
autoCor = xcorr(frameData); % Compute the cross correlation
autoCorVec = autoCor(msfs+[0:L]);
% Levinson's method
err(1) = autoCorVec(1);
k(1) = 0;
A = [];
for index=1:L
numerator = [1 A.']*autoCorVec(index+1:-1:2);
denominator = -1*err(index);
k(index) = numerator/denominator; % PARCOR coeffs
A = [A+k(index)*flipud(A); k(index)];
err(index+1) = (1-k(index)^2)*err(index);
end
aCoeff(:,nframe) = [1; A];
parcor(:,nframe) = k';
% filter response
if 0
gain=0;
cft=0:(1/255):1;
for index=1:L
gain = gain + aCoeff(index,nframe)*exp(-i*2*pi*cft).^index;
end
gain = abs(1./gain);
spec(:,nframe) = 20*log10(gain(1:128))';
plot(20*log10(gain));
title(nframe);
drawnow;
end
% Calculate the filter response
% from the filter's impulse
% response (to check above).
if 0
impulseResponse = filter(1, aCoeff(:,nframe), [1 zeros(1,255)]);
freqResponse = 20*log10(abs(fft(impulseResponse)));
plot(freqResponse);
end
errSig = filter([1 A'],1,frameData); % find excitation noise
G(nframe) = sqrt(err(L+1)); % gain
autoCorErr = xcorr(errSig); % calculate pitch & voicing information
[B,I] = sort(autoCorErr);
num = length(I);
if B(num-1) > .01*B(num)
pitch(nframe) = abs(I(num) - I(num-1));
else
pitch(nframe) = 0;
end
% improve the compressed sound quality
resid(:,nframe) = errSig/G(nframe);
if(frameIndex==1) % add residual frames using a trapezoidal window
stream = resid(1:msfr,nframe);
else
stream = [stream];
overlap+resid(1:msoverlap,nframe).*ramp;
resid(msoverlap+1:msfr,nframe);
end
if(frameIndex+msfr+msfs-1 > duration)
stream = [stream; resid(msfr+1:msfs,nframe)];
else
overlap = resid(msfr+1:msfs,nframe).*flipud(ramp);
end
end
stream = filter(1, [1 -preemp], stream)';
Speech Model one
LPC Vocoder:
function [ outspeech ] = speechcoder1( inspeech )
% Parameters:
% inspeech : wave data with sampling rate Fs
% outputs:
% outspeech : wave data with sampling rate Fs
% (coded and resynthesized)
if ( nargin ~= 1)
error('argument check failed');
end;
Fs = 8000; % sampling rate in Hertz (Hz)
Order = 10; % order of the model used by LPC
% encoded the speech using LPC
[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);
% decode/synthesize speech using LPC and impulse-trains as excitation
outspeech = synlpc(aCoeff, pitch, Fs, G);
% Voice-excited LPC vocoder
function [ outspeech ] = speechcoder2( inspeech )
% Parameters:
% inspeech : wave data with sampling rate Fs
% (Fs can be changed underneath if necessary)
% output:
% outspeech : wave data with sampling rate Fs
% (coded and resynthesized)
if ( nargin ~= 1)
error('argument check failed');
end;
Fs = 16000; % sampling rate in Hertz (Hz)
Order = 10; % order of the model used by LPC
% encoded the speech using LPC
[aCoeff, resid, pitch, G, parcor, stream] = proclpc(inspeech, Fs, Order);
% perform a discrete cosine transform on the residual
resid = dct(resid);
[a,b] = size(resid);
% only use the first 50 DCT-coefficients this can be done
% because most of the energy of the signal is conserved in these coeffs
resid = [ resid(1:50,:); zeros(430,b) ];
% quantize the data
resid = uencode(resid,4);
resid = udecode(resid,4);
% perform an inverse DCT
resid = idct(resid);
% add some noise to the signal to make it sound better
noise = [ zeros(50,b); 0.01*randn(430,b) ];
resid = resid + noise;
% decode/synthesize speech using LPC and the compressed residual as excitation
outspeech = synlpc2(aCoeff, resid, Fs, G)
References
Linear Prediction of Speech, J.D MARKEL, A.H GRAY, Jr. Pages 10-96, 190-158
Digital signal Processing, Alan V. Oppenheim/ Ronald W. Schafer
Digital signal processing using MATLAB, Vinay K. Ingle, John Proakid