Time-Scale Modification of Music Signals

S. Grofit and Y. Lavner

Tel-HaiAcademicCollege

1. Introduction

Time-Scale Modification (TSM) of audio signals is the process of modifying the rate of audio signals such as speech or music, while maintaining other parameters (pitch, timbre) unchanged. It is a subject of major theoretical and practical interest.

In this study, a new algorithm for time-scale modification of music signals is presented. Present techniques for TSM of audio signals are used in many applications, for example, in recordings studios, for synchronization between different sounds, and between the audio and the video components. To accomplish the requirement of high-quality time scaling of speech signals, a number of algorithms have been proposed in the past decade. Unfortunately, applying these algorithms on music signals does not yield satisfactory results. The proposed algorithm is related to the PSOLA-like algorithms, which are based on the similarity of the short-time Fourier Transform between the original and the time-scaled signal. The basic assumption of these algorithms is that the spectral characteristics of the signal are constant for short durations, and that the signal is quasi-periodic in the time domain. The signal in PSOLA is divided into short-time overlapping frames, which are used for constructing the time-scaled synthesis signal, while maintaining the original spectral parameters and their related location.

It is well known that the information contained in thetemporal envelope of audio signals is important for perceptual quality. This information is not preserved when applying the overlap-and-add technique on to the original music signal, causing reverberations and degrading the quality of the time-scaled music, especially inedges like attacks and decays.An attempt to prevent the degradation by maintaining these edges untouched improves the quality in some cases. The short-time energy function used for detecting these edges is not sensitive enough for many situations, for example in very fast music, or when there are many musical instruments playing together. Therefore, in the presented algorithm, the problematic sections are detected using a Mel-scaled filter-bank with a time variant threshold function. This enables detection of the important edges, leaving these sections intact, while time-scaling the steady-state sections. In addition, the normalized correlation function, which is calculated in the PSOLA part of the algorithm, is used to detect frames whose frequency content is dissimilar above a given threshold.

2.PSOLA-like Like algorithmsAlgorithms

Most non-parametric algorithms for time-scale modification of audio signals are based on minimizing the distance between short-time Fourier transforms of the original and the time-scaled signals, in corresponding neighborhoods, according to the mapping function [1]:

1)

Where are the short-time Fourier transforms (STFT) of , respectively, and defined as:

2)

and w(m) is a window function of finite duration, and of symmetric and low-pass type [2] such as a hamming window. The overlap-and-add (OLA) equation minimizes the distance if there exists , such as :

3)

Unfortunately, the OLA equation destroys the original phase relationships, and does not maintain the quasi-periodic structure of the original signal. The WSOLA [3] algorithm avoids these discontinuities and solves the problem by enabling a local adjustment ofin the selection of analysis frames in the input signal:

4) ,

choosing so that the support of the window is and weighing it such that

If is the weighted window, the WSOLA equation will be:

5) .

In the study presented here, was selected according to the maximal normalized correlation between overlapping windowed frames:

6)

where in order to prevent time-reversal in the input signal, and for synchronization between adjacent pitch periods has to be about at least half the maximal pitch period. For each step index k, the sum of two overlapping half windows is added to the current output time-scaled signal. The algorithm was implemented so that in each step, the result of a sum of two overlapping half frames is added to the constructed signal. This facilitates the incorporation of untouched sections in the output signal: after finding the desired , the frames for the OLA operation and the corresponding normalized correlation are known for the current step. If the decision is not to overlap the frames, samples from the input signal will be copied to the output signal, and the procedure will restart from the next frame.

Figure 1: Technic for avoiding non-stationary section overlap. Notice that may be much larger then the non-stationary section.

3. Spectral non-stationarity detection and untouched copy decision

Copying sections from the input signal to the output signal without applying OLA operation would modify the mapping function. Defining long sections as spectrally significant could create undesirable distortions effects. Therefore, local thresholds have to be used, adaptive both to the input signal and the required mapping function.

3.1 Spectral Distortion distortion Mmeasure Considering based on Human human Auditory auditory Perceptionperception

This study proposed a technique for detecting events of significant spectral changes [4], , such as “attacks” and “decays” of musical instruments based on characteristics of the human hearing system. Time-scaling of these sections can deteriorate the music signal by stretching (scaling factor > 1) or eliminating (scaling factor <1) the important sections, causing reverberations and distortions. Thus, the regular PSOLA-like algorithms do not operate properly on music signals. For this purpose the Mel-Frequency Cepstrum (MFC) coefficients are computed in successive analysis frames of the input signal. Let , be coefficient l in a frame centered around , where is the distance (in samples) between adjacent frames. The measure for spectral non-stationarity in the section is defined as

7)

where is the number of MFC coefficients.

3.2 Threshold Function function for the Spectral spectral Variation non-stationarity Measure

The duration of an untouched section from the input signal depends on the length of the common support of the overlapping of adjacenthalf -frames. The average of this length will be:

8)

The threshold values are set so that only a desired percentage of the original signal is copied without modification; therefore they are based on the following function:

9)

The final local threshold value at the end of the signal will be set according to two threshold values: a global and a local value.

The local threshold value is chosen to be the value within the percentile from in a neighborhood of samples:

10)

The local threshold is intended to select the most important sections for unmodified copying in a local interval, thus avoiding long unmodified sections which may drastically change the mapping function and produce rate distortions. **********

The global threshold is aimed at preventing copying unnecessary sections in spectral stationary signals. Consequently:

11)

where .

Figure 2: A) Blue line - original signal. Red line - untouched section marked.

B) Mel-Scale Spectral coefficients over time.

C) Blue line - . Green Line - . Red line - .

D) Energy distance [DB].

3.3 Normalized correlation, motive & thresholds

Referring WSOLA carefully reveals a built-in mechanism for alarming non-stationarity segment overlapping. For each step index k, selected according to the maximal normalized correlation between overlapping windowed frames. Output signal frame quality can be characterized by the normalized correlation achieved. This property does not require additional computations, and based on the relation between the input and the constructed output signal. On the other hand, it doesn't consider human ear properties and tend to ignore high frequencies as low presents.

Despite normalized correlation's constant range , adaptive threshold function found necessary for controlling copy rate and avoiding tempo distortions. A normalized correlation function is created by running WSOLA. The local threshold function and global threshold calculated using the percentile method described above.

4. Correcting the Mappingmapping-Function function to Compensate compensate for Unscaled unscaled Intervalsintervals

A method for time-scale modification with location of significantly spectral events was presented above. The method necessarily modifies the required mapping function. For example, assume a constant mapping function with , and suppose that 10% of the signal is selected for untouched replication. The output signal will be according to a scaling ratio of . Unfortunately, the total duration of the untouched sections cannot be accurately predefined, and hence a constant mapping function that provides the required mapping cannot be evaluated. The mapping function is realized in the ratio between and , so that changing of either or will modify the mapping function. Here we chose to change and to avoid modifying the window function. In each frame was calculated so that the difference between the desired and the actual scaling is compensated within samples, ignoring :

12)

where and are the indices of the former frame in the input and output signals, respectively.

Figure 3: Blue line - original signal. Red line - the mapping-function over time.

Requested constant mapping-function: . Unscaled ratio (of input length): 9.03%

5. Accurate Time-Scale Modification

The WSOLA algorithm as presented in [2] does not guarantee accurate time-scaling, which is an important requirement in some applications. Due to the scope of the present paper, only a brief and general outline of a technique for accurate time-scaling that replicates events containing spectral non-stationarities is presented here. The technique, which is a variation of WSOLA, enables time-scaling of a given input signal of length to an output signal of length , with an accuracy of up to a few samples. Suppose M sections have been chosen for untouched replication, according to the method described in sections 3.1, and 3.2. Let’s denote these sections in ascending sequence of non-overlapping sections: , where and are the left and right delimits, respectively. The preferred locations of these sections in the output signal will be a weighted average according to the their respective locations in the input signal with respect to the left and right delimiters:

13)

unless or . In this case, the untouched sections will be replicated to the left or right ends of the current output signal. In frames that are not chosen for replication, time-scaling will be performed preciselyas described above.

The technique is not problem-free, since it increases the number length of the replication sections, but it does meet the time-scaling requirement.

6. Conclusion

In this study a method for time-scale modification of music signal is presented. The method is based on detection of spectral non-stationarities in music signals, events that are perceptually significant important. The other stationary sections are time-scaled using an improved variation version of WSOLA algorithm. Unofficial listening tests indicated that the proposed algorithm produces better results compared with other algorithms such as SOLA, WSOLA, and EDSOLA.

A drawback of the algorithm is its high computational complexity.

References:

[1] Griffin D.W. and Lim J.S., “Signal Estimation from Modified Short-Time Fourier Transform.” IEEE Trans. on ASSP April (1984),236-243:(2)32.

[2] Moulines E. and Laroche J., “Non-Parametric techniques for pitch-scale and time-scale modification of speech.” Speech Communication 16 (1995) 175-205

[3] Verhelst W. and Roelands M., “An Overlap-Add Technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech”. ICASSP-93, (1993), 554-557.

[4] Kapilow, D., Stylianou, Y., and Schroeter, J., “Detection of non-stationarities in speech signals and is application in time-scaling, Eurospeech 99, (1999).

This study was partly supported by a Guastella Fellowship of the Sacta-Rashi Foundation, and the JAFI project to promote higher education in the Eastern Galilee.