Résumé |
This paper deals with temporal segmentation of acoustic
signals and feature extraction. Segmentation and feature
extraction are aimed at being a first step for sound
signal representation, coding, transformation and multimedia.
Three interdependent levels of segmentation are defined. They
correspond to different levels of signal attributes. The
Source level distinguishes speech, singing voice, instrumental
parts and other sounds, such as street sounds, machine
noise... The Feature level deals with characteristics such
as silence/sound, transitory/steady, voiced/unvoiced, harmonic,
vibrato and so forth. The last level is the segmentation into
Notes and Phones.
A large set of features is first computed: derivative and relative
derivative of f0 and energy, voicing coefficient, mesure of
the inharmonicity of the partials, spectral centroid, spectral ``flux'',
high order statistics, energy modulation, etc. A decision function
on the set of features has been built and provides the segmentation
marks. It also depends on the current application and the required
result. As an example, in the case of the singing voice, segmentation
according to pitch is different from segmentation into phones. A
graphical interface allows visualization of these features, the results
of the decisions, and the final result.
For the Source level, some features are predominant: spectral
centroid, spectral flux, energy modulation and their variance
computed on a sound segment of one second or more.
Segmentation starts with the Source level, but the three levels
are not independent. Therefore, information obtained at a given level
is propagated towards the other levels. For example, in case of
instrumental music and the singing voice, if vibrato is detected at
the Feature level, amplitude and frequency of the vibrato are
estimated and are taken into account for the Notes and Phones
level. The vibrato is removed from the f0 trajectory, and the
high frequencies of the signal are not used in spectral flux
computation.
A complete segmentation and feature extraction system is demonstrated.
Applications and results on various examples such as a movie sound
track are presented.
|