Source Localization and Source Separation
Figure 1: Target extraction using multi-channel signal processing
The figure above shows a sample multi-channel target enhancement scenario. Two competing speakers are
recorded by a 5-channel microphone array in a noisy and reverberant environment T60 = 0.6 s, source and interferer
are at a distance of 1.0m from the center of the array, outside the critical distance of room (≈ 0.84m)). The target
localisation and subsequent extraction/enhancement using the approaches developed in are presented below for this
example. Each speaker signal has the same power. The background noise is diffuse white noise mixed at 10 dB below
the signal power. Note that the noise is correlated across the microphones.
As expected, the DSB offers some enhancement, but as it is not actively cancelling interference and noise, the enhancement is limited – in particular, the interferer is still clearly audible. The mask based approach offers good
noise and interference suppression. The quality of the target speech is also rather good, due to the use of a soft-mask.
However, some artefacts may be observed in the output. Also, there is one point in the signal where the target
distortion is audible: the target has low energy in these frames, and is not well localised – leading to a sudden dip in
the voice (towards the latter part of the sentence). This may impact intelligibility. The smoothed masks improves
upon the target speech quality, however this is at the cost of reduced noise and interference suppression. The PEG
approach offers all-round good performance, in terms of interference and noise cancellation and preservation of the
target signal. It sounds the most natural.
[1] N. Madhu, “Acoustic source localization: Algorithms, applications and extensions to source separation”, Dissertation,
Ruhr-Universit¨at Bochum.
[2] N. Madhu, C. Breithaupt and R. Martin, “Temporal smoothing of spectral masks in the cepstral domain for speech
separation”, Proc. International Conference on Acoustics, Speech and Signal processing (ICASSP), 2008.
Figure 2: Extraction of 3 speakers from 2 microphones using the parsimoniously excited GSC (PEG). Source 2 is at broadside, sources 1 and 3 are each 30° respectively to the left and right of source 2
The extraction of sources from under-determined mixtures using the PEG approach. The source placements are as shown above, the microphones are 8 cm apart. The recordings were made in a reverberant room (T60 = 0.6 s, sources at 1m from the array center outside the critical distance of room (dcrit. = 0.84m)). Each source signal consists of two sentences each, one spoken by a male and one by a female speaker (not necessarily in that order).
Each source has the same power. Thus, the effective mixing SIR is -3 dB.
The microphone signal: | x1(n) | Play! | |
The clean sources as perceived at microphone 1: | |||
1. s1(n) | Play! | ||
2. s2(n) | Play! | ||
3. s3(n) | Play! | ||
Output of the PEG | |||
1. y1(n) | Play! | ||
2. y2(n) | Play! | ||
3. y3(n) | Play! |
After a short adaptation period, the target sources can be quite well extracted from the under-determined mixture. These sources are much clearer after processing by PEG. The separated source from broadside has the least improvement, which is to be expected: separation is achieved by steering a null towards the interferers. For the source at broadside, this requires steering nulls both to the left and the right of the source – which is difficult as we only have one degree of freedom. The system therefore nulls the strongest interference (evident for the duration of the second sentence, where the strong female interferer is suppressed, allowing the target male speaker in the second sentence to be heard). For sources 1 & 3, the interfering sources are all to one side of it making it possible to steer a broader null or quickly vary the position of the null, thereby yielding better separation performance.