Echo explorer

Echo

How do phones get rid of this

https://nicolasbrailo.github.io/

(PSA: Careful when scanning untrusted QR codes!)

DON'T USE HEADPHONES

SERIOUSLY, DON'T USE HEADPHONES

Some demos can have very loud, annoying or even painful audio. Don't use headphones. Using headphones may be detrimental to your hearing health. Besides, demoing echo while using headphones is not very fun

https://nicolasbrailo.github.io/

(PSA: Careful when scanning untrusted QR codes!)

Stop copying me

Echo explorer

Humans can't deal with echo. Hearing ourselves talk breaks our frail concentration, and we just can't speak anymore.
The difference between self-feedback, reverb and echo are to a large extent psychoacoustic phenomena:
- You hear yourself, all the time. This, hopefully, doesn't confuse you. It's what we use to modulate the volume of our voices. The time delay for self-feedback is less than 5 ms
- Beyond 10ms: we don't perceive echo, we perceive our own voice as reverb. This is also normal for us: there is always reverb, and we don't notice
- Beyond 40/50ms: we stop perceiving a single acoustic event. We start perceiving two distinct events (self-feedback+reverb, then echo). This is the one we can't deal with
We can't demo the difference between self-feedback, reverb and echo because this demo can't reach latencies that low, but you can still understand how echo impacts your speech. Try to read this text aloud while echo is active:
Start the demo with "Start". Tweak the delay and the attenuation. You can build an intuitive sense of how annoying echo is, if you didn't already know, and you can also see how much attenuation you need before echo stops being a problem
The rest of the demos here work in a similar way to this one, but are a bit more complicated

Delay Attenuation

AEC demos

The different demos present ways to solve the echo problem in a telephony system (how to avoid the speaker from being picked up by your mic).
There are different audio samples with music or speech that you can use to test the AEC algorithm: playing something simulates echo from a far-end peer. The goal of this demo is that you should enable a loudspeaker, together with the mic, and see how well each algorithm can remove the speaker audio from the mic in the output
The algorithm being tested will start receiving mic data once you click "Mic enable"
Disabling the mic will enable the "Play" button. This will play back the output of the echo canceller while the mic was enabled. Clicking "Mic enable" will restart the recording.
Debug tracks will start a recording of the AEC input, output and state. Once you stop the debug tracks, they will be downloaded as wav files that can be imported in audacity
As part of the download, you will see a file containing raw near-end (mic) audio, raw far-end (speaker) audio, and AEC output (the processed near-end audio)
Some AEC implementations may include other files, such as an echo estimate (the echo that the strategy THINKS should be present in the mic). This will let you understand the internal state of the algorithm
Eval perf will print echo metrics to the browser's console (//TODO add a stats overlay)
There are two plots: the left one shows the level of the audio in the speakers. The right one is the mic
The text with a lot of numbers: stats of the algorithm. They make sense only when reading the code.
You can use the "passthrough" mode to understand the natural attenuation from the acoustic path

Half duplex

A naive strategy for echo removal: we can disabled the mic every time there is active far end audio.
This is what early speakerphones did!
Play with the sliders to see how hard it is to find a good setting: a release that's too short will result in few opportunities to "talk" (eg missing the first few words of a sentence when trying to reply). Too long, and the results will be bursty and noisy
Play with different voices and content to see how hard it is to find a GLOBAL setting. What works for music, will break for speech. Even for different speakers, the attack/release parameters may have different set points
The quality of communication is horrible. Half-duplex is very far from a natural conversation flow. Phonemes are 30 to 50 ms long, and humans frequently speak over one another to signal intent to speak or emotions such as agreeing or disagreeing. This is caused by humans' poor design, lacking an efficient off-band signalling system.

Half duplex config

Attack ms Decay ms Threshold dB Gated: ●

Naive subtract

This "AEC" implementation subtracts the far-end signal from the raw mic signal.
Unsurprisingly, it doesn't work. It probably sounds like a copy of the far-end signal. At best, you may get a comb-like effect.
A naive signal subtraction doesn't account that it takes time for the audio from your speakers to reach your microphone. This is due to the time it takes your system to output audio, the acoustic time-of-flight, and the time it takes your system to capture the audio again.
A naive subtraction will either result in two similar copies of the audio being sent, at best. If there is enough passive attenuation (your speaker is far enough from your microphone) then this algorithm will probably just sound like the original far-end audio.

Time-aligned subtract

This implementation will calculate the loopback time (how long it takes sound from your speaker to reach the microphone). This is a dynamic number, and it's influenced by the acoustic path (the distance from your mic to your speaker), as well as your system (how long it takes for audio to travel from the OS to the real world)
The time delay is calculated using cross correlation.
You can change the xcorr config to see how it impacts processing time and accuracy.
- Min delay is the minimum assumed delay. Making this shorter shouldn't affect audio, but it will: the xcorr algo will have to learn the minimum delay, using a lot of compute.
- Update interval: how often xcorr should run. Smaller: quicker adaptation to acoustic environment changes
- Echo window: the maximum tail of the echo. Together with min delay, can be used to reduce compute (reducing too much means losing echo tails).
- Step size: how precise the xcorr should be. Xcorr implementation uses coarse and fine search, this controls only coarse search. If step size is too large, we can miss first peak of correlation. If it's too small, we can't compute this on real time.
- TX/RX threshold: rejects xcorr updates if there is no TX or RX signal
- Smoothing: smoothing factor for time delay. The algorithm can be prone to jumps without any.
- NCC Threshold: minimum threshold to consider the normalized xcorr matching. Below threshold, the algorithm assumes the signals are uncorrelated
- XCorr window: a longer window can make up for a shorter minimum delay, at the cost of compute. If the window is too short, we won't find a correlation between near end and far end signals.
- Echo attenuation: a gain for the echo path (this is unlikely to be 1: the microphone will typically, but not always, be an less loude than the speaker signal)
Here, the reverb/comb effect is greatly reduced (not entirely, as the time alignment isn't perfect)
But it does very little for the echo: the room coloration means the two signals are different enough that subtracting them does nothing to remove echo. Plus, it's hard to get the attenuation right (we end up just changing the phase slightly)
Time-alignment is necessary but not sufficient. We need to model the path between speaker and microphone

Time aligner config

Min Delay ms Update Interval frames Echo Window ms Step Size samples TX Threshold dB RX Threshold dB Smoothing α NCC Threshold XCorr Window samples Echo Attenuation Loop time locked: ●

Room impulse response

Time-aligned subtraction fails because the room changes the signal: reflections, absorption, frequency-dependent attenuation. Instead of just delaying the far-end signal, we can try to model the entire acoustic path. The Room Impulse Response (RIR) captures how sound travels from speaker to microphone, including all the reflections and coloration.
To measure the RIR, we play a known test signal and record what the mic captures. Comparing the two gives us the room's transfer function. We can then convolve the far-end audio with this RIR to predict what the echo should look like, and subtract it.
Sounds great in theory. In practice:
- Measuring the RIR is disruptive: you need to play a loud test signal (a click, sweep, or noise burst). Not ideal mid-call.
- The IR length needs to be longer than the echo tail. Too short, and the results are garbage.
- The RIR changes when anything moves: people, chairs, doors. The measurement becomes stale quickly.
- Even a millisecond of timing drift breaks the cancellation. The system latency changes dynamically, so the RIR you measured may not match the current delay.
This demo is hard to get right. You may need to take a few RIR measurements before it starts working.
RIR parameters explained
- IR Length: the maximum echo length the filter can capture. Must be longer than your room's reverb tail, otherwise the RIR gets truncated and cancellation fails. Longer = more compute, but too short = garbage results.
- Measurement Duration: how long to play and record the test signal. Longer measurements give cleaner RIR estimates (more averaging, better signal-to-noise), but are more disruptive.
- Test Signal Type: the signal used to probe the room. Trade-off between disruption and accuracy:
  - Skip RIR: don't measure, just use a basic delay. Useful as a baseline.
  - Dirac (sine/square): a short impulse. Quick and less annoying, but lower SNR. The room response is literally the recorded signal. Square has more high-frequency content.
  - MLS (Maximum Length Sequence): pseudo-random noise that's mathematically designed for impulse response measurement. Sounds like white noise. More disruptive, but gives a much cleaner RIR estimate because cross-correlation with MLS rejects uncorrelated noise.
- Dirac Pulse Width: for Dirac signals, how wide the pulse is. Wider pulses have more energy (better SNR) but less high-frequency content (blurrier time resolution).
- MLS Order: the length of the MLS sequence (2^order - 1 samples). Higher order = longer sequence = better noise rejection, but longer measurement time.
- Echo Attenuation: gain applied to the predicted echo before subtraction. The RIR measurement is never perfect, so you may need to tweak this. Too low = residual echo. Too high = artifacts and distortion.
RIR-based AEC is a step in the right direction: we're finally modeling the room. But a static measurement can't keep up with a dynamic environment. We need something that adapts continuously.

Room impulse response config

IR Length/Echo time Measurement Duration ms Test Signal Type Dirac Pulse Width samples MLS Order Echo Attenuation

LMS adaptive filter

Finally a real AEC. Instead of measuring once, we try to model the entire echo path all the time. We use an LMS (Least Mean Squares) adaptive filter to do this.
The filter tries to predict what the echo will look like, then compares the prediction to what the mic actually captures, computes the error, and adapts the filter coefficients to make the error smaller. Over time, the filter converges to approximate the room's transfer function.
This is the base of real AECs. The filter adapts continuously, so it can track changes in the acoustic environment. It won't break just because a person moved around (but it will take some time to reconverge, which you can see in the demo!)
This implementation is half naive: when there is no far end signal, the filter stops adapting (good), but when there is far end signal AND near end signal, the filter will try to remove local speech too. This is called double-talk, a situation where both far and near speakers are active. Double talk is the most challenging environment for an AEC.
More things LMS can't model: non-linear distortion. When your speaker clips or has harmonic distortion, the echo won't be a linear function of the far-end signal, and a linear filter can't fully cancel it. Typically there are non-LTI suppressors run after an AEC to improve performance.
LMS params
- Filter Length: how many taps (samples) to use. This is the maximum echo tail it can model. Like RIR's IR length, it must be longer than your room's reverb, otherwise the filter can't capture the full impulse response and you'll have residual echo. Longer = more compute per sample, and slower convergence (more coefficients to learn). Too short = can't model the room. Too long = slow and may never fully converge.
- Step Size (μ): the learning rate - how aggressively the filter updates its coefficients each sample. This is the most sensitive parameter:
  - Too high: the filter overshoots, oscillates, or diverges entirely. You'll hear ringing, pumping, or the echo getting worse instead of better. In extreme cases, it explodes into loud noise.
  - Too low: the filter adapts too slowly. It may take seconds to converge, and it can't track fast changes in the acoustic environment (someone moving, door opening).
  - The optimal μ depends on the signal level and filter length. NLMS (Normalized LMS) automatically scales μ by signal power to make this less touchy.
- Leakage: each sample, the filter coefficients are multiplied by this value (slightly less than 1). This slowly "forgets" old information, preventing coefficients from drifting to extreme values over time. Values very close to 1 (like 0.9999) mean almost no leakage - the filter has a long memory. Lower values make the filter more forgetful, which can help track changes but reduces steady-state performance.
- Min Delay: the minimum acoustic delay before echo appears. Sound takes time to travel from speaker to mic, plus system latency. The filter doesn't need to model this silent period - we can skip it and start the filter where the echo actually begins. Saves computation and helps convergence. Set this to roughly your system's round-trip latency (usually 20-100ms depending on your audio setup).
- Reset Filter: clears all coefficients back to zero. Useful when the filter has diverged or you've changed the acoustic setup significantly. After reset, the filter needs time to reconverge.
Watch the filter visualization: you can see the coefficients adapt in real-time. When it's working, you'll see the filter shape stabilize to something that looks like an impulse response.

LMS config

Filter Length Step Size (μ) Leakage Min Delay ms