Yelling into the VoIP

Internet! Because phones were too simple

How humans hear
How computers hear
How to move voice between computers
How to play audio

Goto src

(PSA: Careful when scanning untrusted QR codes!)

Links

Goal: understand this E2E

Goal: understand this E2E have a reference for TLAs

Near end / Far end | RX/TX | Render / Capture

Inside a telephony app

Stages

Acoustics / Psychoacoustics

Acoustics: how sound behaves
Psychoacoustics: how humans think sound behaves
Outside of computers: sound = variation in air pressure. A MECHANICAL process.
When ploted (Pascals/Time), looks sinusoidal.
Graphs here may or may not resemble actual audio.

Acoustics: Propagation

Acoustics: Reverb

Reverb gives humans sense of space

Acoustics: Reverb

Echo

Reverb = aural event can't be sepparated. Echo = distinct aural events.

Noise

Stereo

Human use cues from reverb, stereo effects, absorption, etc to spatialize sounds and to create "audio focus".

Moving to capture / mic

Sound capture

Analogue to digital

Beamforming

Volume control: AGC/DRC

Volume control sidequest: dB, dBFS, SPL, dBOV, LUFS

Better explanation here

From DSP to user app

Other DSP types may include noise suppression, wind noise reduction, compensation for different hearing impairments, etc

We're here: capture

Let's discuss: render

Render: out of userspace

Render: out of DSP, into the world

Render sidequest: hearing range and Nyquist theorem

Human hearing: 20 to 20KHz
Human speech: 400 to 4KHz
Sample rate 48KHz = Nyquist frequency 24KHz = Human hearing range
Sample rate 8KHz = Less bandwidth, but not human hearing range

Transport

Audio -> Network packets

Filtering

Frames -> Packets

Packets may overlap frames
[Lossy] compression
Compression depends on available bandwidth
Packets may be elided entirely depending on VAD (DTX)

Jitter buffering

Dealing with packet loss

Much better sources than this presentation

E2E pipeline

Bonus: Echo

Bonus: Transfer function

The transfer function includes: time of flight between speaker and mic, reverb, different echo paths...

Bonus: AEC models multiple paths

Bonus: AEC reconverges when transfer changes

When the path changes, the AEC needs to recalculate the transfer function. This includes accounting for 3p DSP.