When you sing into Sonitus, the sound from your microphone passes through a 13-layer signal processing pipeline before any note is played. Each layer runs in real time, processing audio in chunks of 512 samples at 44,100 samples per second. That gives the engine roughly 11.6 milliseconds per chunk to analyze your voice and generate the correct musical response. Here is what happens in those milliseconds.
Layer 1: Pitch Detection
The raw audio enters a pitch detection algorithm based on the YIN autocorrelation method, implemented through the Aubio library. This extracts the fundamental frequency of your voice, measured in Hertz. A confidence score determines whether the detected pitch is reliable or just background noise. Only when confidence exceeds a threshold does the engine treat the signal as intentional singing.
Layer 2: MIDI Note Mapping
The detected frequency is converted to the nearest MIDI note number using the standard formula: 69 + 12 * log2(frequency / 440). A stability filter prevents rapid note flickering. The note must remain consistent for a minimum number of hops before the engine commits to it.
Layer 3: Key Detection
Over time, Sonitus builds a histogram of which pitch classes (C, C#, D, and so on) appear most frequently in your singing. By comparing this histogram against known major and minor scale templates using the Krumhansl-Kessler algorithm, it determines what key you are singing in. The key detection updates continuously but changes gradually, preventing jarring modulations from brief passing tones.
Layer 4: Chord Inference
Given the detected key and the current melody note, the engine selects the most harmonically appropriate chord. It uses music theory rules (the melody note should be a chord tone or a common extension) combined with smooth voice-leading constraints. Chord changes happen on beat boundaries, never in the middle of a phrase. The algorithm prefers diatonic chords within the key but allows secondary dominants and borrowed chords when the melody implies them.
Layer 5: Section Detection
The engine continuously analyzes energy levels, pitch range, and singing patterns to classify the current musical section as intro, verse, bridge, or chorus. This classification drives instrument selection, drum intensity, and dynamic expression. A hysteresis system prevents rapid section switching: the engine requires sustained evidence before transitioning, and it holds each section for a minimum duration.
Layer 6: Tempo Tracking
Tempo is inferred from onset patterns in the audio signal and from the regularity of the singer's phrasing. The engine can also be set to a fixed tempo. Tempo adapts gradually, following the singer's natural acceleration and deceleration rather than rigidly snapping to a grid.
Layer 7: Expression Modulation
MIDI CC11 (expression) is modulated in real time based on singing intensity. When you sing louder, each instrument responds with increased expression. When you pull back, the instruments soften. Each instrument type has its own expression curve, so the organ responds differently from the piano, which responds differently from the drums.
Layers 8-13: Synthesis and Output
The remaining layers handle instrument voicing, voice-leading, humanization (velocity jitter, timing micro-offsets, ghost notes), feedback cancellation, ducking for speech detection, and final audio mixing through FluidSynth. The output passes through a soft-clipping limiter to prevent distortion before reaching your speakers or PA system.
The Result
All 13 layers execute within each 11.6-millisecond audio hop. The total processing latency from microphone input to synthesized output is typically 40 to 80 milliseconds, which is imperceptible during live worship. The entire engine is written in C, compiled natively for each platform, and runs without any network connection.
No cloud. No lag. Just music that follows your voice.