GPT-5.3 Instant / Sonnet 4.6 Ext
Reverb Algorithm
Technical Reference Document
Integrated Overview of Theory, Implementation, Perception, and Modern Approaches
Freeverb3 / FDN / Dattorro / Lexicon / Convolution Reverb / Neural FDN
When listening to music in a large concert hall, the sound continues to resonate for some time after the instrument stops playing. Even during the performance, the sound resonates in a way that adds richness and luster to the music — this is why performing in a hall is preferred. Suntory Hall in Akasaka, Tokyo, is one of the most celebrated examples of a hall prized for its beautiful acoustics.
Reverb is not merely a "residual sound" — it is the spatial information itself, determining the depth, sense of distance, and texture of the sound. Acoustically, two temporal domains coexist simultaneously:
Using a real concert hall is impractical due to time and cost constraints, and in CD mastering and similar applications, software is used to artificially recreate the acoustic properties of a hall. Freeverb3 is a well-known example. There are two primary methods for generating such reverberation:
|
Method |
Overview / Characteristics |
|
FIR (Convolution Reverb) |
Records and uses the actual impulse response of a real space. Highly realistic but less flexible and computationally expensive. "Technology for copying a real hall." |
|
IIR (Algorithmic Reverb) |
Artificially generates reverb using mathematical models. Lightweight, adjustable, and creatively flexible, but complex to design. "Technology for creating the feeling of a hall." Used in nearly all commercial hardware units such as Lexicon. |
The front-to-back perception (depth) of a sound is primarily determined by three factors:
In other words, depth perception is reverb design itself.
Humans do not perceive individual reflections; rather, they perceive patterns of energy distribution and temporal change. What is needed in a mix is not precise reflection placement but density, decay, and frequency balance.
FIR (IR) reverb can be difficult to use in certain contexts for the following reasons:
By contrast, algorithmic reverb functions as a "perceptually optimized fiction" and proves advantageous in contexts where musical pleasantness is the priority — pop music, film scores, and game audio.
The oldest and most widely implemented form — including in free software — is the reverb based on a combination of comb filters and allpass filters. A detailed explanation is provided on the ARI-WEB website. The precursor to Freeverb3, known as Freeverb, is based on this algorithm, which is referred to as "Schroeder Reverb."
The fundamental issue is modal unevenness — i.e., insufficient randomness. The comb filter produces equally spaced frequency peaks (analogous to a standing-wave structure), which is the root cause of the characteristic metallic "ping" sound.
As a result, it is rarely used in applications demanding high quality. However, its metallic character is sometimes exploited as a substitute for a plate reverb (a real unit that inputs sound into a metal plate and outputs the vibration). Freeverb generates its filters using empirically derived values and is considered to have relatively less metallic character, though its limitations remain.
|
Implementation |
Characteristics |
|
Freeverb |
Improved Schroeder type. Lightweight and simple. Metallic character relatively suppressed through empirical parameters. |
|
CCRMA NRev |
A more refined version of the same algorithm as Freeverb. |
|
NVerb (v2) |
Uses nested allpass filters plus input feedback to minimize ringing as much as possible. Improved low-frequency resonance. |
In a standard (series) configuration: input → AP → AP → AP. In a nested structure, another reverb algorithm is inserted inside an allpass filter. This results in:
It is not widely known that nested allpass filters outperform series allpass filters in quality, but Schroeder himself referenced this algorithm long ago.
◆ Adding modulation to the feedback section of a comb filter (making it a chorus) tends to suppress ringing. The Lexicon Concert Hall algorithm is believed to be based on "Chorus comb filter + allpass filter + nested allpass filter."
One of the methods developed to address the limitations of Schroeder reverb. The algorithm consists of several delay lines of varying lengths along with a matrix for mixing and outputting the signal. In Freeverb3, this is implemented as Hibiki Reverb.
An FDN can be described as a multi-input, multi-output recursive system as follows:
x(n) = A · x(n-D) + B · u(n)
y(n) = C · x(n)
The condition for system stability is: spectral radius ρ(A) < 1. However, because FDN includes delays, a more precise condition is: "stable when A is unitary (or orthogonal) and attenuation is applied."
In other words, an FDN is a "collection of many decaying oscillation modes." If the eigenvalues are uneven, specific frequencies are emphasized (ringing); if the delay lengths form simple ratios, modes overlap. This is the same structural issue that causes the Schroeder reverb to sound metallic.
Using eight or more delay lines together with a Hadamard matrix makes it relatively straightforward to generate high-quality reverberation. Many implementations address frequency phase variations present in real reverb by dynamically varying the matrix in real time. This approach also accommodates algorithmic generation of early reflections and is widely used.
Vintage digital reverbs such as the Lexicon 224 and EMT 250 are highly regarded for their "warm" and beautiful sound — much of which stems from their loop-based (tank) algorithm using allpass filters. By constructing a large loop, a reverb can be created that, like FDN, increases in density over time.
While the algorithm itself is simple, finding the optimal delay lengths and gains is challenging.
The Lexicon 224 was reverse-engineered and implemented as the ESP reverb, and was published in an academic paper. The ESP manual is mirrored on the Freeverb3 website.
The feedback loop is relatively short and is classified as a Plate Reverb, but a suitable character can be achieved by adjusting the coefficients. In Freeverb3_VST, it is implemented as STRev, and separately as ProG Reverb (Progenitor Reverb) based on a different loop structure optimized for hall use.
◆ The main feedback loop of Progenitor Reverb is based on the published Dattorro algorithm, but the input diffuser and output taps are original. The feedback coefficient inside the loop and diffuser is approximately 0.5 as a baseline, with experimentation suggesting an optimal range of 0.4–0.8.
The overall structure follows this flow: Input → Pre-delay → Multiple Allpass (diffuser) → Main Loop (2-channel cross-feedback) → Tap output (multiple points) → Damping.
The core is the cross-feedback. The left loop and right loop each feed back into themselves while also feeding back into each other (cross). This causes information to mix, causing density to increase explosively.
|
Component |
Perceptual Role |
|
Allpass diffuser (immediately after input) |
Disperses transients, eliminates early echoes — "diffusion at the moment of entering the space" |
|
Allpass in the loop |
Additional diffusion. Optimal feedback coefficient: 0.4–0.8 |
|
Damping (LPF) |
Naturally attenuates high frequencies. Without it, the sound is "digitally sterile" |
|
Modulation |
Breaks fixed modes over time. Removes metallic quality; gives a "living" sense of animation |
|
Tap outputs (multiple points) |
Output from different temporal positions and phases. Generates stereo image and spatial width |
The essential difference from FDN: FDN mixes all signals at once via a matrix, whereas Dattorro mixes structurally (gradually and temporally). This makes it more "musically" controllable.
More recent Lexicon reverbs use frequency division to apply the most appropriate algorithm to each band when generating reverberation.
The Bass XOV parameter is a mechanism for dividing and processing low-frequency content to extend the low-band decay time, among other functions. Lexicon adopted this division algorithm quite early on, resulting in a reverb with powerful and warm low-frequency response. This approach is also physically justified, as low frequencies have low directivity in real environments while high frequencies have high directivity.
From an information-theoretic perspective: the low band achieves entropy increase (richness) early on, while the high band rapidly achieves a reduction in mutual information (diffusion).
|
Generation |
Representative |
Core Structure |
Sonic Character |
|
1st Gen |
EMT 250 (1976) |
Short loop + Allpass |
Imperfect but warm, granular quality |
|
2nd Gen |
Lexicon 224 / 480L |
Perceptual optimization + multiband + randomization |
Musical, magical tail |
|
3rd Gen |
Bricasti M7 |
Waveguide-like physical approximation |
Reproduces real spatial feel |
|
4th Gen |
DiffFDN / PINN |
Learning-based integration |
Unified perception, physics, and information |
The precise meaning of convolution is the mathematical operation commonly written as follows. The term "folding" as a rendering is sometimes seen in non-technical contexts, but the correct technical term is convolution.
Mathematical definition:
y(n) = Σ_k x(k) · h(n-k)
Acoustic meaning: IR = the "complete response" of a space (or device); convolution = "computing the result of playing the input signal in that space."
Converting between microphone types (e.g., SM57 → U87) requires deconvolution (an inverse filter).
H_IR(ω) = Y(ω) / X(ω)
A practical issue is noise amplification due to "division by near-zero" at frequencies where the denominator approaches zero. Regularization and band limiting are used as countermeasures.
Ideally, a delta function input would be used, but in practice, the signal-to-noise ratio is too poor. The modern standard method uses a sine sweep or TSP (Time-Stretched Pulse).
Deconvolution: h(n) = y(n) * x^-1(n)
FFT convolution formula: Y = IFFT(FFT(X) · FFT(H))
Because FFT operates on blocks, a delay equal to the block size is introduced (e.g., 1024 samples ≈ 20 ms).
This achieves both low latency and fast processing simultaneously.
IR can only represent LTI (Linear Time-Invariant) systems.
|
Can Be Reproduced |
Cannot Be Reproduced |
|
Frequency characteristics, reverb, phase response |
Even-order harmonic distortion in tube amplifiers (nonlinear distortion) |
|
General linear spatial response |
Nonlinear compression effects of speakers |
|
FM-type modulation (time-dependent frequency variation) |
IR can reproduce the "shape" of a sound but not its "behavior." Additionally, while the characteristics of recording equipment (speakers, microphones) can be partially removed through deconvolution, complete removal is impossible due to constraints of SNR, dynamic range, and distortion.
Representative convolution reverb software of the era:
|
Software |
Characteristics |
|
SIR (Windows-only, freeware) |
Initially high latency; progressively improved across versions |
|
WAVES IR-1 / IR-L |
Industry-standard commercial products |
|
Altiverb |
Supports stereo I/O using 4-channel IR. High spatial accuracy and high-quality bundled library. However, requires iLok and is expensive. |
|
Voxengo Pristine Space |
High-value commercial product |
|
WizooVerb W2 |
Commercial product |
◆ Altiverb's 4-channel IR enables superior spatial accuracy for stereo I/O. A procedure for converting Altiverb's proprietary format (little-endian 24-bit PCM + gain correction data in the resource fork) to WAV was shared publicly. Free IR distribution sites such as Noisevault, Voxengo, and memi.com were also widely used.
IR is not limited to "recording spaces" — it has a wide range of creative applications:
Below is a basic C++ implementation of an FDN. The template parameter length specifies the dimension of the FDN. The range of feedback is [-1.0, 1.0]; values outside this range will cause the system to diverge.
template<typename Sample, size_t length>
struct FeedbackDelayNetwork {
size_t bufIndex = 0;
std::array<std::array<Sample, length>, 2> buf{};
std::array<std::array<Sample, length>, length> matrix{};
std::array<Delay<Sample>, length> delay;
std::array<RateLimiter<Sample>, length> delayTimeSample;
Sample process(Sample input, Sample feedback) {
bufIndex ^= 1;
auto &front = buf[bufIndex];
auto &back = buf[bufIndex ^ 1];
front.fill(0);
for (size_t i = 0; i < length; ++i)
for (size_t j = 0; j < length; ++j)
front[i] += matrix[i][j] * back[j];
input /= Sample(length);
for (size_t idx = 0; idx < length; ++idx) {
auto &&sig = input + feedback * front[idx];
front[idx] = delay[idx].process(
sig, delayTimeSample[idx].process());
}
return std::accumulate(front.begin(), front.end(), Sample(0));
}
};
◆ Reset-related methods are omitted. When the template parameter length is large, rewriting with std::vector is recommended (approximately dim=200 is the practical upper limit).
The argument feedback is a scalar coefficient that uniformly scales the feedback matrix values. Note that because Delay performs linear interpolation, the output gradually attenuates if the delay time is non-integer.
Condition for FDN not to diverge: ρ(g·A) ≤ 1, where ρ denotes the spectral radius.
For an orthogonal matrix A, A^T · A = I holds. This ensures |λ_i| = 1 and thus energy is preserved:
‖y‖² = ‖Ay‖²
That is, an orthogonal matrix + |g| < 1 guarantees stability. Attenuation is controlled solely by g.
◆ According to Schlecht and Habets' "On lossless feedback delay networks," making the feedback matrix unitary or triangular prevents an FDN from diverging.
The optimal solution cannot be determined analytically; it is a hybrid of search and heuristics.
According to research by Schlecht and Habets, the main types of feedback matrices that prevent FDN from diverging are unitary matrices (possibly including complex numbers) and triangular matrices. For the implementations discussed here, real orthogonal matrices are sufficient.
|
Matrix Type |
Characteristics / Sonic Tendency |
|
Random orthogonal matrix |
High diffusion, uniform energy distribution. Smooth with minimal metallic character. Equivalent to scipy.stats.ortho_group.rvs() |
|
Special orthogonal matrix |
Determinant = 1. Difference in sound vs. random orthogonal is difficult to perceive |
|
Householder matrix |
Form: H = I - 2(vv^T)/(v^Tv). Constructable from dim random numbers. Somewhat weaker diffusion with more distinct character |
|
Hadamard matrix |
Only ±1 values. Perfectly orthogonal. Fast (additions only). Similar sound to random orthogonal but CPU-efficient |
|
Circulant matrix |
Shift structure with periodicity. "Pipe-like" sound. Diagonalizable via DFT |
|
Triangular matrix |
Strong early reflections and metallic character. Short delays are prominent |
|
Conference matrix |
Diagonal elements are 0. Very natural; minimizes self-loop (metallic sources). Excellent sound quality |
|
Absorbent allpass matrix |
From Schlecht-Habets Eq. (10). May be less efficient than not using FDN in some cases |
Implementation based on Mezzadri's "How to generate random matrices from the classical compact groups" — a C++ translation of scipy.stats.ortho_group.rvs().
template<size_t dim>
void randomOrthogonal(unsigned seed,
std::array<std::array<Sample, dim>, dim> &H) {
pcg64 rng{}; rng.seed(seed);
std::normal_distribution<Sample> dist{};
H.fill({});
for (size_t i = 0; i < dim; ++i) H[i][i] = Sample(1);
std::array<Sample, dim> x;
for (size_t n = 0; n < dim; ++n) {
auto xRange = dim - n;
for (size_t i = 0; i < xRange; ++i) x[i] = dist(rng);
Sample norm2 = 0;
for (size_t i = 0; i < xRange; ++i) norm2 += x[i] * x[i];
Sample x0 = x[0];
Sample D = x0 >= 0 ? Sample(1) : Sample(-1);
x[0] += D * std::sqrt(norm2);
Sample denom = std::sqrt(
(norm2 - x0*x0 + x[0]*x[0]) / Sample(2));
for (size_t i = 0; i < xRange; ++i) x[i] /= denom;
for (size_t row = 0; row < dim; ++row) {
Sample dotH = 0;
for (size_t col = 0; col < xRange; ++col)
dotH += H[col][row] * x[col];
for (size_t col = 0; col < xRange; ++col)
H[col][row] = D * (H[col][row] - dotH * x[col]);
}
}
}
◆ PCG is used as the random number generator because it is adopted in NumPy's default_rng(). It is preferred over std::minstd_rand due to its superior stability.
Initialized with 1/sqrt(dim) and recursively constructed by tiling. dim must be a power of 2.
template<size_t dim>
void constructHadamardSylvester(
std::array<std::array<Sample, dim>, dim> &mat) {
static_assert(dim && ((dim & (dim-1)) == 0), ...);
mat[0][0] = Sample(1) / std::sqrt(Sample(dim));
size_t start = 1; size_t end = 2;
while (start < dim) {
for (size_t row = start; row < end; ++row)
for (size_t col = start; col < end; ++col) {
auto &&value = mat[row-start][col-start];
mat[row-start][col] = value; // Upper right
mat[row][col-start] = value; // Lower left
mat[row][col] = -value; // Lower right
}
start *= 2; end *= 2;
}
}
Expressed in the form H = I - 2(vv^T)/(v^Tv). An orthogonal matrix can be constructed from dim random numbers, and the result is symmetric (H = H^T).
template<size_t dim>
void randomHouseholder(unsigned seed,
std::array<std::array<Sample, dim>, dim> &matrix) {
// ... (initialize vec with uniform random values)
auto scale = Sample(-2) / denom;
for (size_t i = 0; i < dim; ++i) {
matrix[i][i] = Sample(1) + scale * vec[i] * vec[i];
for (size_t j = i+1; j < dim; ++j) {
auto value = scale * vec[i] * vec[j];
matrix[i][j] = value; matrix[j][i] = value;
}
}
}
The size of a Conference matrix must satisfy the condition: "q+1, where q is an even number expressible as the sum of two squares (equivalent to OEIS A286636)." Candidate sizes from OEIS sequence A000952: 62, 54, 50, 46, 42, 38, 30, 26, 18, 14, 10, 6, 2.
Because the diagonal elements are 0, there are no simple comb filter sections, resulting in superior sound quality.
Final matrix: C[0][0]=0, C[0][i]=C[i][0]=1/sqrt(modulo), C[i][j]=S[i][j]
A matrix introduced in Schlecht-Habets' "Time-varying feedback matrices in FDN." It represents a nested allpass filter structure. Converges when α is in the range (-1, 1).
template<size_t dim>
void randomAbsorbent(unsigned seed, Sample low, Sample high,
std::array<std::array<Sample, dim>, dim> &mat) {
// dim must be even
constexpr size_t half = dim / 2;
// Generate orthogonal matrix A of size half×half
randomOrthogonal(seeder(rng), A);
for (size_t col = 0; col < half; ++col) {
auto gain = dist(rng);
mat[half+col][half+col] = gain; // Bottom-right
mat[half+col][col] = Sample(1) - gain*gain; // Bottom-left
for (size_t row = 0; row < half; ++row) {
mat[row][half+col] = A[row][col]; // Top-right
mat[row][col] = -A[row][col]*gain; // Top-left
}
}
}
By modifying the random number generation section of randomSpecialOrthogonal as follows, the proximity to the identity matrix can be adjusted:
x[0] = Sample(1);
for (size_t i = 1; i < xRange; ++i)
x[i] = identityAmount * dist(rng);
Acoustically, lowering identityAmount tends to strengthen early reflections while weakening the late reverb tail.
Normally, eigenvalues are determined by the feedback matrix A. However, "direct eigenvalue design" — specifying eigenvalues first and then constructing the matrix — is also possible.
Matrix reconstruction: A = V Λ V^-1 (Λ: designed eigenvalues, V: arbitrary orthogonal basis)
The essence of eigenvalue design is the "design of the decay spectrum."
|
Good Reverb |
Bad Reverb |
|
No echo sensation (density increases over time) |
Metallic |
|
Low frequencies extend naturally |
Periodic (ping-pong feel) |
|
High frequencies are not harsh |
Muddy, or loses definition |
|
Does not degrade stereo imaging |
Specific band is prominent |
|
Energy distribution becomes uniform over time |
Ringing is clearly audible |
Reference values for the minimum perceptible difference (JND):
Knowing the JND leads to the design principle that "unnecessarily precise design is unnecessary." Optimization at the 1 ms level or frequency resolution design below the hearing limit carries no practical benefit.
A loud sound masks weaker sounds in adjacent frequency bands. The "granular quality" of the EMT 250 creates masking and can sound more pleasant than complete diffusion — a paradox in which "imperfection is the perceptual optimum."
An effect that makes sounds immediately following a loud sound inaudible. Early reflections are masked by the direct sound, so complete physical accuracy is not required.
Treating the sound field as an "information source" rather than a "wave":
Goal of a good reverb: Entropy increases smoothly over time.
Information-theoretic interpretation of metallic sound: A phenomenon in which information fails to diffuse and structure remains perceptible.
The perceptual "magic" of the Lexicon 480L lies in the combination of Bass XOV (multiband processing), controlled modulation, and the Loop Tank structure.
In Random Hall, internal modulation (Decay Optimization) corresponds to CHO.
◆ The manual states that "BASS MULT maximum, RT HF CUT medium, HF CUTOFF low" is the ideal concert hall setting.
Bass XOV is a mechanism that varies the optimal information diffusion rate by frequency band, based on the following physical and perceptual facts:
According to the official manual, V2 modulation is pitch variation in the late reverb tail — different from chorus or flange.
Each delay tap time D_i(t) is independently modulated:
D_i(t) = D_i + Δ · m(t, level)
// C++ implementation (simplified)
Sample sine = std::sin(2.0f * M_PI * modFreqBase * phase[i]);
Sample noise = 0;
for (int k = 0; k < 3; ++k)
noise += std::sin(2.0f * M_PI * (modFreqBase*(k+1)) * np[i]) / (k+1);
noise /= 3.0f;
// Low: sine-dominant; High: noise-dominant; Mid: blend
The world's first commercial digital reverb (1976). The originator of the Loop Tank concept.
The internal algorithm of the Bricasti M7 is entirely proprietary, but its acoustic characteristics strongly suggest a structure inspired by a scattering-type Waveguide Network.
|
Characteristic |
Lexicon 480L |
Valhalla VintageVerb |
Bricasti M7 |
|
3D depth |
Magical (perceptually optimized) |
Beautiful spatial spread |
Overwhelming realism |
|
Warmth / low end |
Rich via multiband |
Sufficiently warm |
Natural but full-bodied |
|
Musicality |
Supreme (emotionally evocative) |
High (Lexicon-like) |
Somewhat restrained |
|
Metallic quality |
Good via randomization |
Excellent |
Nearly zero |
|
Mix compatibility |
Prominent, hard to lose |
Forward, pop-oriented |
Blends naturally |
|
Character |
Vintage magic |
Modern lush |
Modern realistic |
◆ The Lexicon 480L embodies "perceptual correctness over physical correctness." It anticipated in the 1980s — through human ears — the perceptual loss optimization that DiffFDN and PINN now pursue computationally.
Traditional FDN design has mathematically pursued stability and diffusion, but it is now possible to directly optimize for "how humans perceive the output."
J(θ) = Σ_t w_t · d(P(y_θ(t)), P(y_target(t)))
def perceptual_loss(y, y_target):
Y = loudness(mel(y))
Yt = loudness(mel(y_target))
return torch.mean(torch.abs(Y - Yt))
Learning the FDN itself with a neural network. The orthogonality constraint is preserved through parameterization.
class FDN(nn.Module):
def __init__(self, N=8):
super().__init__()
self.S = nn.Parameter(torch.randn(N, N))
self.delay = nn.Parameter(torch.rand(N) * 2000)
def orthogonal_matrix(self):
S = self.S - self.S.T # Antisymmetric matrix
return torch.matrix_exp(S) # exp(S) is always orthogonal
By taking the matrix exponential of antisymmetric matrix S, A = exp(S) is always orthogonal, guaranteeing stability.
def total_loss(x, y, y_target):
lp = perceptual_loss(y, y_target)
le = -entropy_loss(y) # Maximize entropy
lmi = mutual_info_loss(x, y) # Minimize mutual information
ld = decorrelation_loss(y) # Decorrelation
return lp + 0.1*le + 0.1*lmi + 0.1*ld
The wave equation ∂²p/∂t² = c² ∇²p is incorporated into the loss function:
Loss = DataLoss + PhysicsLoss
PhysicsLoss = ‖∂²p/∂t² - c² ∇²p‖²
Traditional IR: y(n) = x(n) * h(n) (h is fixed). Neural IR: y(n) = Σ_k x(k) · h(n, k) (h is time-dependent).
Seventh Heaven reproduces the M7 using its proprietary Fusion-IR technology (modulated capture of multiple IRs). It is fundamentally different from static IRs (e.g., Samplicity M7 IR).
A model that dynamically controls eigenvalues:
λ_k(t) = r_k(t) · exp(jθ_k(t))
◆ This is an important idea that shifts reverb from a "static structure" to a "dynamic field."
IR represents only the "first-order term" of the Volterra series. Real acoustic environments contain the following nonlinearities:
All of these can be expressed via the second- and third-order terms of the Volterra series. A simple downstream distortion (waveshaper) is insufficient — nonlinearity must be embedded within the convolution process itself.
Linear system: y(t) = ∫ h₁(τ) x(t-τ) dτ
Volterra series (extended to nonlinear):
y(t) = ∫ h₁(τ) x(t-τ) dτ
+ ∬ h₂(τ₁,τ₂) x(t-τ₁) x(t-τ₂) dτ₁dτ₂
+ ...
x_i(n+1) = Σ_j A_ij x_j(n) + Σ_{j,k} B_ijk x_j(n) x_k(n)
// C++ implementation example
y[i] += alpha * x[i] * x[i]; // Second-order Volterra term
◆ α should be a small value on the order of 0.001–0.01. Clipping prevention (e.g., tanh) is essential.
class NonlinearFDN(nn.Module):
def __init__(self, N=8):
super().__init__()
self.S = nn.Parameter(torch.randn(N, N))
self.alpha = nn.Parameter(torch.tensor(0.01))
def forward(self, x):
A = torch.matrix_exp(self.S - self.S.T)
y_lin = x @ A
y_nl = self.alpha * (x ** 2)
return y_lin + y_nl
Rather than fixing the nonlinear coefficient α, it can be made time- or state-dependent.
The essence: By making the nonlinearity a "state-dependent system," a more natural spatial response is simulated.
The dimensionality explosion of the Volterra kernel h₂(τ₁, τ₂) is approximated using a neural network.
class NeuralVolterra(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 1)
)
def forward(self, x):
return self.net(x) # x is a time-series frame
class HybridReverb(nn.Module):
def __init__(self):
super().__init__()
self.fdn = FDN()
self.nl = NeuralVolterra()
def forward(self, x):
y_lin = self.fdn(x)
y_nl = self.nl(frame_signal(x))
return y_lin + y_nl.squeeze()
The following processing pipeline is the recommended configuration for a high-quality reverb with nonlinear components:
|
Level |
Perspective / Approach |
|
Level 1: Physical |
Wave equation: ∂²p/∂t² = c² ∇²p |
|
Level 2: Structural |
FDN / Waveguide Network (low-dimensional wave approximation) |
|
Level 3: Statistical |
Mode distribution, diffusion, energy distribution |
|
Level 4: Perceptual |
JND, masking, ERB filters, perceptual loss |
|
Level 5: Learning |
Differentiable FDN, PINN, Neural Volterra |
|
Level 6: Information |
Entropy maximization, mutual information minimization |
Commercial units are the result of optimizing — through human ears — the following implicit objective function:
J(θ) = d_spec + d_time + d_info + d_mask
Synthesizing all perspectives, a good reverb is one that satisfies the following conditions:
In one phrase:
"A dynamic system in which the energy, information, and perception of sound undergo controlled diffusion over time"
Q: What's the difference between a duck and an elephant? A: You can't get down off an elephant. It is so very hard to be an on-your-own-take-care-of-yourself-because-there-is-no-one-else-to-do-it-for-you grown-up.