Network Architectures¶
Neural network architectures for denoising autoencoders.
Currently provides two classes:
VanillaAutoEncoder- fully-connected DAE used in production.VariationalAutoEncoder- placeholder for a future VAE variant.
- class libdse.nets.VanillaAutoEncoder(input_dim: int, latent_dim: int, hidden_layer_struct: list[int] | None = None, dropout: list[float] | None = None)[source]¶
Bases:
ModuleSymmetric fully-connected denoising autoencoder.
The encoder compresses the input through a user-defined stack of linear → ReLU → LayerNorm (→ Dropout) layers down to a bottleneck of size latent_dim. The decoder mirrors this structure and maps the bottleneck back to the original input dimension.
A
LayerNormis prepended to the encoder to normalise the raw input features to zero mean and unit variance.- Parameters:
input_dim (int) – Dimensionality of the input feature vector.
latent_dim (int) – Bottleneck (latent) dimensionality.
hidden_layer_struct (list[int] or None) – Ordered list of hidden-layer widths between the input and the bottleneck. latent_dim is appended automatically. Defaults to
[1024, 512, 256, 128].dropout (float or None) – Dropout probability applied after the first hidden layer of the encoder (and the corresponding decoder layer).
Noneor0.0disables dropout.
- forward(input: Tensor) Tensor[source]¶
Encode input to the bottleneck, then decode back to input space.
- Parameters:
input (
torch.Tensor) – Feature batch, shape(B, input_dim).- Returns:
Reconstructed batch, shape
(B, input_dim).- Return type:
- class libdse.nets.WaveUNet(n_layers: int, f_u: int, f_d: int, F_c: int)[source]¶
Bases:
ModuleWave-U-Net for end-to-end audio source separation (Stoller et al., 2018).
Conceptual overview
Wave-U-Net operates directly on the raw audio waveform - no STFT, no spectrogram. The architecture is a 1-D analogue of the image-segmentation U-Net: a contracting encoder path progressively halves the time resolution while doubling the number of feature channels, a bottleneck captures the most abstract representation, and a symmetric expanding decoder path recovers the original resolution step by step.
The key insight that makes this work for separation is the skip connections: every encoder layer’s output is concatenated (channel-wise) to the corresponding decoder layer’s input. This gives the decoder access to fine-grained temporal detail that would otherwise be lost during downsampling, letting the network combine high-level context (what is happening globally) with low-level detail (exactly how the waveform looks locally) at every scale simultaneously.
Signal flow:
raw audio → [DS 1] → decimate → [DS 2] → decimate → … → bottleneck ↓ ↓ saved saved (skip connections) ↓ ↓ output ← [US 1] ← upsample ← [US 2] ← upsample ← …Channel schedule (following Table 1 of the paper)
Let
F_cbe the channel-growth factor. The encoder layerk(1-indexed) producesk * F_cchannels. The bottleneck produces(n_layers + 1) * F_cchannels. During decoding the skip connection from the mirror encoder layer is concatenated before the convolution, so the number of input channels to each decoder convolution equals the sum of the upsampled decoder channels and the corresponding encoder channels.Output
The network predicts the foreground source (e.g. vocals / speech) as a residual mask on the original waveform. The background (accompaniment / noise residual) is obtained for free as
original - foreground, which enforces the implicit mixture constraint that both outputs must sum back to the input.- Parameters:
n_layers (int) – Number of encoder (= decoder) layers. More layers mean a larger receptive field and more levels of temporal abstraction.
f_u (int) – Kernel size of every upsampling (decoder) convolution.
f_d (int) – Kernel size of every downsampling (encoder) and bottleneck convolution.
F_c (int) – Base channel-growth factor. Encoder layer k will have
k * F_coutput channels.
- Reference:
Stoller, D., Ewert, S., & Dixon, S. (2018). Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. arXiv:1806.03185.
- center_crop(input: Tensor, target_shape: int) Tensor[source]¶
Crop the time axis of input symmetrically to target_shape.
Because
Conv1dwithout padding shortens the time axis bykernel_size - 1, encoder and decoder tensors at the same depth will have slightly different lengths. Before concatenating a skip connection we therefore crop the longer tensor to the length of the shorter one, always removing an equal number of samples from both ends to keep the remaining samples centred in time.- Parameters:
input (
torch.Tensor) – Tensor of shape(B, C, T_in)to be cropped.target_shape (int) – Desired length T_out along the time axis. Must satisfy
T_out <= T_in.
- Returns:
Tensor of shape
(B, C, T_out).- Return type:
- stack_channels(input1: Tensor, input2: Tensor) Tensor[source]¶
Concatenate two feature maps along the channel dimension.
This is the skip-connection merge operation. Because encoder and decoder tensors differ in length (due to unpadded convolutions), input2 is centre-cropped to match the time length of input1 before concatenation.
- Parameters:
input1 (
torch.Tensor) – Primary tensor, shape(B, C1, T). Its time length determines the output length.input2 (
torch.Tensor) – Skip-connection tensor, shape(B, C2, T'). Will be cropped toTalong the time axis.
- Returns:
Merged tensor of shape
(B, C1 + C2, T).- Return type:
- forward(x: Tensor) tuple[Tensor, Tensor][source]¶
Run a full separation forward pass.
The pass has four conceptual phases:
Encoder:
n_layersrounds of Conv1d + LeakyReLU followed by hard decimation (keep every other sample). Each round halves the temporal resolution and increases the channel count byF_c. The pre-decimation activations are stashed as skip connections.Bottleneck: one Conv1d + LeakyReLU on the most compressed representation.
Decoder:
n_layersrounds of linear interpolation back to the previous resolution, skip-connection concatenation, Conv1d, and LeakyReLU. The skip connections are consumed in reverse order (deepest encoder layer first).Output: the decoder output is concatenated with the original raw waveform, collapsed to one channel by a pointwise convolution, and passed through Tanh. The complementary output (accompaniment / noise residual) is derived as
original - foreground, enforcing the mixture constraint.
- Parameters:
x (
torch.Tensor) – Raw waveform batch, shape(B, 1, T).- Returns:
Tuple
(foreground, background)where both tensors have shape(B, 1, T').T'is slightly shorter thanTdue to unpadded convolutions reducing the time axis at each layer.- Return type:
tuple[
torch.Tensor,torch.Tensor]