Network Architectures¶

Neural network architectures for denoising autoencoders.

Currently provides two classes:

VanillaAutoEncoder - fully-connected DAE used in production.
VariationalAutoEncoder - placeholder for a future VAE variant.

class libdse.nets.VanillaAutoEncoder(input_dim: int, latent_dim: int, hidden_layer_struct: list[int] | None = None, dropout: list[float] | None = None)[source]¶

Bases: Module

Symmetric fully-connected denoising autoencoder.

The encoder compresses the input through a user-defined stack of linear → ReLU → LayerNorm (→ Dropout) layers down to a bottleneck of size latent_dim. The decoder mirrors this structure and maps the bottleneck back to the original input dimension.

A LayerNorm is prepended to the encoder to normalise the raw input features to zero mean and unit variance.

Parameters:

input_dim (int) – Dimensionality of the input feature vector.
latent_dim (int) – Bottleneck (latent) dimensionality.
hidden_layer_struct (list[int] or None) – Ordered list of hidden-layer widths between the input and the bottleneck. latent_dim is appended automatically. Defaults to [1024, 512, 256, 128].
dropout (float or None) – Dropout probability applied after the first hidden layer of the encoder (and the corresponding decoder layer). None or 0.0 disables dropout.

forward(input: Tensor) → Tensor[source]¶

Encode input to the bottleneck, then decode back to input space.

Parameters:: input (torch.Tensor) – Feature batch, shape (B, input_dim).
Returns:: Reconstructed batch, shape (B, input_dim).
Return type:: torch.Tensor

class libdse.nets.WaveUNet(n_layers: int, f_u: int, f_d: int, F_c: int)[source]¶

Bases: Module

Wave-U-Net for end-to-end audio source separation (Stoller et al., 2018).

Conceptual overview

Wave-U-Net operates directly on the raw audio waveform - no STFT, no spectrogram. The architecture is a 1-D analogue of the image-segmentation U-Net: a contracting encoder path progressively halves the time resolution while doubling the number of feature channels, a bottleneck captures the most abstract representation, and a symmetric expanding decoder path recovers the original resolution step by step.

The key insight that makes this work for separation is the skip connections: every encoder layer’s output is concatenated (channel-wise) to the corresponding decoder layer’s input. This gives the decoder access to fine-grained temporal detail that would otherwise be lost during downsampling, letting the network combine high-level context (what is happening globally) with low-level detail (exactly how the waveform looks locally) at every scale simultaneously.

Signal flow:

raw audio  →  [DS 1] → decimate → [DS 2] → decimate → … → bottleneck
                 ↓                   ↓
              saved                saved          (skip connections)
                 ↓                   ↓
output     ←  [US 1] ← upsample ← [US 2] ← upsample ← …

Channel schedule (following Table 1 of the paper)

Let F_c be the channel-growth factor. The encoder layer k (1-indexed) produces k * F_c channels. The bottleneck produces (n_layers + 1) * F_c channels. During decoding the skip connection from the mirror encoder layer is concatenated before the convolution, so the number of input channels to each decoder convolution equals the sum of the upsampled decoder channels and the corresponding encoder channels.

Output

The network predicts the foreground source (e.g. vocals / speech) as a residual mask on the original waveform. The background (accompaniment / noise residual) is obtained for free as original - foreground, which enforces the implicit mixture constraint that both outputs must sum back to the input.

Parameters:

n_layers (int) – Number of encoder (= decoder) layers. More layers mean a larger receptive field and more levels of temporal abstraction.
f_u (int) – Kernel size of every upsampling (decoder) convolution.
f_d (int) – Kernel size of every downsampling (encoder) and bottleneck convolution.
F_c (int) – Base channel-growth factor. Encoder layer k will have k * F_c output channels.

Reference:: Stoller, D., Ewert, S., & Dixon, S. (2018). Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation. arXiv:1806.03185.

center_crop(input: Tensor, target_shape: int) → Tensor[source]¶

Crop the time axis of input symmetrically to target_shape.

Because Conv1d without padding shortens the time axis by kernel_size - 1, encoder and decoder tensors at the same depth will have slightly different lengths. Before concatenating a skip connection we therefore crop the longer tensor to the length of the shorter one, always removing an equal number of samples from both ends to keep the remaining samples centred in time.

Parameters:

input (torch.Tensor) – Tensor of shape (B, C, T_in) to be cropped.
target_shape (int) – Desired length T_out along the time axis. Must satisfy T_out <= T_in.

Returns:

Tensor of shape (B, C, T_out).

Return type:

torch.Tensor

stack_channels(input1: Tensor, input2: Tensor) → Tensor[source]¶

Concatenate two feature maps along the channel dimension.

This is the skip-connection merge operation. Because encoder and decoder tensors differ in length (due to unpadded convolutions), input2 is centre-cropped to match the time length of input1 before concatenation.

Parameters:

input1 (torch.Tensor) – Primary tensor, shape (B, C1, T). Its time length determines the output length.
input2 (torch.Tensor) – Skip-connection tensor, shape (B, C2, T'). Will be cropped to T along the time axis.

Returns:

Merged tensor of shape (B, C1 + C2, T).

Return type:

torch.Tensor

forward(x: Tensor) → tuple[Tensor, Tensor][source]¶

Run a full separation forward pass.

The pass has four conceptual phases:

Encoder: n_layers rounds of Conv1d + LeakyReLU followed by hard decimation (keep every other sample). Each round halves the temporal resolution and increases the channel count by F_c. The pre-decimation activations are stashed as skip connections.
Bottleneck: one Conv1d + LeakyReLU on the most compressed representation.
Decoder: n_layers rounds of linear interpolation back to the previous resolution, skip-connection concatenation, Conv1d, and LeakyReLU. The skip connections are consumed in reverse order (deepest encoder layer first).
Output: the decoder output is concatenated with the original raw waveform, collapsed to one channel by a pointwise convolution, and passed through Tanh. The complementary output (accompaniment / noise residual) is derived as original - foreground, enforcing the mixture constraint.

Parameters:: x (torch.Tensor) – Raw waveform batch, shape (B, 1, T).
Returns:: Tuple (foreground, background) where both tensors have shape (B, 1, T'). T' is slightly shorter than T due to unpadded convolutions reducing the time axis at each layer.
Return type:: tuple[torch.Tensor, torch.Tensor]