Data Utilities¶

The libdse.data sub-package contains everything needed to load and pre-process audio data for training:

Feature Extraction — `libdse.data.features`¶

Feature extraction utilities for log-mel power spectrograms.

This module provides the abstract BaseExtractor interface and the concrete MelPowerSpectrumExtractor implementation used to build training samples for the denoising autoencoder (DAE).

The feature pipeline converts raw mono waveforms into fixed-width log-mel power spectrogram vectors following the approach described in:

Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising autoencoder. INTERSPEECH 2013.

Pipeline summary¶

Compute the short-time Fourier transform (STFT) with a Hann window.
Project the magnitude-squared spectrum onto a mel filterbank.
Divide the resulting mel spectrogram into non-overlapping temporal windows of chunks_per_feature frames; discard incomplete trailing windows.
Flatten each window into a 1-D vector of length n_mels * chunks_per_feature.

When a DEMANDNoiseDataset is supplied to MelPowerSpectrumExtractor, a noisy copy of the waveform is synthesised on the fly by add_noise_snr(). The returned training pair is then (noisy_feature, clean_feature) instead of (clean_feature, clean_feature).

Classes¶

BaseExtractor — Abstract base; subclass to define custom extractors.
MelPowerSpectrumExtractor — Log-mel power spectrum extractor.
MagnitudePowerSpectrumExtractor — Raw magnitude power spectrum extractor.

Typical usage¶

from pathlib import Path
from libdse.data.features import MelPowerSpectrumExtractor
from libdse.data.noise import DEMANDNoiseDataset, DEMANDNoiseType

noise_ds = DEMANDNoiseDataset(
    entry_point=Path("data/noise/DEMAND"),
    noise_types=DEMANDNoiseType.ALL,
)
extractor = MelPowerSpectrumExtractor(
    sampling_rate=16_000,
    window_length=512,
    hop_length=128,
    n_mels=40,
    chunks_per_feature=7,
    noise=noise_ds,
)
# Called once per utterance inside a DataLoader worker — yields one pair
# per non-overlapping spectrogram window:
for noisy_feat, clean_feat in extractor(waveform):
    ...

libdse.data.features.Sample¶: Type alias for a feature tensor returned by an extractor.

libdse.data.features.Label¶: Type alias for a label (target) tensor returned by an extractor.

class libdse.data.features.BaseExtractor[source]¶

Bases: ABC

Abstract base class for feature extractors.

Defines the interface expected by LibriSpeechDataset. Concrete subclasses must implement __call__(), which converts a raw mono waveform into a (sample, label) tensor pair.

sample_shape: tuple[int, ...]¶: Shape of a single feature vector produced by this extractor. Must be set in the subclass __init__ before the instance is passed to LibriSpeechDataset.

abstractmethod __call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) → Generator[tuple[Tensor, Tensor], None, None][source]¶

Yield (feature, label) pairs from a raw audio waveform.

The waveform is split into non-overlapping windows; one pair is yielded for each window. The number of pairs depends on the duration of sample and on sample_shape.

Parameters:: sample (numpy.ndarray of float32) – Mono audio waveform at the extractor’s expected sampling rate.
Returns:: Generator of (feature, label) tensor pairs, each tensor having shape sample_shape.
Return type:: Generator[tuple[torch.Tensor, torch.Tensor], None, None]

class libdse.data.features.LogMelPowerSpectrumExtractor(sampling_rate: int, window_length: int, hop_length: int, n_mels: int, chunks_per_feature: int, noise: DEMANDNoiseDataset | None)[source]¶

Bases: BaseExtractor

Log-mel power spectrum feature extractor.

Converts a raw mono waveform into a sequence of log-mel power spectrogram feature vectors. The STFT is computed with a Hann window with 50% overlap; the power spectrum is projected through a mel filterbank; and the spectrogram is divided into non-overlapping windows of chunks_per_feature frames. Calling an instance yields one (feature, label) pair per window.

When noise is provided, a noisy version of the waveform is synthesised by add_noise_snr() at a randomly selected SNR of 0, 5, or 10 dB, and every yielded pair becomes (noisy_feature, clean_feature).

The extractor is designed to be instantiated once and called repeatedly — one call per utterance — from inside a DataLoader.

Parameters:

sampling_rate (int) – Expected sample rate of input waveforms in Hz.
window_length (int) – STFT window length in samples (also used as the FFT size).
hop_length (int) – STFT hop size in samples.
n_mels (int) – Number of mel filterbank bins.
chunks_per_feature (int) – Number of consecutive spectrogram frames per output feature vector.
noise (DEMANDNoiseDataset or None) – Optional DEMAND noise dataset used for on-the-fly noise mixing. Pass None for clean-only feature extraction.

Example

extractor = MelPowerSpectrumExtractor(
    sampling_rate=16_000,
    window_length=512,
    hop_length=128,
    n_mels=40,
    chunks_per_feature=7,
    noise=None,
)
for feature, label in extractor(waveform):
    assert feature.shape == (40 * 7,)

sample_shape: tuple¶

n_mels * chunks_per_feature.

Type:: Flat length of each feature vector

mel_power_spectrum(sample: ndarray[tuple[Any, ...], dtype[float32]]) → ndarray[tuple[Any, ...], dtype[float32]][source]¶

Compute a (log)-mel power spectrogram and split it into fixed-length chunks.

Follows the feature extraction procedure described in:

Lu, X. et al. (2012). Speech Restoration Based on Deep Learning Autoencoder with Layer-Wised Pretraining.

The spectrogram is divided into non-overlapping temporal windows of chunks_per_feature frames. Incomplete trailing windows are discarded without padding.

Parameters:: sample (numpy.ndarray of float32) – Mono audio waveform.
Returns:: Array of shape (n_chunks, n_mels * chunks_per_feature) where each row is a flattened temporal window.
Return type:: numpy.ndarray of float32

__call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) → Generator[tuple[Tensor, Tensor], None, None][source]¶

Yield (feature, label) pairs for every non-overlapping window.

The waveform is converted to a mel power spectrogram, divided into non-overlapping windows of chunks_per_feature frames, and one pair is yielded per window. Incomplete trailing windows are discarded.

When noise is set, a synthetic noisy copy of the waveform is blended at a randomly selected SNR of 0, 5, or 10 dB before feature extraction, and the pair becomes (noisy_feature, clean_feature).

Parameters:: sample (numpy.ndarray of float32) – Mono audio waveform at fs Hz.
Returns:: Generator of (feature, label) tensor pairs, each tensor having shape (n_mels * chunks_per_feature,).
Return type:: Generator[tuple[torch.Tensor, torch.Tensor], None, None]

class libdse.data.features.PowerSpectrumExtractor(sampling_rate: int, window_length: int, hop_length: int, noise: DEMANDNoiseDataset | None)[source]¶

Bases: BaseExtractor

Raw magnitude power spectrum feature extractor (no mel projection).

Converts a raw mono waveform into a sequence of single-sided magnitude power spectrum frames. Unlike MelPowerSpectrumExtractor, no mel filterbank is applied — the full (1 + window_length // 2)-bin power spectrum of each STFT frame is used directly as a feature vector.

Calling an instance yields one (feature, label) pair per STFT frame.

The extractor is designed to be instantiated once and called repeatedly — one call per utterance — from inside a DataLoader.

Parameters:

sampling_rate (int) – Expected sample rate of input waveforms in Hz.
window_length (int) – STFT window length in samples (also used as the FFT size). Each feature vector has length 1 + window_length // 2.
hop_length (int) – STFT hop size in samples.
noise (DEMANDNoiseDataset or None) – Optional DEMAND noise dataset used for on-the-fly noise mixing. Pass None for clean-only feature extraction.

Example

extractor = MagnitudePowerSpectrumExtractor(
    sampling_rate=16_000,
    window_length=512,
    hop_length=256,
    noise=None,
)
for feature, label in extractor(waveform):
    assert feature.shape == (257,)  # 1 + 512 // 2

sample_shape: tuple¶

1 + window_length // 2 (the number of unique frequency bins in the single-sided STFT).

Type:: Flat length of each feature vector

magnitude_power_spectrum(sample: ndarray[tuple[Any, ...], dtype[float32]]) → ndarray[tuple[Any, ...], dtype[float32]][source]¶

Compute the single-sided magnitude power spectrum frame by frame.

Applies the STFT with a Hann window and returns |STFT|² — the power of each frequency bin for every frame. No mel projection is applied.

Parameters:: sample (numpy.ndarray of float32) – Mono audio waveform.
Returns:: Array of shape (1 + window_length // 2, n_frames) where each column is the power spectrum of one STFT frame.
Return type:: numpy.ndarray of float32

__call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) → Generator[tuple[Tensor, Tensor], None, None][source]¶

Yield (feature, label) pairs for every STFT frame.

The waveform is converted to a magnitude power spectrogram and one pair is yielded per frame (column of the spectrogram). Each tensor contains the single-sided power spectrum of that frame.

When noise is set, a synthetic noisy copy of the waveform is blended at a randomly selected SNR of 0, 5, or 10 dB before feature extraction, and the pair becomes (noisy_feature, clean_feature).

Parameters:: sample (numpy.ndarray of float32) – Mono audio waveform at fs Hz.
Returns:: Generator of (feature, label) tensor pairs, each tensor having shape (1 + window_length // 2,).
Return type:: Generator[tuple[torch.Tensor, torch.Tensor], None, None]

class libdse.data.features.LogMagnitudeSpectrumExtractor(sampling_rate: int, window_length: int, hop_length: int, noise: DEMANDNoiseDataset | None)[source]¶

Bases: BaseExtractor

Log-magnitude power spectrum feature extractor (no mel projection).

Converts a raw mono waveform into a sequence of single-sided log magnitude spectrum frames.

Calling an instance yields one (feature, label) pair per STFT frame.

The extractor is designed to be instantiated once and called repeatedly — one call per utterance — from inside a DataLoader.

Parameters:

sampling_rate (int) – Expected sample rate of input waveforms in Hz.
window_length (int) – STFT window length in samples (also used as the FFT size). Each feature vector has length 1 + window_length // 2.
hop_length (int) – STFT hop size in samples.
noise (DEMANDNoiseDataset or None) – Optional DEMAND noise dataset used for on-the-fly noise mixing. Pass None for clean-only feature extraction.

Example

extractor = LogMagnitudeSpectrumExtractor(
    sampling_rate=16_000,
    window_length=512,
    hop_length=256,
    noise=None,
)
for feature, label in extractor(waveform):
    assert feature.shape == (257,)  # 1 + 512 // 2

sample_shape: tuple¶

1 + window_length // 2 (the number of unique frequency bins in the single-sided STFT).

Type:: Flat length of each feature vector

log_magnitude_power_spectrum(sample: ndarray[tuple[Any, ...], dtype[float32]]) → ndarray[tuple[Any, ...], dtype[float32]][source]¶

Compute the single-sided magnitude power spectrum frame by frame.

Applies the STFT with a Hann window and returns |STFT|² — the power of each frequency bin for every frame. No mel projection is applied.

Parameters:: sample (numpy.ndarray of float32) – Mono audio waveform.
Returns:: Array of shape (1 + window_length // 2, n_frames) where each column is the power spectrum of one STFT frame.
Return type:: numpy.ndarray of float32

__call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) → Generator[tuple[Tensor, Tensor], None, None][source]¶

Yield (feature, label) pairs for every STFT frame.

The waveform is converted to a magnitude power spectrogram and one pair is yielded per frame (column of the spectrogram). Each tensor contains the single-sided power spectrum of that frame.

When noise is set, a synthetic noisy copy of the waveform is blended at a randomly selected SNR of 0, 5, or 10 dB before feature extraction, and the pair becomes (noisy_feature, clean_feature).

Parameters:: sample (numpy.ndarray of float32) – Mono audio waveform at fs Hz.
Returns:: Generator of (feature, label) tensor pairs, each tensor having shape (1 + window_length // 2,).
Return type:: Generator[tuple[torch.Tensor, torch.Tensor], None, None]

class libdse.data.features.RawWaveformExtractor(sampling_rate: int, window_length: int, noise: DEMANDNoiseDataset | None)[source]¶

Bases: BaseExtractor

Raw waveform extractor — no frequency transform applied.

Splits a mono waveform into non-overlapping windows of window_length samples and yields each window directly as a feature vector. This is the natural companion extractor for time-domain models such as WaveUNet that operate on raw audio rather than spectrograms.

When noise is provided, a noisy mixture is generated with add_noise_snr() at a random SNR (0, 5, or 10 dB) and the pair becomes (noisy_window, clean_window); otherwise both elements of the pair are the same clean window.

Parameters:

sampling_rate (int) – Expected sample rate of input waveforms in Hz.
window_length (int) – Number of samples per output feature vector.
noise (DEMANDNoiseDataset or None) – Optional DEMAND noise dataset for on-the-fly noise mixing.

sample_shape: tuple¶

(window_length,).

Type:: Shape of each feature vector

__call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) → Generator[tuple[Tensor, Tensor], None, None][source]¶

Yield (feature, label) pairs for every non-overlapping window.

The waveform is zero-padded at the end when its length is not an integer multiple of window_length, ensuring no samples are silently dropped.

Parameters:: sample (numpy.ndarray of float32) – Mono audio waveform at fs Hz.
Returns:: Generator of (feature, label) tensor pairs, each of shape (window_length,).
Return type:: Generator[tuple[torch.Tensor, torch.Tensor], None, None]

LibriSpeech Dataset — `libdse.data.librispeech`¶

Streaming PyTorch dataset for the LibriSpeech ASR corpus.

This module provides LibriSpeechDataset, an IterableDataset that streams (sample, label) tensor pairs directly from raw FLAC audio files. Feature extraction is fully delegated to a BaseExtractor instance supplied at construction, keeping the dataset class decoupled from the specific feature representation.

Internally the dataset discovers all FLAC files under entry_point, shuffles them once at construction to reduce temporal correlation between consecutive batches, and then iterates through each file. For every utterance the raw waveform is loaded at 16 kHz, passed to the extractor, and the resulting (sample, label) pair is yielded directly.

Layout assumption¶

The entry_point directory must contain exactly one sub-directory named LibriSpeech/, matching the structure produced by the official LibriSpeech tar archives:

entry_point/
└── LibriSpeech/
    └── <speaker>/<chapter>/<utterance>.flac

Classes¶

LibriSpeechDataset — Iterable PyTorch dataset.

Exceptions¶

EntryPointError — Raised when entry_point is invalid.

class libdse.data.librispeech.LibriSpeechDataset(entry_point: Path, extractor: BaseExtractor, sample_rate: int = 16000)[source]¶

Bases: IterableDataset

Iterable PyTorch dataset for the LibriSpeech ASR corpus.

Streams (sample, label) tensor pairs directly from raw FLAC audio files. Feature extraction — STFT, mel projection, windowing, and optional noise mixing — is entirely delegated to the extractor argument, making this class agnostic about the feature representation.

FLAC files are discovered recursively under entry_point at construction and shuffled once to reduce temporal correlation between consecutive training batches. Thereafter one (sample, label) pair is yielded per utterance by calling extractor(waveform).

Parameters:

entry_point (pathlib.Path) – LibriSpeech root directory. Must contain a single child directory named LibriSpeech/.
extractor (BaseExtractor) – Feature extractor instance. Called once per utterance with the raw mono waveform (float32, 16 kHz) as its sole argument and must return a (sample, label) tensor pair.

Raises:

EntryPointError – If entry_point is not a directory or does not contain a LibriSpeech/ sub-directory.

Note

Because the number of feature chunks per utterance is not known without reading every file, __len__() is not supported. Use the DataLoader and iterate until StopIteration.

DEMAND Noise Dataset — `libdse.data.noise`¶

Noise dataset utilities for the DEMAND corpus.

This module provides two public objects used to load and mix real-world background noise into clean speech:

DEMANDNoiseType — An Enum that maps human-readable environment names to the exact directory names used in the DEMAND dataset archive.
DEMANDNoiseDataset — Loads one or more noise environments from disk, concatenates them into a single array, and exposes it for slicing.
add_noise_snr() — Mixes a noise segment into a clean signal at a caller-specified signal-to-noise ratio.

The DEMAND dataset contains 18 noise environments recorded at 16 kHz on 16 channels. Only channel 1 (ch01.wav) is used here.

Typical usage¶

from pathlib import Path
from dae.data.noise import DEMANDNoiseDataset, DEMANDNoiseType, add_noise_snr

noise_ds = DEMANDNoiseDataset(
    entry_point=Path("data/noise/DEMAND"),
    noise_types=DEMANDNoiseType.ALL,
)
noisy = add_noise_snr(signal=clean_waveform, noise=noise_ds.noise[:len(clean_waveform)], snr_db=10)

class libdse.data.noise.DEMANDNoiseType(*values)[source]¶

Bases: Enum

Directory-name identifiers for the DEMAND noise dataset.

Each member’s value is the exact directory name inside the DEMAND archive, which follows the pattern <CATEGORY><NAME>_<FS>k.

Pass a subset of members (or the convenience member ALL) to DEMANDNoiseDataset to control which environments are loaded.

Members¶

Member	Directory name
`KITCHEN`	`DKITCHEN_16k`
`LIVING`	`DLIVING_16k`
`WASHING`	`DWASHING_16k`
`FIELD`	`NFIELD_16k`
`PARK`	`NPARK_16k`
`RIVER`	`NRIVER_16k`
`HALLWAY`	`OHALLWAY_16k`
`MEETING`	`OMEETING_16k`
`OFFICE`	`OOFFICE_16k`
`CAFETERIA`	`PCAFETER_16k`
`RESTAURANT`	`PRESTO_16k`
`STATION`	`PSTATION_16k`
`SQUARE`	`SPSQUARE_16k`
`TRAFFIC`	`STRAFFIC_16k`
`BUS`	`TBUS_16k`
`CAR`	`TCAR_16k`
`METRO`	`TMETRO_16k`
`ALL`	(all of the above)

KITCHEN = 'DKITCHEN_16k'¶

LIVING = 'DLIVING_16k'¶

WASHING = 'DWASHING_16k'¶

FIELD = 'NFIELD_16k'¶

PARK = 'NPARK_16k'¶

RIVER = 'NRIVER_16k'¶

HALLWAY = 'OHALLWAY_16k'¶

MEETING = 'OMEETING_16k'¶

OFFICE = 'OOFFICE_16k'¶

CAFETERIA = 'PCAFETER_16k'¶

RESTAURANT = 'PRESTO_16k'¶

STATION = 'PSTATION_16k'¶

SQUARE = 'SPSQUARE_16k'¶

TRAFFIC = 'STRAFFIC_16k'¶

BUS = 'TBUS_16k'¶

CAR = 'TCAR_16k'¶

METRO = 'TMETRO_16k'¶

ALL = ['DKITCHEN_16k', 'DLIVING_16k', 'DWASHING_16k', 'NFIELD_16k', 'NPARK_16k', 'NRIVER_16k', 'OHALLWAY_16k', 'OMEETING_16k', 'OOFFICE_16k', 'PCAFETER_16k', 'PRESTO_16k', 'PSTATION_16k', 'SPSQUARE_16k', 'STRAFFIC_16k', 'TBUS_16k', 'TCAR_16k', 'TMETRO_16k']¶: Convenience value that selects every environment at once. Pass DEMANDNoiseType.ALL to DEMANDNoiseDataset to load all 17 DEMAND environments in a single call.

class libdse.data.noise.DEMANDNoiseDataset(entry_point: Path, noise_types: list[DEMANDNoiseType] | DEMANDNoiseType, sample_rate: int = 16000)[source]¶

Bases: object

Loads and exposes DEMAND background-noise recordings as a single array.

The DEMAND dataset contains 18 real-world noise environments, each recorded on 16 channels at 16 kHz. Only channel 1 (ch01.wav) is used here. All selected recordings are concatenated end-to-end into noise so that callers can slice arbitrary-length segments without managing individual files.

Parameters:

entry_point (pathlib.Path) – Directory that directly contains the per-environment sub-directories (e.g. DKITCHEN_16k/, TCAR_16k/, …).
noise_types (DEMANDNoiseType or list[DEMANDNoiseType]) – Noise environments to load. Pass a single DEMANDNoiseType member, a list of members, or the special value DEMANDNoiseType.ALL to load every environment at once. Every requested type must have a matching sub-directory under entry_point.

Raises:

EntryPointError – If any requested environment directory is missing under entry_point.

noise: numpy.ndarray¶: 1-D float32 array containing all noise samples concatenated in the order the environment directories were iterated. Slice this directly to obtain segments of arbitrary length.

__repr__() → str[source]¶

Return a concise string representation of the dataset.

Returns:: DEMANDNoiseDataset(fs=F, noise_samples=N)
Return type:: str

libdse.data.noise.add_noise_snr(signal: ndarray[tuple[Any, ...], dtype[float32]], noise: ndarray[tuple[Any, ...], dtype[float32]], snr_db: float) → ndarray[tuple[Any, ...], dtype[float32]][source]¶

Mix noise into signal at a target signal-to-noise ratio.

The noise array is first padded (wrap mode) or truncated to match the length of signal, then scaled so that the resulting SNR equals snr_db. If the mixture clips (peak > 1.0) it is peak-normalised.

\[\text{SNR}_{\text{dB}} = 10 \log_{10}\! \left(\frac{P_{\text{signal}}}{P_{\text{noise}}}\right)\]

Parameters:

signal (numpy.ndarray of float32) – Clean mono waveform, assumed to be in [-1, 1].
noise (numpy.ndarray of float32) – Noise waveform. May be shorter or longer than signal.
snr_db (float) – Desired signal-to-noise ratio in decibels.

Returns:

Noisy mixture with the same length as signal, peak-normalised if clipping occurs.

Return type:

numpy.ndarray of float32

Exceptions — `libdse.data.err`¶

exception libdse.data.err.EntryPointError[source]¶

Bases: Exception

Raised when the dataset entry point is not a valid root.

For LibriSpeech, the expected layout is a directory containing exactly one child named LibriSpeech/. Any deviation indicates a wrong path or a manually altered dataset.

For DEMAND, the expected layout is a directory containing one or more child directories named after the requested noise types. Any deviation indicates a wrong path or a manually altered dataset.

Data Utilities¶

Feature Extraction — libdse.data.features¶

Pipeline summary¶

Classes¶

Typical usage¶

LibriSpeech Dataset — libdse.data.librispeech¶

Layout assumption¶

Classes¶

Exceptions¶

DEMAND Noise Dataset — libdse.data.noise¶

Typical usage¶

Members¶

Exceptions — libdse.data.err¶

Feature Extraction — `libdse.data.features`¶

LibriSpeech Dataset — `libdse.data.librispeech`¶

DEMAND Noise Dataset — `libdse.data.noise`¶

Exceptions — `libdse.data.err`¶