Data Utilities¶
The libdse.data sub-package contains everything needed to load and
pre-process audio data for training:
Feature Extraction — libdse.data.features¶
Feature extraction utilities for log-mel power spectrograms.
This module provides the abstract BaseExtractor interface and the
concrete MelPowerSpectrumExtractor implementation used to build
training samples for the denoising autoencoder (DAE).
The feature pipeline converts raw mono waveforms into fixed-width log-mel power spectrogram vectors following the approach described in:
Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising autoencoder. INTERSPEECH 2013.
Pipeline summary¶
Compute the short-time Fourier transform (STFT) with a Hann window.
Project the magnitude-squared spectrum onto a mel filterbank.
Divide the resulting mel spectrogram into non-overlapping temporal windows of chunks_per_feature frames; discard incomplete trailing windows.
Flatten each window into a 1-D vector of length
n_mels * chunks_per_feature.
When a DEMANDNoiseDataset is supplied to
MelPowerSpectrumExtractor, a noisy copy of the waveform is
synthesised on the fly by add_noise_snr(). The
returned training pair is then (noisy_feature, clean_feature) instead of
(clean_feature, clean_feature).
Classes¶
BaseExtractor— Abstract base; subclass to define custom extractors.MelPowerSpectrumExtractor— Log-mel power spectrum extractor.MagnitudePowerSpectrumExtractor— Raw magnitude power spectrum extractor.
Typical usage¶
from pathlib import Path
from libdse.data.features import MelPowerSpectrumExtractor
from libdse.data.noise import DEMANDNoiseDataset, DEMANDNoiseType
noise_ds = DEMANDNoiseDataset(
entry_point=Path("data/noise/DEMAND"),
noise_types=DEMANDNoiseType.ALL,
)
extractor = MelPowerSpectrumExtractor(
sampling_rate=16_000,
window_length=512,
hop_length=128,
n_mels=40,
chunks_per_feature=7,
noise=noise_ds,
)
# Called once per utterance inside a DataLoader worker — yields one pair
# per non-overlapping spectrogram window:
for noisy_feat, clean_feat in extractor(waveform):
...
- libdse.data.features.Sample¶
Type alias for a feature tensor returned by an extractor.
- libdse.data.features.Label¶
Type alias for a label (target) tensor returned by an extractor.
- class libdse.data.features.BaseExtractor[source]¶
Bases:
ABCAbstract base class for feature extractors.
Defines the interface expected by
LibriSpeechDataset. Concrete subclasses must implement__call__(), which converts a raw mono waveform into a(sample, label)tensor pair.- sample_shape: tuple[int, ...]¶
Shape of a single feature vector produced by this extractor. Must be set in the subclass
__init__before the instance is passed toLibriSpeechDataset.
- abstractmethod __call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) Generator[tuple[Tensor, Tensor], None, None][source]¶
Yield
(feature, label)pairs from a raw audio waveform.The waveform is split into non-overlapping windows; one pair is yielded for each window. The number of pairs depends on the duration of sample and on
sample_shape.- Parameters:
sample (
numpy.ndarrayof float32) – Mono audio waveform at the extractor’s expected sampling rate.- Returns:
Generator of
(feature, label)tensor pairs, each tensor having shapesample_shape.- Return type:
Generator[tuple[
torch.Tensor,torch.Tensor], None, None]
- class libdse.data.features.LogMelPowerSpectrumExtractor(sampling_rate: int, window_length: int, hop_length: int, n_mels: int, chunks_per_feature: int, noise: DEMANDNoiseDataset | None)[source]¶
Bases:
BaseExtractorLog-mel power spectrum feature extractor.
Converts a raw mono waveform into a sequence of log-mel power spectrogram feature vectors. The STFT is computed with a Hann window with 50% overlap; the power spectrum is projected through a mel filterbank; and the spectrogram is divided into non-overlapping windows of chunks_per_feature frames. Calling an instance yields one
(feature, label)pair per window.When noise is provided, a noisy version of the waveform is synthesised by
add_noise_snr()at a randomly selected SNR of 0, 5, or 10 dB, and every yielded pair becomes(noisy_feature, clean_feature).The extractor is designed to be instantiated once and called repeatedly — one call per utterance — from inside a
DataLoader.- Parameters:
sampling_rate (int) – Expected sample rate of input waveforms in Hz.
window_length (int) – STFT window length in samples (also used as the FFT size).
hop_length (int) – STFT hop size in samples.
n_mels (int) – Number of mel filterbank bins.
chunks_per_feature (int) – Number of consecutive spectrogram frames per output feature vector.
noise (
DEMANDNoiseDatasetor None) – Optional DEMAND noise dataset used for on-the-fly noise mixing. PassNonefor clean-only feature extraction.
Example
extractor = MelPowerSpectrumExtractor( sampling_rate=16_000, window_length=512, hop_length=128, n_mels=40, chunks_per_feature=7, noise=None, ) for feature, label in extractor(waveform): assert feature.shape == (40 * 7,)
- mel_power_spectrum(sample: ndarray[tuple[Any, ...], dtype[float32]]) ndarray[tuple[Any, ...], dtype[float32]][source]¶
Compute a (log)-mel power spectrogram and split it into fixed-length chunks.
Follows the feature extraction procedure described in:
Lu, X. et al. (2012). Speech Restoration Based on Deep Learning Autoencoder with Layer-Wised Pretraining.
The spectrogram is divided into non-overlapping temporal windows of
chunks_per_featureframes. Incomplete trailing windows are discarded without padding.- Parameters:
sample (
numpy.ndarrayof float32) – Mono audio waveform.- Returns:
Array of shape
(n_chunks, n_mels * chunks_per_feature)where each row is a flattened temporal window.- Return type:
numpy.ndarrayof float32
- __call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) Generator[tuple[Tensor, Tensor], None, None][source]¶
Yield
(feature, label)pairs for every non-overlapping window.The waveform is converted to a mel power spectrogram, divided into non-overlapping windows of
chunks_per_featureframes, and one pair is yielded per window. Incomplete trailing windows are discarded.When
noiseis set, a synthetic noisy copy of the waveform is blended at a randomly selected SNR of 0, 5, or 10 dB before feature extraction, and the pair becomes(noisy_feature, clean_feature).- Parameters:
sample (
numpy.ndarrayof float32) – Mono audio waveform atfsHz.- Returns:
Generator of
(feature, label)tensor pairs, each tensor having shape(n_mels * chunks_per_feature,).- Return type:
Generator[tuple[
torch.Tensor,torch.Tensor], None, None]
- class libdse.data.features.PowerSpectrumExtractor(sampling_rate: int, window_length: int, hop_length: int, noise: DEMANDNoiseDataset | None)[source]¶
Bases:
BaseExtractorRaw magnitude power spectrum feature extractor (no mel projection).
Converts a raw mono waveform into a sequence of single-sided magnitude power spectrum frames. Unlike
MelPowerSpectrumExtractor, no mel filterbank is applied — the full(1 + window_length // 2)-bin power spectrum of each STFT frame is used directly as a feature vector.Calling an instance yields one
(feature, label)pair per STFT frame.When noise is provided, a noisy version of the waveform is synthesised by
add_noise_snr()at a randomly selected SNR of 0, 5, or 10 dB, and every yielded pair becomes(noisy_feature, clean_feature).The extractor is designed to be instantiated once and called repeatedly — one call per utterance — from inside a
DataLoader.- Parameters:
sampling_rate (int) – Expected sample rate of input waveforms in Hz.
window_length (int) – STFT window length in samples (also used as the FFT size). Each feature vector has length
1 + window_length // 2.hop_length (int) – STFT hop size in samples.
noise (
DEMANDNoiseDatasetor None) – Optional DEMAND noise dataset used for on-the-fly noise mixing. PassNonefor clean-only feature extraction.
Example
extractor = MagnitudePowerSpectrumExtractor( sampling_rate=16_000, window_length=512, hop_length=256, noise=None, ) for feature, label in extractor(waveform): assert feature.shape == (257,) # 1 + 512 // 2
- sample_shape: tuple¶
1 + window_length // 2(the number of unique frequency bins in the single-sided STFT).- Type:
Flat length of each feature vector
- magnitude_power_spectrum(sample: ndarray[tuple[Any, ...], dtype[float32]]) ndarray[tuple[Any, ...], dtype[float32]][source]¶
Compute the single-sided magnitude power spectrum frame by frame.
Applies the STFT with a Hann window and returns
|STFT|²— the power of each frequency bin for every frame. No mel projection is applied.- Parameters:
sample (
numpy.ndarrayof float32) – Mono audio waveform.- Returns:
Array of shape
(1 + window_length // 2, n_frames)where each column is the power spectrum of one STFT frame.- Return type:
numpy.ndarrayof float32
- __call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) Generator[tuple[Tensor, Tensor], None, None][source]¶
Yield
(feature, label)pairs for every STFT frame.The waveform is converted to a magnitude power spectrogram and one pair is yielded per frame (column of the spectrogram). Each tensor contains the single-sided power spectrum of that frame.
When
noiseis set, a synthetic noisy copy of the waveform is blended at a randomly selected SNR of 0, 5, or 10 dB before feature extraction, and the pair becomes(noisy_feature, clean_feature).- Parameters:
sample (
numpy.ndarrayof float32) – Mono audio waveform atfsHz.- Returns:
Generator of
(feature, label)tensor pairs, each tensor having shape(1 + window_length // 2,).- Return type:
Generator[tuple[
torch.Tensor,torch.Tensor], None, None]
- class libdse.data.features.LogMagnitudeSpectrumExtractor(sampling_rate: int, window_length: int, hop_length: int, noise: DEMANDNoiseDataset | None)[source]¶
Bases:
BaseExtractorLog-magnitude power spectrum feature extractor (no mel projection).
Converts a raw mono waveform into a sequence of single-sided log magnitude spectrum frames.
Calling an instance yields one
(feature, label)pair per STFT frame.When noise is provided, a noisy version of the waveform is synthesised by
add_noise_snr()at a randomly selected SNR of 0, 5, or 10 dB, and every yielded pair becomes(noisy_feature, clean_feature).The extractor is designed to be instantiated once and called repeatedly — one call per utterance — from inside a
DataLoader.- Parameters:
sampling_rate (int) – Expected sample rate of input waveforms in Hz.
window_length (int) – STFT window length in samples (also used as the FFT size). Each feature vector has length
1 + window_length // 2.hop_length (int) – STFT hop size in samples.
noise (
DEMANDNoiseDatasetor None) – Optional DEMAND noise dataset used for on-the-fly noise mixing. PassNonefor clean-only feature extraction.
Example
extractor = LogMagnitudeSpectrumExtractor( sampling_rate=16_000, window_length=512, hop_length=256, noise=None, ) for feature, label in extractor(waveform): assert feature.shape == (257,) # 1 + 512 // 2
- sample_shape: tuple¶
1 + window_length // 2(the number of unique frequency bins in the single-sided STFT).- Type:
Flat length of each feature vector
- log_magnitude_power_spectrum(sample: ndarray[tuple[Any, ...], dtype[float32]]) ndarray[tuple[Any, ...], dtype[float32]][source]¶
Compute the single-sided magnitude power spectrum frame by frame.
Applies the STFT with a Hann window and returns
|STFT|²— the power of each frequency bin for every frame. No mel projection is applied.- Parameters:
sample (
numpy.ndarrayof float32) – Mono audio waveform.- Returns:
Array of shape
(1 + window_length // 2, n_frames)where each column is the power spectrum of one STFT frame.- Return type:
numpy.ndarrayof float32
- __call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) Generator[tuple[Tensor, Tensor], None, None][source]¶
Yield
(feature, label)pairs for every STFT frame.The waveform is converted to a magnitude power spectrogram and one pair is yielded per frame (column of the spectrogram). Each tensor contains the single-sided power spectrum of that frame.
When
noiseis set, a synthetic noisy copy of the waveform is blended at a randomly selected SNR of 0, 5, or 10 dB before feature extraction, and the pair becomes(noisy_feature, clean_feature).- Parameters:
sample (
numpy.ndarrayof float32) – Mono audio waveform atfsHz.- Returns:
Generator of
(feature, label)tensor pairs, each tensor having shape(1 + window_length // 2,).- Return type:
Generator[tuple[
torch.Tensor,torch.Tensor], None, None]
- class libdse.data.features.RawWaveformExtractor(sampling_rate: int, window_length: int, noise: DEMANDNoiseDataset | None)[source]¶
Bases:
BaseExtractorRaw waveform extractor — no frequency transform applied.
Splits a mono waveform into non-overlapping windows of window_length samples and yields each window directly as a feature vector. This is the natural companion extractor for time-domain models such as
WaveUNetthat operate on raw audio rather than spectrograms.When noise is provided, a noisy mixture is generated with
add_noise_snr()at a random SNR (0, 5, or 10 dB) and the pair becomes(noisy_window, clean_window); otherwise both elements of the pair are the same clean window.- Parameters:
sampling_rate (int) – Expected sample rate of input waveforms in Hz.
window_length (int) – Number of samples per output feature vector.
noise (
DEMANDNoiseDatasetor None) – Optional DEMAND noise dataset for on-the-fly noise mixing.
- __call__(sample: ndarray[tuple[Any, ...], dtype[float32]]) Generator[tuple[Tensor, Tensor], None, None][source]¶
Yield
(feature, label)pairs for every non-overlapping window.The waveform is zero-padded at the end when its length is not an integer multiple of window_length, ensuring no samples are silently dropped.
- Parameters:
sample (
numpy.ndarrayof float32) – Mono audio waveform atfsHz.- Returns:
Generator of
(feature, label)tensor pairs, each of shape(window_length,).- Return type:
Generator[tuple[
torch.Tensor,torch.Tensor], None, None]
LibriSpeech Dataset — libdse.data.librispeech¶
Streaming PyTorch dataset for the LibriSpeech ASR corpus.
This module provides LibriSpeechDataset, an
IterableDataset that streams (sample, label)
tensor pairs directly from raw FLAC audio files. Feature extraction is fully
delegated to a BaseExtractor instance supplied at
construction, keeping the dataset class decoupled from the specific feature
representation.
Internally the dataset discovers all FLAC files under entry_point, shuffles
them once at construction to reduce temporal correlation between consecutive
batches, and then iterates through each file. For every utterance the raw
waveform is loaded at 16 kHz, passed to the extractor, and the resulting
(sample, label) pair is yielded directly.
Layout assumption¶
The entry_point directory must contain exactly one sub-directory named
LibriSpeech/, matching the structure produced by the official LibriSpeech
tar archives:
entry_point/
└── LibriSpeech/
└── <speaker>/<chapter>/<utterance>.flac
Classes¶
LibriSpeechDataset— Iterable PyTorch dataset.
Exceptions¶
EntryPointError— Raised when entry_point is invalid.
- class libdse.data.librispeech.LibriSpeechDataset(entry_point: Path, extractor: BaseExtractor, sample_rate: int = 16000)[source]¶
Bases:
IterableDatasetIterable PyTorch dataset for the LibriSpeech ASR corpus.
Streams
(sample, label)tensor pairs directly from raw FLAC audio files. Feature extraction — STFT, mel projection, windowing, and optional noise mixing — is entirely delegated to the extractor argument, making this class agnostic about the feature representation.FLAC files are discovered recursively under entry_point at construction and shuffled once to reduce temporal correlation between consecutive training batches. Thereafter one
(sample, label)pair is yielded per utterance by callingextractor(waveform).- Parameters:
entry_point (
pathlib.Path) – LibriSpeech root directory. Must contain a single child directory namedLibriSpeech/.extractor (
BaseExtractor) – Feature extractor instance. Called once per utterance with the raw mono waveform (float32, 16 kHz) as its sole argument and must return a(sample, label)tensor pair.
- Raises:
EntryPointError – If entry_point is not a directory or does not contain a
LibriSpeech/sub-directory.
Note
Because the number of feature chunks per utterance is not known without reading every file,
__len__()is not supported. Use theDataLoaderand iterate untilStopIteration.See also
MelPowerSpectrumExtractorDefault extractor implementation.
DEMANDNoiseDatasetNoise dataset injected into the extractor for on-the-fly mixing.
Typical usage
from pathlib import Path from torch.utils.data import DataLoader from dae.data.features import MelPowerSpectrumExtractor from dae.data.librispeech import LibriSpeechDataset from dae.data.noise import DEMANDNoiseDataset, DEMANDNoiseType noise_ds = DEMANDNoiseDataset( entry_point=Path("data/noise/DEMAND"), noise_types=DEMANDNoiseType.ALL, ) extractor = MelPowerSpectrumExtractor( sampling_rate=16_000, window_length=512, hop_length=128, n_mels=40, chunks_per_feature=7, noise=noise_ds, ) ds = LibriSpeechDataset( entry_point=Path("data/train-clean-100"), extractor=extractor, ) loader = DataLoader(ds, batch_size=32) for noisy, clean in loader: loss = criterion(model(noisy), clean)
- fs¶
Sampling rate for the entire LibriSpeech corpus. Original cropus is sampled at 16 kHz, and all files are resampled to this rate at load time.
- sample_shape¶
Shape of a single feature vector, as reported by the extractor.
- __repr__() str[source]¶
Return a concise string representation of the dataset.
- Returns:
LibriSpeechDataset(n_files=M, sample_shape=S)- Return type:
- __len__() None[source]¶
Not implemented — the dataset length cannot be determined cheaply.
The exact number of
(sample, label)pairs depends on the duration of every audio file in the corpus. Scanning all files upfront would be prohibitively slow, solen()is intentionally unsupported. Use theDataLoaderand iterate untilStopIteration.- Raises:
NotImplementedError – Always.
- __iter__() Generator[tuple[ndarray[tuple[Any, ...], dtype[_ScalarT]], ndarray[tuple[Any, ...], dtype[_ScalarT]]], None, None][source]¶
Yield
(sample, label)tensor pairs by streaming each FLAC file.For every utterance the raw waveform is loaded at 16 kHz and passed to
extractorviayield from. The extractor is itself a generator that yields one(sample, label)pair per non-overlapping spectrogram window, so the total number of pairs emitted by this iterator is roughly proportional to the total audio duration.- Returns:
Generator of
(sample, label)tensor pairs.- Return type:
Generator[tuple[
torch.Tensor,torch.Tensor], None, None]
DEMAND Noise Dataset — libdse.data.noise¶
Noise dataset utilities for the DEMAND corpus.
This module provides two public objects used to load and mix real-world background noise into clean speech:
DEMANDNoiseType— AnEnumthat maps human-readable environment names to the exact directory names used in the DEMAND dataset archive.DEMANDNoiseDataset— Loads one or more noise environments from disk, concatenates them into a single array, and exposes it for slicing.add_noise_snr()— Mixes a noise segment into a clean signal at a caller-specified signal-to-noise ratio.
The DEMAND dataset contains 18 noise environments recorded at 16 kHz on
16 channels. Only channel 1 (ch01.wav) is used here.
Typical usage¶
from pathlib import Path
from dae.data.noise import DEMANDNoiseDataset, DEMANDNoiseType, add_noise_snr
noise_ds = DEMANDNoiseDataset(
entry_point=Path("data/noise/DEMAND"),
noise_types=DEMANDNoiseType.ALL,
)
noisy = add_noise_snr(signal=clean_waveform, noise=noise_ds.noise[:len(clean_waveform)], snr_db=10)
- class libdse.data.noise.DEMANDNoiseType(*values)[source]¶
Bases:
EnumDirectory-name identifiers for the DEMAND noise dataset.
Each member’s value is the exact directory name inside the DEMAND archive, which follows the pattern
<CATEGORY><NAME>_<FS>k.Pass a subset of members (or the convenience member
ALL) toDEMANDNoiseDatasetto control which environments are loaded.Members¶
Member
Directory name
KITCHENDKITCHEN_16kLIVINGDLIVING_16kWASHINGDWASHING_16kFIELDNFIELD_16kPARKNPARK_16kRIVERNRIVER_16kHALLWAYOHALLWAY_16kMEETINGOMEETING_16kOFFICEOOFFICE_16kCAFETERIAPCAFETER_16kRESTAURANTPRESTO_16kSTATIONPSTATION_16kSQUARESPSQUARE_16kTRAFFICSTRAFFIC_16kBUSTBUS_16kCARTCAR_16kMETROTMETRO_16kALL(all of the above)
- KITCHEN = 'DKITCHEN_16k'¶
- LIVING = 'DLIVING_16k'¶
- WASHING = 'DWASHING_16k'¶
- FIELD = 'NFIELD_16k'¶
- PARK = 'NPARK_16k'¶
- RIVER = 'NRIVER_16k'¶
- HALLWAY = 'OHALLWAY_16k'¶
- MEETING = 'OMEETING_16k'¶
- OFFICE = 'OOFFICE_16k'¶
- CAFETERIA = 'PCAFETER_16k'¶
- RESTAURANT = 'PRESTO_16k'¶
- STATION = 'PSTATION_16k'¶
- SQUARE = 'SPSQUARE_16k'¶
- TRAFFIC = 'STRAFFIC_16k'¶
- BUS = 'TBUS_16k'¶
- CAR = 'TCAR_16k'¶
- METRO = 'TMETRO_16k'¶
- ALL = ['DKITCHEN_16k', 'DLIVING_16k', 'DWASHING_16k', 'NFIELD_16k', 'NPARK_16k', 'NRIVER_16k', 'OHALLWAY_16k', 'OMEETING_16k', 'OOFFICE_16k', 'PCAFETER_16k', 'PRESTO_16k', 'PSTATION_16k', 'SPSQUARE_16k', 'STRAFFIC_16k', 'TBUS_16k', 'TCAR_16k', 'TMETRO_16k']¶
Convenience value that selects every environment at once. Pass
DEMANDNoiseType.ALLtoDEMANDNoiseDatasetto load all 17 DEMAND environments in a single call.
- class libdse.data.noise.DEMANDNoiseDataset(entry_point: Path, noise_types: list[DEMANDNoiseType] | DEMANDNoiseType, sample_rate: int = 16000)[source]¶
Bases:
objectLoads and exposes DEMAND background-noise recordings as a single array.
The DEMAND dataset contains 18 real-world noise environments, each recorded on 16 channels at 16 kHz. Only channel 1 (
ch01.wav) is used here. All selected recordings are concatenated end-to-end intonoiseso that callers can slice arbitrary-length segments without managing individual files.- Parameters:
entry_point (
pathlib.Path) – Directory that directly contains the per-environment sub-directories (e.g.DKITCHEN_16k/,TCAR_16k/, …).noise_types (
DEMANDNoiseTypeor list[DEMANDNoiseType]) – Noise environments to load. Pass a singleDEMANDNoiseTypemember, a list of members, or the special valueDEMANDNoiseType.ALLto load every environment at once. Every requested type must have a matching sub-directory under entry_point.
- Raises:
EntryPointError – If any requested environment directory is missing under entry_point.
- noise: numpy.ndarray¶
1-D float32 array containing all noise samples concatenated in the order the environment directories were iterated. Slice this directly to obtain segments of arbitrary length.
- libdse.data.noise.add_noise_snr(signal: ndarray[tuple[Any, ...], dtype[float32]], noise: ndarray[tuple[Any, ...], dtype[float32]], snr_db: float) ndarray[tuple[Any, ...], dtype[float32]][source]¶
Mix noise into signal at a target signal-to-noise ratio.
The noise array is first padded (wrap mode) or truncated to match the length of signal, then scaled so that the resulting SNR equals snr_db. If the mixture clips (peak > 1.0) it is peak-normalised.
\[\text{SNR}_{\text{dB}} = 10 \log_{10}\! \left(\frac{P_{\text{signal}}}{P_{\text{noise}}}\right)\]- Parameters:
signal (
numpy.ndarrayof float32) – Clean mono waveform, assumed to be in[-1, 1].noise (
numpy.ndarrayof float32) – Noise waveform. May be shorter or longer than signal.snr_db (float) – Desired signal-to-noise ratio in decibels.
- Returns:
Noisy mixture with the same length as signal, peak-normalised if clipping occurs.
- Return type:
numpy.ndarrayof float32
Exceptions — libdse.data.err¶
- exception libdse.data.err.EntryPointError[source]¶
Bases:
ExceptionRaised when the dataset entry point is not a valid root.
For LibriSpeech, the expected layout is a directory containing exactly one child named
LibriSpeech/. Any deviation indicates a wrong path or a manually altered dataset.For DEMAND, the expected layout is a directory containing one or more child directories named after the requested noise types. Any deviation indicates a wrong path or a manually altered dataset.