Resources

Audio Features Invariant to Signal Degradations

  • fourier coefficients
  • mel frequency cepstral coefficients (MFCC)
  • spectral flatness
  • sharpness
  • linear predictive coding (LPC)

In order to extract a 32-bit frame, 33 non-overlapping frequency bands are selected

  • frequency range from 300Hz to 2000Hz
  • logarithmic spacing (HAS operates on approximately logarithmic bands)

Initial

yt-dlp -x "https://www.youtube.com/watch?v=hLQl3WQQoQ0"
ffmpeg -i song-01.opus -c:a pcm_s24le song-01.wav
import os
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

audio_fpath = "./audio/"
audio_clips = os.listdir(audio_fpath)

# x, sr = librosa.load(audio_fpath + audio_clips[0], sr=None, offset=15.0, duration=0.01)
x, sr = librosa.load(audio_fpath + audio_clips[0], sr=None)
# x is the audio time series
# sr is the sample rate

# Compute the Short-Time Fourier Transform (STFT)
D = librosa.stft(x, n_fft=131072)

D = np.abs(D)

# Convert amplitude to decibels (log scale)
log_D = librosa.amplitude_to_db(D, ref=np.max)

# Plot the log spectrogram
plt.figure(figsize=(14, 5))
librosa.display.specshow(log_D, sr=sr, x_axis='time', y_axis='log')
plt.colorbar()
plt.show()