Introduction

The development of machine learning algorithms has revolutionized the way we interact with technology, with applications in fields ranging from image and speech recognition to natural language processing. One of the areas is the recognition and understanding of sounds in raw audio data. This is because audio data is complex and doesn't have a natural structure like written language, making it difficult to train computer models to understand it.

In recent years, the Wav2Vec2 model has emerged as a promising approach to solving this problem. The model uses self-supervised learning to train on raw audio data, allowing it to learn about sounds and their relationships without needing explicit labeling or annotation. This is a significant departure from traditional supervised learning methods that require large amounts of labeled data to train models.

Motivation

There are two main motivations for this work:

1. Can there be a better way to train the wav-to-vec model without any label data?

If we compare this to training text-based language model (LM), working with text is much easier to do self-supervised learning. Commonly, a token is masked and the model is used to guess what that masked word is, e.g., the BERT model. Alternatively, given $n$ tokens, the model is trained to predict the next one. However, in audio, things are slightly more complicated, specifically, it's no longer about predicting a token but predicting a sequence of the audio segment.

2. Can the model benefit from training on multilingual data?

If we have the same model architecture but are trained on two different types of data (monolingual and multilingual), which one is more beneficial for the downstream task?

Concept

The core of Wav2Vec2 is based on the standard Transformer model. It's one of the main building blocks of this architecture. The main difference is its input and loss functions.

Preprocessing: This model takes in raw audio. No audio preprocessor (or Fourier transform) is required.

Latent Representation: Raw audio $X$ is split into chunks (with some overlappings) and fed into CNN $Z$ to extract its latent representation. For instance at chunk $i$, $x_i \rightarrow z_i$.

Contextual Representation: This latent representation $z_i$ along with $z_j$ at other time steps is fed into the standard Transformer to obtain contextual representation $c_i$. What this means is that the vector $c_i$ captures and is aware of nearby audio chunks.

Product Quantization: The same $z_i$ is also get discretized into $l_i$. In other words, we try to put a label onto $z_i$.

$q$ has two modules where each element has a vocab size of 320.
This corresponds to 102.4K combinations of code words.
This discretized module uses Gumbel-Softmax to allow the flow of differentiation because the regular argmax function is indifferentiable.
With the discretized $l_i$, it is projected back into a vector and concatenated (since we are having two modules for each $l$) into a vector $q_i$ of the same size as $c_i$.

Masked Prediction: Remember that the model does not see $z_i$ because $z_i$ is masked and instead obtains $c_i$ from its surrounding context. Now model needs to pick which $q_i$ is the most relevant to the embedding $c_i$?

The comparison between $c_i$ and $q_i$ is done using a simple cosine similarity function.

Contrastive Loss Function

Diverse Discretize: There can be cases where the model does not utilize all the vocabulary in the $Q$ module. It might just pick a subset of $Q$ again and again. To combat this problem, they include a diversity loss function. Essentially, it penalizes when the model is very certain of a particular code label. The higher the entropy, the bigger the loss value is.

Diversity Loss Function

Dataset

The model was experimented with and trained on Librispeech (960 hours of audio) or LibriVox (53.2K hours of audio). Both datasets are not merged. Based on this blog post, it seems that the publicly released version of Wav2Vec 2.0 is pre-trained on LibriVox dataset.

Strengths and Limitations

This self-supervised learning approach is successful. Fine-tuning this on the benchmark dataset (ASR) outperforms the existing models.
Fine-tuning does not require that much data. This enables the downstream task on low-resource language to be efficient. It can go down to 10 hours or even 1 hour.
The limitation of this model is not properly discussed in the paper.

Conclusion

In conclusion, Wav2Vec 2.0 is a promising model for self-supervised learning in the field of audio processing. By using raw audio input and a novel loss function, the model is able to learn representations of audio that can be fine-tuned for various downstream tasks such as automatic speech recognition. One of the strengths of this approach is that it can be applied to low-resource languages, requiring only a small amount of labeled data for fine-tuning. Wav2Vec 2.0 represents a significant advance in the field of audio processing and has the potential to contribute to a wide range of applications, from speech recognition to speaker identification to music processing.