Word Embeddings
  • a type of Text Embeddings that maps words/phrases from the vocabulary to real numbers (or vectors)
  • conceptually it involves a mathematical embedding from space with many dimensions to a continuous vector space with a much lower dimension

Word Embedding - Spectrum

Classification Type

Density

Description

Reverse One-Hot Encoding

dense

  • each word is assigned a unique number
  • two downsides to this approach:
    • the integer-encoding is arbitrary (it does not capture any relationship between words).
    • an integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful

One-Hot Encoding

sparse

  • each word is represented as a vector where:
    • its length equal to the size of the vocabulary
    • has zeros everywhere except a 1 that corresponds to the word

Word Embeddings

middle

  • each word is represented as a vector

Word Embedding - Methods Generating Mapping

Word Embeddings - Taxonomy

Bag of Words (BoW)

a text, such as a sentence or a document, is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity

TF-IDF

gets this importance score by getting the term’s frequency (TF) and multiplying it by the term inverse document frequency (IDF)

Word2Vec

shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec can utilize either of two model architectures: continuous bag-of-words (CBOW) or continuous skip-gram. In the CBOW architecture, the model predicts the current word from a window of surrounding context words. In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words

GloVe

Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space

FastText

unlike GloVe, it embeds words by treating each word as being composed of character n-grams instead of a word whole. This feature enables it not only to learn rare words but also out-of-vocabulary words

ELMO (Embeddings from Language Model)

learns contextualized word representations based on a neural language model with a character-based encoding layer and two BiLSTM layers

CoVe (Contextualized Word Vectors)

uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors

BERT (Bidirectional Encoder Representations from Transformers)

transformer-based language representation model trained on a large cross-domain corpus. Applies a masked language model to predict words that are randomly masked in a sequence, and this is followed by a next-sentence-prediction task for learning the associations between sentences

XLM (Cross-lingual Language Model)

it’s a transformer pre-trained using next token prediction, a BERT-like masked language modeling objective, and a translation objective

RoBERTa (Robustly Optimized BERT Pretraining Approach)

it builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates

ALBERT (A Lite BERT for Self-supervised Learning of Language Representations)

it presents parameter-reduction techniques to lower memory consumption and increase the training speed of BERT

Resources