Word Embeddings

a type of Text Embeddings that maps words/phrases from the vocabulary to real numbers (or vectors)
conceptually it involves a mathematical embedding from space with many dimensions to a continuous vector space with a much lower dimension

Word Embedding - Spectrum

Classification Type	Density	Description
Reverse One-Hot Encoding	dense	each word is assigned a unique number two downsides to this approach: the integer-encoding is arbitrary (it does not capture any relationship between words). an integer-encoding can be challenging for a model to interpret. A linear classifier, for example, learns a single weight for each feature. Because there is no relationship between the similarity of any two words and the similarity of their encodings, this feature-weight combination is not meaningful
One-Hot Encoding	sparse	each word is represented as a vector where: its length equal to the size of the vocabulary has zeros everywhere except a 1 that corresponds to the word
Word Embeddings	middle	each word is represented as a vector

Word Embedding - Methods Generating Mapping

neural networks
dimensionality reduction on the word co-occurrence matrix
probabilistic models
explainable knowledge base method
explicit representation in terms of the context in which words appear

Word Embeddings - Taxonomy

Bag of Words (BoW)	a text, such as a sentence or a document, is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity
TF-IDF	gets this importance score by getting the term’s frequency (TF) and multiplying it by the term inverse document frequency (IDF)
Word2Vec	shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec can utilize either of two model architectures: continuous bag-of-words (CBOW) or continuous skip-gram. In the CBOW architecture, the model predicts the current word from a window of surrounding context words. In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words
GloVe	Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space
FastText	unlike GloVe, it embeds words by treating each word as being composed of character n-grams instead of a word whole. This feature enables it not only to learn rare words but also out-of-vocabulary words
ELMO (Embeddings from Language Model)	learns contextualized word representations based on a neural language model with a character-based encoding layer and two BiLSTM layers
CoVe (Contextualized Word Vectors)	uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors
BERT (Bidirectional Encoder Representations from Transformers)	transformer-based language representation model trained on a large cross-domain corpus. Applies a masked language model to predict words that are randomly masked in a sequence, and this is followed by a next-sentence-prediction task for learning the associations between sentences
XLM (Cross-lingual Language Model)	it’s a transformer pre-trained using next token prediction, a BERT-like masked language modeling objective, and a translation objective
RoBERTa (Robustly Optimized BERT Pretraining Approach)	it builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates
ALBERT (A Lite BERT for Self-supervised Learning of Language Representations)	it presents parameter-reduction techniques to lower memory consumption and increase the training speed of BERT

Resources

What are Embeddings - Vicki Boykis