Word Embeddings
- a type of Text Embeddings that maps words/phrases from the vocabulary to real numbers (or vectors)
- conceptually it involves a mathematical embedding from space with many dimensions to a continuous vector space with a much lower dimension
Word Embedding - Spectrum
---cognitive-computing---machine-intelligence/ai---subfields/natural-language-processing-(nlp)---computational-linguistics/information-retrieval-(ir)---information-extraction-(ie)/feature-conversion---text-embeddings/embedding/word-embeddings/embedding/word-embeddings-spectrum.png)
|
Classification Type |
Density |
Description |
|---|---|---|
|
dense |
| |
|
sparse |
| |
|
Word Embeddings |
middle |
|
Word Embedding - Methods Generating Mapping
- neural networks
- dimensionality reduction on the word co-occurrence matrix
- probabilistic models
- explainable knowledge base method
- explicit representation in terms of the context in which words appear
Word Embeddings - Taxonomy
---cognitive-computing---machine-intelligence/ai---subfields/natural-language-processing-(nlp)---computational-linguistics/information-retrieval-(ir)---information-extraction-(ie)/feature-conversion---text-embeddings/embedding/word-embeddings/embedding/taxonomy-of-word-embeddings.png)
|
a text, such as a sentence or a document, is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity | |
|
gets this importance score by getting the term’s frequency (TF) and multiplying it by the term inverse document frequency (IDF) | |
|
shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec can utilize either of two model architectures: continuous bag-of-words (CBOW) or continuous skip-gram. In the CBOW architecture, the model predicts the current word from a window of surrounding context words. In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words | |
|
Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space | |
|
unlike GloVe, it embeds words by treating each word as being composed of character n-grams instead of a word whole. This feature enables it not only to learn rare words but also out-of-vocabulary words | |
|
learns contextualized word representations based on a neural language model with a character-based encoding layer and two BiLSTM layers | |
|
uses a deep LSTM encoder from an attentional sequence-to-sequence model trained for machine translation to contextualize word vectors | |
|
BERT (Bidirectional Encoder Representations from Transformers) |
transformer-based language representation model trained on a large cross-domain corpus. Applies a masked language model to predict words that are randomly masked in a sequence, and this is followed by a next-sentence-prediction task for learning the associations between sentences |
|
it’s a transformer pre-trained using next token prediction, a BERT-like masked language modeling objective, and a translation objective | |
|
it builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates | |
|
ALBERT (A Lite BERT for Self-supervised Learning of Language Representations) |
it presents parameter-reduction techniques to lower memory consumption and increase the training speed of BERT |