transformer-based language representation model trained on a large cross-domain corpus
applies a masked language model to predict words that are randomly masked in a sequence, and this is followed by a next-sentence-prediction task for learning the associations between sentences
because BERT doesn’t have a decoder component, it can’t generate text, which paved the way for GPT models to pick up where BERT left off