Optical Character Recognition (OCR) - image to text
Text-to-Image - text to image
Syntax Related
Grammar Induction - generate a formal grammar that describes a language’s syntax
Segmentation/Tokenizer:
Sentence Segmentation (Sentence Boundary Disambiguation) - task of separating a body of text into sentences
Tokenization (Word Segmentation) - process of breaking a body of text into tokens (e.g. words and/or phrases)
Morphological Segmentation - the task of separating words into individual morphemes and identifying the classes of morphemes
Normalization - process of normalizing a token (e.g. U.S.A to USA?)
Lower/Upper Casing -
Stemming - the task of reducing inflected/derived words to their root form (removing affixes) (e.g. automates automatic automation → automat)
porter’s algorithm - the most common english stemmer
Lemmatization - the task of removing inflectional words and return the lemma (base dictionary form of a word) and grouping together different forms of the same word (e.g. am are is → be | car cars car’s cars’ → car)
also takes into consideration the context of the word in order to solve other problems like disambiguation
Syntactic Parsing - is a method of syntactic analysis of a sentence (e.g. the task of determining the parse-tree of a given sentence)
Constituency Parsing - focuses on building out parse-tree of constituents
Dependency Parsing - focuses on the relationships between words in a sentence (e.g. marking words like primary-objects and predicates)
Semantic Related
Word/Phrase Semantics:
Morphology - components of words that carry meanings aside from actual definition of word (e.g. singular vs plural)
Lexical Semantics - meaning of individual words (in context)
Compositional Semantics - meaning of phrases/groups of words (e.g. distinction between Western Europe and Eastern Europe)
Distribution Semantics - theories and methods for quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in large samples of language data
Machine Translation - task of translating a document from one language to another
Information Extraction - the task of extracting information (e.g. entities, relations, events, temporal, etc) from a body of text
Automatic Summarization - the task of producing a summary of a body of text
Referring Expressions Detection - a more general task of coreference resolution. the task of identifying “bridging relationships”. (e.g. “he enter the house through the front door” the front door is a referring expression and the bridging relationship to be identified is the fact that the door is of John’s house)
Co-Reference Resolution - the task of determining which words (“mentions”) refer to the same objects (“entities”). makes use of knowledge about how words like that or pronouns like it or she refer to previous parts of the discourse
Anaphora Resolution - a specific type of coreference resolution concerned with matching up pronouns with the nouns or name-entities to which they refer
Question-Answering - given question, determine the meaning of words, then determine the answer. (see Search Engines - Types)
Conversational Agents or Dialogue Systems - superset of question-answering. computer programs that are able to converse with humans in natural language
Discourse Analysis - a number of tasks:
identifying the discourse structure of connected text
recognizing and classifying speech-acts in text (e.g. yes-no question, content question, statement, assertion, etc)