Treebanks

  • a corpus of sentences where each sentence is paired with a parse tree (presumably the right one)
  • useful for:
    • corpus linguistics - investigating the empirical detail of various constructions in a given language
    • statistical parsers -
  • Penn Treebank - widely used treebank

Treebank Grammars

  • treebanks implicitly define a grammar for the language covered in the treebank
  • simply take the local rules that make up the sub-trees in all trees in the treebank and you have a grammar
  • not complete, but if you have a decent size corpus, you’ll have a grammar with decent coverage
  • such grammar tend to be very flat due to the fact that they tend to avoid recursion. for example, the Penn Treebank has 4500 different rules for VPs. Among them:
    • VP → VBD PP
    • VP → VBD PP PP
    • VP → VBD PP PP PP
    • VP → VBD PP PP PP PP

Heads in Trees

  • finding heads in treebank trees is a task that arises
  • we can visualize this task by annotating the nodes of a parse tree with the heads of each corresponding node

Head Finding

  • a standard way to do head finding is to use a simple set of tree traversal rules specific to each non-terminal in the grammar