Weakly Supervised RE

The idea here is to either:

  • start out with a set of hand-crafted rules and automatically find new ones from the unlabeled text data, through and iterative process (bootstrapping)
  • start out with a sed of seed tuples, describing entities with a specific relation (e.g. seed={(ORG:IBM, LOC:Armonk), (ORG:Microsoft, LOC:Redmond)} states entities having the relation “based in”)


Snowball: Extracting relations from large plain-text collections

Snowball is a fairly old example of an algorithm which does this:

  1. Start with a set of seed tuples (or extract a seed set from the unlabeled text with a few hand-crafted rules).
  2. Extract occurrences from the unlabeled text that matches the tuples and tag them with a NER (named entity recognizer).
  3. Create patterns for these occurrences, e.g. “ORG is based in LOC”.
  4. Generate new tuples from the text, e.g. (ORG:Intel, LOC: Santa Clara), and add to the seed set.
  5. Go step 2 or terminate and use the patterns that were created for further extraction

Example

  • seed tuple: <Mark Twain, Elmira>
  • grep/google for the environments of the seed tuple:
    • Mark Twain is buried in Elmira, NY
      • X is buried in Y
    • The grave of Mark Twain is in Elmira
      • The grave of X is in Y
    • Elmira is Mark Twain’s final resting place
      • Y is X’s final resting place
  • use those patterns to grep for new tuples
  • iterate