wtf-plugin-classify

2.1.0 • Public • Published
wtf-plugin-classify
a plugin for wtf_wikipedia

npm install wtf-plugin-classify

This plugin uses a (large) number of heuristics to classify a wikipedia article into a basic Person/Place/Thing scheme.

Things it looks at:

  • infoboxes (like {{Infobox Person ...}})
  • categories (like '[[Category:Canadian Saxophone Players]]')
  • templates (like {{Liechtenstein-sport-bio-stub}})
  • sections (like '==Early life==')
  • titles (like 'John Smith (poet)')
const wtf = require('wtf_wikipedia')
wtf.extend(require('wtf-plugin-classify'))

wtf.fetch('Toronto Raptors').then((doc) => {
  let res = doc.classify()
  //{
  //  type: 'Organization/SportsTeam',
  //  score: 0.9,
  //  details: {...}
  //}
})
<script src="https://unpkg.com/wtf_wikipedia"></script>
<script src="https://unpkg.com/wtf-plugin-classify"></script>
<script defer>
  wtf.plugin(window.wtfClassify)
  wtf.fetch('Radiohead', function (err, doc) {
    console.log(doc.classify())
  })
</script>

Justification:

Traversing wikipedia's categories to find say, all the People or Places is a notoriously broken strategy: image or worse: image

Infoboxes like {{Infobox person}} are a really clear signal, but get muddled quickly with things like {{Infobox architect}}.

This library tries to do this sort of work, to determine if a page is about Person, a Place, or an Organization in broad terms.

Types:

Person:
  Athlete:
      AmericanFootballPlayer : true
      BaseballPlayer : true
      FootballPlayer : true
      BasketballPlayer : true
      HockeyPlayer : true
  Creator:
    Actor : true
    Musician : true
    Author : true
    Director : true
  Politician : true
Place:
  Jurisdiction:
      City : true
      Country : true
  Structure:
      Bridge : true
      Airport : true
  BodyOfWater : true
Organization:
  MusicalGroup : true
  Company : true
  SportsTeam : true
  PoliticalParty : true
  School : true
Event:
  Disaster : true
  Election : true
  MilitaryConflict : true
  SportsEvent : true
Creation:
  CreativeWork:
      Album : true
      Book : true
      Film : true
      TVShow : true
      Play : true
      Song : true
      VideoGame : true
  Product : true
  FictionalCharacter : true
Concept:
  MedicalCondition : true
  Organism : true

as of March 2020, it can classify ~65% of english wikipedia articles:

    null: 37.71%
    People: 18.86%
    Place: 14.01%
    Organization: 8.27%
    CreativeWork: 5.38%
    Event: 4.57%
    Thing: 5.75%

i18n

it is trained on the english wikipedia, but may also provide reasonable results in other languages.

it may help if you first require wtf-plugin-i18n, which maps many templates to their english forms.

work-in-progress.

MIT

Readme

Keywords

none

Package Sidebar

Install

npm i wtf-plugin-classify

Weekly Downloads

2

Version

2.1.0

License

MIT

Unpacked Size

296 kB

Total Files

75

Last publish

Collaborators

  • spencermountain