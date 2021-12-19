wtf_wikipedia parse data from wikipedia npm install wtf_wikipedia by Spencer Kelly and many contributors



we put our information in places we can't take it out.

and it's not just wikipedia it's dj set-lists

mathematical proofs

e-sports rankings

dictionary information it's

const wtf = require ( 'wtf_wikipedia' ) wtf.fetch( 'Toronto Raptors' ).then( ( doc ) => { let coach = doc.infobox().get( 'coach' ) coach.text() doc.sentences()[ 0 ].text() })

get clean plaintext:

let str = `[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall. <ref>Field of our Fathers: By Richard Johnson</ref>` wtf(str).text()

let doc = await wtf.fetch( 'Glastonbury' , 'en' ) doc.text()

get all the data from a page:

let doc = await wtf.fetch( 'Whistling' ) doc.json()

the default .json() output is really verbose, but you can cherry-pick data by poking-around like this:

doc.links().map( ( link ) => link.json()) doc.images()[ 0 ].json() doc.section( 'see also' ).link().json()

run it on the client-side:

< script src = "https://unpkg.com/wtf_wikipedia" > </ script > < script > wtf.fetch( 'Radiohead' , { 'Api-User-Agent' : 'Name your script here' }, function ( err, doc ) { let members = doc.infobox().get( 'current members' ) members.links().map( ( l ) => l.page()) }) </ script >

or in Deno/typescript/webpack:

import wtf from 'https://unpkg.com/wtf_wikipedia'

full wikipedia dumps

With this library, in conjunction with dumpster-dive, you can parse the whole english wikipedia in an aftertoon.

npm install -g dumpster-dive

Tutorials

Plugins

these add all sorts of new functionality:

wtf.extend( require ( 'wtf-plugin-classify' )) await wtf.fetch( 'Toronto Raptors' ).classify() wtf.extend( require ( 'wtf-plugin-summary' )) await wtf.fetch( 'Pulp Fiction' ).summary() wtf.extend( require ( 'wtf-plugin-person' )) await wtf.fetch( 'David Bowie' ).birthDate() wtf.extend( require ( 'wtf-plugin-i18n' )) await wtf.fetch( 'Ziggy Stardust' , 'fr' ).infobox().json()

Plugin classify person/place/thing summary short description text person birth/death information api fetch more data from the API i18n improves multilingual template coverage wtf-mlb fetch baseball data wtf-nhl fetch hockey data nsfw flag sexual/graphic/adult articles image additional methods for .images() html output html wikitext output wikitext markdown output markdown latex output latex

Ok first, 🛀

Wikitext is no small thing.

Consider:

the partial-implementation of inline-css,

nested elements do not honour the scope of other elements

the language has no errors

deep recursion of similar-syntax templates

the egyptian hieroglyphics syntax

'Birth_date_and_age' vs 'Birth-date_and_age'

the unexplained hashing scheme for image paths

the custom encoding of whitespace and punctuation

right-to-left values in left-to-right templates

PEG-based parsers struggle with wikitext's backtracking/lookarounds

there are 634,755 templates in en-wikipedia (as of Nov-2018)

there are a large number of pages that don't render properly on wikipedia, or its apps..

this library supports many recursive shenanigans, depreciated and obscure template variants, and illicit wiki-shorthands.

What it does:

Detects and parses redirects and disambiguation pages

and pages Parse infoboxes into a formatted key-value object

into a formatted key-value object Handles recursive templates and links- like [[.. [[...]] ]]

Per-sentence plaintext and link resolution

plaintext and link resolution Parse and format internal links

creates image thumbnail urls from File:XYZ.png filenames

filenames Properly resolve dynamic templates like {{CURRENTMONTH}} and {{CONVERT ..}}

Parse images , headings , and categories

, , and converts 'DMS-formatted' (59°12\'7.7"N) geo-coordinates to lat/lng

geo-coordinates to lat/lng parse and combine citation and reference metadata

Eliminate xml, latex, css, and table-sorting cruft

What doesn't do:

external 'transcluded' page data [1]

AST output

output smart (or 'pretty') formatting of html in infoboxes or galleries [1]

maintain perfect page order [1]

per-sentence references (by 'section' element instead)

maintain template or infobox css styling

large tables that span different sections [1]

It is built to be as flexible as possible. In all cases, tries to fail in considerate ways.

How about html scraping..?

Wikimedia's official parser turns wikitext ➔ HTML.

if you prefer this screen-scraping workflow, you can pluck at parts of a page like that.

that's cool!

getting structured data this way is still a complex, weird process. Manually spelunking the html is sometimes just as tricky and error-prone as scanning the wikitext itself.

The contributors to this library have come to that conclusion, as many others have.

This library has (lovingly) borrowed a lot of code and data from the parsoid project, and is gracious to those contributors.

enough chat.

flip your wikitext into a Doc object

import wtf from 'wtf_wikipedia' let txt = ` ==Wood in Popular Culture== * harry potter's wand * the simpson's fence ` wtf(txt)

let txt = `Whistling is featured in a number of television shows, such as [[Lassie (1954 TV series)|''Lassie'']], and the title theme for ''[[The X-Files]]''.` wtf(txt) .links() .map( ( l ) => l.page())

returns nice plain-text of the article

var txt = "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>" wtf(txt).text()

a section is a heading '==Like This=='

wtf(page).sections()[ 1 ].children() wtf(page).section( 'see also' ).remove()

let s = wtf(page).sentences()[ 4 ] s.links() s.bolds() s.italics()

await wtf.fetch( 'Whistling' ).categories()

let img = wtf(page).images()[ 0 ] img.url() img.thumbnail() img.format()

Fetch

You can grab and parse articles from any wiki api. This includes any language, any wiki-project, and most 3rd-party wikis.

let doc = await wtf.fetch( 'https://muppet.fandom.com/wiki/Miss_Piggy' ) doc = await wtf.fetch( 'Tony Hawk' , 'fr' ) doc.sentence().text() let docs = wtf.fetch([ 'Whistling' , 2983 ], { follow_redirects : false }) wtf.fetch( 'Toronto' , { lang : 'de' , wiki : 'wikivoyage' }).then( ( doc ) => { console .log(doc.sentences()[ 0 ].text()) })

you may also pass the wikipedia page id as parameter instead of the page title:

let doc = await wtf.fetch( 64646 , 'de' )

the fetch method follows redirects.

API plugin

wtf.category(title, [lang], [options | callback])

retrieves all pages and sub-categories belonging to a given category:

wtf.extend( require ( 'wtf-plugin-api' )) let result = await wtf.category( 'Category:Politicians_from_Paris' )

wtf.random([lang], [options], [callback])

fetches a random wikipedia article, from a given language or domain

wtf.extend( require ( 'wtf-plugin-api' )) wtf.random().then( ( doc ) => { console .log(doc.title(), doc.categories()) })

see wtf-plugin-api

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

pass a Api-User-Agent as something so they can use to easily throttle bad scripts

as something so they can use to easily throttle bad scripts bundle multiple pages into one request as an array (say, groups of 5?)

run it serially, or at least, slowly.

wtf .fetch([ 'Royal Cinema' , 'Aldous Huxley' ], 'en' , { 'Api-User-Agent' : 'spencermountain@gmail.com' , }) .then( ( docList ) => { let links = docList.map( ( doc ) => doc.links()) console .log(links) })

API

.title() - get/set the title of the page from the first-sentence

- get/set the title of the page from the first-sentence .pageID() - get/set the wikimedia id of the page, if we have it.

- get/set the wikimedia id of the page, if we have it. .wikidata() - get/set the wikidata id of the page, if we have it.

- get/set the wikidata id of the page, if we have it. .domain() - get/set the domain of the wiki we're on, if we have it.

- get/set the domain of the wiki we're on, if we have it. .url() - (try to) generate the url for the current article

- (try to) generate the url for the current article .lang() - get/set the current language (used for url method)

- get/set the current language (used for url method) .namespace() - get/set the wikimedia namespace of the page, if we have it

- get/set the wikimedia namespace of the page, if we have it .isRedirect() - if the page is just a redirect to another page

- if the page is just a redirect to another page .redirectTo() - the page this redirects to

- the page this redirects to .isDisambiguation() - is this a placeholder page to direct you to one-of-many possible pages

- is this a placeholder page to direct you to one-of-many possible pages .categories() - return all categories of the document

- return all categories of the document .sections() - return a list of the Document's sections

- return a list of the Document's sections .paragraphs() - return a list of Paragraphs, in all sections

- return a list of Paragraphs, in all sections .sentences() - return a list of all sentences in the document

- return a list of all sentences in the document .images() - return all images found in the document

- return all images found in the document .links() - return a list of all links, in all parts of the document

- return a list of all links, in all parts of the document .lists() - sections in a page where each line begins with a bullet point

- sections in a page where each line begins with a bullet point .tables() - return a list of all structured tables in the document

- return a list of all structured tables in the document .templates() - any type of structured-data elements, typically wrapped in like {{this}}

- any type of structured-data elements, typically wrapped in like {{this}} .infoboxes() - specific type of template, that appear on the top-right of the page

- specific type of template, that appear on the top-right of the page .references() - return a list of 'citations' in the document

- return a list of 'citations' in the document .coordinates() - geo-locations that appear on the page

- geo-locations that appear on the page .text() - plaintext, human-readable output for the page

- plaintext, human-readable output for the page .json() - a 'stringifyable' output of the page's main data

- a 'stringifyable' output of the page's main data .wikitext() - original wiki markup

Section

.title() - the name of the section, between ==these tags==

- the name of the section, between ==these tags== .index() - which number section is this, in the whole document.

- which number section is this, in the whole document. .indentation() - how many steps deep into the table of contents it is

- how many steps deep into the table of contents it is .sentences() - return a list of sentences in this section

- return a list of sentences in this section .paragraphs() - return a list of paragraphs in this section

- return a list of paragraphs in this section .links() - list of all links, in all paragraphs and templates

- list of all links, in all paragraphs and templates .tables() - list of all html tables

- list of all html tables .templates() - list of all templates in this section

- list of all templates in this section .infoboxes() - list of all infoboxes found in this section

- list of all infoboxes found in this section .coordinates() - list of all coordinate templates found in this section

- list of all coordinate templates found in this section .lists() - list of all lists in this section

- list of all lists in this section .interwiki() - any links to other language wikis

- any links to other language wikis .images() - return a list of any images in this section

- return a list of any images in this section .references() - return a list of 'citations' in this section

- return a list of 'citations' in this section .remove() - remove the current section from the document

- remove the current section from the document .nextSibling() - a section following this one, under the current parent: eg. 1920s → 1930s

- a section following this one, under the current parent: eg. 1920s → 1930s .lastSibling() - a section before this one, under the current parent: eg. 1930s → 1920s

- a section before this one, under the current parent: eg. 1930s → 1920s .children() - any sections more specific than this one: eg. History → [PreHistory, 1920s, 1930s]

- any sections more specific than this one: eg. History → [PreHistory, 1920s, 1930s] .parent() - the section, broader than this one: eg. 1920s → History

- the section, broader than this one: eg. 1920s → History .text() - readable plaintext for this section

- readable plaintext for this section .json() - return all section data

- return all section data .wikitext() - original wiki markup

Paragraph

.sentences() - return a list of sentence objects in this paragraph

- return a list of sentence objects in this paragraph .references() - any citations, or references in all sentences

- any citations, or references in all sentences .lists() - any lists found in this paragraph

- any lists found in this paragraph .images() - any images found in this paragraph

- any images found in this paragraph .links() - list of all links in all sentences

- list of all links in all sentences .interwiki() - any links to other language wikis

- any links to other language wikis .text() - generate readable plaintext for this paragraph

- generate readable plaintext for this paragraph .json() - generate some generic data for this paragraph in JSON format

- generate some generic data for this paragraph in JSON format .wikitext() - original wiki markup

Sentence

.links() - list of all links

- list of all links .bolds() - list of all bold texts

- list of all bold texts .italics() - list of all italic formatted text

- list of all italic formatted text .json() - return all sentence data

- return all sentence data .wikitext() - original wiki markup

Image

.url() - return url to full size image

- return url to full size image .thumbnail() - return url to thumbnail (pass size to customize)

- return url to thumbnail (pass to customize) .links() - any links from the caption (if present)

- any links from the caption (if present) .format() - get file format (e.g. jpg )

- get file format (e.g. ) .json() - return some generic metadata for this image

- return some generic metadata for this image .text() - does nothing

- does nothing .wikitext() - original wiki markup

Template

.text() - does this template generate any readable plaintext?

- does this template generate any readable plaintext? .json() - get all the data for this template

- get all the data for this template .wikitext() - original wiki markup

Infobox

.links() - any internal or external links in this infobox

- any internal or external links in this infobox .keyValue() - generate simple key:value strings from this infobox

- generate simple key:value strings from this infobox .image() - grab the main image from this infobox

- grab the main image from this infobox .get() - lookup properties from their key

- lookup properties from their key .template() - which infobox, eg 'Infobox Person'

- which infobox, eg 'Infobox Person' .text() - generate readable plaintext for this infobox

- generate readable plaintext for this infobox .json() - generate some generic 'stringifyable' data for this infobox

- generate some generic 'stringifyable' data for this infobox .wikitext() - original wiki markup

List

.lines() - get an array of each member of the list

- get an array of each member of the list .links() - get all links mentioned in this list

- get all links mentioned in this list .text() - generate readable plaintext for this list

- generate readable plaintext for this list .json() - generate some generic easily-parsable data for this list

- generate some generic easily-parsable data for this list .wikitext() - original wiki markup

Reference

.title() - generate human-facing text for this reference

- generate human-facing text for this reference .links() - get any links mentioned in this reference

- get any links mentioned in this reference .text() - returns nothing

- returns nothing .json() - generate some generic metadata data for this reference

- generate some generic metadata data for this reference .wikitext() - original wiki markup

Table

.links() - get any links mentioned in this table

- get any links mentioned in this table .keyValue() - generate a simple list of key:value objects for this table

- generate a simple list of key:value objects for this table .text() - returns nothing

- returns nothing .json() - generate some useful metadata data for this table

- generate some useful metadata data for this table .wikitext() - original wiki markup

Configuration

Adding new methods:

you can add new methods to any class of the library, with wtf.extend()

wtf.extend( ( models ) => { models.Doc.prototype.isPerson = function ( ) { return this .categories().find( ( cat ) => cat.match( /people/ )) } }) await wtf.fetch( 'Stephen Harper' ).isPerson()

Adding new templates:

does your wiki use a {{foo}} template? Add a custom parser for it:

wtf.extend( ( models, templates ) => { templates.foo = ( tmpl, list, parse ) => { let obj = parse(tmpl) list.push(obj) return 'new-text' } templates.foo = [ 'a' , 'b' , 'c' ] templates.baz = 0 templates.asterisk = '*' })

you can determine which templates are understood to be 'infoboxes' with the 3rd parameter:

wtf.extend( ( models, templates, infoboxes ) => { Object .assign(infoboxes, { person : true , place : true , thing : true }) })

3rd-party wikis

by default, a public API is provided by a installed mediawiki application. This means that most wikis have an open api, even if they don't realize it. Some wikis may turn this feature off.

It can usually be found by visiting http://mywiki.com/api.php

to fetch pages from a 3rd-party wiki:

wtf.fetch( 'Kermit' , { domain : 'muppet.fandom.com' }).then( ( doc ) => { console .log(doc.text()) })

some wikis will change the path of their API, from ./api.php to elsewhere. If your api has a different path, you can set it like so:

wtf.fetch( '2016-06-04_-_J.Fernandes_@_FIL,_Lisbon' , { domain : 'www.mixesdb.com' , path : 'db/api.php' }).then( ( doc ) => { console .log(doc.template( 'player' ).json()) })

for image-urls to work properly, the wiki should also have Special:Redirect enabled. Some wikis, (like wikia) have intentionally disabled this.

i18n and multi-language:

wikitext is (amazingly) used across all languages, wikis, and even in right-to-left languages. This parser actually does an okay job at it too.

Wikipedia I18n langauge information for Redirects, Infoboxes, Categories, and Images are included in the library, with pretty-decent coverage.

To improve coverage of i18n templates, use wtf-plugin-i18n

Please make a PR if you see something missing for your language.

this library ships seperate client-side and server-side builds, to preserve filesize.

./wtf_wikipedia-client.js - with sourcemap

./wtf_wikipedia-client.mjs - as es-module (or Deno)

./wtf_wikipedia-client.min.js - for production

./wtf_wikipedia.js - main node build

./wtf_wikipedia.mjs - esmodule node (deno/typescript)

the browser version uses fetch() and the server version uses require('https') .

It is not the fastest parser, and is very unlikely to beat a single-pass parser in C or Java.

Using dumpster-dive, this library can parse a full english wikipedia in around 4 hours on a macbook.

That's about 100 pages/second, per thread.

MIT