A Latin-script language parser for retext producing nlcst nodes.
Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum
penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”),
parse-latin does a good job at tokenizing it.
Note also that
parse-latin does a decent job at tokenizing Latin-like scripts,
Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի
է”), and such.
This package is ESM only: Node 12+ is needed to use it and it must be
imported
instead of
required.
npm:
npm install parse-latin
import inspect from 'unist-util-inspect'
import {ParseLatin} from 'parse-latin'
var tree = new ParseLatin().parse('A simple sentence.')
console.log(inspect(tree))
Which, when inspecting, yields:
RootNode[1] (1:1-1:19, 0-18)
└─ ParagraphNode[1] (1:1-1:19, 0-18)
└─ SentenceNode[6] (1:1-1:19, 0-18)
├─ WordNode[1] (1:1-1:2, 0-1)
│ └─ TextNode: "A" (1:1-1:2, 0-1)
├─ WhiteSpaceNode: " " (1:2-1:3, 1-2)
├─ WordNode[1] (1:3-1:9, 2-8)
│ └─ TextNode: "simple" (1:3-1:9, 2-8)
├─ WhiteSpaceNode: " " (1:9-1:10, 8-9)
├─ WordNode[1] (1:10-1:18, 9-17)
│ └─ TextNode: "sentence" (1:10-1:18, 9-17)
└─ PunctuationNode: "." (1:18-1:19, 17-18)
This package exports the following identifiers:
ParseLatin.
There is no default export.
ParseLatin(value)
Exposes the functionality needed to tokenize natural Latin-script languages into
a syntax tree.
If
value is passed here, it’s not needed to give it to
#parse().
ParseLatin#tokenize(value)
Tokenize
value (
string) into letters and numbers (words), white space, and
everything else (punctuation).
The returned nodes are a flat list without paragraphs or sentences.
Array.<Node> — Nodes.
ParseLatin#parse(value)
Tokenize
value (
string) into an NLCST tree.
The returned node is a
RootNode with in it paragraphs and sentences.
Node — Root node.
Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.
parse-latin splits text into white space, word, and punctuation tokens.
parse-latin starts out with a pretty easy definition, one that most other
tokenizers use:
Then, it manipulates and merges those tokens into a (nlcst) syntax tree, adding sentences and paragraphs where needed.
non-profit,
she’s,
G.I.,
11:00,
N/A,
&c,
nineteenth- and…
1.,
e.g.,
id.
.),
."