readability-cli takes any HTML page and strips out unnecessary bloat by using Mozilla's Readability library. As a result, you get a web page which contains only the core content and nothing more. The resulting HTML is suitable for terminal browsers, text readers, and other uses.
Here is a before-and-after comparison, using an article from The Guardian as a test subject.
So much useless stuff that the main article does not even fit on the screen!
Ah, much better.
readability-cli can be installed on any system with Node.js:
npm install -g readability-cli
Arch Linux users may use the readability-cli AUR package instead.
readable [SOURCE] [options]
readable [options] -- [SOURCE]
SOURCE is a file, an http(s) URL, or '-' for standard input
readable --help for more information.
Read HTML from a file and output the result to the console:
Fetch a random Wikipedia article, get its title and an excerpt:
readable https://en.wikipedia.org/wiki/Special:Random -p title,excerpt
Fetch a web page and read it in W3M:
readable https://www.nytimes.com/2020/01/18/technology/clearview-privacy-facial-recognition.html | w3m -T text/html
Download a web page using cURL, parse it and output as JSON:
curl https://github.com/mozilla/readability | readable --base=https://github.com/mozilla/readability --json
It's a good idea to supply the --base parameter when piping input, otherwise
readable won't know the document's URL, and things like relative links won't work.