Convert a blog article into a clean Markdown text file.
For example, this article:
Is converted into this text file:
$ clean-mark "http://some-website.com/fancy-article"
The article will be automatically named using the URL path name. In the case, above, the name will be
fancy-article.md.
The file type can be specified:
$ clean-mark "http://some-website.com/fancy-article" -t html
The available types are: HTML, TEXT and Markdown.
The output file and path can be also specified:
$ clean-mark "http://some-website.com/fancy-article" -o /tmp/article
In that case the output will be
/tmp/article.md. The extension is added automatically.
Simply install with npm:
$ npm install clean-mark --global
Implementation steps:
This project depends on the A-Extractor project, a database of expressions used for extracting content from blogs and articles.
The goals of the project are are:
Clean-mark was tested on all major news sites. On some websites, the text, or links are cut from the article. In this case, you have to manually edit the resulted text,
AND
please raise an issue on A-Extractor with the link that doesn't work and we'll add it in the database, so that next time, the text will be extracted correctly.
Also, see how to contribute.
A massive list of Awesome Web Archiving tools: https://github.com/iipc/awesome-web-archiving
MIT © Cristi Constantin.