HTML cleaner and beautifier

Do you have crappy HTML? I do!

< table width = "100%" border = "0" cellspacing = "0" cellpadding = "0" > < tr > < td height = "31" > < b > Currently we have these articles available: </ b > < blockquote > < p > < a href = "foo.html" > The History of Foo </ a > < br /> An < span color = "red" > informative </ span > piece of < font face = "arial" > information </ font > . </ p > < p > < A HREF = "bar.html" > A Horse Walked Into a Bar </ A > < br /> The bartender said "Why the long face?" </ p > </ blockquote > </ td > </ tr > </ table >

Just look at those blank lines and random line breaks, trailing spaces, mixed tabs, deprecated tags - it's outrageous!

Let's clean it up:

var cleaner = require ( 'clean-html' ), fs = require ( 'fs' ), filename = process.argv[ 2 ]; fs.readFile(filename, function ( err, data ) { cleaner.clean(data, function ( html ) { console .log(html); }); });

Running this script on the file above produces the following output:

< table > < tr > < td > < b > Currently we have these articles available: </ b > < blockquote > < p > < a href = "foo.html" > The History of Foo </ a > < br > An < span > informative </ span > piece of information. </ p > < p > < a href = "bar.html" > A Horse Walked Into a Bar </ a > < br > The bartender said "Why the long face?" </ p > </ blockquote > </ td > </ tr > </ table >

You can pass additional options to the clean function like this:

var options = { 'add-remove-tags' : [ 'table' , 'tr' , 'td' , 'blockquote' ] }; cleaner.clean(data, options, function ( html ) { console .log(html); });

In this case, it produces:

< b > Currently we have these articles available: </ b > < p > < a href = "foo.html" > The History of Foo </ a > < br > An < span > informative </ span > piece of information. </ p > < p > < a href = "bar.html" > A Horse Walked Into a Bar </ a > < br > The bartender said "Why the long face?" </ p >

Sanity restored!

Options

Adds line breaks before and after comments.

Type: Boolean

Default: true

Tags that should have line breaks added before and after.

Type: Array

Default: ['body', 'blockquote', 'br', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'head', 'hr', 'link', 'meta', 'p', 'table', 'title', 'td', 'tr']

indent

The string to use for indentation. e.g., a tab character or one or more spaces.

Type: String

Default: ' ' (two spaces)

Attributes to remove from markup.

Type: Mixed Array (strings or RegExp pattern)

Default: ['align', 'bgcolor', 'border', 'cellpadding', 'cellspacing', 'color', 'height', 'target', 'valign', 'width']

Removes comments.

Type: Boolean

Default: false

Tags to remove from markup if empty.

Type: Mixed Array (strings or RegExp pattern)

Default: []

Tags to always remove from markup. Nested content is preserved.

Type: Mixed Array (strings or RegExp pattern)

Default: ['center', 'font']

Replaces non-breaking white space entities ( ) with regular spaces.

Type: Boolean

Default: false

wrap

The column number where lines should wrap. Set to 0 to disable line wrapping.

Type: Integer

Default: 120

Adding values to option lists

These options exist for your convenience.

Additional tags to include in break-around-tags .

Type: Array

Default: null

Additional attributes to include in remove-attributes .

Type: Array

Default: null

Additional tags to include in remove-tags .

Type: Array

Default: null

Global installation

If this package is installed globally, it can be used from the command line:

$ cat crappy.html | clean-html

Instead of piping the input from another program, you can supply a filename as the first argument:

$ clean-html crappy.html

You can redirect the output to another file:

$ clean-html crappy.html > clean.html

Or you can edit the file in place:

$ clean-html crappy.html -- in -place

All of the options above can be used from the command line. Array option values should be separated by commas:

$ clean-html crappy.html --add-remove-tags b,i,u

Boolean options can be set to true like this:

$ clean-html crappy.html --remove-comments

Or like this

$ clean-html crappy.html --remove-comments true

They can be set to false like this: