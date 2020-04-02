😍 JSON Syntax: input json, output the same structured json including with scraped data
🌈 Simple patterns: simple inline
selectors,
extractors,
filters and
parser.
💪 Reliable & fast: used in production within crawlers
let cheerio = require('cheerio')
let $ = cheerio.load(`
<body>
<h1>I love jsonframe!</h1>
<span itemprop="email"> Email: gabin@datascraper.pro </span>
<body>`)
let jsonframe = require('jsonframe-cheerio')
jsonframe($) // initializing the plugin
let frame = {
"title": "h1", // this is an inline selector
"email": "span[itemprop=email] < email" // output an extracted email
}
console.log( $('body').scrape(frame, { string: true } ))
/*=>
{
"title": "I love jsonframe!",
"email": "gabin@datascraper.pro"
}
/*
Install the plugin to your Node.js app through NPM
npm i jsonframe-cheerio --save
Start by
loading Cheerio.
let cheerio = require('cheerio')
let $ = cheerio.load("HTML DOM to load") // See Cheerio API
Then
load the jsonframe plugin.
let jsonframe = require('jsonframe-cheerio') // require from npm package
jsonframe($) // apply the plugin to the current Cheerio instance
Once the plugin is loaded, you've first got to set the frame of your data.
Let's take the following
HTML example:
<html>
<head></head>
<body>
<h2>Pricing</h2>
<img class="picture" src="somepath/to/image.png">
<a class="mainLink" href="some/url/to/somewhere">A Link</a>
<span class="date"> We are the 04/02/2017</span>
<div class="popup"><span>Some inner content</span></div>
<ul id="pricing" class="menu">
<li class="item">
<span class="planName">Hacker</span>
<span class="planPrice" price="0">Free</span>
<a href="/hacker"> <img src="./img/hacker.png"> </a>
</div>
<li class="item">
<span class="planName">Pro</span>
<span class="planPrice" price="39.00">$39</span>
<a href="/pro"> <img src="./img/pro.png"> </a>
</div>
</ul>
<div id="contact">
<span itemprop="usaphone">Phone USA: (912) 148-456</div>
<span itemprop="frphone">Phone FR: +332 38 30 37 90</div>
<span itemprop="email">Email: lspurcell@suddenlink.net</div>
</div>
</body>
</html>
selector is defined in Cheerio's documentation
frame is a JSON or Javascript Object
{options} are detailed later in its own section
let frame = {
"title": "h2" // CSS selector
}
We then pass the frame to the function:
let result = $('body').scrape(frame, { string: true })
console.log( result )
//=> {"title": "Pricing"}
Most common selector,
inline line by specifying nothing more than the data name property and the selector as its value.
...
let frame = { "title": "h2" }
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{ "title": "Pricing" }
*/
...
You can now declare everything in line. You should just be careful to always use them in the following order when combining them :
@ (attribute), | (extractor), || (parse).
See examples for each of them above.
_a: "attributeName" allows you to retrieve
any attribute data
@ inside the selector
_s allows you to do it inline
...
let frame = {
"proPrice": ".planName:contains('Pro') + span@price"
}
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{ "proPrice": "39.00" }
*/
...
< inside the selector
_s allows you to do it inline
It currently supports
telephone (also
phone),
date,
fullName (or
firstName,
lastName,
initials,
suffix,
salutation) and
html (to get the inner html) and by default (no declaration), we get the
inner text.
...
let frame = {
"email": "[itemprop=email] < phone",
"frphone": "[itemprop=frphone] < phone"
}
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{
"email": "example@google.net",
"frphone": "33238303790"
}
*/
...
| inside the selector
_s allows you to do it inline
It currently supports
trim (remove spaces at beginning and end),
lowercase or lcase,
uppercase or ucase,
capitalize or cap,
words or w,
noescapchar or nec,
compact or cmp and
number or nb.
...
let frame = {
"email1": "[itemprop=email] < phone | uppercase",
"email2": "[itemprop=email] < phone | capitalize"
}
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{
"email1": "EXAMPLE@GOOGLE.NET",
"email2": "EXAMPLE GOOGLE NET"
}
*/
...
|| inside the selector
_s allows you to use regexes in line
_p: /regex/ allows you to extract data based on regular expressions
...
let frame = {
"data": ".date || \\d{1,2}/\\d{1,2}/\\d{2,4}"
}
// or use the longer version for proper regex entry
let frame = {
"data": {
_s: ".date",
_p: /\d{1,2}\/\d{1,2}\/\d{2,4}/ // n[n]/n[n]/nn[nn] format here
}
}
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{
"date": "04/02/2017"
}
*/
...
_d: [{ }] allows you to get an
array / list of data
_d: ["selector"] will retrieves a list based on the selector inbetween quotes.
_d: ["firstSelector", "secondSelector"] works too and merge the results into one array
You could even shorten it more by listing right from the selector as follows:
"selectorName": [".selector"] which returns an array of strings
...
let frame = {
"pricing": {
_s: "#pricing .item",
_d: [{
"name": ".planName",
"price": ".planPrice"
}]
}
}
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{
"pricing": [
{
"name": "Hacker",
"price": "Free"
},
{
"name": "Pro",
"price": "$39"
}
]
}
*/
// Or a shorter way which works for simple string arrays
let frame = {
"pricingNames": ["#pricing .item .planName"]
}
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{
"pricingNames": ["Hacker", "Pro"]
}
*/
...
"_g": { _s: "", _d: {} } allows you to group some data selectors by a parent selector without naming the parent. You can also extends the group property to add some meaning or simply have several groups at the same level.
Group property name must be
_g or
_group followed by
_ and whatever string you want.
ex:
_g_head : {} or
_g_body : {}
...
let frame = {
_g: {
_s: "#pricing .item",
_d: {
"name": ".planName",
"price": ".planPrice"
}
},
_g_second: {
_s: "#pricing .item",
_d: {
"secondName": ".planName",
"secondPrice": ".planPrice"
}
}
}
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{
"name": "Hacker",
"price": "Free",
"secondName": "Hacker",
"secondPrice": "Free"
}
*/
...
"parent": { _s: "parentSelector", _d: {} } allows you to segment your data by
setting a parent section from which the child data will be scraped.
You can also use
"parent": { } when you only want to nest data into objects without setting a parent selector.
...
let frame = {
"pricing": {
_s: "#pricing .item",
_d: {
"name": ".planName",
"price": ".planPrice"
}
}
}
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{
"pricing":{
"name": "Hacker",
"price": "Free"
}
}
*/
...
Note here that we get the first returned result (#pricing .item).
See how you can properly
structure your data, ready for the output!
...
let frame = {
"pricing": {
_s: "#pricing .item",
_d: [{
"name": ".planName",
"price": ".planPrice @ price",
"image": {
"url": "img @ src",
"link": "a @href"
}
}]
}
}
let result = $('body').scrape(frame, { string: true })
console.log( result )
/* output =>
{
"pricing":[
{
"name": "Hacker",
"price": "0",
"image": {
"url": "./img/hacker.png",
"link": "/hacker"
}
},
{
"name": "Pro",
"price": "39.00",
"image": {
"url": "./img/pro.png",
"link": "/pro"
}
}
]
}
*/
...
Note here that we get the first returned result (#pricing .item).
...
let frame = {
"proPrice": {
_s: ".planName:contains('Pro') + span",
_a: "price"
}
}
let result = $('body')
.scrape(frame, {
timestats: true, // default: false
string: true // default: false
})
console.log(result)
/* output =>
{
"proPrice": {
"value":"39.00",
"_timestats": "1" // ms
}
}
*/
...
One shot tests
npm run test
Watching test on updates
npm run test-watch
⚠ Careful if you've been using jsonframe from the version 1.x.x, some things changed to make it more flexible, faster to use (inline parameters) and more meaningful in the syntax.
2.0.52 (28/02/2017)
2.0.51 (27/02/2017)
2.0.50 (27/02/2017)
.selector < html email would work
2.0.49 (27/02/2017)
2.0.48 (27/02/2017)
Split(char) to split string based on character (default to whitespace)
numbers or nb (return potentially an array)
numbers or nb (simply filter the string to output only numbers)
between(string1&&string2) to filter data by starting and finishing string
before(string) to get data before a string
after(string) to get data after a string
left(nb) and
right(nb) (slice the array elements)
fromto(startNb,endNb) to either slice an array or a string from index to index
get(nb) to extract either an array item or a character from a string
2.0.46 (26/02/2017)
2.0.45 (25/02/2017)
2.0.44 (24/02/2017)
"mails": [".parentSelector < email"]
prenom and
nom to humanname extractor
right(number),
left(number)
/([a-zA-Z0-9._-]{0,30}@[a-zA-Z0-9._-]{0,15}\.[a-zA-Z0-9._-]{0,15})/gmi
2.0.3 (23/02/2017)
_b). More about this soooon in the readme.
words or w,
noescapchar or nec and
compact or cmp
.selector | words compact. Simply separated by spaces.
2.0.2 (15/02/2017)
2.0.1 (14/02/2017)
2.0.0 (12/02/2017)
Type
Extractor with shortcode
< instead of
|
filters with the shortcode
|
"attribute",
"extractor" and
"parse"
"_g" or
"_group" )
1.1.1 (05/02/2017)
_s,
_t,
_a) instead of
"selector",
"extractor",
"attr". Idea behind being to easily differentiate retrieved data name to functionnal data.
img selected element (automatically retrieve the img src link)
_parent_ selector to target the parent content
_p (
_parse works too)
_t: "html" feature to get back inner html of a selector
.scrape(frame, {timestats: true})
1.0.0 (27/01/2017)
Feel free to follow the procedure to make it even more awesome!
issue so we
get the discussion started
git checkout -b my-new-feature
git commit -am 'Add some feature'
git push origin my-new-feature
Gabin Desserprit - datascraper.pro
Released under MIT License