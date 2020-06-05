Declarative DOM extraction expression evaluator.
Powerful, succinct, composable, extendable, declarative API.
articles:
- select article {0,}
- body:
- select .body
- read property innerHTML
imageUrl:
- select img
- read attribute src
summary:
- select ".body p:first-child"
- read property innerHTML
- format text
title:
- select .title
- read property textContent
pageName:
- select .body
- read property innerHTML
Not succinct enough for you? Use aliases and the pipe operator (
|) to shorten and concatenate the commands:
articles:
- sm article
- body: s .body | rp innerHTML
imageUrl: s img | ra src
summary: s .body p:first-child | rp innerHTML | f text
title: s .title | rp textContent
pageName: s .body | rp innerHTML
Have you got suggestions for improvement? I am all ears.
|Name
|Type
|Description
|Default value
evaluator
EvaluatorType
|HTML parser and selector engine. See evaluators.
browser evaluator if
window and
document variables are present,
cheerio otherwise.
subroutines
$PropertyType<UserConfigurationType, 'subroutines'>
|User defined subroutines. See subroutines.
|N/A
Subroutines use an evaluator to parse input (i.e. convert a string into an object) and to select nodes in the resulting document.
The default evaluator is configured based on the user environment:
browser evaluator is used if
window and
document variables are defined; otherwise
cheerio
Have a use case for another evaluator? Raise an issue.
For an example implementation of an evaluator, refer to:
browser evaluator
Uses native browser methods to parse the document and to evaluate CSS selector queries.
Use
browser evaluator if you are running Surgeon in a browser or a headless browser (e.g. PhantomJS).
import {
browserEvaluator
} from './evaluators';
surgeon({
evaluator: browserEvaluator()
});
cheerio evaluator
Uses cheerio to parse the document and to evaluate CSS selector queries.
Use
cheerio evaluator if you are running Surgeon in Node.js.
import {
cheerioEvaluator
} from './evaluators';
surgeon({
evaluator: cheerioEvaluator()
});
A subroutine is a function used to advance the DOM extraction expression evaluator, e.g.
x('foo | bar baz', 'qux');
In the above example, Surgeon expression uses two subroutines:
foo and
bar.
foo subroutine is invoked without additional values.
bar subroutine is executed with 1 value ("baz").
Subroutines are executed in the order in which they are defined – the result of the last subroutine is passed on to the next one. The first subroutine receives the document input (in this case: "qux" string).
Multiple subroutines can be written as an array. The following example is equivalent to the earlier example.
x([
'foo',
'bar baz'
], 'qux');
There are two types of subroutines:
Note:
These functions are called subroutines to emphasise the cross-platform nature of the declarative API.
The following subroutines are available out of the box.
append subroutine
append appends a string to the input string.
|Parameter name
|Description
|Default
|tail
|Appends a string to the end of the input string.
|N/A
Examples:
// Assuming an element <a href='http://foo' />,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | append '/bar'`);
closest subroutine
closest subroutine iterates through all the preceding nodes (including parent nodes) searching for either a preceding node matching the selector expression or a descendant of the preceding node matching the selector.
Note: This is different from the jQuery
.closest() in that the latter method does not search for parent descendants matching the selector.
|Parameter name
|Description
|Default
|CSS selector
|CSS selector used to select an element.
|N/A
constant subroutine
constant returns the parameter value regardless of the input.
|Parameter name
|Description
|Default
constant
|Constant value that will be returned as the result.
|N/A
format subroutine
format is used to format input using printf.
|Parameter name
|Description
|Default
|format
|sprintf format used to format the input string. The subroutine input is the first argument, i.e.
%1$s.
%1$s
Examples:
// Extracts 1 matching capturing group from the input string.
// Prefixes the match with 'http://foo.com'.
x(`select a | read attribute href | format 'http://foo.com%1$s'`);
match subroutine
match is used to extract matching capturing groups from the subject input.
|Parameter name
|Description
|Default
|Regular expression
|Regular expression used to match capturing groups in the string.
|N/A
|Sprintf format
|sprintf format used to construct a string using the matching capturing groups.
%s
Examples:
// Extracts 1 matching capturing group from the input string.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)/"');
// Extracts 2 matching capturing groups from the input string and formats the output using sprintf.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)-(\d+)/" %2$s-%1$s');
nextUntil subroutine
nextUntil subroutine is used to select all following siblings of each element up to but not including the element matched by the selector.
|Parameter name
|Description
|Default
|selector expression
|A string containing a selector expression to indicate where to stop matching following sibling elements.
|N/A
|filter expression
|A string containing a selector expression to match elements against.
prepend subroutine
prepend prepends a string to the input string.
|Parameter name
|Description
|Default
|head
|Prepends a string to the start of the input string.
|N/A
Examples:
// Assuming an element <a href='//foo' />,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | prepend 'http:'`);
previous subroutine
previous subroutine selects the preceding sibling.
|Parameter name
|Description
|Default
|CSS selector
|CSS selector used to select an element.
|N/A
Example:
<ul>
<li>foo</li>
<li class='bar'></li>
<ul>
x('select .bar | previous | read property textContent');
// 'foo'
read subroutine
read is used to extract value from the matching element using an evaluator.
|Parameter name
|Description
|Default
|Target type
|Possible values: "attribute" or "property"
|N/A
|Target name
|Depending on the target type, name of an attribute or a property.
|N/A
Examples:
// Returns .foo element "href" attribute value.
// Throws error if attribute does not exist.
x('select .foo | read attribute href');
// Returns an array of "href" attribute values of the matching elements.
// Throws error if attribute does not exist on either of the matching elements.
x('select .foo {0,} | read attribute href');
// Returns .foo element "textContent" property value.
// Throws error if property does not exist.
x('select .foo | read property textContent');
remove subroutine
remove subroutine is used to remove elements from the document using an evaluator.
remove subroutine accepts the same parameters as the
select subroutine.
The result of
remove subroutine is the input of the subroutine, i.e. previous
select subroutine result.
|Parameter name
|Description
|Default
|CSS selector
|CSS selector used to select an element.
|N/A
|Quantifier expression
|A quantifier expression is used to control the expected result length.
|See quantifier expression.
Examples:
// Returns 'bar'.
x('select .foo | remove span | read property textContent', `<div class='foo'>bar<span>baz</span></div>`);
select subroutine
select subroutine is used to select the elements in the document using an evaluator.
|Parameter name
|Description
|Default
|CSS selector
|CSS selector used to select an element.
|N/A
|Quantifier expression
|A quantifier expression is used to control the shape of the results (direct result or array of results) and the expected result length.
|See quantifier expression.
A quantifier expression is used to assert that the query matches a set number of nodes. A quantifier expression is a modifier of the
select subroutine.
A quantifier expression is defined using the following syntax.
|Name
|Syntax
|Fixed quantifier
{n} where
n is an integer
>= 1
|Greedy quantifier
{n,m} where
n >= 0 and
m >= n
|Greedy quantifier
{n,} where
n >= 0
|Greedy quantifier
{,m} where
m >= 1
A quantifier expression can be appended a node selector
[i], e.g.
{0,}[1]. This allows to return the first node from the result set.
If this looks familiar, its because I have adopted the syntax from regular expression language. However, unlike in regular expression, a quantifier in the context of Surgeon selector will produce an error (
SelectSubroutineUnexpectedResultCountError) if selector result length is out of the quantifier range.
Examples:
// Selects 0 or more nodes.
// Result is an array.
x('select .foo {0,}');
// Selects 1 or more nodes.
// Throws an error if 0 matches found.
// Result is an array.
x('select .foo {1,}');
// Selects between 0 and 5 nodes.
// Throws an error if more than 5 matches found.
// Result is an array.
x('select .foo {0,5}');
// Selects 1 node.
// Result is the first match in the result set (or `null`).
x('select .foo {0,}[0]');
test subroutine
test is used to validate the current value using a regular expression.
|Parameter name
|Description
|Default
|Regular expression
|Regular expression used to test the value.
|N/A
Examples:
// Validates that .foo element textContent property value matches /bar/ regular expression.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | test /bar/');
See error handling for more information and usage examples of the
test subroutine.
Custom subroutines can be defined using
subroutines configuration.
A subroutine is a function. A subroutine function is invoked with the following parameters:
|Parameter name
|An instance of [Evaluator].
|Current value, i.e. value used to query Surgeon or value returned from the previous (or ancestor) subroutine.
|An array of values used when referencing the subroutine in an expression.
Example:
const x = surgeon({
subroutines: {
mySubroutine: (currentValue, [firstParameterValue, secondParameterValue]) => {
console.log(currentValue, firstParameterValue, secondParameterValue);
return parseInt(currentValue, 10) + 1;
}
}
});
x('mySubroutine foo bar | mySubroutine baz qux', 0);
The above example prints:
0 "foo" "bar"
1 "baz" "qux"
For more examples of defining subroutines, refer to:
Custom subroutines can be inlined into pianola instructions, e.g.
x(
[
'foo',
(subject) => {
// `subject` is the return value of `foo` subroutine.
return 'bar';
},
'baz',
],
'qux'
);
Surgeon exports an alias preset is used to reduce verbosity of the queries.
|Name
|Description
ra ...
|Reads Element attribute value. Equivalent to
read attribute ...
rdtc ...
|Removes any descending elements and reads the resulting
textContent property of an element. Equivalent to `remove * {0,}
rih ...
|Reads
innerHTML property of an element. Equivalent to
read property ... innerHTML
roh ...
|Reads
outerHTML property of an element. Equivalent to
read property ... outerHTML
rp ...
|Reads Element property value. Equivalent to
read property ...
rtc ...
|Reads
textContent property of an element. Equivalent to
read property ... textContent
sa ...
|Select any (sa). Selects multiple elements (0 or more). Returns array. Equivalent to
select "..." {0,}
saf ...
|Select any first (saf). Selects multiple elements (0 or more). Returns single result or
null. Equivalent to
select "..." {0,}[0]
sm ...
|Select many (sm). Selects multiple elements (1 or more). Returns array. Equivalent to
select "..." {1,}
smo ...
|Select maybe one (smo). Selects one element. Returns single result or
null. Equivalent to
select "..." {0,1}[0]
so ...
|Select one (so). Selects a single element. Returns single result. Equivalent to
select "..." {1}[0].
t {name}
|Tests value. Equivalent to
test ...
Note regarding
s ...alias. The CSS selector value is quoted. Therefore, you can write a CSS selector that includes spaces without putting the value in the quotes, e.g.
s .foo .baris equivalent to
select ".foo .bar" {1}.
Other alias values are not quoted. Therefore, if value includes a space it must be quoted, e.g.
t "/foo bar/".
Usage:
import surgeon, {
subroutineAliasPreset
} from 'surgeon';
const x = surgeon({
subroutines: {
...subroutineAliasPreset
}
});
x('s .foo .bar | t "/foo bar/"');
In addition to the built-in aliases, user can declare subroutine aliases.
Surgeon subroutines are referenced using expressions.
An expression is defined using the following pseudo-grammar:
subroutines ->
subroutines _ "|" _ subroutine
| subroutine
subroutine ->
subroutineName " " parameters
| subroutineName
subroutineName ->
[a-zA-Z0-9\-_]:+
parameters ->
parameters " " parameter
| parameter
Example:
x('foo bar baz', 'qux');
In this example, Surgeon query executor (
x) is invoked with
foo bar baz expression and
qux starting value. The expression tells the query executor to run
foo subroutine with parameter values "bar" and "baz". The expression executor runs
foo subroutine with parameter values "bar" and "baz" and subject value "qux".
Multiple subroutines can be combined using an array:
x([
'foo bar baz',
'corge grault garply'
], 'qux');
In this example, Surgeon query executor (
x) is invoked with two expressions (
foo bar baz and
corge grault garply). The first subroutine is executed with the subject value "qux". The second subroutine is executed with a value that is the result of the parent subroutine.
The result of the query is the result of the last subroutine.
Read user-defined subroutines documentation for broader explanation of the role of the parameter values and the subject value.
|)
Multiple subroutines can be combined using the pipe operator.
The following examples are equivalent:
x([
'foo bar baz',
'qux quux quuz'
]);
x([
'foo bar baz | foo bar baz'
]);
x('foo bar baz | foo bar baz');
Unless redefined, all examples assume the following initialisation:
import surgeon from 'surgeon';
/**
* @param configuration {@see https://github.com/gajus/surgeon#configuration}
*/
const x = surgeon();
Use
select subroutine and
read subroutine to extract a single value.
const subject = `
<div class="title">foo</div>
`;
x('select .title | read property textContent', subject);
// 'foo'
Specify
select subroutine
quantifier to match multiple results.
const subject = `
<div class="foo">bar</div>
<div class="foo">baz</div>
<div class="foo">qux</div>
`;
x('select .title {0,} | read property textContent', subject);
// [
// 'bar',
// 'baz',
// 'qux'
// ]
Use a
QueryChildrenType object to name the results of the descending expressions.
const subject = `
<article>
<div class='title'>foo title</div>
<div class='body'>foo body</div>
</article>
<article>
<div class='title'>bar title</div>
<div class='body'>bar body</div>
</article>
`;
x([
'select article',
{
body: 'select .body | read property textContent'
title: 'select .title | read property textContent'
}
]);
// [
// {
// body: 'foo body',
// title: 'foo title'
// },
// {
// body: 'bar body',
// title: 'bar title'
// }
// ]
Use
test subroutine to validate the results.
const subject = `
<div class="foo">bar</div>
<div class="foo">baz</div>
<div class="foo">qux</div>
`;
x('select .foo {0,} | test /^[a-z]{3}$/');
See error handling for information how to handle
test subroutine errors.
Define a custom subroutine to validate results using arbitrary logic.
Use
InvalidValueSentinel to leverage standardised Surgeon error handler (see error handling). Otherwise, simply throw an error.
import surgeon, {
InvalidValueSentinel
} from 'surgeon';
const x = surgeon({
subroutines: {
isRed: (value) => {
if (value === 'red') {
return value;
};
return new InvalidValueSentinel('Unexpected color.');
}
}
});
As you become familiar with the query execution mechanism, typing long expressions (such as
select,
read attribute and
read property) becomes a mundane task.
Remember that subroutines are regular functions: you can partially apply and use the partially applied functions to create new subroutines.
Example:
import surgeon, {
readSubroutine,
selectSubroutine,
testSubroutine
} from 'surgeon';
const x = surgeon({
subroutines: {
ra: (subject, values, bindle) => {
return readSubroutine(subject, ['attribute'].concat(values), bindle);
},
rp: (subject, values, bindle) => {
return readSubroutine(subject, ['property'].concat(values), bindle);
},
s: (subject, values, bindle) => {
return selectSubroutine(subject, [values.join(' '), '{1}'], bindle);
},
sm: (subject, values, bindle) => {
return selectSubroutine(subject, [values.join(' '), '{0,}'], bindle);
},
t: testSubroutine
}
});
Now, instead of writing:
articles:
- select article
- body:
- select .body
- read property innerHTML
You can write:
articles:
- sm article
- body:
- s .body
- rp innerHTML
The aliases used in this example are available in the aliases preset (read built-in subroutine aliases).
Surgeon throws the following errors to indicate a predictable error state. All Surgeon errors can be imported. Use
instanceof operator to determine the error type.
Note:
Surgeon errors are non-recoverable, i.e. a selector cannot proceed if it encounters an error. This design ensures that your selectors are capturing the expected data.
|Name
|Description
ReadSubroutineNotFoundError
|Thrown when an attempt is made to retrieve a non-existent attribute or property.
SelectSubroutineUnexpectedResultCountError
|Thrown when a
select subroutine result length does not match the quantifier expression.
InvalidDataError
|Thrown when a subroutine returns an instance of
InvalidValueSentinel.
SurgeonError
|A generic error. All other Surgeon errors extend from
SurgeonError.
Example:
import {
InvalidDataError
} from 'surgeon';
const subject = `
<div class="foo">bar</div>
`;
try {
x('select .foo | test /bar/', subject);
} catch (error) {
if (error instanceof InvalidDataError) {
// Handle data validation error.
} else {
throw error;
}
}
Return
InvalidValueSentinel from a subroutine to force Surgeon throw
InvalidDataError error.
Surgeon is using
roarr to log debugging information.
Export
ROARR_LOG=TRUE environment variable to enable Surgeon debug log.