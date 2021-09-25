An extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye.
You can use the familiar jQuery/CSS selector syntax to easily find the data you need.
In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM.
See tests/README.md.
Just add this folder to your project and
include_once 'hquery.php'; and you are ready to
hQuery.
Alternatively
composer require duzun/hquery
or using
npm install hquery.php,
require_once 'node_modules/hquery.php/hquery.php';.
// Optionally use namespaces
use duzun\hQuery;
// Either use composer, or include this file:
include_once '/path/to/libs/hquery.php';
// Set the cache path - must be a writable folder
// If not set, hQuery::fromURL() would make a new request on each call
hQuery::$cache_path = "/path/to/cache";
// Time to keep request data in cache, seconds
// A value of 0 disables cache
hQuery::$cache_expires = 3600; // default one hour
I would recommend using php-http/cache-plugin with a PSR-7 client for better flexibility.
$filename, boolean
$use_include_path = false, resource
$context = NULL )
// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');
// Remote
$doc = hQuery::fromFile('https://example.com/', false, $context);
Where
$context is created with stream_context_create().
For an example of using
$context to make a HTTP request with proxy see #26.
$html, string
$url = NULL )
$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');
// Set base_url, in case the document is loaded from local source.
// Note: The base_url property is used to retrieve absolute URLs from relative ones.
$doc->base_url = 'http://desired-host.net/path';
$url, array
$headers = NULL, array|string
$body = NULL, array
$options = NULL )
use duzun\hQuery;
// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);
var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request
// with POST
$doc = hQuery::fromUrl(
'http://example.com/someDoc.html', // url
['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);
For building advanced requests (POST, parameters etc) see hQuery::http_wr(),
though I recommend using a specialized (PSR-7?) library for making requests
and
hQuery::fromHTML($html, $url=NULL) for processing results.
See Guzzle for eg.
composer require php-http/message php-http/discovery php-http/curl-client
If you don't have cURL PHP extension,
just replace
php-http/curl-client with
php-http/socket-client in the above command.
use duzun\hQuery;
use Http\Discovery\HttpClientDiscovery;
use Http\Discovery\MessageFactoryDiscovery;
$client = HttpClientDiscovery::find();
$messageFactory = MessageFactoryDiscovery::find();
$request = $messageFactory->createRequest(
'GET',
'http://example.com/someDoc.html',
['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']
);
$response = $client->sendRequest($request);
$doc = hQuery::fromHTML($response, $request->getUri());
Another option is to use stream_context_create()
to create a
$context, then call
hQuery::fromFile($url, false, $context).
$sel, array|string
$attr = NULL, hQuery_Node
$ctx = NULL )
// Find all banners (images inside anchors)
$banners = $doc->find('a[href] > img[src]:parent');
// Extract links and images
$links = array();
$images = array();
$titles = array();
// If the result of find() is not empty
// $banners is a collection of elements (hQuery_Element)
if ( $banners ) {
// Iterate over the result
foreach($banners as $pos => $a) {
$links[$pos] = $a->attr('href'); // get absolute URL from href property
$titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text
// Filter the result
if ( !$a->hasClass('logo') ) {
// $a->style property is the parsed $a->attr('style')
if ( strtolower($a->style['position']) == 'fixed' ) continue;
$img = $a->find('img')[0]; // ArrayAccess
if ( $img ) $images[$pos] = $img->src; // short for $img->attr('src')
}
}
// If at least one element has the class .home
if ( $banners->hasClass('home') ) {
echo 'There is .home button!', PHP_EOL;
// ArrayAccess for elements and properties.
if ( $banners[0]['href'] == '/' ) {
echo 'And it is the first one!';
}
}
}
// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;
// Get the size of the document ( strlen($html) )
$size = $doc->size;
You can easily run any of the
examples/ on your local machine.
All you need is PHP installed in your system.
After you clone the repo with
git clone https://github.com/duzun/hQuery.php.git,
you have several options to start a web-server.
cd hQuery.php/examples
php -S localhost:8000
# open browser http://localhost:8000/
This option starts a live-reload server and is good for playing with the code.
npm install
gulp
# open browser http://localhost:8080/
If you are using VSCode, simply open the project and run debugger (
F5).
