An extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye.

You can use the familiar jQuery/CSS selector syntax to easily find the data you need.

In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM.

See tests/README.md.

API Documentation

💡 Features

Very fast parsing and lookup

Parses broken HTML

jQuery-like style of DOM traversal

Low memory usage

Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)

Doesn't require cURL to be installed and automatically handles redirects (see hQuery::fromUrl())

Caches response for multiple processing tasks

PSR-7 friendly (see hQuery::fromHTML($message))

PHP 5.3+

No dependencies

🛠 Install

Just add this folder to your project and include_once 'hquery.php'; and you are ready to hQuery .

Alternatively composer require duzun/hquery

or using npm install hquery.php , require_once 'node_modules/hquery.php/hquery.php'; .

⚙ Usage

Basic setup:

use duzun \ hQuery ; include_once '/path/to/libs/hquery.php' ; hQuery::$cache_path = "/path/to/cache" ; hQuery::$cache_expires = 3600 ;

I would recommend using php-http/cache-plugin with a PSR-7 client for better flexibility.

Load HTML from a file

hQuery::fromFile( string $filename , boolean $use_include_path = false, resource $context = NULL )

$doc = hQuery::fromFile( '/path/to/filesystem/doc.html' ); $doc = hQuery::fromFile( 'https://example.com/' , false , $context);

Where $context is created with stream_context_create().

For an example of using $context to make a HTTP request with proxy see #26.

Load HTML from a string

hQuery::fromHTML( string $html , string $url = NULL )

$doc = hQuery::fromHTML( '<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>' ); $doc->base_url = 'http://desired-host.net/path' ;

Load a remote HTML document

hQuery::fromUrl( string $url , array $headers = NULL, array|string $body = NULL, array $options = NULL )

use duzun \ hQuery ; $doc = hQuery::fromUrl( 'http://example.com/someDoc.html' , [ 'Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8' ]); var_dump($doc->headers); var_dump(hQuery::$last_http_result); $doc = hQuery::fromUrl( 'http://example.com/someDoc.html' , [ 'Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8' ], [ 'username' => 'Me' , 'fullname' => 'Just Me' ], [ 'method' => 'POST' , 'timeout' => 7 , 'redirect' => 7 , 'decode' => 'gzip' ] );

For building advanced requests (POST, parameters etc) see hQuery::http_wr(), though I recommend using a specialized (PSR-7?) library for making requests and hQuery::fromHTML($html, $url=NULL) for processing results. See Guzzle for eg.

composer require php-http/message php-http/discovery php-http/curl-client

If you don't have cURL PHP extension, just replace php-http/curl-client with php-http/socket-client in the above command.

use duzun \ hQuery ; use Http \ Discovery \ HttpClientDiscovery ; use Http \ Discovery \ MessageFactoryDiscovery ; $client = HttpClientDiscovery::find(); $messageFactory = MessageFactoryDiscovery::find(); $request = $messageFactory->createRequest( 'GET' , 'http://example.com/someDoc.html' , [ 'Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8' ] ); $response = $client->sendRequest($request); $doc = hQuery::fromHTML($response, $request->getUri());

Another option is to use stream_context_create() to create a $context , then call hQuery::fromFile($url, false, $context) .

Processing the results

$banners = $doc->find( 'a[href] > img[src]:parent' ); $links = array (); $images = array (); $titles = array (); if ( $banners ) { foreach ($banners as $pos => $a) { $links[$pos] = $a->attr( 'href' ); $titles[$pos] = trim($a->text()); if ( !$a->hasClass( 'logo' ) ) { if ( strtolower($a->style[ 'position' ]) == 'fixed' ) continue ; $img = $a->find( 'img' )[ 0 ]; if ( $img ) $images[$pos] = $img->src; } } if ( $banners->hasClass( 'home' ) ) { echo 'There is .home button!' , PHP_EOL; if ( $banners[ 0 ][ 'href' ] == '/' ) { echo 'And it is the first one!' ; } } } $charset = $doc->charset; $size = $doc->size;

🖧 Live Demo

On DUzun.Me

A lot of people ask for sources of my Live Demo page. Here we go:

view-source:https://duzun.me/playground/hquery

🏃 Run the playground

You can easily run any of the examples/ on your local machine. All you need is PHP installed in your system. After you clone the repo with git clone https://github.com/duzun/hQuery.php.git , you have several options to start a web-server.

Option 1:

cd hQuery.php/examples php -S localhost:8000

Option 2 (browser-sync):

This option starts a live-reload server and is good for playing with the code.

npm install gulp

Option 3 (VSCode):

If you are using VSCode, simply open the project and run debugger ( F5 ).

🔧 TODO

Unit tests everything

Document everything

Cookie support (implemented in mem for redirects)

(implemented in mem for redirects) Improve selectors to be able to select by attributes

Add more selectors

Use HTTPlug internally

