hQuery.php

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye.

In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM.

See tests/README.md.

API Documentation

💡 Features

Very fast parsing and lookup
Parses broken HTML
jQuery-like style of DOM traversal
Low memory usage
Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
Doesn't require cURL to be installed and automatically handles redirects (see hQuery::fromUrl())
Caches response for multiple processing tasks
PSR-7 friendly (see hQuery::fromHTML($message))
PHP 5.3+
No dependencies

🛠 Install

Just include_once 'hquery.php'; in your project and start using hQuery.

Alternatively composer require duzun/hquery

or using npm install hquery.php, require_once 'node_modules/hquery.php/hquery.php';.

⚙ Usage

Basic setup:

// Optionally use namespaces
use duzun\hQuery;
 
// Either use commposer, or include this file:
include_once '/path/to/libs/hquery.php';
 
// Set the cache path - must be a writable folder
// If not set, hQuery::fromURL() whould make a new request on each call
hQuery::$cache_path = "/path/to/cache";
 
// Time to keed request data in cache, seconds
// A value of 0 disables cahce
hQuery::$cache_expires = 3600; // default one hour

I would recomend using php-http/cache-plugin with a PSR-7 client for better flexibility.

Load HTML from a file

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');
 
// Remote
$doc = hQuery::fromFile('https://example.com/', false, $context);

Where $context is created with stream_context_create().

For an example of using $context to make a HTTP request with proxy see #26.

Load HTML from a string

hQuery::fromHTML( string `$html`, string `$url` = NULL )

$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');
 
// Set base_url, in case the document is loaded from local source.
// Note: The base_url property is used to retrive absolute URLs from relative ones.
$doc->base_url = 'http://desired-host.net/path';

Load a remote HTML document

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

use duzun\hQuery;
 
// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);
 
var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request
 
// with POST
$doc = hQuery::fromUrl(
    'http://example.com/someDoc.html', // url
    ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
    ['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
    ['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);

For building advanced requests (POST, parameters etc) see hQuery::http_wr(), though I recomend using a specialized (PSR-7?) library for making requests and hQuery::fromHTML($html, $url=NULL) for processing results. See Guzzle for eg.

PSR-7 example:

composer require php-http/message php-http/discovery php-http/curl-client

If you don't have cURL PHP extension, just replace php-http/curl-client with php-http/socket-client in the above command.

use duzun\hQuery;
 
use Http\Discovery\HttpClientDiscovery;
use Http\Discovery\MessageFactoryDiscovery;
 
$client = HttpClientDiscovery::find();
$messageFactory = MessageFactoryDiscovery::find();
 
$request = $messageFactory->createRequest(
  'GET', 
  'http://example.com/someDoc.html',
  ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']
);
 
$response = $client->sendRequest($request);
 
$doc = hQuery::fromHTML($response, $request->getUri());

Another option is to use stream_context_create() to create a $context, then call hQuery::fromFile($url, false, $context).

Processing the results

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery_Node `$ctx` = NULL )

// Find all banners (images inside anchors)
$banners = $doc->find('a[href] > img[src]:parent');
 
// Extract links and images
$links  = array();
$images = array();
$titles = array();
 
// If the result of find() is not empty
// $banners is a collection of elements (hQuery_Element)
if ( $banners ) {
    
    // Iterate over the result
    foreach($banners as $pos => $a) {
        $links[$pos] = $a->attr('href'); // get absolute URL from href property
        $titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text
 
        // Filter the result
        if ( !$a->hasClass('logo') ) {
            // $a->style property is the parsed $a->attr('style')
            if ( strtolower($a->style['position']) == 'fixed' ) continue;
 
            $img = $a->find('img')[0]; // ArrayAccess
            if ( $img ) $images[$pos] = $img->src; // short for $img->attr('src')
        }
    }
 
    // If at least one element has the class .home
    if ( $banners->hasClass('home') ) {
        echo 'There is .home button!', PHP_EOL;
 
        // ArrayAccess for elements and properties.
        if ( $banners[0]['href'] == '/' ) {
            echo 'And it is the first one!';
        }
    }
}
 
// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;
 
// Get the size of the document ( strlen($html) )
$size = $doc->size;

🖧 Live Demo

On DUzun.Me

A lot of people ask for sources of my Live Demo page. Here we go:

view-source:https://duzun.me/playground/hquery

🔧 TODO

Unit tests everything
Document everything
~~Cookie support~~ (implemented in mem for redirects)
~~Improve selectors to be able to select by attributes~~
Add more selectors
Use HTTPlug internally

💖 Support my projects

I love Open Source. Whenever possible I share cool things with the world (check out NPM and GitHub).

If you like what I'm doing and this project helps you reduce time to develop, please consider to:

★ Star and Share the projects you like (and use)
☕ Give me a cup of coffee - PayPal.me/duzuns (contact at duzun.me)
₿ Send me some Bitcoin at this addres: bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa (or using the QR below)

hquery.php

hQuery.php

💡 Features

🛠 Install

⚙ Usage

Basic setup:

Load HTML from a file

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

Load HTML from a string

hQuery::fromHTML( string `$html`, string `$url` = NULL )

Load a remote HTML document

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

PSR-7 example:

Processing the results

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery_Node `$ctx` = NULL )

🖧 Live Demo

🔧 TODO

💖 Support my projects

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

hquery.php

hQuery.php

💡 Features

🛠 Install

⚙ Usage

Basic setup:

Load HTML from a file

hQuery::fromFile( string $filename, boolean $use_include_path = false, resource $context = NULL )

Load HTML from a string

hQuery::fromHTML( string $html, string $url = NULL )

Load a remote HTML document

hQuery::fromUrl( string $url, array $headers = NULL, array|string $body = NULL, array $options = NULL )

PSR-7 example:

Processing the results

hQuery::find( string $sel, array|string $attr = NULL, hQuery_Node $ctx = NULL )

🖧 Live Demo

🔧 TODO

💖 Support my projects

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

hQuery::fromFile( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

hQuery::fromHTML( string `$html`, string `$url` = NULL )

hQuery::fromUrl( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

hQuery::find( string `$sel`, array|string `$attr` = NULL, hQuery_Node `$ctx` = NULL )

Weekly Downloads