Pagean is a web page analysis tool designed to automate tests requiring web pages to be loaded in a browser window (for example 404 error loading an external resource, page renders with horizontal scrollbars). The specific tests are outlined below, but are all general tests that do not include any page-specific logic.
Install Pagean globally (as shown below), or locally, via npm.
npm install -g pagean
Pagean runs as a command line tool and is executed as follows:
Installed globally:
> pagean [options]
Installed locally:
> npx pagean [options]
Options:
-V, --version output the version number
-c, --config <file> the path to the pagean configuration file (default: "./.pageanrc.json")
-h, --help display help for command
Pagean requires a configuration file named, which can be specified via the CLI
as detailed previously, or use the default file .pageanrc.json
in the project
root. This file provides the URLs to be tested and options to configure the
tests and reports. Details on the available tests and the configuration file
format are provided below.
The tests use Puppeteer to launch
a headless Chrome browser. The URLs defined in the configuration file are each
loaded once, and after page load the applicable tests are executed. Test
results are passed
or failed
, but can be configured to report warning
instead of failure. Only a failed
test causes the test process to fail and
exit with an error code (a warning
does not). If a page URL fails to load,
it is retried up to two additional times and if unsuccessful the URL is
logged as a page error
with the error message.
The broken link test checks for broken links on the page. It checks any <a>
tag on the page with href
pointing to another location on the current page or
another page (that is, only http(s)
or file
protocols).
- For links within the page, this test checks for existence of the element on
the page, passing if the element exists and failing otherwise (and passing
for cases that are always valid, for example
#
or#top
for the current page). It doesn't check the visibility of the element. Failing tests return a response of "#element Not Found" (where#element
identifies the specific element). - For links to other pages, the test tries to most efficiently confirm whether
the target link is valid. It first makes a
HEAD
request for that URL and checks the response. If an erroneous response is returned (>= 400 with no execution error) and not code 429 (Too Many Requests), the request is retried with aGET
request. The test passes for HTTP responses < 400 and fails otherwise (if HTTP response is >= 400 or another error occurs).- This can result in false failure indications, specifically for
file:
links (404
orECONNREFUSED
) or where the browser passes a domain identity with the request (page loads when tested, but401
response for links to that page). For these cases, or other false failures, the test configuration allows a BooleancheckWithBrowser
option that instead checks links by loading the target in the browser (viapuppeteer
). Note this can increase test execution time, in some cases substantially, due to the time to open a new browser tab and plus load the page and all assets. - Note that
file:
links can only be tested with thecheckWithBrowser
option. - If the link to another page includes a hash it's removed prior to checking. The test in this case is confirming a valid link, not that the element exists, which is only done for the current page.
- The test configuration allows an
ignoredLinks
array listing link URLs to ignore for this test. Note this only applies to links to other pages, not links within the page, which are always checked.
- This can result in false failure indications, specifically for
- To optimize performance, link test results are cached and those links aren't
re-tested for the entire test run (across all tested URLs). The test
configuration allows a Boolean
ignoreDuplicates
option that can be set tofalse
to bypass this behavior and re-test all links. The results for any failed links are included in the reports in any case.
For any failing test, the data
array in the test report includes the original
URL and the response code or error as shown below.
[
{
"href": "https://about.gitlab.com/not-found",
"status": 404
},
{
"href": "http://localhost:3000/brokenLinks.html#notlinked",
"status": "#notlinked Not Found"
},
{
"href": "https://this.url.does.not.exist/",
"status": "ENOTFOUND"
}
]
Note: this test checks all links on the page, and doesn't respect mechanisms
intended to limit web crawlers such as robots.txt
or noindex
tags.
The console error test fails if any error is written to the browser console, but is otherwise simply a subset of the console output test. This separation allows for testing for console errors, but allowing any other console output.
The console output test fails if any output is written to the browser console. An array is included in the report with all entries, as shown below:
[
{
"type": "error",
"text": "Failed to load resource: net::ERR_NAME_NOT_RESOLVED",
"location": {
"url": "https://this.url.does.not.exist/file.js"
}
}
]
The external script test is intended to identify any externally loaded
JavaScript files (for example loaded from a CDN) and aggregate those files so
they can undergo further analysis (for example dependency vulnerability
scanning). The test is included here since these tests load fully rendered
pages, therefore allowing the aggregation of this data for pages generated
using any language or framework. By default the test returns a warning if the
page includes any JavaScript files loaded from a different domain than the page
(although this could be overridden to fail instead via setting
failWarn: false
, see the Configuration section below). These files are then
downloaded and saved in the "pagean-external-files" directory in the project
root. Subdirectories are created for each domain, then following the URL path.
For example, the following script…
<script src="https://bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js"></script>
…is saved as ./bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js
. The
data
array in the test report includes the original file URL and the local
saved filename or applicable error, as shown below.
[
{
"url": "https://code.jquery.com/jquery-3.4.1.slim.min.js",
"localFile": "pagean-external-scripts/code.jquery.com/jquery-3.4.1.slim.min.js"
},
{
"url": "http://bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js",
"error": "Request failed with status code 404"
}
]
Each external script is saved only once, but is reported on any page where it's referenced.
The horizontal scrollbar test fails if the rendered page has a horizontal
scrollbar. If a specific browser viewport size is desired for this test, that
can be configured in the puppeteerLaunchOptions
.
The page load time test fails if the page load time (from start through the
load
event) exceeds the defined threshold in the configuration file (or the
default of 2 seconds). The actual load time is included in the report. Tests
time out at twice the page load time threshold.
The rendered HTML test is intended for cases where content is dynamically
created prior to page load (that is, the load
event firing). The rendered
HTML is returned and checked with
HTML Hint and the test fails if any
issues are found. An array is included in the report with all HTML Hint issues,
as shown below:
[
{
"col": 9,
"evidence": " <div id=\"div1\"></div>",
"line": 6,
"message": "The id value [ div1 ] must be unique.",
"raw": " id=\"div1\"",
"rule": {
"description": "The value of id attributes must be unique.",
"id": "id-unique",
"link": "https://github.com/thedaviddias/HTMLHint/wiki/id-unique"
},
"type": "error"
}
]
An htmlhintrc file can be specified in the configuration file, otherwise the default "./.htmlhintrc" file is used (if it exists). See the Configuration section below.
Note: this test may not find some errors in the original HTML that are removed/resolved as the page is parsed (for example closing tags with no opening tags).
Based on the reporters
configuration, Pagean results may be displayed in the
console and saved in two reports in the project root directory (any or all of
the three):
- A JSON report named
pagean-results.json
. - An HTML report named
pagean-results.html
.
Both reports contain:
- The time of test execution.
- A summary of the total tests and results (passed, warning, failed, and page errors).
- The detailed test results, including the URL tested, list of tests performed on that URL with results, and, if applicable, any relevant data associated with the test failure (for example the console errors if the console error test fails).
Complete reports for the example case in this project (the tests as specified
in the project
.pageanrc.json
file) can be found at the preceding links.
Pagean looks for a configuration file as specified via the CLI, or defaults to
a file named .pageanrc.json
in the project root. If the configuration file is
not found, is not valid JSON, or doesn't contain any URLs to check the job
fails.
Below is an example .pageanrc.json
file, which is broken into seven major
properties:
-
htmlhintrc
: An optional path to an htmlhintrc file to be used in the rendered HTML test. -
project
: An optional name of the project, which is included in HTML and JSON reports. -
puppeteerLaunchOptions
: An optional set of options to pass to Puppeteer on launch. The complete list of available options can be found at https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions. -
reporters
: An optional array of reporters indicating the test reports that should be provided. There are three possible options -cli
,html
, andjson
. Thecli
option reports all test details to the console, but the final results summary is always output (even withcli
disabled). Ifreporters
is specified, at least one reporter must be included. The default value, as specified below, is all three reporters enabled. -
settings
: These settings enable/disable or configure tests, and are applied to all tests overriding the default values.- The shorthand notation allows easy enabling/disabling of tests. In this format the test name is given with a Boolean value to enable or disable the test. In this case any other test-specific settings use the default values.
- The longhand version includes an object for each test. Every test includes
two possible properties (some tests include additional settings):
-
enabled
: A Boolean value to enable/disable the test, and some tests include additional settings (defaulttrue
for all tests). -
failWarn
: A Boolean value causing a failed test to report a warning instead of failure. A warning result doesn't cause the test process to fail (exit with an error code). The default value for all tests isfalse
except theexternalScriptTest
, as shown below.
-
The shorthand:
"settings": {
"consoleErrorTest": true
}
is equivalent to the longhand:
"settings": {
"consoleErrorTest": {
"enabled": true,
"failWarn": false
}
}
-
sitemap
: Specify a sitemap with URLs to test. If a sitemap is specified, the URLs from the sitemap are added to theurls
array. If a URL is in theurls
array withsettings
, those settings are retained. Note that<sitemapindex>
is currently not supported. Thesitemap
object can have the following properties:-
url
: The URL of the sitemap (required ifsitemap
is included). This can be either an actual URL or a local file. -
find
: A string to search for in sitemap URLs (for examplehttps://somehere.test
) (required ifreplace
is specified). -
replace
: The string to replace thefind
string with (for examplehttp://localhost:3000
) (required iffind
is specified). -
exclude
: An array of strings with regular expressions to exclude URLs from the sitemap (for example['\.pdf$']
to exclude any PDF files). Since these are string representations of regular expressions, the backslash must be escaped (for example\\.
). Exclude is performed before find/replace, so uses the original URLs from the sitemap.
-
-
urls
: An array of URLs to be tested, which must contain at least one value. Each array entry can either be a URL string, or an object that contains aurl
string and an optionalsettings
object. This object can contain any of thesettings
values identified previously and overrides that setting for testing that URL. Theurl
string can be either an actual URL or a local file, as shown in the example below.
The following shows all available settings, except sitemap
, with the default
values.
{
"puppeteerLaunchOptions": {
"headless": "new"
},
"reporters": ["cli", "html", "json"],
"settings": {
"brokenLinkTest": {
"enabled": true,
"failWarn": false,
"checkWithBrowser": false,
"ignoreDuplicates": true
},
"consoleErrorTest": {
"enabled": true,
"failWarn": false
},
"consoleOutputTest": {
"enabled": true,
"failWarn": false
},
"externalScriptTest": {
"enabled": true,
"failWarn": true
},
"horizontalScrollbarTest": {
"enabled": true,
"failWarn": false
},
"pageLoadTimeTest": {
"enabled": true,
"failWarn": false,
"pageLoadTimeThreshold": 2
},
"renderedHtmlTest": {
"enabled": true,
"failWarn": false
}
}
}
Numerous example config files used in the tests can be found here.
Provided with the Pagean project are container images configured to run the
tests. All available image tags can be found in the
registry.gitlab.com/gitlab-ci-utils/pagean
repository
here.
Details on each release can be found on the
Releases page.
Note: any images in the gitlab-ci-utils/pagean/tmp
repository are
temporary images used during the build process and may be deleted at any point.
In Puppeteer v19
the default cache location for installing the Chrome binary was changed from
within the project's node_modules
folder to ~/.cache/puppeteer
. To simplify
execution in a container, the PUPPETEER_CACHE_DIR
environment variable is set
to install the Chrome binaries in /home/pptruser/.cache/puppeteer
during
container build, so setting to another value before execution can cause errors
where Puppeteer can't find the Chrome binary.
The following is an example job from a .gitlab-ci.yml file to use this image to run Pagean against another project in GitLab CI:
pagean:
image: registry.gitlab.com/gitlab-ci-utils/pagean:latest
stage: test
script:
- pagean
artifacts:
when: always
paths:
- pagean-results.html
- pagean-results.json
- pagean-external-scripts/
The container image shown previously includes
serve
and
wait-on
installed globally to run a
local HTTP server for testing static content. The example job below illustrates
how to use this for Pagean tests. The script starts the server in this
project's ./tests/fixtures/site
directory and uses wait-on
to hold the
script until the server is running and returns a valid response. The referenced
pageanrc
file is the same as the project default pageanrc
, but references
all test URLs from the local server.
pagean:
image: registry.gitlab.com/gitlab-ci-utils/pagean:latest
stage: test
before_script:
# Start static server in test cases directory, discarding any console output,
# and wait until the server is running.
- serve ./tests/fixtures/site > /dev/null 2>&1 & wait-on http://localhost:3000
script:
- pagean -c static-server.pageanrc.json
artifacts:
when: always
paths:
- pagean-results.html
- pagean-results.json
- pagean-external-scripts/
A command line tool is also available to lint pageanrc files, which is executed as follows:
Installed globally:
> pageanrc-lint [options] [file] (default: "./.pageanrc.json")
Installed locally:
> npx pageanrc-lint [options] [file] (default: "./.pageanrc.json")
Lint a pageanrc file
Options:
-V, --version output the version number
-j, --json output JSON with full details
-h, --help display help for command
The --json
option outputs the JSON results to stdout in all cases for
consistency ([]
if no errors found, so that it always outputs valid
JSON). Otherwise errors are output to stderr, for example:
.\tests\test-configs\cli-tests\some-test.pageanrc.json
<pageanrc>.puppeteerLaunchOptions must NOT have fewer than 1 properties
<pageanrc>.reporters[0] must be equal to one of the allowed values (cli, html, json)
<pageanrc>.settings.consoleOutputTest must be either Boolean or object with the appropriate properties
<pageanrc>.settings.pageLoadTimeTest.foo must NOT contain additional properties: "foo"
<pageanrc>.settings.pageLoadTimeTest must be either Boolean or object with the appropriate properties
<pageanrc>.sitemap must use 'find' and 'replace' together
<pageanrc>.urls[2].settings.consoleOutputTest must be either Boolean or object with the appropriate properties
<pageanrc>.urls[3] must be either URL string or object with the appropriate properties
<pageanrc>.urls[5] must have required property 'url'
In some cases, a single error might result in multiple messages based on the
options in the schema definition, especially for cases that can be either a
single value or an object with specific properties (for example the errors for
<pageanrc>.settings.pageLoadTimeTest
in the preceding example).
Note that because of the large number of options, which are dependent on an
external project, the linting of puppeteerLaunchOptions
only checks that at
least one property is provided, it doesn't check the detailed settings.