robots-parse

A lightweight and simple robots.txt parser in node.

Installation

npm install --save robots-parse

Usage

You can use the module to scan a domain for robots file like in the example below:

const robotsParse = require('robots-parse');
 
robotsParse('github.com', (err, res) => {
  console.log('Result:', res);
});

You can also use it with promises if the callback is not specified:

const robotsParse = require('robots-parse');
 
(async () => {
  const res = await robotsParse('github.com');
  console.log('Result:', res);
})().catch(console.log)

Or you can use the built-in parser to parse an existing robots.txt file, for example a local file or a string. The parser works in sync so you don't have to use callback or promises.

const {parser} = require('robots-parse');
 
request('google.com/robots.txt', (err, res, body) => {
  const object = parser(body);
  console.log(object);
});

Parsing an existing local robots.txt file:

const {parser} = require('robots-parse');
const content = fs.readFileSync('./robots.txt', 'utf-8');
const object = parser(content);
 
console.log(object);

How it works?

By default the script will get and parse the robots.txt file for a given website or domain and it will search for various rules:

Agents: A user-agent identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent.
Host: Supported by Yandex (and not by Google even though some posts say it does), this directive lets you decide whether you want the search engine to show.
Allow: The allow directive specifies paths that may be accessed by the designated crawlers. When no path is specified, the directive is ignored.
Disallow: The disallow directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.
Sitemap: An absolute url that points to a Sitemap, Sitemap Index file or equivalent URL.

It returns, if the robots files were successfully retrieved and parsed, an object containing the properties mentioned above, inside every agent found you will find agent-specific allow and disallow rules, which also will be stored in allow and disallow root properties containing all of them indistinctly.

You can read more about the specifications of the robots file on it's Google Reference Page.

Contributing

Create an issue and describe your idea
Fork the project (https://github.com/b4dnewz/robots-parse/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Write tests for your code (npm run test)
Publish the branch (git push origin my-new-feature)
Create a new Pull Request

robots-parse

robots-parse

Installation

Usage

How it works?

Contributing

License

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

robots-parse

robots-parse

Installation

Usage

How it works?

Contributing

License

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads