icrawl
Crawl pages and generate html
s corresponding to the path.
Features
- With nginx, you can do SEO on the front-end rendered page.
- Built-in server, you can directly crawl the page based on the built folder
- The html save path corresponds to the url path
- Does not depend on any front-end framework
- Provide API calls and command line calls
Examples
Node API
const path = const Crawl = const crawl = requestTimeout: 10000 isNormalizeSourceURL: true routes: 'https://nodejs.org/api/path.html' 'https://nodejs.org/api/url.html' path: pathcrawlstart
Configuration
.icrawlrc.js in your project root
const path = moduleexports = isNormalizeSourceURL: true routes: 'https://nodejs.org/api/path.html' 'https://nodejs.org/api/url.html' path: path
package.json
"scripts":
options
options
<Object>viewport
<Object> viewport sizewidth
<Number>height
<Number>
maxPageCount
<Number> Number of pages that can be opened in parallel, default:10
isNormalizeSourceURL
<Boolean | Object> Whether to convert the relative path of images, anchors, links, scripts to absolute paths in your crawled html, for example, When you crawl the page url ishttp://www.example.com/example
, it will be/favicon.ico
tohttp://www.example.com/favicon.ico
. You can also set each option individually. default:false
links
<Boolean>images
<Boolean>scripts
<Boolean>anchors
<Boolean>
requestTimeout
<Number> Number of milliseconds for request timeout, default:30000ms
, set to0
to wait indefinitelyhost
<String> default:''
routes
<Array<String>> The list of routes to be crawled, the relative path needs to set thehost
optionoutputPath
<String> Html saved directorysaveHTML
<Boolean> Whether to save the crawl page as html, default:true
depth
<Number | Object> Specify page depth if it is aNumber
, The A page is configured on the routes, the A (depth
: 0) page contains a link to B (depth
: 1), and the B page contains a link to C (depth
: 2), default:0
value
<Number> page depthinclude
<RegExp> Included link, default:null
exclude
<RegExp> Excluded link, default:null
after
<Function(Array<PageRoute>)> Callback after page link collection is complete, default:null
serverConfig
<String | Object> If the page to be crawled is not on a server, you can specify this option to start a server locally. If it is aString
, specify the directory where the page is located. default:null
path
<String> The directory where the page is located, for example, yourbuild
directory path, then you can runicrawl
afterbuild
command or put two commands together inscripts
port
<Number> default:3333
public
<String> This option needs to be specified when theisNormalizeSourceURL
option is specified astrue
at the same time. Relative paths will be converted relative to this optionisFallback
<Boolean> For SPA, alwalys change the requested location to theindex.html
requestInterception
<Object> Filter requests, use this configuration reasonably to speed up crawling. For example, we don't need to wait for images, css, fonts, third-party scripts to load, after all, we only need to save the renderedhtml
most of the timeinclude
<RegExp>exclude
<RegExp>
progressBarStyle
<Object> Progress bar styleprefix
<String> default:''
suffix
<String> default:''
remaining
<String> default:'░'
completed
<String> default:'█'
crawl.start()
return
: Promise
PageRoute
url
<String> The page url to crawlroot
<PageRoute> Theroot
of chainreferer
<PageRoute> The parent of this url
Tips
- By configuring nginx, you can enable SEO for front-end rendering pages.
- If you use nginx you will need to install the set-misc-nginx-module module, or install the OpenResty directly.