WebPDFArticleScrape
WebPDFArticleScrape is a npm module that allows you to scrape main content out of pdfs and webpage articles.
var scraper = ;
Input:
URL or Directory
Returned DataStructures:
sizeMap: Map<key:fontSize, val:Array of all text chunks of said size>
output: {
title: [Array of all text chunks classified as titles]
content: [Array ofall text chunks classified as content]
}
Basic Usage
scraper;
scraper;
scraper;
scraper;
###Advanced Usage: By default the module runs a local web sever on port 8081 to help manually configure parsing in case messes up. After every web parse it will log to the console a link to the web page. To help make our system better we encourage using it.
HOW TO USE MANUAL CONFIG SITE:
manually toggle which font sizes correspond to the useful text
->Yellow : the text is a title
->Green : the text is content
->Red : ignore these texts
once you fix the page, give your configuration a name and publish it.
A more in depth visual explaination will be provided in the near future.
scraper //logs more information about the processing to console scraper scraper //sometimes the regex for title classification causes HUGE lag, so ignoring them is sometimes useful scraper scraper //stops logging the manual config link scraper //closes the server