WebPDFArticleScrape

WebPDFArticleScrape is a npm module that allows you to scrape main content out of pdfs and webpage articles.

    var scraper = require('web-pdf-scraper');

Input:
URL or Directory

Returned DataStructures:
sizeMap: Map<key:fontSize, val:Array of all text chunks of said size>
output: {
title: [Array of all text chunks classified as titles]
content: [Array ofall text chunks classified as content]
}

Basic Usage

Generating sizeMap of a PDF

        scraper.scrapePDF("pdfDir.pdf").then(
            function(sizeMap){
                console.log(sizeMap);
            }
        ).catch(
                function(reason) {
                    console.log('Handle rejected promise ('+reason+') here.');
                }
        );

Generating output of a PDF

        scraper.smartPDF("pdfDir.pdf").then(
            function(output){
                console.log(output);
            }
        ).catch(
                function(reason) {
                    console.log('Handle rejected promise ('+reason+') here.');
                }
        );

Generating sizeMap of a Web Article

        scraper.scrapeWeb("https://en.wikipedia.org/wiki/Heart").then(
            function(sizeMap){
                console.log(sizeMap);
            }
        ).catch(
                function(reason) {
                    console.log('Handle rejected promise ('+reason+') here.');
                }
        );

Generating output of a Web Article

        scraper.smartWeb("https://en.wikipedia.org/wiki/Heart").then(
            function(output){
                console.log(output);
            }
        ).catch(
                function(reason) {
                    console.log('Handle rejected promise ('+reason+') here.');
                }
        );

###Advanced Usage: By default the module runs a local web sever on port 8081 to help manually configure parsing in case messes up. After every web parse it will log to the console a link to the web page. To help make our system better we encourage using it.

	HOW TO USE MANUAL CONFIG SITE:
		manually toggle which font sizes correspond to the useful text
		 ->Yellow : the text is a title
		 ->Green : the text is content
		 ->Red : ignore these texts

		 once you fix the page, give your configuration a name and publish it.

		 A more in depth visual explaination will be provided in the near future.

Additional Useful Functions:

    scraper.makeVerbose()  //logs more information about the processing to console
    scraper.stopVerbose()
    
    scraper.ignoreTitles() //sometimes the regex for title classification causes HUGE lag, so ignoring them is sometimes useful
    scraper.markTitles()
 
    scraper.shutUp()	//stops logging the manual config link
 
    scraper.closeServer() //closes the server

web-pdf-scraper

WebPDFArticleScrape

WebPDFArticleScrape is a npm module that allows you to scrape main content out of pdfs and webpage articles.

Basic Usage

Generating sizeMap of a PDF

Generating output of a PDF

Generating sizeMap of a Web Article

Generating output of a Web Article

Additional Useful Functions:

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Last publish

Collaborators

web-pdf-scraper

WebPDFArticleScrape

WebPDFArticleScrape is a npm module that allows you to scrape main content out of pdfs and webpage articles.

Basic Usage

Generating sizeMap of a PDF

Generating output of a PDF

Generating sizeMap of a Web Article

Generating output of a Web Article

Additional Useful Functions:

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Last publish

Collaborators

Weekly Downloads