a-extractor

    2.0.2 • Public • Published

    📃 Article extractor

    Database of expressions used for extracting content from blogs and articles.

    NPM Version NPM Downloads Build Status Standard Style Guide

    The main database is JSON5 format, a strict subset of Javascript, also available as a normal JSON, for convenience.

    The extraction expressions are Cheerio, similar with jQuery.

    The targeted information is:

    • the author
    • the date when the article was written
    • and of course, the article text, as clean as possible

    This project is designed to be used with Clean-Mark, but you can use it however you want.

    86 domains available

    • abcnews.go.com
    • aeon.co
    • agroinfo.ro
    • arenait.net
    • arstechnica.com
    • articles.latimes.com
    • artsy.net
    • bbc.com
    • beta.theglobeandmail.com
    • bigthink.com
    • bindiribli.ro
    • bossfeed.net
    • businessinsider.com
    • collectivelyconscious.net
    • curentul.info
    • dailymail.co.uk
    • deepdotweb.com
    • digi24.ro
    • earthsky.org
    • edition.cnn.com
    • engadget.com
    • express.co.uk
    • farnamstreetblog.com
    • fastcompany.com
    • finesociety.ro
    • firstpost.com
    • foxnews.com
    • galacticconnection.com
    • gandeste.org
    • gazetadambovitei.ro
    • gnosticwarrior.com
    • hackread.com
    • hbr.org
    • hotnews.ro
    • howtogeek.com
    • huffingtonpost.com
    • info.localytics.com
    • infoalert.ro
    • irishmirror.ie
    • isgp-studies.com
    • jamesclear.com
    • jurnalul.ro
    • latimes.com
    • life.ro
    • mashable.com
    • merckmanuals.com
    • money.cnn.com
    • nautil.us
    • nbcnews.com
    • ncbi.nlm.nih.gov
    • neonnettles.com
    • news.com.au
    • newscientist.com
    • newyorker.com
    • nytimes.com
    • nzherald.co.nz
    • observator.tv
    • pri.org
    • qz.com
    • romaniaa.ro
    • rt.com
    • rts.earth
    • smh.com.au
    • start-up.ro
    • stiri.tvr.ro
    • stirileprotv.ro
    • techcrunch.com
    • techradar.com
    • telegraph.co.uk
    • theatlantic.com
    • theguardian.com
    • theliberal.ie
    • thenextweb.com
    • theverge.com
    • thrillist.com
    • torrentfreak.com
    • usatoday.com
    • usnews.com
    • vox.com
    • wakingtimes.com
    • wall-street.ro
    • washingtonpost.com
    • weforum.org
    • wsj.com
    • yahoo.com
    • ziare.com

    Important

    Clean-Mark already has algorithms to extract most of the info, if the website is SEO friendly, eg: it respects schema.org/Article, or Microformats, or the Open Graph protocol.
    But it's not a perfect tool 🤖 and it needs help from us humans 🙄

    Contributions

    We ❤️ contributions !!!

    Want to report a bug, request a feature, or contribute? Things can only be contributed via the A-Extractor GitHub repository.

    The "fork-and-pull" Git workflow:

    1. Fork the repo on GitHub
    2. Clone the project to your own machine
    3. Work on your fork
      1. Make your changes and additions
      2. Change or add tests if needed
      3. Run tests and make sure they pass
      4. Add changes to README.md if needed
    4. Commit changes to your own branch
    5. Make sure you merge the latest from "upstream" and resolve conflicts if there is any
    6. Push your work back up to your fork
    7. Submit a Pull request so that we can review your changes

    License

    MIT © Cristi Constantin.

    Install

    npm i a-extractor

    DownloadsWeekly Downloads

    8

    Version

    2.0.2

    License

    MIT

    Unpacked Size

    947 kB

    Total Files

    37

    Last publish

    Collaborators

    • croqaz