No Packages, Mate

    urldatabase

    1.0.6 • Public • Published

    URL Database

    URL Database is a Node.js module that provides Content Category for around 90 million domains.

    There are two Tiers of categories available. Tier 1 categories are:

    'Style & Fashion', 'Religion & Spirituality', 'Events and Attractions', 'Shopping', 'Pop Culture', 'Fine Art', 'Books and Literature', 'Television', 'Travel', 'Movies', 'Careers', 'Home & Garden', 'Hobbies & Interests', 'Family and Relationships', 'Sports', 'Real Estate', 'Food & Drink', 'Healthy Living', 'Automotive', 'Medical Health', 'Video Gaming', 'Education', 'Music and Audio', 'Technology & Computing', 'News and Politics', 'Pets', 'Personal Finance', 'Science', 'Business and Finance'

    Tier 2 categories are listed in Appendix of this Readme.

    Categories of domains were determined with the following data acquisition and machine learning pipeline:

    • website of domains was fetched
    • text of websites was extracted and pre-processed (lemmatization, removal of punctuations, etc.)
    • for non-english websites, text was translated to English text using NMT solution (with BLEU scores of language pairs NMT models >40)
    • each text was classified with Tier 1 and Tier 2 classifier

    Installation

    npm i urldatabase
    

    Usage example

    var request = require('request');
    var options = {
      'method': 'POST',
      'url': 'https://www.alpha-quantum.com/api/domains.php',
      'headers': {
        'Content-Type': 'application/x-www-form-urlencoded'
      },
      form: {
        'domain': 'www.zdf.de'
      }
    };
    request(options, function (error, response) {
      if (error) throw new Error(error);
      console.log(response.body);
    });
    

    Usage of URL Database

    Our URL Categorization Database can be accessed either via API, as implemented above or you can receive in form of dataset file, which can serve as an offline URL Database.

    Offline URL Database can be used in internal applications, e.g. for content filtering the websites of company's employees, by restricting access from non-work websites, like shopping, social media and gaming sites.

    It can also be used for cybersecurity apps or in Ecommerce Saas platforms and services.

    Format of json

    Example output from URL Database for "www.zdf.de" - Tier 1:

    {
      "classification": [
        {
          "category": "Television",
          "value": 0.60773588801323
        },
        {
          "category": "Movies",
          "value": 0.29109074822883085
        },
        {
          "category": "Events and Attractions",
          "value": 0.07486490625416359
        },
        {
          "category": "Family and Relationships",
          "value": 0.005374985197691561
        },
        {
          "category": "Hobbies & Interests",
          "value": 0.005101833789390943
        },
        {
          "category": "Video Gaming",
          "value": 0.003984198425722353
        },
        {
          "category": "Books and Literature",
          "value": 0.002492840101745817
        },
        {
          "category": "Fine Art",
          "value": 0.0023078275948925885
        },
        {
          "category": "Shopping",
          "value": 0.000736829495733268
        },
        {
          "category": "Travel",
          "value": 0.0007148378661549944
        },
        {
          "category": "Religion & Spirituality",
          "value": 0.0006182756059490645
        },
        {
          "category": "Music and Audio",
          "value": 0.0006017436156576558
        },
        {
          "category": "News and Politics",
          "value": 0.0005944575220540115
        },
        {
          "category": "Pop Culture",
          "value": 0.0005872038218177597
        },
        {
          "category": "Healthy Living",
          "value": 0.0005831789414856245
        },
        {
          "category": "Careers",
          "value": 0.0005243635107021117
        },
        {
          "category": "Automotive",
          "value": 0.00039890616180756646
        },
        {
          "category": "Technology & Computing",
          "value": 0.0002859548776286219
        },
        {
          "category": "Real Estate",
          "value": 0.00027637364331928056
        },
        {
          "category": "Personal Finance",
          "value": 0.0001710230563593708
        },
        {
          "category": "Sports",
          "value": 0.00016042771723498377
        },
        {
          "category": "Education",
          "value": 0.00014381866308073145
        },
        {
          "category": "Pets",
          "value": 0.00012728402872631592
        },
        {
          "category": "Business and Finance",
          "value": 0.000123494990696087
        },
        {
          "category": "Style & Fashion",
          "value": 0.00011405926539219588
        },
        {
          "category": "Food & Drink",
          "value": 0.00010023782530038409
        },
        {
          "category": "Science",
          "value": 0.0000877636365314911
        },
        {
          "category": "Home & Garden",
          "value": 0.00007493299862686662
        },
        {
          "category": "Medical Health",
          "value": 0.000021605150073794945
        }
      ],
      "language": "de"
    }
    

    Here is the result for Tier 2 classification for same domain (only top probability categories shown):

    {
      "classification": [
        {
          "category": "Comedy TV",
          "value": 0.12665120800837792
        },
        {
          "category": "World Movies",
          "value": 0.11467298561750293
        },
        {
          "category": "Fantasy Movies",
          "value": 0.07605491578220645
        },
        {
          "category": "Drama Movies",
          "value": 0.05372015353841327
        },
        {
          "category": "Drama TV",
          "value": 0.048950849776443935
        },
        {
          "category": "Soap Opera TV",
          "value": 0.043373118622605095
        },
        {
          "category": "Science Fiction TV",
          "value": 0.03838582265067825
        },
        {
          "category": "Holiday TV",
          "value": 0.024368499196304464
        },
        {
          "category": "Cinemas and Events",
          "value": 0.02408407549980423
        },
        {
          "category": "Action and Adventure Movies",
          "value": 0.02262422360894283
        },
        {
          "category": "Children's TV",
          "value": 0.01985699003319781
        },
        {
          "category": "Crime and Mystery Movies",
          "value": 0.016198758949356365
        },
        {
          "category": "Reality TV",
          "value": 0.01584871616578955
        },
        {
          "category": "Horror Movies",
          "value": 0.014501264118914434
        },
        {
          "category": "Video Game Genres",
          "value": 0.013148151950373053
        },
        {
          "category": "Music TV",
          "value": 0.013036725828882795
        },
        {
          "category": "Animation TV",
          "value": 0.01281354534376587
        },
        {
          "category": "Romance Movies",
          "value": 0.011537290751170815
        },
        {
          "category": "Travel Books",
          "value": 0.010342167707548545
        },
        {
          "category": "Content Production",
          "value": 0.008501663028851797
        },...
    }
    
    

    Language support

    URL Database contains English as well as non-english domains.

    Appendix

    Tier 2 categories of domains (first 20 out of 441):

    'Beauty', 'Astrology', 'Polish', 'Fashion Trends', 'Street Style', 'Sales and Promotions', 'Celebrity Style', 'Fashion Events', 'Personal Celebrations & Life Events', 'Holiday Shopping', 'Body Art', 'Outdoor Decorating', 'Fiction', 'Personal Care', 'Interior Decorating', 'Auto Buying and Selling', 'Sci-fi and Fantasy', 'Images/Galleries', 'Gifts and Greetings Cards', 'Coupons and Discounts', 'Digital Arts', 'Soap Opera TV', "Women's Fashion",

    Install

    npm i urldatabase@1.0.6

    Version

    1.0.6

    License

    MIT

    Unpacked Size

    13.1 kB

    Total Files

    4

    Last publish

    Collaborators

    • websitecategorization