Neutron Polarization Manipulator

    urldatabase

    1.0.1 • Public • Published

    URL Database

    URL Database is a Node.js module that provides Content Category for around 90 million domains.

    There are two Tiers of categories available. Tier 1 categories are:

    'Style & Fashion', 'Religion & Spirituality', 'Events and Attractions', 'Shopping', 'Pop Culture', 'Fine Art', 'Books and Literature', 'Television', 'Travel', 'Movies', 'Careers', 'Home & Garden', 'Hobbies & Interests', 'Family and Relationships', 'Sports', 'Real Estate', 'Food & Drink', 'Healthy Living', 'Automotive', 'Medical Health', 'Video Gaming', 'Education', 'Music and Audio', 'Technology & Computing', 'News and Politics', 'Pets', 'Personal Finance', 'Science', 'Business and Finance'

    Tier 2 categories are listed in Appendix of this Readme.

    Categories of domains were determined with the following data acquisition and machine learning pipeline:

    • website of domains was fetched
    • text of websites was extracted and pre-processed (lemmatization, removal of punctuations, etc.)
    • for non-english websites, text was translated to English text using NMT solution (with BLEU scores of language pairs NMT models >40)
    • each text was classified with Tier 1 and Tier 2 classifier

    Installation

    npm i urldatabase
    

    Usage example

    var request = require('request');
    var options = {
      'method': 'POST',
      'url': 'https://www.alpha-quantum.com/api/domains.php',
      'headers': {
        'Content-Type': 'application/x-www-form-urlencoded'
      },
      form: {
        'domain': 'www.zdf.de'
      }
    };
    request(options, function (error, response) {
      if (error) throw new Error(error);
      console.log(response.body);
    });
    

    Usage of URL Database

    Our URL Database can be accessed either via API, as implemented above or you can receive in form of dataset file, which can serve as an offline URL Database.

    Offline URL Database can be used in internal applications, e.g. for content filtering the websites of company's employees, by restricting access from non-work websites, like shopping, social media and gaming sites.

    It can also be used for cybersecurity apps or in Ecommerce Saas platforms and services.

    Form of json

    Example output from URL Database for "www.zdf.de" - Tier 1:

    {
      "classification": [
        {
          "category": "Television",
          "value": 0.60773588801323
        },
        {
          "category": "Movies",
          "value": 0.29109074822883085
        },
        {
          "category": "Events and Attractions",
          "value": 0.07486490625416359
        },
        {
          "category": "Family and Relationships",
          "value": 0.005374985197691561
        },
        {
          "category": "Hobbies & Interests",
          "value": 0.005101833789390943
        },
        {
          "category": "Video Gaming",
          "value": 0.003984198425722353
        },
        {
          "category": "Books and Literature",
          "value": 0.002492840101745817
        },
        {
          "category": "Fine Art",
          "value": 0.0023078275948925885
        },
        {
          "category": "Shopping",
          "value": 0.000736829495733268
        },
        {
          "category": "Travel",
          "value": 0.0007148378661549944
        },
        {
          "category": "Religion & Spirituality",
          "value": 0.0006182756059490645
        },
        {
          "category": "Music and Audio",
          "value": 0.0006017436156576558
        },
        {
          "category": "News and Politics",
          "value": 0.0005944575220540115
        },
        {
          "category": "Pop Culture",
          "value": 0.0005872038218177597
        },
        {
          "category": "Healthy Living",
          "value": 0.0005831789414856245
        },
        {
          "category": "Careers",
          "value": 0.0005243635107021117
        },
        {
          "category": "Automotive",
          "value": 0.00039890616180756646
        },
        {
          "category": "Technology & Computing",
          "value": 0.0002859548776286219
        },
        {
          "category": "Real Estate",
          "value": 0.00027637364331928056
        },
        {
          "category": "Personal Finance",
          "value": 0.0001710230563593708
        },
        {
          "category": "Sports",
          "value": 0.00016042771723498377
        },
        {
          "category": "Education",
          "value": 0.00014381866308073145
        },
        {
          "category": "Pets",
          "value": 0.00012728402872631592
        },
        {
          "category": "Business and Finance",
          "value": 0.000123494990696087
        },
        {
          "category": "Style & Fashion",
          "value": 0.00011405926539219588
        },
        {
          "category": "Food & Drink",
          "value": 0.00010023782530038409
        },
        {
          "category": "Science",
          "value": 0.0000877636365314911
        },
        {
          "category": "Home & Garden",
          "value": 0.00007493299862686662
        },
        {
          "category": "Medical Health",
          "value": 0.000021605150073794945
        }
      ],
      "language": "de"
    }
    

    Here is the result for Tier 2 classification for same domain (only top probability categories shown):

    {
      "classification": [
        {
          "category": "Comedy TV",
          "value": 0.12665120800837792
        },
        {
          "category": "World Movies",
          "value": 0.11467298561750293
        },
        {
          "category": "Fantasy Movies",
          "value": 0.07605491578220645
        },
        {
          "category": "Drama Movies",
          "value": 0.05372015353841327
        },
        {
          "category": "Drama TV",
          "value": 0.048950849776443935
        },
        {
          "category": "Soap Opera TV",
          "value": 0.043373118622605095
        },
        {
          "category": "Science Fiction TV",
          "value": 0.03838582265067825
        },
        {
          "category": "Holiday TV",
          "value": 0.024368499196304464
        },
        {
          "category": "Cinemas and Events",
          "value": 0.02408407549980423
        },
        {
          "category": "Action and Adventure Movies",
          "value": 0.02262422360894283
        },
        {
          "category": "Children's TV",
          "value": 0.01985699003319781
        },
        {
          "category": "Crime and Mystery Movies",
          "value": 0.016198758949356365
        },
        {
          "category": "Reality TV",
          "value": 0.01584871616578955
        },
        {
          "category": "Horror Movies",
          "value": 0.014501264118914434
        },
        {
          "category": "Video Game Genres",
          "value": 0.013148151950373053
        },
        {
          "category": "Music TV",
          "value": 0.013036725828882795
        },
        {
          "category": "Animation TV",
          "value": 0.01281354534376587
        },
        {
          "category": "Romance Movies",
          "value": 0.011537290751170815
        },
        {
          "category": "Travel Books",
          "value": 0.010342167707548545
        },
        {
          "category": "Content Production",
          "value": 0.008501663028851797
        },...
    }
    
    

    Language support

    URL Database contains English as well as non-english domains.

    Appendix

    Tier 2 categories of domains:

    'Beauty', 'Astrology', 'Polish', 'Fashion Trends', 'Street Style', 'Sales and Promotions', 'Celebrity Style', 'Fashion Events', 'Personal Celebrations & Life Events', 'Holiday Shopping', 'Body Art', 'Outdoor Decorating', 'Fiction', 'Personal Care', 'Interior Decorating', 'Auto Buying and Selling', 'Sci-fi and Fantasy', 'Images/Galleries', 'Gifts and Greetings Cards', 'Coupons and Discounts', 'Digital Arts', 'Soap Opera TV', "Women's Fashion", 'Alternative Music', 'Cookbooks', 'Travel Accessories', 'Remote Working', 'Career Advice', 'Spirituality', 'Single Life', 'Travel Books', 'Action and Adventure Movies', "Men's Health", 'Designer Clothing', 'Party Supplies and Decorations', 'Comedy TV', 'Romance Movies', 'Travel Type', 'Travel Locations', 'Poetry', 'Collecting', 'Arts and Crafts', 'Weight Loss', 'Musicals', 'Drama Movies', 'Real Estate Buying and Selling', 'Fantasy Movies', 'Rock Music', 'Swimming', 'Museums & Galleries', 'Costume', 'Parenting', 'Science Fiction TV', 'Auto Insurance', 'Science Fiction Movies', 'Country Music', "Men's Fashion", 'Reality TV', 'Crime and Mystery Movies', 'Retail Property', 'Dining Out', 'Music TV', 'Design', 'Social', 'Pet Supplies', 'Flower Shopping', 'Fitness and Exercise', 'World Movies', 'Comics and Graphic Novels', 'Awards Shows', 'Modern Art', 'Dating', 'User Generated', 'Concerts & Music Events', "Children's Literature", 'Divorce', 'Comedy Movies', 'Houses', "Children's Music", 'Career Planning', 'Hip Hop Music', 'Vegan Diets', 'Young Adult Literature', 'Interactive Content', 'Frugal Living', 'High Fashion', 'Homeschooling', 'Animation Movies', 'Home Security', 'Grocery Shopping', 'Genealogy and Ancestry', 'Olympic Sports', 'Search Engine/Listings', 'Console Games', 'Home Appliances', 'Virtual Reality', 'Feature', 'French', 'Cinemas and Events', 'Language Learning', 'Vegetarian Diets', 'Weightlifting', 'Holiday TV', 'VR/AR', 'Cheerleading', 'Workshops and Classes', 'Sporting Events', 'Cats', 'Sports Equipment', 'Fine Art Photography', 'Content Production', 'Auto Rentals', 'Consumer Electronics', 'Magic and Illusion', 'Celebrity Homes', 'Space and Astronomy', 'Forum/Community', 'Sports TV', 'Celebrity Families', 'Drama TV', 'Outdoor Activities', 'Auto Repair', 'Home Utilities', 'Personal Taxes', 'Spanish', 'Personal Debt', 'Auto Safety', 'Games and Puzzles', 'Religious Events', 'Home Improvement', 'Theater', 'Job Search', 'Amusement and Theme Parks', 'Dogs', 'Beekeeping', 'Walking', 'Auto Parts', 'Snooker/Pool/Billiards', 'Video', 'Nightclubs', 'Video Game Genres', 'Augmented Reality', 'Horror Movies', 'Rowing', 'Bereavement', 'Pet Adoptions', 'Financial Planning', 'Celebrity Pregnancy', "Children's TV", 'Educational Content', 'Retirement Planning', 'Food Allergies', 'Beach Volleyball', 'Remodeling & Construction', 'Game', 'Zoos & Aquariums', 'Extreme Sports', 'Car Culture', 'Cosmetic Medical Services', 'Birdwatching', 'Casinos & Gambling', 'Gardening', 'Atheism', 'Ice Hockey', 'Alcoholic Beverages', 'Art and Photography Books', "Children's Games and Toys", 'Mobile Games', 'Tagalog', 'Politics', 'World Cuisines', 'Korean', 'Artificial Intelligence', 'Auto Shows', 'Business', 'Auto Technology', 'Figure Skating', 'Road-Side Assistance', 'Food Movements', 'Environment', 'Surgery', 'American Football', 'Dance and Electronic Music', 'Animation TV', 'Political Event', 'Auto Type', 'Afrikaans', 'Dutch', 'Weather', 'Marketplace/eCommerce', 'Field Hockey', 'Editorial/Professional', 'Rugby', 'Dance', 'Classical Music', 'Equine Sports', 'Email', 'Gospel Music', 'Insurance', 'Maltese', "Children's Clothing", 'Vacation Properties', 'Navajo', 'Home Entertaining', 'Bodybuilding', 'Kannada', 'Consumer Banking', 'Indie and Arthouse Movies', 'Desserts and Baking', 'Fantasy Sports', 'Baseball', 'Cooking', 'Arabic', 'Diving', 'Musical Instruments', 'PC Games', 'Golf', "Women's Health", 'World/International Music', 'Portuguese', 'Barbecues and Grilling', 'Cycling', 'Theater Venues and Events', 'Diseases and Conditions', 'College Education', 'English', 'Landscaping', 'Land and Farms', 'Primary Education', 'Non-Alcoholic Beverages', 'Genetics', 'Irish', 'Smart Home', 'Financial Assistance', 'Chinese', 'Radio Control', 'Healthy Cooking and Eating', 'Gujarati', 'Reggae', 'Agnosticism', 'Early Childhood Education', 'Fijian', 'Model Toys', 'Apprenticeships', 'Malay', 'Bars & Restaurants', 'Samoan', 'Nutrition', 'Biographies', 'Sailing', 'Inline Skating', 'Eldercare', 'Wellness', 'Fish and Aquariums', 'Rodeo', 'Twi', 'Auto Racing', 'Household Supplies', 'Hunting and Shooting', 'Instructional Content', 'Entertainment Content', 'Homework and Study', 'Xhosa', 'Senior Health', 'Soccer', 'Urdu', 'Real Estate Renting and Leasing', 'Disabled Sports', 'Sardinian', 'Lotteries and Scratchcards', 'Croatian', 'Geography', 'Fan Conventions', 'Auto Body Styles', 'Skiing', 'Swedish', 'Online Education', 'Classic Hits', 'Comedy Events', 'Economy', 'Igbo', 'Manx', 'Macedonian', 'Hungarian', 'Finnish', 'Australian Rules Football', 'Review', 'Marathi', 'Turkish', 'Celebrity Relationships', 'Norwegian', 'City', 'Auto Recalls', 'Oriya', 'Wrestling', 'Secondary Education', 'Malls & Shopping Centers', 'Latvian', 'Humor and Satire', 'Amharic', 'Fishing Sports', 'Ojibwe', 'German', 'Swahili', 'Tibetan', 'Bowling', 'Welsh', 'Kazakh', 'Medical Tests', 'Law', 'Volleyball', 'Table Tennis', 'Estonian', 'Special Education', 'Tahitian', 'Vocational Training', 'Marriage and Civil Unions', 'Gymnastics', 'Motorcycles', 'Yoruba', 'Paranormal Phenomena', 'Ganda', 'Vietnamese', 'Talk Radio', 'Softball', 'Belarusian', 'Reptiles', 'Sindhi', 'Office Property', "Children's Health", 'Region/State', 'Bulgarian', 'Veterinary Medicine', 'Educational Assessment', 'Shona', 'Audio', 'Ewe', 'Lithuanian', 'Squash', 'Scooters', 'Blues', 'Sports Radio', 'Continent', 'Kyrgyz', 'Apartments', 'International News', 'Kinyarwanda', 'Javanese', 'Kanuri', 'Nepali', 'Textual', 'Afar', 'Indonesian', 'Computing', 'Danish', 'Adult Education', 'Czech', 'Biological Sciences', 'Hebrew', 'Basque', 'Chamorro', 'Dash Cam Videos', 'Lao', 'General Social', 'Mongolian', 'Inuktitut', 'Italian', 'Greek', 'Basketball', 'Pashto', 'Darts', 'Martial Arts', 'Hindi', 'Bengali', 'Romansh', 'Personal Investing', 'Physics', 'Pharmaceutical Drugs', 'Breton', 'Lacrosse', 'Tswana', 'Avestan', 'Chichewa', 'Indoor Environmental Quality', 'Track and Field', 'Fula', 'Robotics', 'Opera', 'Akan', 'Malagasy', 'Farsi', 'Assamese', 'Birds', 'Vaccines', 'Western Frisian', 'Uzbek', 'Utility/Online Tool', 'Galician', 'Ido', 'Malayalam', 'Georgian', 'Punjabi', 'Quechua', 'Comedy (Music and Audio)', 'Luxembourgish', 'Tigrinya', 'Guarani', 'Jazz', 'Tsonga', 'Thai', 'Kirundi', 'Slovene', 'Boxing', 'Tamil', 'Catalan', 'Metro', 'Wolof', 'Bashkir', 'Esperanto', 'Hotel Properties', 'Turkmen', 'Interlingua', 'Sanskrit', 'Industries', 'Chemistry'

    Install

    npm i urldatabase@1.0.1

    Version

    1.0.1

    License

    MIT

    Unpacked Size

    14.5 kB

    Total Files

    3

    Last publish

    Collaborators

    • websitecategorization