Neglected Pizza Money

    nrk-sapmi-crawler

    0.2.1 • Public • Published

    nrk-sapmi-crawler

    NPM version NPM downloads tests MIT License

    Crawler for NRK Sapmi news bulletins that will be the basis for stopword-sami and an example search engine for content in Sami.

    Crawl news bulletins in Northern Sami, Lule Sami and South Sami.

    Code is not the cleanest one, but it works well enough, and hopefully will without too much maintenance for the next copule of years. If you just want the datasets, install stopword-sami modul.

    Getting a list of article IDs to crawl

    import { getList, crawlHeader, readIfExists, calculateIdListAndWrite } from '../index.js'
    
    const southSami = {
     id: '1.13572943',
     languageName: 'Åarjelsaemien',
     url: 'https://www.nrk.no/serum/api/content/json/1.13572943?v=2&limit=1000&context=items',
     file: './lib/list.southSami.json'
    }
    
    // Bringing it all together, fetching URL and reading file, and if new content -> merging arrays and writing
    Promise.all([getList(southSami.url, crawlHeader), readIfExists(southSami.file).catch(e => e)])
     .then((data) => {
       calculateListAndWrite(data, southSami.id, southSami.file, southSami.languageName)
     })
     .catch(function (err) {
       console.log('Error: ' + err)
     })

    To change user-agent for the crawler

    crawlHeader['user-agent'] = 'name of crawler/version - comment (i.e. contact-info)'

    Getting the content from a list of IDs

    import { crawlContentAndWrite } from 'nrk-sapmi-crawler'
    const appropriateTime = 2000
    
    const southSami = {
      idFile: './datasets/list.southSami.json',
      contentFile: './datasets/content.southSami.json'
    }
    
    
    async function crawl () {
      await crawlContentAndWrite(southSami.idFile, southSami.contentFile, appropriateTime)
    }
    
    crawl()

    Install

    npm i nrk-sapmi-crawler

    DownloadsWeekly Downloads

    14

    Version

    0.2.1

    License

    MIT

    Unpacked Size

    138 kB

    Total Files

    12

    Last publish

    Collaborators

    • eklem