nrk-sapmi-crawler

Crawler for NRK Sapmi news bulletins that will be the basis for stopword-sami and an example search engine for content in Sami.

Crawl news bulletins in Northern Sami, Lule Sami and South Sami.

Code is not the cleanest one, but it works well enough, and hopefully will without too much maintenance for the next copule of years. If you just want the datasets, install stopword-sami modul.

Getting a list of article IDs to crawl

import { getList, crawlHeader, readIfExists, calculateIdListAndWrite } from '../index.js'

const southSami = {
 id: '1.13572943',
 languageName: 'Åarjelsaemien',
 url: 'https://www.nrk.no/serum/api/content/json/1.13572943?v=2&limit=1000&context=items',
 file: './lib/list.southSami.json'
}

// Bringing it all together, fetching URL and reading file, and if new content -> merging arrays and writing
Promise.all([getList(southSami.url, crawlHeader), readIfExists(southSami.file).catch(e => e)])
 .then((data) => {
   calculateListAndWrite(data, southSami.id, southSami.file, southSami.languageName)
 })
 .catch(function (err) {
   console.log('Error: ' + err)
 })

To change user-agent for the crawler

crawlHeader['user-agent'] = 'name of crawler/version - comment (i.e. contact-info)'

Getting the content from a list of IDs

import { crawlContentAndWrite } from 'nrk-sapmi-crawler'
const appropriateTime = 2000

const southSami = {
  idFile: './datasets/list.southSami.json',
  contentFile: './datasets/content.southSami.json'
}


async function crawl () {
  await crawlContentAndWrite(southSami.idFile, southSami.contentFile, appropriateTime)
}

crawl()

nrk-sapmi-crawler

nrk-sapmi-crawler

Getting a list of article IDs to crawl

To change user-agent for the crawler

Getting the content from a list of IDs

Dependencies (3)

Dev Dependencies (0)

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

nrk-sapmi-crawler

nrk-sapmi-crawler

Getting a list of article IDs to crawl

To change user-agent for the crawler

Getting the content from a list of IDs

Dependencies (3)

Dev Dependencies (0)

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads