reddit-crawler

0.0.1 • Public • Published

reddit-crawler

npm install reddit-crawler

Iterate over all submissions in a subreddit.

  • Uses Reddit's CloudSearch API.
  • Auto-renews OAuth access token as it crawls.

Usage

Until Node gets native async iterators, the crawler approximates one with its method next(): Promise<array | null>.

If falsey, then the crawler is done crawling the subreddit.

The array of submissions may be empty. The crawler will expand its search interval until it finds results, attempting to hover around 50-99 results per request.

const makeCrawler = require('reddit-crawler')
 
const creds = {
    username: 'foo',
    password: 'secret',
    appId: 'xxx',
    appSecret: 'yyy',
}
 
async function work() {
    const crawler = makeCrawler('webdev', {
        creds,
        userAgent: 'my-crawler:0.0.1 (by /u/foo)',
    })
 
    while (true) {
        const submissions = await crawler.next()
 
        if (!submissions) {
            console.log('end of subreddit')
            break
        }
 
        for (const sub of submissions) {
            await processSubmission(sub)
        }
    }
}
 
function processSubmission(sub) {
    console.log(`title: "${sub.title}"`)
}
 
work().catch(console.error)

Credentials are for a Reddit app and a user that owns it. By giving the crawler your creds, it can renew its access-token as it crawls.

The access-token expires in one hour, but large subreddits take a while to crawl (respecting Reddit's rate-limit).

Reddit's API requires a user-agent: https://github.com/reddit/reddit/wiki/API.

<platform>:<app ID>:<version string> (by /u/<reddit username>)

Options

const Duration = require('reddit-crawler/duration')
 
const crawler = makeCrawler('webdev', {
    // Required
    creds,
    userAgent: 'my-crawler:0.0.1 (by /u/foo)',
    // Optional (here are the defaults)
    initInterval: Duration.ofMinutes(15),
    minInterval: Duration.ofMinutes(10),
    maxInterval: Duration.ofDays(365),
    initMax: new Date(),
})
  • initInterval: the crawler starts off requesting submissions created within this span of time.
  • minInterval/maxInterval: the crawler shrinks/grows its interval to hover around 50-99 results per request, never exceeding min nor max.
  • maxInterval also tells the crawler when to give up: if the crawler has grown to its max interval yet it still does not find any results, then it assumes that there are no more submissions.
  • initMax: the crawler starts at initMax date and crawls backwards into the past. useful when you want to resume progress without re-crawling the top N submissions.

Notes

  • Reddit asks that you hit its API no more than once per second. The crawler has a rudimentary, built-in sleep after each cloud-search request.
  • Set DEBUG=reddit-crawler to see debug logging.

Readme

Keywords

Package Sidebar

Install

npm i reddit-crawler

Weekly Downloads

1

Version

0.0.1

License

MIT

Last publish

Collaborators

  • danneu