nlp-corpus

lots of weird english sentences

npm install nlp-corpus

_{by
Spencer Kelly}

see french, german, and spanish translations

nlp-corpus is a proud series of weird texts from a delicious smattering of sources - aimed at getting cosmopolitan flavours of english - highbrow, lowbrow and unibrow - dialects, typos, shakespeare, unicode, 19th century, aggressive emoji, and epic nsfw slurs into your training data.

it is 50,000 sentences, or 5mb, split into 50 files of randomized sentences.

it's role is mainly to kick the tires a bit, as creatively as possible, for fuzzy linguistic parsing.

suggestive American rock lyrics
campy Friends tv-show transcripts
vulnerable drug-trip reports from Erowid
singaporean SMS messages
State of the union logorrhea
generally-offensive 90's rap
Legal descriptions in NAFTA
20th century romantic fiction
pedantic arguments on reddit
arcane and dense jeopardy questions

Note that some of this text is nsfw, or containing offensive content, badly-formatted unicode, weird indentation, ascii art, antiquated shorthands, etc.

These texts were found just clicking around on the internet. Running them blindly through your parser should be considered fair-use, but please don't commercially republish them, or anything like that.

ok go.

npm install nlp-corpus

running this library server-side loads a subset of the documents - abt 3mb total

import corpus from 'nlp-corpus'

// all 10k sentences, in an array
let arr = corpus.all()

// or load just a few:
arr = corpus.some(400)

//random sentence
let str = corpus.random()
//random 5 sentences
let arr = corpus.some(5) //n can only be <= 1,500

or on the client-side, there's a one-liner that fetches the docs:

<script src="http://unpkg.com/nlp-corpus"></script>
<script>
  // load a documents lazily
  await nlpCorpus.fetch(2) //1 - 20
  // (each doc is abt 150kb)
  let arr = nlpCorpus.random(4) //1 - 1,500
</script>

nlp-corpus

ok go.

Contents:

Dialog

Music lyrics

Fiction

Speeches

Wikipedia

Internet comments

Questions

Instructions

News Headlines

Reviews

Legal Text

Jokes & puns

Literature

Email text

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

nlp-corpus

ok go.

Contents:

Dialog

Music lyrics

Fiction

Speeches

Wikipedia

Internet comments

Questions

Instructions

News Headlines

Reviews

Legal Text

Jokes & puns

Literature

Email text

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads