simple-data-analysis
TypeScript icon, indicating that this package has built-in type declarations

2.13.0 • Public • Published

Simple data analysis (SDA) in JavaScript

This repository is maintained by Nael Shiab, computational journalist and senior data producer for CBC News.

To install with NPM:

npm i simple-data-analysis

The documentation is available here.

The library is tested for NodeJS. Please reach out if you want to make it work with Bun and Deno! :)

Core principles

These project's goals are:

  • To offer a high-performance and convenient solution in JavaScript for data analysis. It's based on DuckDB and inspired by Pandas (Python) and the Tidyverse (R).

  • To standardize and accelerate frontend/backend workflows with a simple-to-use library working both in the browser and with NodeJS (and similar runtimes).

  • To ease the way for non-coders (especially journalists and web developers) into the beautiful world of data analysis and data visualization in JavaScript.

SDA is based on duckdb-node and duckdb-wasm. DuckDB is a high-performance analytical database system. Under the hood, SDA sends SQL queries to be executed by DuckDB.

You also have the flexibility of writing your own queries if you want to (check the customQuery method) or to use JavaScript to process your data (check the updateWithJS method).

Feel free to start a conversation or open an issue. Check how you can contribute.

About v2

Because v1.x.x versions weren't based on DuckDB, v2.x.x is a complete rewrite of the library with many breaking changes.

To test and compare the performance of simple-data-analysis@2.x.x, we calculated the average temperature per decade and city with the daily temperatures from the Adjusted and Homogenized Canadian Climate Data. See this repository for the code.

We ran the same calculations with simple-data-analysis@1.8.1 (both NodeJS and Bun), simple-data-analysis@2.0.1 (NodeJS), simple-data-analysis@2.7.3 (NodeJS), Pandas (Python), and the tidyverse (R).

In each script, we:

  1. Loaded a CSV file (Importing)
  2. Selected four columns, removed rows with missing temperature, converted date strings to date and temperature strings to float (Cleaning)
  3. Added a new column decade and calculated the decade (Modifying)
  4. Calculated the average temperature per decade and city (Summarizing)
  5. Wrote the cleaned-up data that we computed the averages from in a new CSV file (Writing)

Each script has been run ten times on a MacBook Pro (Apple M1 Pro / 16 GB). The durations have been averaged and we calculated the standard deviation.

The charts displayed below come from this Observable notebook.

Small file

With ahccd-samples.csv:

  • 74.7 MB
  • 19 cities
  • 20 columns
  • 971,804 rows
  • 19,436,080 data points

simple-data-analysis@1.8.1 was the slowest, but simple-data-analysis@2.x.x versions are now the fastest.

A chart showing the processing duration of multiple scripts in various languages

Big file

With ahccd.csv:

  • 1.7 GB
  • 773 cities
  • 20 columns
  • 22,051,025 rows
  • 441,020,500 data points

The file was too big for simple-data-analysis@1.8.1, so it's not included here.

While simple-data-analysis@2.0.1 was already fast, simple-data-analysis@2.7.3 shines even more with big files.

A chart showing the processing duration of multiple scripts in various languages

We also tried the One Billion Row Challenge, which involves computing the min, mean, and max temperature for hundreds of cities in a 1,000,000,000 rows CSV file. And we were impressed by the results! For more, check this repo forked from this one. The JavaScript code is here.

Note that DuckDB, which powers SDA, can also be used with Python and R.

SDA in an Observable notebook

Observable notebooks are great for data analysis in JavaScript. This example shows you how to use simple-data-analysis in one of them.

SDA in an HTML page

If you want to add the library directly to your webpage, you can use a npm-based CDN like jsDelivr.

Here's some code that you can copy and paste into an HTML file. For more methods, check the SimpleDB class documentation.

<script type="module">
    // We import the SimpleDB class from the esm bundle.
    import { SimpleDB } from "https://cdn.jsdelivr.net/npm/simple-data-analysis/+esm"

    async function main() {
        // We start a new instance of SimpleDB
        const sdb = new SimpleDB()

        // We load daily temperatures for three cities.
        // We put the data in the table dailyTemperatures.
        await sdb.loadData(
            "dailyTemperatures",
            "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/dailyTemperatures.csv"
        )

        // We compute the decade from each date
        // and put the result in the decade column.
        await sdb.addColumn(
            "dailyTemperatures",
            "decade",
            "integer",
            "FLOOR(YEAR(time)/10)*10" // This is SQL
        )

        // We summarize the data by computing
        // the average dailyTemperature
        // per decade and per city.
        await sdb.summarize("dailyTemperatures", {
            values: "t",
            categories: ["decade", "id"],
            summaries: "mean",
        })

        // We run linear regressions
        // to check for trends.
        await sdb.linearRegressions("dailyTemperatures", {
            x: "decade",
            y: "mean",
            categories: "id",
            decimals: 4,
        })

        // The dailyTemperature table does not have
        // the name of the cities, just the ids.
        // We load another file with the names
        // in the table cities.
        await sdb.loadData(
            "cities",
            "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/cities.csv"
        )

        // We join the two tables. By default,
        // join searches for a common column
        // and does a left join. The result is stored in
        // the left table (dailyTemperatures here).
        await sdb.join("dailyTemperatures", "cities")

        // We select the columns of interest
        // after the join operation.
        await sdb.selectColumns("dailyTemperatures", [
            "city",
            "slope",
            "yIntercept",
            "r2",
        ])

        // We log the results table.
        await sdb.logTable("dailyTemperatures")

        // We store the data in a variable.
        const results = await sdb.getData("dailyTemperatures")
    }

    main()
</script>

And here's the table you'll see in your browser's console tab.

The console tab in Google Chrome showing the result of simple-data-analysis computations.

SDA with React

First, ensure that you have NodeJS v18 or higher installed.

Then you'll need to run this command to install the library in your code repository.

npm install simple-data-analysis

And here's an example with React and TypeScript. For more methods, check the SimpleDB class documentation.

import { useEffect, useState } from "react"

// We import the SimpleDB class.
import { SimpleDB } from "simple-data-analysis"

const Main = () => {
    // A state to store the results.
    const [results, setResults] =
        useState<{ [key: string]: string | number | boolean | Date | null }[]>(
            null
        )

    // You can use sda inside a useEffect.
    useEffect(() => {
        // Because SimpleDB uses promises,
        // we need to declare an async function
        // in the useEffect hook.
        async function sdaMagic() {
            // We start a new instance of SimpleDB.
            const sdb = new SimpleDB()

            // We load daily temperatures for three cities.
            // We put the data in the table dailyTemperatures.
            await sdb.loadData(
                "dailyTemperatures",
                "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/dailyTemperatures.csv"
            )

            // We compute the decade from each date
            // and put the result in the decade column.
            // The calculations are written in SQL,
            // but you can also use updateWithJS to
            // use JavaScript.
            await sdb.addColumn(
                "dailyTemperatures",
                "decade",
                "integer",
                "FLOOR(YEAR(time)/10)*10" // This is SQL
            )

            // We summarize the data by computing
            // the average dailyTemperature
            // per decade and per city.
            await sdb.summarize("dailyTemperatures", {
                values: "t",
                categories: ["decade", "id"],
                summaries: "mean",
            })

            // We run linear regressions
            // to check for trends.
            await sdb.linearRegressions("dailyTemperatures", {
                x: "decade",
                y: "mean",
                categories: "id",
                decimals: 4,
            })

            // The dailyTemperature table does not have
            // the name of the cities, just the ids.
            // We load another file with the names
            // in the table cities.
            await sdb.loadData(
                "cities",
                "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/cities.csv"
            )

            // We join the two tables. By default,
            // join searches for a common column
            // and does a left join. The result is stored in
            // the left table (dailyTemperatures here).
            await sdb.join("dailyTemperatures", "cities")

            // We select the columns of interest
            // after the join operation.
            await sdb.selectColumns("dailyTemperatures", [
                "city",
                "slope",
                "yIntercept",
                "r2",
            ])

            // We log the results table.
            await sdb.logTable("dailyTemperatures")

            // We can store the results in our state.
            setResults(await sdb.getData("dailyTemperatures"))
        }

        // We call the async function inside the useEffect hook.
        sdaMagic()
    }, [])

    return (
        <div>
            <p>Check the console!</p>
            <p>Here are the computed results:</p>
            <p>{JSON.stringify(results, null, " ")}</p>
        </div>
    )
}

SDA with NodeJS and similar runtimes

First, ensure that you have NodeJS v20 or higher installed.

Then you'll need to run this command to install the library in your code repository.

npm install simple-data-analysis

A package.json file should have been created. Open it and add or change the type to "module".

{
    "type": "module",
    "dependencies": {
        "simple-data-analysis": "^2.5.0"
    }
}

Here's some code you can copy and paste into a JavaScript file. It's the same as the one you would run in a browser, except we use the SimpleNodeDB class.

This class has more methods available to load data from local files and write data to files. Check the SimpleNodeDB class documentation. Its geospatial capabilities are under development. Check the loadGeoData, area, and intersection methods for more information.

import { SimpleNodeDB } from "simple-data-analysis"

async function main() {
    // We start a new instance of SimpleNodeDB
    const sdb = new SimpleNodeDB()

    // We load daily temperatures for three cities.
    // We put the data in the table dailyTemperatures.
    await sdb.loadData(
        "dailyTemperatures",
        "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/dailyTemperatures.csv"
    )

    // We compute the decade from each date
    // and put the result in the decade column.
    // The calculations are written in SQL,
    // but you can also use updateWithJS to
    // use JavaScript.
    await sdb.addColumn(
        "dailyTemperatures",
        "decade",
        "integer",
        "FLOOR(YEAR(time)/10)*10" // This is SQL
    )

    // We summarize the data by computing
    // the average dailyTemperature
    // per decade and per city.
    await sdb.summarize("dailyTemperatures", {
        values: "t",
        categories: ["decade", "id"],
        summaries: "mean",
    })

    // We run linear regressions
    // to check for trends.
    await sdb.linearRegressions("dailyTemperatures", {
        x: "decade",
        y: "mean",
        categories: "id",
        decimals: 4,
    })

    // The dailyTemperature table does not have
    // the name of the cities, just the ids.
    // We load another file with the names
    // in the table cities.
    await sdb.loadData(
        "cities",
        "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/cities.csv"
    )

    // We join the two tables. By default,
    // join searches for a common column
    // and does a left join. The result is stored in
    // the left table (dailyTemperatures here).
    await sdb.join("dailyTemperatures", "cities")

    // We select the columns of interest
    // after the join operation.
    await sdb.selectColumns("dailyTemperatures", [
        "city",
        "slope",
        "yIntercept",
        "r2",
    ])

    // We log the results table.
    await sdb.logTable("dailyTemperatures")

    // We store the data in a variable.
    const results = await sdb.getData("dailyTemperatures")
}

main()

Here's the command to run the file. Change index.js to your actual file.

node index.js

And here's what you should see in your console.

The console tab in Google Chrome showing the result of simple-data-analysis computations.

If you want to generate and save charts, check the journalism library, more specifically the savePlotChart function.

Dependencies (4)

Dev Dependencies (17)

Package Sidebar

Install

npm i simple-data-analysis

Weekly Downloads

179

Version

2.13.0

License

MIT

Unpacked Size

5.39 MB

Total Files

341

Last publish

Collaborators

  • nshiab