@thi.ng/hiccup-html-parse
TypeScript icon, indicating that this package has built-in type declarations

0.3.41 • Public • Published

@thi.ng/hiccup-html-parse

npm version npm downloads Mastodon Follow

[!NOTE] This is one of 192 standalone projects, maintained as part of the @thi.ng/umbrella monorepo and anti-framework.

🚀 Please help me to work full-time on these projects by sponsoring me on GitHub. Thank you! ❤️

About

Well-formed HTML parsing and customizable transformation to nested JS arrays in @thi.ng/hiccup format.

Note: This parser is intended to work with wellformed HTML and will likely fail for any "quirky" (aka malformed/dodgy) markup...

Basic usage

import { parseHtml } from "@thi.ng/hiccup-html-parse";

const src = `<!doctype html>
<html lang="en">
<head>
    <script lang="javascript">
console.log("</"+"script>");
    </script>
    <style>
body { margin: 0; }
    </style>
</head>
<body>
    <div id="foo" bool data-xyz="123" empty=''>
    <a href="#bar">baz <b>bold</b></a><br/>
    </div>
</body>
</html>`;

const result = parseHtml(src);

console.log(result.type);
// "success"

console.log(result.result);

// [
//   ["html", { lang: "en" },
//     ["head", {},
//       ["script", { lang: "javascript" }, "console.log(\"</\"+\"script>\");" ],
//       ["style", {}, "body { margin: 0; }"] ],
//     ["body", {},
//       ["div", { id: "foo", bool: true, "data-xyz": "123" },
//         ["a", { href: "#bar" },
//           "baz ",
//           ["b", {}, "bold"]],
//         ["br", {}]]]]
// ]

Parsing & transformation options

Parser behavior & results can be customized via supplied options and user transformation functions:

Option Description Default
ignoreElements Array of element names to ignore []
ignoreAttribs Array of attribute names to ignore []
dataAttribs Keep data attribs true
comments Keep <!-- ... --> comments false
doctype Keep <!doctype ...> element false
whitespace Keep whitespace-only text bodies false
collapse Collapse whitespace(1) true
unescape Replace named & numeric HTML entities(1) true
tx Element transform/filter function
txBody Plain text transform/filter function
  • (1) - Not in CData content sections like inside <script> or <style> elements

Status

ALPHA - bleeding edge / work-in-progress

Search or submit any issues for this package

Related packages

Installation

yarn add @thi.ng/hiccup-html-parse

ESM import:

import * as hp from "@thi.ng/hiccup-html-parse";

Browser ESM import:

<script type="module" src="https://esm.run/@thi.ng/hiccup-html-parse"></script>

JSDelivr documentation

For Node.js REPL:

const hp = await import("@thi.ng/hiccup-html-parse");

Package sizes (brotli'd, pre-treeshake): ESM: 1.18 KB

Dependencies

Usage examples

One project in this repo's /examples directory is using this package:

Screenshot Description Live demo Source
Mastodon API feed reader with support for different media types, fullscreen media modal, HTML rewriting Demo Source

API

Generated API docs

TODO

Benchmarks

Results from the benchmark parsing the HTML of the thi.ng website (MBA M1 2021, 16GB RAM, Node.js v20.5.1):

benchmarking: thi.ng html (87.97 KB)
        warmup... 1951.76ms (100 runs)
        total: 19375.49ms, runs: 1000 (@ 1 calls/iter)
        mean: 19.38ms, median: 19.26ms, range: [18.12..28.45]
        q1: 18.75ms, q3: 19.68ms
        sd: 4.66%

Authors

If this project contributes to an academic publication, please cite it as:

@misc{thing-hiccup-html-parse,
  title = "@thi.ng/hiccup-html-parse",
  author = "Karsten Schmidt",
  note = "https://thi.ng/hiccup-html-parse",
  year = 2023
}

License

© 2023 - 2024 Karsten Schmidt // Apache License 2.0

Package Sidebar

Install

npm i @thi.ng/hiccup-html-parse

Weekly Downloads

126

Version

0.3.41

License

Apache-2.0

Unpacked Size

30.5 kB

Total Files

6

Last publish

Collaborators

  • thi.ng