Web Auto Extractor
Parse semantically structured information from any HTML webpage.
Supported formats:-
- Encodings that support Schema.org vocabularies:-
- Microdata
- RDFa-lite
- JSON-LD
- Random Meta tags
Popularly, many websites mark up their webpages with Schema.org vocabularies for better SEO. This library helps you parse that information to JSON.
Demo it on tonicdev
Installation
npm install web-auto-extractor
Usage
// IF CommonJSvar WAE = default// IF ES6 var parsed =
Let's use the following text as the sampleHTML
in our example. It uses Schema.org vocabularies to structure a Product information and is encoded in microdata
format.
Input
ACME Executive Anvil Sleeker than ACME's Classic Anvil, the Executive Anvil is perfect for the business traveler looking for something to drop from a height. Product #: 925872 4.4 stars, based on 89 reviews Regular price: $179.99 $119.99 (Sale ends 5 November!) Available from: Executive Objects Condition: Previously owned, in excellent condition In stock! Order now!
Output
Our parsed
object should look like -
The parsed
object includes four objects - microdata
, rdfa
, jsonld
and metatags
. Since the above HTML does not have any information encoded in rdfa
and jsonld
, those two objects are empty.
Caveat
I wouldn't call it a caveat but rather the parser is strict by design. It might not parse like expected if the HTML isn't encoded correctly, so one might assume the parser is broken.
For example, take the following HTML snippet.
Ghostbusters Black Rhino Country: United States
The problem here is the itemprop
- productionCompany
which is of itemtype
- Organization
doesn't have any itemprop
as its children, in this case - name
.
The parser assumes every itemtype
contains an itemprop
, or every typeof
contains a property
in case of rdfa
. So the "Black Rhino"
information is lost.
It'll be nice to fix this by having a non-strict
mode for parsing this information. PRs are welcome.
License
MIT