Web Auto Extractor
Parse semantically structured information from any HTML webpage.
- Encodings that support Schema.org vocabularies:-
- Random Meta tags
Popularly, many websites mark up their webpages with Schema.org vocabularies for better SEO. This library helps you parse that information to JSON.
Demo it on tonicdev
npm install web-auto-extractor
// IF CommonJSvar WAE = default// IF ES6var parsed =
Let's use the following text as the
sampleHTML in our example. It uses Schema.org vocabularies to structure a Product information and is encoded in
ACMEExecutive AnvilSleeker than ACME's Classic Anvil, theExecutive Anvil is perfect for the business travelerlooking for something to drop from a height.Product #: 9258724.4 stars, based on 89reviewsRegular price: $179.99$119.99(Sale ends5 November!)Available from:Executive ObjectsCondition: Previously owned,in excellent conditionIn stock! Order now!
parsed object should look like -
parsed object includes four objects -
metatags. Since the above HTML does not have any information encoded in
jsonld, those two objects are empty.
I wouldn't call it a caveat but rather the parser is strict by design. It might not parse like expected if the HTML isn't encoded correctly, so one might assume the parser is broken.
For example, take the following HTML snippet.
GhostbustersBlack RhinoCountry: United States
The problem here is the
productionCompany which is of
Organization doesn't have any
itemprop as its children, in this case -
The parser assumes every
itemtype contains an
itemprop, or every
typeof contains a
property in case of
rdfa. So the
"Black Rhino" information is lost.
It'll be nice to fix this by having a
non-strict mode for parsing this information. PRs are welcome.