- Readability reference to Arc90's.
- Scrape article from any page (automatically).
- Make any web page readable, no matter Chinese or English.
- Score Rule
- Extract Selectors
- Image Fallback
- Customize Settings
How it works
In my case, the speed of spider is about 1500k documents per day, and the maximize crawling speed is 1.2k /minute, avg 1k /minute, the memory cost are about 200 MB on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.
(4) Server infos:
- 20M bandwidth of fibre-optical
- 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cpus
- 32G memory