I made my first crawler with crawler, or node-crawler as it might be known on github. In this post I will just be briefly covering how to get started with it, as the source code of my crawler is a little buggy at the moment and I don’t want to go over it here just yet.
To get started with crawler I just called the main constructor function to create an instance of crawler, and I give it an object with a method that will be called for each crawled page.
For now I am keeping maxConnections at 1, and disabling the jQuery feature that seems to be giving me errors for some reason, I have not looked into why.
One of the dependences for node-crawler is cheerio, which is a lean implementation of jQuery for node.js. In short it’s a nice little node package for working with html.
When the jQuery bool is set to true crawler automatically does this and sets the parsed body to res.$. If I want to use something else to do this I would want to to feed what is in res.body to it, as that will always be the raw html of a page, if what is being crawled is html.
There are some events that can be attached to the crawler, that can be used to define some things to do for certain events such as when crawling stops because of an empty queue.