Rate limited website scraping with node.js and async 
So yesterday a job description at my previous employer popped up in my facebook stream which reminded me of the programming excercise that we included in the interview process just before I left the company. In short it comes down to:
Expanding async
The async library already has a pretty convenient way to create dynamically sized queues with concurrency, in the form of:
JavaScript:
To add rate limiting to queues I created a mixin that adds some methods to async that will create a form of an event loop structure that'll fire every X ms. Where X is of course the max. speed that we can query the target website. The usage is still the same, but the queue variable now has a chainable method 'rateLimit' added. Executing the same code like before but rate limited to 1 request per second will give a sorted response, because even though we have a concurrency of four, the max. time processing an item is 1 second. The previous record will therefore always be processed.
JavaScript:
Transforming it in real world code
The response that we get from funda has a 'Paging' parameter that contains the next URL that we can call. If it's empty, we've reached the end of our set. In pseudo code:
pseudo:
In javascript with async, this will look like:
JavaScript:
Counting realtor IDs
Because the purpose of the assignment is to count the realtor IDs we'll add a simple object map where we gather all the data:
JavaScript:
Hooking it together
We'll need some small things to do, first, we'll need to incorporate the base URL, then, we'll need to normalize the URLs we receive from 'VolgendeUrl' and maybe do some sanitizing. The final script will look something like this:
JavaScript:
Running it
To run it: execute the following commands on your local system or on Cloud9 IDE:
bash:
- Funda has an API that lets you do queries, the response is paged, max. 25 objects at a time
- The API is rate limited at about 100 req./minute
- Request all pages for a given query
- Count the times a realtor ID is in the result
- Aggregate and sum the realtor ID's and create a top 10 list of realtors with the most objects
Expanding async
The async library already has a pretty convenient way to create dynamically sized queues with concurrency, in the form of:
JavaScript:
1 | // create a queue that does 4 items at the same time
|
To add rate limiting to queues I created a mixin that adds some methods to async that will create a form of an event loop structure that'll fire every X ms. Where X is of course the max. speed that we can query the target website. The usage is still the same, but the queue variable now has a chainable method 'rateLimit' added. Executing the same code like before but rate limited to 1 request per second will give a sorted response, because even though we have a concurrency of four, the max. time processing an item is 1 second. The previous record will therefore always be processed.
JavaScript:
1 | // change
|
Transforming it in real world code
The response that we get from funda has a 'Paging' parameter that contains the next URL that we can call. If it's empty, we've reached the end of our set. In pseudo code:
pseudo:
1
2
3
4
5
6
| func processItem (url)
resp = request(url)
if resp.Paging.VolgendeUrl
processItem resp.Paging.VolgendeUrl
else
"done" |
In javascript with async, this will look like:
JavaScript:
1 | var async = require("async");
|
Counting realtor IDs
Because the purpose of the assignment is to count the realtor IDs we'll add a simple object map where we gather all the data:
JavaScript:
1 | // the key will be the realtor ID and the value the no of times we encountered this realtor
|
Hooking it together
We'll need some small things to do, first, we'll need to incorporate the base URL, then, we'll need to normalize the URLs we receive from 'VolgendeUrl' and maybe do some sanitizing. The final script will look something like this:
JavaScript:
1 | var async = require("async");
|
Running it
To run it: execute the following commands on your local system or on Cloud9 IDE:
bash:
1
2
3
4
| $ git clone https://github.com/janjongboom/async node_modules/async $ npm install request # paste the code in server.js $ node server.js |
07-'12 Building Wordpress sites in the cloud
04-'12 Inheritance in javascript
Comments
waarom wel het woord makelaar gebruiken in je code ( om er eens 1 te noemen)
en de rest van je post in het engels?
Sowieso, Funda is toch een Nederlandse toko ?
en de rest van je post in het engels?
Sowieso, Funda is toch een Nederlandse toko ?
Viel mij ook op inderdaad. Wat mij betreft: code (en dus o.a. ook variabelen) altijd in het engels.Xantios wrote on Wednesday 25 April 2012 @ 20:04:
waarom wel het woord makelaar gebruiken in je code ( om er eens 1 te noemen)
en de rest van je post in het engels?
Sowieso, Funda is toch een Nederlandse toko ?
Mwah, het punt is dat de API van funda in het Nederlands is. En het object dus 'MakelaarId' heet, een 'makelaarMap' is daar dan nog wel te verdedigen.kipusoep wrote on Thursday 26 April 2012 @ 10:20:
[...]
Viel mij ook op inderdaad. Wat mij betreft: code (en dus o.a. ook variabelen) altijd in het engels.
De reden dat we dat bij funda doen is dat we vinden dat je je code liefst in het Engels moet doen, maar echte domein termen die in je organisatie op een heel specifieke manier worden gebruikt beter niet kunt gaan vertalen. Je krijgt dan code met namen zoals GetMakelaarContracts() en CalculateWoonoppervlakte(). Dat is inderdaad lelijk, maar voorkomt verwarring. In onze organisatie betekenen Makelaar en Woonoppervlakte iets heel specifiekt en dat raak je kwijt in GetEstateAgentContracts() en CalculateTotalLivingSurface(). Vind ik.
This reminds me of my own Funda crawler, which seems a lot simpler.
I used Mechanize (http://mechanize.rubyforge.org/) and Ruby on Rails and created a rake-file (http://guides.rubyonrails.org/command_line.html#rake) for this:
desc "Get houses from Funda"
task :funda => :environment do
require 'mechanize'
agent = Mechanize.new
i = 1;
agent.get("http://www.funda.nl/koop/heel-nederland/p#{i}/")
begin
agent.page.search(".nvm").each do |node|
street = node.search(".item").map(&:text).map(&:strip).first
info = node.search(".specs").map(&:text).map(&:strip).first
price = node.search(".nvm-extern").map(&:text).map(&:strip).first
broker = node.search(".rel a").map(&:text).map(&:strip).first
House.create! do |house|
house.street = street
house.broker = broker
house.price = price
house.info = info
end
end
i = i.next
next_page = agent.page.link_with(:href => "/koop/heel-nederland/p#{i}/")
end while (next_page.click unless next_page.nil?)
end
Don't know if this still works, tho.
I used Mechanize (http://mechanize.rubyforge.org/) and Ruby on Rails and created a rake-file (http://guides.rubyonrails.org/command_line.html#rake) for this:
desc "Get houses from Funda"
task :funda => :environment do
require 'mechanize'
agent = Mechanize.new
i = 1;
agent.get("http://www.funda.nl/koop/heel-nederland/p#{i}/")
begin
agent.page.search(".nvm").each do |node|
street = node.search(".item").map(&:text).map(&:strip).first
info = node.search(".specs").map(&:text).map(&:strip).first
price = node.search(".nvm-extern").map(&:text).map(&:strip).first
broker = node.search(".rel a").map(&:text).map(&:strip).first
House.create! do |house|
house.street = street
house.broker = broker
house.price = price
house.info = info
end
end
i = i.next
next_page = agent.page.link_with(:href => "/koop/heel-nederland/p#{i}/")
end while (next_page.click unless next_page.nil?)
end
Don't know if this still works, tho.
[Comment edited on Thursday 26 April 2012 14:33]
Yeah, it's probably not the easiest way to do this, but the point of adding rate limiting made it interesting for me personally.BeRtjh wrote on Thursday 26 April 2012 @ 14:31:
This reminds me of my own Funda crawler, which seems a lot simpler.
I used Mechanize (http://mechanize.rubyforge.org/) and Ruby on Rails and created a rake-file (http://guides.rubyonrails.org/command_line.html#rake) for this:
Hi TeunTeun wrote on Thursday 26 April 2012 @ 13:52:
De reden dat we dat bij funda doen is dat we vinden dat je je code liefst in het Engels moet doen...
Ik kwam de opdracht ook tegen op internet en vond het wel een mooie showcase voor een REST client die ik ontwikkel. De source kun je vinden op github: https://github.com/albertjan/houses en de voor de REST client hier: https://github.com/albertjan/DynamicRestClient
How do you access the Funda webservice exactly? is this still possible without a paid subscription?
I searched myself a while ago because i wanted to automatic keep track of some objects i am interested in.
I searched myself a while ago because i wanted to automatic keep track of some objects i am interested in.
There is an API key available but it's only intended to be used for the pre-job interview programming example, other parts of the API aren't public (at least this was a year ago when I left funda).Daniel de Witte wrote on Thursday 13 December 2012 @ 13:44:
How do you access the Funda webservice exactly? is this still possible without a paid subscription?
I searched myself a while ago because i wanted to automatic keep track of some objects i am interested in.