Google disallows automated access in their TOS, so if you accept their terms you would break them.
That said, I know of no lawsuit from Google against a scrapers. Even Microsoft scraped Google Results, they powered their search engine Bing with it. They got caught in 2011 red handed.
There are three ways to scrape Google results:
1) Use their API
- You can issue around 40 requests access per hour You are limited to what they give you, it’s not really useful if you want to track ranking position or what a real users would see. That’s something you are not allowed to gather the data.
- If you want a higher amounts of API requests you need to pay.
- 60 requests per hours cost 2000 USD per year, more query require a custom deal.
2) Scrape the normal result pages
- Here comes the tricky parts. It is possible to scrape the normal result page. Google does not allow it.
- If you scrape at a rate higher than 8 keyword requests per hour you risk detections, higher than 10/h (updated from 20) will get you blocked from my experiences.
- By using multiple IPs you can up the rate, so with 100 IP addresses you can scrape data up to 1000 requests per hour. (24k a day)
- There is an open source search engine scrapers written in PHP at http://scraping.compunect.com It allows to reliable scrape Google, parses the results properly and manages IP addresses, delays, etc. So if you can use PHP it’s a nice kickstart, otherwise the codes will still be useful to learn how it is done.
3) Alternatively use a scraping services
- I used the service at http://scraping.services instead. They also provide open source code and so far it’s running well (several thousand resultpages per hour during the refreshes)
- The downside is that such a services means that your solution is “bound” to one professional suppliers, the upside is that it was a lot cheaper than the other options I evaluated (and faster in our case)
- One option to reduce the dependency on one company is to make two approaches at the same times. Using the scraping service as primary source of data and falling back to a proxy based solutions like described.
Take your time to comment on this article.