Want to crawl a site for outbound link data? Perhaps you’d like to easily analyze header tags across your entire site. You could aggregate a ton of publicly available data from around the web.
We’re excited to announce that we now have a general crawling as a service API endpoint live and in beta. We’ve been playing with the idea of providing a general crawling as a service API for a while now and after some initial development and testing, we are making it available on Partner API accounts.
We’re planning on needing to scale this to hundreds of billions of requests a month and in order to do that carefully and smoothly, if you are interested, you will need to contact us to get started. We just want to make sure we can maintain performance at the volume people will need. We’re already easily processing billions of pages, but we still want to make sure we’ve got our sh*t together.
We haven’t made the docs for this service publicly available yet. It seems each person we’ve worked with so far is interested in something pretty specific, which is why we want to work with each of you closely. We are already parsing out a lot of the data you can expect in a typical HTML page (title, description, H1s, links, etc. – see this gist for an example) and are working on adding more options every day. We want to know what kind of things people need in the default JSON version of a page so if you have ideas, definitely let us know.
If you’re looking to get started and run Ruby, check out our Partner API Ruby Gem.
When you get signed up for our Partner API, we’ll credit your account for 1,000,000 pages of free crawling. Contact us for more information.
The Future
Imagine being able to…
- Write a little bit of code.
- Upload that as a plugin.
- Call that plugin using a parameter in your POST request along with the URL to crawl.
- GET the JSON version of that page with plugin specific data appended to the default data we already parse out.
Well, that’s what we’re wrapping up right now as one of the next features for the Partner API and crawling as a service. If you’d like to get started, please contact us for more information.