Crawling as a Service From AuthorityLabs – Get 1 Million Pages Free

Want to crawl a site for outbound link data? Perhaps you’d like to easily analyze header tags across your entire site. You could aggregate a ton of publicly available data from around the web.

We’re excited to announce that we now have a general crawling as a service API endpoint live and in beta. We’ve been playing with the idea of providing a general crawling as a service API for a while now and after some initial development and testing, we are making it available on Partner API accounts.

We’re planning on needing to scale this to hundreds of billions of requests a month and in order to do that carefully and smoothly, if you are interested, you will need to contact us to get started. We just want to make sure we can maintain performance at the volume people will need. We’re already easily processing billions of pages, but we still want to make sure we’ve got our sh*t together.

We haven’t made the docs for this service publicly available yet. It seems each person we’ve worked with so far is interested in something pretty specific, which is why we want to work with each of you closely. We are already parsing out a lot of the data you can expect in a typical HTML page (title, description, H1s, links, etc. – see this gist for an example) and are working on adding more options every day. We want to know what kind of things people need in the default JSON version of a page so if you have ideas, definitely let us know.

If you’re looking to get started and run Ruby, check out our Partner API Ruby Gem.

When you get signed up for our Partner API, we’ll credit your account for 1,000,000 pages of free crawling. Contact us for more information.

The Future

Imagine being able to…

  1. Write a little bit of code.
  2. Upload that as a plugin.
  3. Call that plugin using a parameter in your POST request along with the URL to crawl.
  4. GET the JSON version of that page with plugin specific data appended to the default data we already parse out.

Well, that’s what we’re wrapping up right now as one of the next features for the Partner API and crawling as a service. If you’d like to get started, please contact us for more information.

About Brian LaFrance

Brian is the marketing director here at AuthorityLabs. He's not an expert or guru for anything but he makes good things happen.

Comments

  1. Parnell Springmeyer says:

    Not sure how I feel about this – why aren’t you just stripping out all of the media and gzipping the HTML to serve to the client?

    Any programmer worth his salt will know that getting back a distilled version of the original resource is, generally, A Bad Idea. Additionally you’re throwing out a much more capable form of markup by using JSON: XML.

    JSON has no notion of types, parent -> child relationships, and many other features that make XML, XSD, XSLT, and the that whole family very powerful.

    I could see this being potentially interesting if you parsed out the meaningful data into an XML representation with proper type classing &c… That would be very useful. But JSON, really?

    • Parnell Springmeyer says:

      I mean, that’s why we have all of these tools, like lxml or simplexml – that can create rich object trees of the marked up data. You can do so much more with that than just a little JSON blob!

    • Chase Granberry says:

      Parnell… I see what you’re saying, but in general we went with JSON because there’s not as much overhead associated it… it’s exactly that… little :) We may end up providing XML also but we just haven’t needed it yet and the rest of our stuff is JSON.

      Aside from the JSON vs XML debate, I think you’re missing the point. Yes, it’s easy to turn one HTML page into JSON, but do that for millions, hundreds of millions or billions and it starts getting much more complex. We now have the ability to abstract a lot of the complexity and load out of the process and help people scale crawling anything quicker and easier.

      Oh… and to answer your first question… you get easily access the exactly what we turned into JSON by doing a request to http://api.authoritylabs.com/web/insight.html instead of http://api.authoritylabs.com/web/insight.json