Home > Method, Monitoring, Observations, Software > I’m learning to crawl…

I’m learning to crawl…

I’m learning to fly but I ain’t got wings
Coming down is the hardest thing
– Tom Petty

Blog Discovery

This chart show how long is the tail climbing up the authority till the blog reaches more than 1000 reactions. Then the sky is the limit!

I was looking for a way to get more blogs for the blog monitoring project and was thinking about different ways to go about it.

So, I decided to write a blogger crawler.  I decided on the following strategy for discovering new blogs:

Starting from one of the great blogs that I already discovered (manually) I will look for this blog’s inbound links (using Technorati Cosmos API call) and then keep only blogs from this list with equal or higher authority (than the starting blog) . My working assumption is that blogs with higher authority will leads me to higher quality blogs. The crawler then continues in this path recursively hopefully as high as possible. The stop condition is no more inbound links (blogs reactions) with equal or higher authority than the parent node (the current blog).

I believe that starting from each one of the blogs in my origin list will lead to a list of high potential blogs and doing it every now and then will help me to discover new great bloggers.

Once I’ll solve the blog categorization problem (Technorati don’t provide this information through their APY) I could also improve the crawl to find only blogs from the same domain of interest.  In this way I can check blog’s visibility (and maybe reach/influence) in a category.

I did a test run today starting from one of the blogs from tier3:

The starting blog: authority =  16, rank  = 517455

The crawler ended up finding 128 new blogs.

The top blog in the list is: Daily Kos: State of the Nation, authority =  10854, rank  = 12

I think that this is not bad of a catch.

I wanted to continue but I maxed my Technorati API call for that day (500) 🙂

The theory behind this strategy is that it is easier to get on the radar of low authority blogs (upcoming) and then continue moving your message up. If these bloggers has already some visibility to higher quality blogs they may help to expend your reach. You can see how many comment I got from tier3 compare with tier2 and tier1.

Few more thought:

  • I could add a check for freshness using the lastupdate date
  • It is also possible to go top down – the outbound links should be discovered from a top blog (not using Technorati)
  • The same approach could be added to Twitter – who follow you, from what field and what are their credentials in the blogspheres

I have more thoughts and observation after running my first test and I skipped some of the implementation details but I hope that you can see the picture.

I’m learning to fly around the clouds
But what goes up must come down
– Tom Petty

Love to hear your thoughts.

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: