In his post “Could Someone Explain Technorati” Chris Brogan wonders about the consistency, accuracy and reliability of Technorati service. I can’t explain the behavior of the system over there but I can share some of my experience dealing with different challenges using online APIs (web services) and data. The objective here is to help other mashupers to better prepare for future integrations effort across multiple web services. Since it appears that the mashupers community is growing faster than the web service provider I’m sure that more fellow API consumers can share some stories of their own. I will be happy to hear about.
I see three participants perspectives in this “love triangle”: the web site visitor, the mashuper (the API consumer) and the service provider.
My visitor experience:
Chris Brogan talks about his experience from the user perspective in his post. I have nothing to add here but I would say that as a service provider, this should be my top concern satisfying my loyal community. Maybe the way to deal with this in the case from Chris’s post is by monitoring for exceptions (drastic rise or fall in the rank/authority).
My mashup experience:
As I mentioned in some of my earlier posts (here, here and here) I’m working on a small project for finding productive bloggers by monitoring for consistent improvements in their Technorati rank. So on a frequent basis I monitor the rank for over 800 bloggers now. I plot some of the result to a designated Twitter account: blogmon.
The first set of challenge is dealing with volatile data:
- Some times I see no authority in the results (inboundblogs).
- Some times there is no valid last update date in the results: <lastupdate>1970-01-01 00:00:00 GMT</lastupdate>
- Most time there is no author (the user did not add it)
- Some time there are no tags (the user did not add it)
- Some time as Chris mentioned the rank is off for a short period of time
For example see Seth Godin’s Blog rank history:
last update rank authority
2/12/2008 19 8599
2/25/2008 18 8697
3/17/2008 19 8658
3/22/2008 16 8827
4/10/2008 15 8946
4/19/2008 16 8882
4/23/2008 17 8819
5/12/2008 17 8828
5/14/2008 16 8863
5/20/2008 15 8890
These are the details that a consumer of online volatile data must plan and look for ways to compensate for.
- Check the validity of the date
- Don’t just count on the last result i.e. search for the last valid result and monitor over time.
- Be prepare to plot partial results (e.g. no top tags or author).
- Most important: guard your data i.e. protect what that you take from the service and store in your records.
The next set of challenge has to do with the web service behavior:
- I get the fowling error once or twice: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
- Some API requests come back with:
<META HTTP-EQUIV=”REFRESH” CONTENT=”2; URL=http://api.technorati.com/bloginfo?url=****&key=****&version=0.9&start=1&limit=30&claim=0&highlight=0″>
**I intentionally masked the URL, title, image and my developer Key with ****
This result can crash your system if not handled.
- Finally: and I get this one a lot:)
<?xml … “http://api.technorati.com/dtd/tapi-002.xml”>
<error>You have used up your daily allotment of Technorati API queries.</error>
- I can’t picture my dev world without Exception Handling – this is the ultimate protection against web service unexpected behavior in this specific case. So guard any call, loading XML result and data parsing by wrapping them with a try and catch block.
- Logging – log expected and unexpected behavior for later analysis and recovery.
- Build the system so exceptions are caught, logged but the execution can move on to the next task.
- This is something that I learned from a smart Army office: “If there is a doubt there is no doubt” basically saying that it is better to not report at all than to report inaccurate data.
- Find ways to minimize the API calls – e.g. I ask for tags only when I find a blog worth reporting on
- A thought: I’m not an expert in XML and DTD but could it be that using DTD slows down the web service. If you know more about it please share with me/us. Is this really necessary on a read only calls?
About the service:
I can’t talk much about what that a web service provider feels or experience (I’m sure that Ian Kallen from Technorati has a lot to share about this subject) but I want to say few things:
- Please don’t get this post wrong I’m a fan of Technorati – I use it and deeply appreciate their service and thankful for having the option using the APIs . As I said earlier the intention is to share from experience and to allow you to better prepare for such effort.
- I guess that it is hard to estimate the load on the system with such growth in the number of mashupers out there. So my heart is with them.
- There are two more threats that the web service provider needs to protect itself from and I’m sure that those consume some energy: protecting the hard gather data and its environments from abuse and malicious attacks.
One last comment: ironically I had none problems with Twitter so far:) but I’m aware of the pain that some of the Twitter API user suffer occasionally.