Fantastic article by Jeff Atwood, of Stack Overflow [1], on Exception Driven Development – some highlighted excerpts:
“If you're waiting around for users to tell you about problems with your website or application, you're only seeing a tiny fraction of all the problems that are actually occurring. The proverbial tip of the iceberg.
…
The first thing any responsibly run software project should build is an exception and error reporting facility
… Our exception logs are a de-facto to do list for our team
… Broad-based trend analysis of error reporting data shows that 80% of customer issues can be solved by fixing 20% of the top-reported bugs. Even addressing 1% of the top bugs would address 50% of the customer issues. The same analysis results are generally true on a company-by-company basis too.
… Although I remain a fan of test driven development, the speculative nature of the time investment is one problem I've always had with it. If you fix a bug that no actual user will ever encounter, what have you actually fixed? While there are many other valid reasons to practice TDD, as a pure bug fixing mechanism it's always seemed far too much like premature optimization for my tastes. I'd much rather spend my time fixing bugs that are problems in practice rather than theory.
You can certainly do both. But given a limited pool of developer time, I'd prefer to allocate it toward fixing problems real users are having with my software based on cold, hard data. That's what I call Exception-Driven Development. Ship your software, get as many users in front of it as possible, and intently study the error logs they generate. Use those exception logs to hone in on and focus on the problem areas of your code. Rearchitect and refactor your code so the top 3 errors can't happen any more. Iterate rapidly, deploy, and repeat the process. This data-driven feedback loop is so powerful you'll have (at least from the users' perspective) a rock stable app in a handful of iterations.”
Side-stepping the implementation details (I personally haven’t see strong justification in favour of using anything beyond Enterprise Library + MSMQ + Custom Database), the value is really in what you collect and how you store it rather than the particulars of your approach (whether it be log4net, EL or ELMAH).
At a minimum, the following information should be available in indexed columns for querying:
- IP
- Web Server
- DB Server
- Message
- Time
- Severity
- SessionId (ASP.NET’s – crucial for correlating activity)
- File Name (Class)
- Method Name
- Line Number (IL Offset)
- Url
Next, the exception should have attached to it, in some form, a set of Extended Properties (typically stored in a different table under a CLOB):
- Full Stack Trace (customized, in some instances)
- Original Request Headers (all, especially cookie)
- Request Total Bytes
- Http Method
- Url Referrer
- ASP.NET Request Cookies (differ from header cookie, if changed)
- Form Data
- Session Data (particularly IsNewSession)
For session and form data, you have to exercise some caution in putting domain-specific rules in place to avoid unnecessary data (like ViewState, for example; or DataSets stored in Session). It’s prudent to plan on keeping the indexed data for a minimum of 6 month, if possible, and CLOB values for half of that period.
Being able to look back on trends is crucial; if you have time to implement a more elaborate warehousing strategy, all the better. But if you can’t answer basic questions like:
- What are the top trending exceptions in the last hour, day, month, or week?
- Is the error isolated to one web server or wide-spread? What about database servers?
- Are there trends by IP?
- Did the user see a Yellow Page of Death or was it a behind-the-scenes exception?
- What was the sequence of exceptions for a particular session (series of requests leading up to it)?
- What cookies did a request start with; what cookies were assigned, prior to the exception?
- What were the form, header values that generated a particular exception?
… then you really should stop development and reconsider your bearings.
[1] – Jeff Atwood: Exception Driven Development
http://www.codinghorror.com/blog/archives/001239.html
[2] – More on ELMAH
http://www.hanselman.com/blog/ELMAHErrorLoggingModulesAndHandlersForASPNETAndMVCToo.aspx
http://code.google.com/p/elmah/
