Fantastic article by Jeff Atwood, of Stack Overflow [1], on Exception Driven Development – some highlighted excerpts:
“If you're waiting around for users to tell you about problems with your website or application, you're only seeing a tiny fraction of all the problems that are actually occurring. The proverbial tip of the iceberg.
…
The first thing any responsibly run software project should build is an exception and error reporting facility
… Our exception logs are a de-facto to do list for our team
… Broad-based trend analysis of error reporting data shows that 80% of customer issues can be solved by fixing 20% of the top-reported bugs. Even addressing 1% of the top bugs would address 50% of the customer issues. The same analysis results are generally true on a company-by-company basis too.
… Although I remain a fan of test driven development, the speculative nature of the time investment is one problem I've always had with it. If you fix a bug that no actual user will ever encounter, what have you actually fixed? While there are many other valid reasons to practice TDD, as a pure bug fixing mechanism it's always seemed far too much like premature optimization for my tastes. I'd much rather spend my time fixing bugs that are problems in practice rather than theory.
You can certainly do both. But given a limited pool of developer time, I'd prefer to allocate it toward fixing problems real users are having with my software based on cold, hard data. That's what I call Exception-Driven Development. Ship your software, get as many users in front of it as possible, and intently study the error logs they generate. Use those exception logs to hone in on and focus on the problem areas of your code. Rearchitect and refactor your code so the top 3 errors can't happen any more. Iterate rapidly, deploy, and repeat the process. This data-driven feedback loop is so powerful you'll have (at least from the users' perspective) a rock stable app in a handful of iterations.”
Side-stepping the implementation details (I personally haven’t see strong justification in favour of using anything beyond Enterprise Library + MSMQ + Custom Database), the value is really in what you collect and how you store it rather than the particulars of your approach (whether it be log4net, EL or ELMAH).
At a minimum, the following information should be available in indexed columns for querying:
- IP
- Web Server
- DB Server
- Message
- Time
- Severity
- SessionId (ASP.NET’s – crucial for correlating activity)
- File Name (Class)
- Method Name
- Line Number (IL Offset)
- Url
Next, the exception should have attached to it, in some form, a set of Extended Properties (typically stored in a different table under a CLOB):
- Full Stack Trace (customized, in some instances)
- Original Request Headers (all, especially cookie)
- Request Total Bytes
- Http Method
- Url Referrer
- ASP.NET Request Cookies (differ from header cookie, if changed)
- Form Data
- Session Data (particularly IsNewSession)
For session and form data, you have to exercise some caution in putting domain-specific rules in place to avoid unnecessary data (like ViewState, for example; or DataSets stored in Session). It’s prudent to plan on keeping the indexed data for a minimum of 6 month, if possible, and CLOB values for half of that period.
Being able to look back on trends is crucial; if you have time to implement a more elaborate warehousing strategy, all the better. But if you can’t answer basic questions like:
- What are the top trending exceptions in the last hour, day, month, or week?
- Is the error isolated to one web server or wide-spread? What about database servers?
- Are there trends by IP?
- Did the user see a Yellow Page of Death or was it a behind-the-scenes exception?
- What was the sequence of exceptions for a particular session (series of requests leading up to it)?
- What cookies did a request start with; what cookies were assigned, prior to the exception?
- What were the form, header values that generated a particular exception?
… then you really should stop development and reconsider your bearings.
[1] – Jeff Atwood: Exception Driven Development
http://www.codinghorror.com/blog/archives/001239.html
[2] – More on ELMAH
http://www.hanselman.com/blog/ELMAHErrorLoggingModulesAndHandlersForASPNETAndMVCToo.aspx
http://code.google.com/p/elmah/
2 comments:
While I create applications in a different environment I too agree logging can be a PIA but the P is worth it because it can really save your A.
I created a process to scrub a contacts table of social security numbers. When my process found an SSN it logged the raw field and then the update statement with the splatted out SSNs. My recollection is there were like 25 million records to process, so I split the process up into date based partitions and we would process one partition per run.
Reviewing results in the middle of the entire process I noted some records contained MULTIPLE SSNs, so I had to change my regular substitute expression (from /.../.../ to /.../.../g) to account for the possibility of multiple SSNs.
Guess what? I had about 12 million records possibly containing multiple SSNs.
Hehehehe, all I did was to create a script to create a new file by extracting the relevant data from the log file and feed the new file back into my main process. This is not the only time log files have come to the rescue.
CodeSmith Tools recently released a new product, CodeSmith Insight, that is an excellent tool for Exception Driven Development. Insight's application integration allows you to (among other things) report all unhandled exceptions just by referencing the Insight client assembly. Also, possibly best of all, it's free to get started!
Check it out at http://www.codesmithinsight.com/
Post a Comment