Crisis management of IT projects – a 10 point checklist

Introduction

Something that characterizes many of the projects that we undertake for our customers is that they start from an existing project that has gone wrong. When a project has gone wrong, the reason may be misunderstandings between the former supplier and the customer about how the project was to turn out, that internal members of staff are not qualified in the field to which the project belongs, inadequate or poor technical implementation or perhaps simply deficient routines. It can be hard to tell if the problems are caused by one thing or the other without doing a thorough analysis, but if you go through the ten points below, it might be easier to obtain an overview of how problems may occur.

1. Responsibilities and process
The first point has nothing to do with technical matters – it is about roles and responsibility for various different parts of a project and about how you collaborate. This point can relatively easily be clarified by identifying how various types of tasks are handled: has it been defined who is responsible for what, who is the contact person in which areas, who can replace who, what is the agreed response time, what does it take for a requirement to be fulfilled and so on. If all of these organisational aspects have been uncovered and documented, it should be obvious to everybody how new features and possible errors can be handled in the easiest way.

2. Environments, source code control and backups
When you handle a large system, it is a minimum requirement that a test environment has been set up where new features and corrections can be tested and approved before being put into operation. Large systems should also have a staging environment (a copy of the production environment with replicated data). In addition to this, it should be possible to return to earlier versions if unexpected problems occur. A backup procedure should also be prepared for it to be possible to find earlier data if an unexpected error occurs or if the system is compromised. You should carry out a sample test of your backup to ensure that it is intact and that it is possible to roll it back if this proves necessary.

3. Handling of errors and monitoring
Even the most tried and tested system may contain errors. What separates a good and a very good system is, for instance, how errors are handled. The first thing that should happen when an error is discovered is that the user should be given an explanation (for instance ‘The system is down temporarily – please try again in five minutes’). At the same time the administrator should be notified by email that there is an error, including detailed information, so that the background can be examined closely and the error can be remedied immediately. Handling of errors can typically be handled by Microsoft Enterprise Library, which is one of the common standard frameworks for this purpose. In other words, you should ensure your system includes a well documented model for handling errors and for monitoring if your system does not already include this.

4. Traceability
Errors and problems do occur, but, if you do not know why, it can be difficult to rectify them, and the errors will probably repeat themselves. By focusing on developing traceable systems you avoid problems reccuring. First of all this means that logs must be in place everywhere where errors may occur (for instance sending of emails, IIS pool and so on). In this way you can examine things instead of guessing. Another method for increasing traceability is to make sure the right notification systems are in place. If errors happen in production, it is important that this is discovered quickly as a part of the management of errors (see next point).
In terms of organisation traceability should work during deployments and so on by documenting and registering changes (with date, time and person) so that if an error happens, you can refer to a name.

 5. Transaction management
Transaction management ensures that, even if something goes wrong, it does not go seriously wrong. Bank systems constitute a good example of systems that are wrapped in transactions. If something goes wrong (for instance payment of money), money should not be drawn from an account. All systems contain internal dependencies, and, if something goes wrong, the process should be stopped and rolled back to make sure a system error does not produce more errors that can be difficult to explain and handle. As a person responsible for a system you should make sure all central functionality areas are covered by correct transaction logic so that possible errors do not cause other unwanted and related errors.

6. Documentation
Thorough documentation ensures that programmers, people responsible for systems and others can be replaced without problems. Documentation ensures that important knowledge is not lost (forgotten) along the way, but also ensures that procedures (for instance concerning release and deployment) are complied with. In addition, documentation ensures that no human errors occur because of a poor information level. Documentation is a discipline in itself and should not be underestimated if you want your system to be safe and if you want to be independent of the system supplier.

7. Learning
Everybody makes mistakes, and, if you do not learn from your mistakes, they will keep repeating themselves. For this reason recurring, retrospective analyses of processes are critical. When errors occur, they should be discussed and proposals for solutions should be prepared and implemented, and the solutions should be evaluated and, if possible, improved. An organisation that learns all the time means you can continually improve product quality and ensure better quality of product development.

8. Use of standard systems
If something can be bought, this is often cheaper than to build it yourself. Problems often occur when home-made functions take over areas that established CMS, economy, mass mail and CRM systems could have handled better. If standards exist on the market relating to a specific area in your system, it is often beneficial to integrate to the standard system instead of reinventing things from scratch. In order to clarify this you can ask if there is something in your system that should be replaced by a standardized system and if this would provide advantages?

9. Best practice
All large tasks consist of a series of subtasks to which there are good and bad solutions. In terms of most implementation tasks there are ‘best practice’ methods and procedures. If you follow these known patterns all the way through a project, the risk of errors is minimised because known and tested models are used where most surprises have been eliminated. If your project is not based on best practice, it should be reviewed and possible ‘home-made’ solutions should be replaced by best practice in the field.

10. Test
Do not let users act as testers without them knowing about it. Good tests start with good, internal procedures that are close to what is being developed – such as browser and device compatibility tests. Systems should also be covered by automated tests (units tests, validation tests, stress tests and so on), and regression tests based on realistic user scenarios should be used for stabilization. Only when all this preliminary work has been done should real users be involved, and this should take place in a controlled fashion. 

Summary

  • Are responsibilities clearly allocated and is your project process clear as well?
  • Have environments been separated (development, testing, possible staging, production), is the source code subject to revision control and do you test your backups?
  • Can you easily trace errors to their underlying causes?
  • Do you deal with errors by means of user-friendly messages and notifications until the errors have been remedied?
  • Do you avoid extra errors caused by the first error by having transaction management of central functionality?
  • Are you in possession of updated documentation that ensures independence from your supplier?
  • Do you learn from errors and do you evaluate things on a continual basis?
  • Do you use standard systems where it makes sense instead of inventing things from scratch?
  • Are you sure you comply with best practice when you develop your solutions?
  • Do you test things thoroughly enough (compatibility tests, automated tests and regression tests) before involving users?

If you can answer yes to the 10 questions above, there is a much greater chance that your IT project will succeed. When you are one step ahead and the right procedures are in place, you stand a much better chance of developing good IT practices and avoiding surprises and weekend work. It is never too late to take control of your procedures and you can keep on improving them.

Other blog posting
linkedin twitter facebook arrow-left arrow-right