An Introduction To Information Technology Practices
Work in progress; last updated 2007-09-01
Series 1985 I have been working professionally within Information Technology. Over the course of those years I have seen many different ways of delivery IT services, and have become opinionated about the good, the bad, and the ugly w.r.t. IT service and practice. This page is an attempt to capture my opinions in one place.
My approach in documenting my thoughts will be to focus on good practices, and only to talk about bad practices when they are destructive to the work environment. There is a true need to distinguish between what small companies should do vs. the practices that will benefit large corporations—large corporations are able to gain economies of scale—and I have attempted to indicate such situations.
Within this page I have used terminology that is commonly used and understood by IT professionals. I have attempted to define these terms as they are used, but please email me and alert me to terms that I have not sufficiently explained.
The volume of email most users receive borders on extreme, and filing that email for future retrieval can be a difficult problem. Here's how I file email for retrieval, along with my information retrieval methods: Christopher's Email Filing & Retrieval Method.
Application developers always clamour for access to the most powerful computers possible: the argument developers pose is that if they may compile their programs more quickly then higher quality (that is, bug-free) and feature rich code will be written faster. Having both worked as an application developer and managed such staff I do not accept such claims: there is no direct correlation between fast computers and programmer work product; on the contrary, programmers take greater care ensuring high quality code when compiles are slow because they do not want to see their compiles fail. When deciding what type of computer to deploy to programmers, do not be suckered into the faster-is-better argument.
When application developers create and test applications it is important that they do their work in an environment that is no better than the users they are developing for. If the developers have fast computers with extra-large monitor screens attached to high speed networks, then they will create applications that only perform well in an environment that has those attributes. If your customers (that is, the users) will mostly use slow computers with normal-sized monitors attached to slow networks, then you must act to ensure that the programs you deliver to them behave well in such an environment.
Twenty years ago, when the computer environments were predominantly terminals attached to minicomputers (or mainframes), an application best practice was to connect developer-terminals to their minicomputer via the same (or slower) speed serial line as was deployed to users. If developers had high speed terminal-minicomputer links then the applications would only perform well in such environments; which resulted in frustrated users.
Today, where applications are often compiled on the programmers desktop, there is a legitimate need to give programmers a computer that compiles the applications in a reasonable timeframe; but, the right balance must be found:
When planning the server environment for your business, it is important to understand when the services provided by each server are required, and then to architect your environment accordingly. It has become fashionable for IT managers to talk about 5 nines availability; however in my experience very few of those who bandy availability numbers around really understand the subject.
The essential elements one must understand regarding server availability are:
To assist in communicating to others just what various levels of availability mean, I created a set of tables that show the amount of unplanned outage time one should expect to see in various hours of production situations. A sample from that workbook appears below. The workbook is available via this link: Availability Outage Tables.zip
7x24 Service Hours Coverage No Maintenance Window Availability Planned Outage Windows Unplanned Downtime Weekly Monthly Yearly 5 Yearly 99.9995 % None 3s. 13s. 2m. 38s. 13m. 9s. 99.999 % None 6s. 26s. 5m. 15s. 26m. 18s. 99.995 % None 30s. 2m. 10s. 26m. 17s. 2h. 11m. 28s. 99.99 % None 1m. 4m. 19s. 52m. 34s. 4h. 22m. 57s. 99.75 % None 24m. 55s. 1h. 48m. 21h. 54m. 4d. 13h. 33m. 36s. 99.5 % None 49m. 51s. 3h. 36m. 1d. 19h. 48m. 1w. 2d. 3h. 7m. 12s. 99.25 % None 1h. 14m. 46s. 5h. 24m. 2d. 17h. 42m. 1w. 6d. 16h. 40m. 48s. 99.0 % None 1h. 39m. 42s. 7h. 12m. 3d. 15h. 36m. 2w. 4d. 6h. 14m. 24s. 98.0 % None 3h. 19m. 23s. 14h. 24m. 1w. 7h. 12m. 1M. 6d. 1h. 58m. 48s. 97.0 % None 4h. 59m. 5s. 21h. 36m. 1w. 3d. 22h. 48m. 1M. 3w. 3d. 8h. 13m. 12s. 96.0 % None 6h. 38m. 46s. 1d. 4h. 48m. 2w. 14h. 24m. 2M. 1w. 5d. 3h. 57m. 36s. 95.0 % None 8h. 18m. 28s. 1d. 12h. 2w. 4d. 6h. 2M. 4w. 2d. 10h. 12m. 94.0 % None 9h. 58m. 9s. 1d. 19h. 12m. 3w. 21h. 36m. 3M. 2w. 4d. 5h. 56m. 24s. 93.0 % None 11h. 37m. 51s. 2d. 2h. 24m. 3w. 4d. 13h. 12m. 4M. 6d. 1h. 40m. 48s. 92.0 % None 13h. 17m. 32s. 2d. 9h. 36m. 4w. 1d. 4h. 48m. 4M. 3w. 3d. 7h. 55m. 12s. 91.0 % None 14h. 57m. 14s. 2d. 16h. 48m. 1M. 2d. 9h. 54m. 5M. 1w. 5d. 3h. 39m. 36s. 90.0 % None 16h. 36m. 55s. 3d. 1M. 6d. 1h. 30m. 5M. 4w. 2d. 9h. 54m.
Embedded inside the The Uptime Institute whitepaper "Tier Classifications Define Site Infrastructure Performance" is an excellent discussion about availability and the constraints required to achieve high levels of availability.
Always establish regular outage windows with your users; then use them!!
When developing that interact with the network in any manner, it is important that the way in which the application uses the network is properly accounted for. The classic mistake in this domain is to write an application that assumes that all network access is across a LAN; where latency and trip times are very low. This doesn't manifest itself as a performance problem until one or more of the network links leaves the Local Area Network (LAN) and traverses the Wide Area Network (WAN). WAN latency and trip times are at least an order of magnitude greater than the corresponding LAN times; which generally degrades application performance.
Here are typical latency times from a server migration project I was recently part of; where moving a server from an in-State data center (~100 km. away) to a data center across the continent (~1,200 km away) caused a database job to run in an unacceptable timeframe. Although this section is entitled "WAN vs. LAN", the point I'm making is illustrated by this change in WAN latency.
Latency 5 ms 50 ms Initial DB Load Time 65 min. 102 min. Tuned DB Load Time 18 min. 33 min.
Before the server move, this particular batch job was completing within the scheduled window; but, after the move, the database load was holding up other processing from occurring because the job ran beyond the planned window. Now that the pressure was on the application support team to find ways to reduce execution time, a support call to the database vendor revealed that a simple protocol change could be made (OLE was changed to ODBC) to speed execution. After this simple application change the database load ran faster across the slow WAN link than it had previously run in the faster WAN environment.
The bottom line is that the application development team should have used the faster database protocol to start with. Just because the initial application design was acceptable doesn't mean that it was correct.
Always plan for the worst is a good motto to apply to program design, and in this case this means using the most efficient protocols right from the start. Do not assume that today's LAN-based application will always be hosted in that lightening fast environment.
Certification for new help desk personnel.
This section contains topics that either do not fit within one of the sections that appear above, or that apply to all of those topic areas. The topics in this section are presented in alphabetic order (by heading).
As a general rule, human-based systems---that is, systems in which people perform the work---are inherently flexible; whereas, automated systems are inherently fragile. When properly trained, people have the ability to dynamically deal with exceptions to the rule; however, automated systems are only able to cope with the situations they have been specifically designed for.
The above generalisation can also be applied to situations where unskilled labour is applied to skilled tasks by way of "scripts" that they must follow. In the IT industry, the classic anecdotal description of this situation is that "This is how the Navy can take a bunch of high school grads and successfully operate a nuclear submarine." A more accurate portrayal is a help desk where individuals with very little training follow written scripts to take calls and solve problems. Help desks operated in this manner are not able to deal with exceptional situations, and so when faced with a situation that doesn't fit the scripts the help desk agent passes the call to someone more skilled.
I mention this flexible-fragile concept because it must always be accounted for as you seek to automate as your company grows. A small company that relies upon IT staff to perform all its IT functions (for example, installing patches on systems) usually hasn't invested in much process documentation; however, as the company grows, efficiencies are sought, and tools are purchased & deployed, exceptional situations are more likely to cause outages. You must plan accordingly. In other words, don't count on everything always going well.
All IT services cost money. One of the challenges an IT manager faces is obtaining adequate funding to deliver the IT service that the user community needs.
The Funding Continuum Corporate Budget — where the IT department is provided with a budget to spend but those providing the services do not explicitly send any monies to that department. User-Funded — where all the budget necessary to fund the IT services are provided to the IT department through explicit funding from those consuming those services.
For companies with locations in more than one time zone, performance reporting must be carefully designed. For example, if you have help desks that serve multiple time zones, should daily call metrics be reported from the call centre's perspective or the client's (callers') perspective; and, if it is from the client's perspective then how is the fact that the callers originate in multiple time zones factored into the metrics calculations.
If you haven't yet learnt about ITIL (Information Technology Information Library), then I strongly encourage you to educate yourself about ITIL. ITIL is a set of IT practices that have been codified into a set of books (hence, "library"). Most large IT service providers are now restructuring their service offerings around the way in which ITIL suggests IT services be performed. ITIL does not attempt to describe the types of IT practices I have described above; rather, ITIL is attempting to address the fundamental manner that the various IT service areas work together. Conforming to the ITIL service model gives you a common terminology and understanding of IT service with which to engage the industry as a whole, including your customers (that is, those consuming your IT services).
Implementing the full weight of discipline described by ITIL is beyond the means of all but large companies; however, this does not mean that small IT shops should be ignorant of the processes and disciplines described by ITIL: all of the issues addressed by ITIL are faced by small companies, and all of those issues must be addressed no matter what size of company your IT shop is supporting.
E.g., as employees leave the company, what happens to the IT resources that were assigned to them?
Under girding all of the IT practices described above is the management of the staff you have hired to perform the work. If you don't treat your people right then they are not motivated to do the job properly. I have documented an approach to salary planning elsewhere on this website that I have used very successfully with IT managers and their staff.
The steps to follow in order to decide what needs to be measured and reported:
On its own, outsourcing does not save money.
Click on each
graph to open a more detailed view.
When is the code "good enough" to release? Without formal release discipline and process, this question is impossible to answer; but, even with process and discipline one must still have an objective way to determine when enough bugs have been found and eradicated---and such determination must involve empirical measurement. Long before Six Sigma took the IT world by storm, I was introduced to the concept of defects per thousand by Otfried Riml, a telecom manufacturing manager who moved into corporate Information Technology late in his career. Otfried's insight in applying defects per thousand to software release quality were groundbreaking and fundamentally changed how the department I was working in (inside Bell-Northern Research, Division 4 IT) released the BNR UNIX workstation software environment to developers.
The software release team I was part of already had a formal release process, but we lacked the empirical measurement methodology necessary to objectively determine when it was time to declare that our software load was ready to release. The objective measure we needed was "new problems per thousand user hours per week."
The rationale behind this defect per thousand-type measurement is that when quality is poor a typical user finds more new defects in the product per unit of time than when quality is good. In the case of the collection of software that comprised the UNIX desktop environment, our release target was to have less than 1 new problem per thousand user hours. The metric was tracked and plotted as you see in the graph on the right side of this page. The graph shown here is actual project data from a release of the environment in the mid-1990s.
A second key metric we used to determine if the software package was ready for release was the unresolved problem backlog. That set of graphs also appears to the right, labelled "problem and incident metrics". Associated with the backlog target was the requirement that no Priority 1 problems were open and that all open Priority 2 problems had acceptable workarounds in place.
In our case we were applying these metrics to the collection of the operating system, tools, and applications that comprised the UNIX workstation software environment. No single application or program was measured on its own; rather, the metrics for the collection of software was measured as a whole. The reason for this is that computer programs never execute in isolation from each other; rather, they intentionally and unintentionally interact with each other in ways that the programmers did not envision. The whole purpose behind testing and trialing the packages together was to flush out and remediate these interactions as well as to detect bugs in individual programs (and have them corrected by their authors).
The criteria described in this section—defects per thousand, problem backlog, and problem priority/severity—allowed us to make an objective and defensible release decision. This empirical method facilitated buy-in from our users and project sponsors and resulted in less friction during the decision-making process.
When evaluating how to respond to a complaint or an actual service issue (incident), always determine the actual business impact associated with the complaint or incident; then, respond appropriately to the business impact instead of responding to the emotion that has accompanied the complaint or incident.
"Greasing the skids"
Doing the right work at the right time is always a key to success; no matter what type of work is being done. See the "Cost Management Via Problem Prioritisation" paper elsewhere on this site for a rational, objective approach to work prioritisation.
Which business processes and applications are critical to a company's livelihood? When planning IT investment, it is important to know what is important and what is not. See the "What Actually Matters?" paper elsewhere on this site for a method to use to bring clarity in such examinations.