Sunday, June 16, 2013

User Messages - Use 'em or Lose 'em

My "day job" of the last three years has meant that I've been quite delinquent it keeping true to Jerry Weinberg's challenge, but a recent case-in-point from that job brings to mind another anecdote and lesson learned that prompts me to recount it.

The time in question was the late 1980s, a time of much upheaval in the telecommunications industry due to the aftermath of Judge Green's order breaking up the Bell System.  While that era and all of the events surrounding it had immense impact on many people and companies, one fallout of the situation was the opening up to many companies of business opportunities previously the exclusive domain of Western Electric and Bell Labs.  The one in which I was intimately involved was an application called AMA TeleProcessing Systems (AMATPS).  The systems to implement this application were the first fully-automated replacement of various physical transportation approaches to collecting Call Detail Records (CDRs) from telephone switching equipment to central locations for various purposes, the principal one at the time being customer billing.  A group of former GTE executives and private investors acquired the assets of a company named Pacific Western Engineering, eventually renaming it Telematic Products.  After an ill-fated attempt to leverage their former GTE business contacts to capture the GTE-equivalent AMATPS opportunity, the company had the fortunate good timing to be able to respond to Pacific Bell's RFP for AMATPS (each RBOC tendered their own solicitations, "independently" from the old AT&T/Bell Labs/Western Electric default product selection choice).  Against all odds, Telematic won the business and proceeded to deploy their solution, the UTS-4000 AMAT and a DEC VAX Collector solution.  It was about the time of Telematic having won this watershed business that I was hired as their VP, Engineering.  There is a whole back story on that hiring, but it's for another blog posting (working title: "Biker Rusty and the Workshop Pub Incident").

There are a whole set of blog postings that come to mind for lessons learned over the next 3-4 years, but leave it for now that a small private company like Telematic had many hard days that it had to endure to conform to the expectations of 50+ years of Bell System Practices (BSPs) which our best (and virtually only) customer imposed upon the company.  Eventually, the Collector system component was displaced, Bell System-wide, by an AT&T solution, but Telematic leveraged their perseverance and success in implementing Pacific Bell's deployment that we secured 5 of the 7 RBOCs (AT&T won Bell South, though at the expense of a $1 million settlement with Telematic due to issues surrounding the selection process, and TeleSciences Corporation won US West--the subject of two Harvard Business School case studies in which I was also intimately involved, but that's for yet another blog posting).  One thing to bear in mind concerning AMATPS is the fact that long distance and intra-LATA toll charges in that era amounted to charges of around $2.50 per CDR.  Pacific Bell alone accounted for well in excess of one million CDRs per day, so it goes without saying that the availability requirements of an AMATPS system were stringent.  Telematic's UTS-4000 AMAT operating during its entire production phase without the loss of a single CDR over a period of at least a decade.  This was accomplished via a 100% redundancy design approach, with no single point of failure capable of interruption of the operation of the unit.

Given the stringency of the availability requirements (BellCore actually certified the total system availability of our design to be at least 99.9995%), there came a time when the growing pains of Telematic had no residual toleration on the part of its demanding customers and a "come to Jesus" challenge to attain system maturity was lodged.  The big event to secure this maturity was Version 3.0 of the UTS-4000 AMAT software.  It was not only our customers who had no residual appetite for indulging a "young", entrepreneurial company, but the time for individual heroism of an admittedly heroic development team that I was honored to lead had reached that point in time where professional practices had to assume prominence.  To that end, and I'm sure much to the chagrin of all, I appointed myself as one of the final test team, working with two of those aforementioned heroes, the husband and wife team of Carrie and Dave Campbell.  It didn't surprise me when the Development Director, my long-time friend and confidant John Hansen, decided that he better be a member of that team too.  Over the next days and weeks the team ran the system through its entire gamut of feature and failure mode injection tests, with iterative revisions looped back through the strict "waterfall" development life-cycle process (before you laugh, I'd ask the reader if they would bet their mortgage and stock options on an Agile/Extreme Programming process for a system with five nines and five availability requirements and $20 million/day revenue responsibility).

Finally, to the point of this posting, one of the strict requirements which I imposed upon the release was that each and every console message emanating from the system be fully documented as to what it signified and what action or reaction was required of them all.  Understand that the system, virtually from day one, spewed several yards/meters of console messages, many of them indicating a severity of fault or alarm state that they had been a long-standing source of concern (or at least erosion of confidence) that the system was operating in a stable and predictable state.  At first, the development staff was somewhat bemused at my apparently quirky obsession with this issue.  Eventually, they evolved to a state of permanent crankiness as I refused to relax the requirement one notch.  This attitude collapsed in a heap on the day, toward the end of the test/retest iterations, when one of the last of the remaining messages was determined to be the external symptom of an extremely subtle (but finitely probable) fault condition which would result in the system operating in a mode that would expose it to a single-fault total system outage.  It didn't take any explanation at all of the motivation behind my seemingly arbitrary requirement--console messages are the only way the system has of engaging the mind of a human operator toward the challenge of determining whether any action or reaction is critical to system stability and thus the developer owes it to that person to enable them to succeed at making the proper determination.  The final outcome of the effort to resolve all messages resulted in a couple dozen messages surviving the cut, each and everyone of which was documented in the maintenance manual, along with action/reaction requirements.  I'll leave it to the reader to decide whether the effort was worth it or not.

No comments:

Post a Comment

Please be kind to an old software guy and considerate of the ladies and gentlmen who may read these posts.