Sunday, June 16, 2013
User Messages - Use 'em or Lose 'em
The time in question was the late 1980s, a time of much upheaval in the telecommunications industry due to the aftermath of Judge Green's order breaking up the Bell System. While that era and all of the events surrounding it had immense impact on many people and companies, one fallout of the situation was the opening up to many companies of business opportunities previously the exclusive domain of Western Electric and Bell Labs. The one in which I was intimately involved was an application called AMA TeleProcessing Systems (AMATPS). The systems to implement this application were the first fully-automated replacement of various physical transportation approaches to collecting Call Detail Records (CDRs) from telephone switching equipment to central locations for various purposes, the principal one at the time being customer billing. A group of former GTE executives and private investors acquired the assets of a company named Pacific Western Engineering, eventually renaming it Telematic Products. After an ill-fated attempt to leverage their former GTE business contacts to capture the GTE-equivalent AMATPS opportunity, the company had the fortunate good timing to be able to respond to Pacific Bell's RFP for AMATPS (each RBOC tendered their own solicitations, "independently" from the old AT&T/Bell Labs/Western Electric default product selection choice). Against all odds, Telematic won the business and proceeded to deploy their solution, the UTS-4000 AMAT and a DEC VAX Collector solution. It was about the time of Telematic having won this watershed business that I was hired as their VP, Engineering. There is a whole back story on that hiring, but it's for another blog posting (working title: "Biker Rusty and the Workshop Pub Incident").
There are a whole set of blog postings that come to mind for lessons learned over the next 3-4 years, but leave it for now that a small private company like Telematic had many hard days that it had to endure to conform to the expectations of 50+ years of Bell System Practices (BSPs) which our best (and virtually only) customer imposed upon the company. Eventually, the Collector system component was displaced, Bell System-wide, by an AT&T solution, but Telematic leveraged their perseverance and success in implementing Pacific Bell's deployment that we secured 5 of the 7 RBOCs (AT&T won Bell South, though at the expense of a $1 million settlement with Telematic due to issues surrounding the selection process, and TeleSciences Corporation won US West--the subject of two Harvard Business School case studies in which I was also intimately involved, but that's for yet another blog posting). One thing to bear in mind concerning AMATPS is the fact that long distance and intra-LATA toll charges in that era amounted to charges of around $2.50 per CDR. Pacific Bell alone accounted for well in excess of one million CDRs per day, so it goes without saying that the availability requirements of an AMATPS system were stringent. Telematic's UTS-4000 AMAT operating during its entire production phase without the loss of a single CDR over a period of at least a decade. This was accomplished via a 100% redundancy design approach, with no single point of failure capable of interruption of the operation of the unit.
Given the stringency of the availability requirements (BellCore actually certified the total system availability of our design to be at least 99.9995%), there came a time when the growing pains of Telematic had no residual toleration on the part of its demanding customers and a "come to Jesus" challenge to attain system maturity was lodged. The big event to secure this maturity was Version 3.0 of the UTS-4000 AMAT software. It was not only our customers who had no residual appetite for indulging a "young", entrepreneurial company, but the time for individual heroism of an admittedly heroic development team that I was honored to lead had reached that point in time where professional practices had to assume prominence. To that end, and I'm sure much to the chagrin of all, I appointed myself as one of the final test team, working with two of those aforementioned heroes, the husband and wife team of Carrie and Dave Campbell. It didn't surprise me when the Development Director, my long-time friend and confidant John Hansen, decided that he better be a member of that team too. Over the next days and weeks the team ran the system through its entire gamut of feature and failure mode injection tests, with iterative revisions looped back through the strict "waterfall" development life-cycle process (before you laugh, I'd ask the reader if they would bet their mortgage and stock options on an Agile/Extreme Programming process for a system with five nines and five availability requirements and $20 million/day revenue responsibility).
Finally, to the point of this posting, one of the strict requirements which I imposed upon the release was that each and every console message emanating from the system be fully documented as to what it signified and what action or reaction was required of them all. Understand that the system, virtually from day one, spewed several yards/meters of console messages, many of them indicating a severity of fault or alarm state that they had been a long-standing source of concern (or at least erosion of confidence) that the system was operating in a stable and predictable state. At first, the development staff was somewhat bemused at my apparently quirky obsession with this issue. Eventually, they evolved to a state of permanent crankiness as I refused to relax the requirement one notch. This attitude collapsed in a heap on the day, toward the end of the test/retest iterations, when one of the last of the remaining messages was determined to be the external symptom of an extremely subtle (but finitely probable) fault condition which would result in the system operating in a mode that would expose it to a single-fault total system outage. It didn't take any explanation at all of the motivation behind my seemingly arbitrary requirement--console messages are the only way the system has of engaging the mind of a human operator toward the challenge of determining whether any action or reaction is critical to system stability and thus the developer owes it to that person to enable them to succeed at making the proper determination. The final outcome of the effort to resolve all messages resulted in a couple dozen messages surviving the cut, each and everyone of which was documented in the maintenance manual, along with action/reaction requirements. I'll leave it to the reader to decide whether the effort was worth it or not.
Sunday, June 20, 2010
4 Absence of Evidence is Not Evidence of Absence – The System Testing Conundrum
Jerry’s discussion of the pitfalls and psychological traps surrounding testing struck home, especially bringing to mind one episode I have always referred to as the “No Show Stopper Case.”
I had recently been promoted to Vice President of Quality Management in GTE’s Telenet division. They had recently launched the first commercial e-mail system, TeleMail, and were poised to release the full-featured version 2—with much fanfare. Some of the maintenance releases of version 1 had not gone well, with much bad will among early customers and a spreading reputation for poor quality.
It was in this atmosphere that the CEO directed me to conduct an independent review of the recommendation to immediately release the new version into production. This was not a product that could be rolled out to isolated beta customer sites, but rather a network-based service, so any problems would affect all customers and further erode our fragile industry reputation.
The director of the development shop and his staff were apparently unaware that I had spent three years of my GTE career heading the corporate-wide effort at standardization of methods and tools for large system development. I suspect that they thought an hour or so of boring technical slides would make the “Quality Control” guy glaze over in boredom and retreat, in awe of their brilliance. Then they could get on with their release.
The meeting started to deteriorate (at least from the project team’s perspective) when I started probing for such practices as test case generation and test coverage metrics. By the time I got around to asking to see their development process standards, the room had developed a noticeable chill. When the director stated that the project was under too tight a deadline for such “overhead”, I asked what was, then, the criterion on which he was recommending to go into production. His answer was, “We tested until there were no more show stoppers.”
It was at this point that I channeled my high school Jesuit teachers and applied the principle of “reductio ad absurdum” to state, “Well, then, I plan to recommend that we hold the release until your staff delivers to me the exhaustive list of ’show stoppers’ for which you tested, thereby proving they are all absent.” The director began sputtering at the absurdity of my request, at which point I suggested that maybe the absurdity was lodged in the claim that there were none in the program, given that there was no definition and no test to which he could point to prove the assertion.
After things settled down, we agreed that the delicacy of the situation with our customers and our historically poor industry reputation probably did justify a reasonably short delay to apply some additional rigor to the testing and test results analysis. When we did, it turns out that the system functioned as specified, but the performance was dismally inadequate for full production use, so performance optimization was undertaken in parallel to the upgrading of the testing discipline.
The blind spot exposed in this case was hardly unique to this team, but seems to be rather common, as continuing industry experience with software project failures indicates.
By the way, over three years time, including this incident, Telenet went from last place to first place in industry ratings for our products and services.
Saturday, June 5, 2010
3 If You Don’t Think Documentation is a Critical Component of a Product, Think Again
Jerry’s Silver Anniversary edition comments on documentation rang true when I recall an incident that came up during a review of the maintenance manual for a telecommunications operations support system. At the time, I was serving as VP of Engineering in a private company producing a business-critical billing teleprocessing system for telephone companies.
I was not always present for the review of every component of documentation, but there were a lot of programmers on the staff with no particular telecommunications systems experience, so I was very interested to see how the maintenance manual turned out. The system was conceived to operate continually throughout its decade or more of deployment, carrying all of the usage detail records for every phone call in all but one of the major telephone companies in the United States (as well as several in other countries), from which all of their usage-based billing revenues were derived. Obviously, routine maintenance and fault repair needed to be handled very carefully.
We were not 30 minutes into the presentation by the documentation team when a procedure was presented that started out: “First, power down both sides of the unit.” The unit in question was a central office-based real time teleprocessing node with 100% redundancy and fault tolerance designed into the hardware and, presumably, the software. I probably scared the meeting attendees half to death as my 6+ foot, 250+ pound frame nearly levitated with agitation. As I explained that the last time both sides of this system were ever intended to be powered down at the same time was the moment just before both sides were first powered up for the rest of their service life, I could tell from the look on the face of the programmers in the meeting (in addition to the look they had on their face when they entered the room due to even having to attend a documentation review) that there was a serious issue.
The senior programmer attending tried, as delicately as he could under the circumstances of my mood, to explain why this was the way to perform the procedure. I was calm enough to explain to him and the rest of those present that any design or implementation that compelled such a step in any procedure was flawed and they better go back to the drawing board and rethink their solution. Luckily, as it turned out, the changes in design and implementation of the software was able to adapt to the reality that had somehow escaped the programmer’s understanding. The hardware, which I had also influenced greatly during its design, was totally capable of supporting the proper approach.
I offer this example of the critical role documentation should play in the total packaging of a software-centered product.
In retrospect, I’d go so far as to say that today I’d take the approach of teaming systems designers and documentation specialists to co-produce the manuals before the detailed design. I’d make approval of the design and user documentation components a gating event for the commencement of detailed design and implementation.
Thursday, May 27, 2010
2 Jerry’s Comments on Comments Start a Fight
Soon after I was put in charge of the Tools and Standards group of my first big project in GTE, I discovered and enthusiastically read Jerry Weinberg’s Psychology of Computer Programming. One thing that particularly caught my attention was the experiment regarding comments in code and their effect on a debugging exercise. I had made and found a few of these myself, so I made the recommendation to my boss that we set our coding standard such that no line-by-line comments were to be used.
This meshed nicely with another standard that I was recommending, based on my discovery of a method of designing and documenting detailed designs called HIPO (Hierarchy plus Input-Process-Output), wherein the entire discussion of the code was encapsulated in a block of text we called the Prologue. There were also sections in the design documentation for internal storage variables, as well as flow charts for the logic of the component.
Much to my surprise, virtually the entire programmer workforce on the project rose in opposition to this standard, even when presented with the evidence of Jerry’s experimental results. Being supremely confident in my powers of persuasion, I even constructed a sample bit of code, with and without comments and with an injected bug, but no one was convinced, mainly because I had let the cat out of the bag when I showed them the section from the book. To further complicate the whole matter, there were programmers from 13 different nationalities on this project, so language was a further barrier to communication.
Finally, the project director declared the standard as approved and, luckily, everyone soon got used to producing documentation and code to the new pattern and project cohesion was not destroyed.
Even so, to this day I am still struck by how attached to the status quo programmers can become, maybe because of their apparent tendency toward conservative versus radical approaches.
1 - Our COBOL Program Owns the Machine – Give us the Feature
We went through the proper channels and documented the request, going so far as to specify the form of the system routine call that would be used in an application program, be it in assembly language or COBOL. In no time at all, an official response from Burroughs was returned, summarily declaring that what we asked couldn’t be done and we should find another way to do what we were trying to do.
Well, that’s all it took. Due to the gigantic nature of the contract between the Air Force and Burroughs (150+ mainframe computers, plus all the extras that can be charged for in a plum DOD contract), the project programming team had access to the full assembly code source of the MCP. Ron and I proceeded to pore through it until we found an unused executive interrupt vector location. Then we exploited a neat (albeit dangerous) feature that the COBOL compiler guys had put into their implementation of the language. One could place a console command into a working storage variable and, by referencing it from a form of the COBOL PERFORM verb, send operator console commands and receive responses as if you were the operator actually typing at the physical console of the machine. Through a series of these steps, we discovered the relocateable memory location of our COBOL application, poked an entry point address into the unused interrupt vector slot, and triggered an executive mode interrupt to that address, upon which the entire machine was now under control of our COBOL application program in executive mode.
We then, just to be cute and prove that we “owned” the machine, caused the master console lamps to flash in some silly little pattern before executing the very code that we needed and that the gurus at Burroughs told us was impossible to do. Needless to say, when we demonstrated our little trick program, the Burroughs guys demanded to know how we had pulled it off. After feeling mighty proud of ourselves, we drug the concession out of them that they would indeed add the feature to the MCP, in return for which we would disclose the secret to our trick.
There are many times I have seen the tightened jaw and steel in the eye of a programmer who is told something can’t be done. I must admit that it is difficult to harness this energy for sustained productive results, but it is real and must reveal something about the inner workings of the programmer’s mind. I’m sure Ron and I could have gotten our program done without this fun little excursion, but our minds didn’t work that way.