Friday, November 7, 2014

Product Reliability




When the war ended, I spent my first six years working at RCA Communications with a team of analysts reviewing the severe damage by the Japanese to receiving and transmitting facilities during the war.  During this tour I found that despite the major damage to equipment there were still a substantial number of individual items that were repairable for reliable service by the Korean and Philippine nations – a testament to the quality of prewar American workmanship. 

Subsequently, my employment as an administrator with a number of organizations introduced me to quality control as performed by my employer firms and their suppliers.  The basic approach to quality control was for the firms to produce a product that complied with the applicable requirements and specifications.  The product could be physical, such as an instrument or a truck; it could also be a service, the results of an investigation or a report on a particular subject.  Whatever the product, it was prepared and tested by the manufacturer or producer prior to delivery to insure compliance with the specifications. 

For purposes of this essay, products have been placed into two broad categories: Products, such as automobiles, clothes washers and dryers, lawn care equipment, simple household tools, stoves and similar items, manufactured and sold through retailers with a broad warranty guaranteeing that the product would meet the purposes intended if used properly, are not a part of this review; other products included are subject to review or additional testing by the purchaser.                                                                                         

Companies purchasing the latter products maintained a department for inspecting items received from their suppliers.  The extent of the incoming inspections varied widely with the companies.  Small numbers of spot checks would be performed if the purchaser had confidence in the supplier’s ability to produce quality products consistently.  Larger numbers of spot checks were performed if the purchaser was planning to put the product to use in a critical service, operation or product.  The U.S. AEC, for example, performed a 100% test of all products used in their nuclear testing program. 

All things considered, in those early days, producers tested their products in an effort to ship a reliable, quality product to the purchaser.  The purchaser, in turn, would only spot check the incoming product unless there was a special need for more extensive testing.

But, as the years went by and producers sought ways to reduce their costs, the quality control picture changed dramatically.  Producers continued to perform quality control tests, but gradually reduced the extent of the tests to a varying minimum necessary, in the view of the producer, to meet specifications.  This led to a more basic warranty which, while reducing testing, emphasized the right of the purchaser to return the item if it did not meet the purpose intended.  In the eyes of the producer, the item worked and if the purchaser could show that it didn’t, then it could be returned.  This led to some gallows humor on the part of critics, with the familiar favorite story of the parachute: If it didn’t open just return it for a replacement.

What has all this testing to do with product reliability?  It follows that if a product does its job under the specified conditions and for the specified length of time, the product is reliable.  In some instances where it was possible, but not necessarily probable, that a product was less reliable than desired, redundancy was introduced and two or more identical parts were installed to do the work of one.  A missile, for example, to improve performance in flight, may use redundancy in many flight components, thereby increasing reliability but concurrently increasing weight and cost. 

A fact, not necessarily well-known, has to do with the NASA Apollo program.  In 1969 when Apollo 11/Eagle was destined to make the first landing on the Moon, NASA was faced with a serious problem.  Early on in the design of Apollo and in anticipation of the planned flights, computers were far, far less reliable than they are today.  In fact, the Windows 386, which was issued by Microsoft in December 1987, or twenty years after Apollo 11, still referred to DOS and was never considered “powerful,” though it was a vast improvement over the Apollo 11 unit.  Yet, NASA was going to send astronauts to the Moon using computers that were only a shadow of the 386’s pathetic capability.  Redundancy!  The answer to the problem was redundancy: NASA decided to use quadruple redundancy and installed four computers for improved reliability.  The wisdom of this decision became clearly apparent with the successful conclusion of the Apollo 13 near disaster.

The problem faced by engineers to design a product of multiple parts and maintain high reliability is reflected in the definition of reliability and the formula devised by mathematicians to arrive at the percentage of product reliability.

“The overall reliability of any device is defined as the product of the design, component or parts, and fabrication or assembly reliabilities.”  This relationship can be expressed as:

P(ov) = P(d) x P(c) x P(f), where (ov), (d), (c) and (f) represent reliability: overall, design, component or parts, and fabrication or assembly, respectively.  

It is obvious that overall reliability can be increased by improvement of the reliability of any one or more of the individual reliabilities.  Ah, but watch what happens when the number of components increases.

The engineer, when designing his device, decides to include components having individual reliabilities of 99.0%.  If the device has 10 components, using the definition and the formula, the overall reliability will drop to 90.4%.  Should his device require 100 components, the ov will become a mere 36.6%, and if the component total increases to 1000, we are left with an ov of <1%.  A problem indeed and certainly not one that can be tolerated when lives are at stake.

“Not necessarily so,” says our engineer.  “I will solve the problem by increasing component reliability.”  And so he does, increasing his component specification reliability to 99.99%, a not so easy reliability to achieve.  What happens?  A device having 10 components will now have an ov of 99.9%.  Increasing the required number of components to 100 will drop the ov to a slightly lower value of 99.1%, and 1000 components to 90.6%.  Depending on the requirements of the product, these percentages might be acceptable, but at a substantial increase in cost due to using the very high component reliability rate.  However, should the device be more complex and require 10,000 components, the ov tumbles to 37.2%.  A more complex device of 100,000 would reduce the ov to <1% and we’re back to square one.

Modern day manned space vehicles have varying numbers of components ranging above 10,000.  Using component reliabilities of 99.0%, the best vehicle reliability that can be expected is less than 1%.  Raising component reliability to 99.99%, an expensive and unrealistic goal, increases the vehicle reliability to 37.2%, at best.  For a manned mission, these rates are totally unacceptable.

What to do?  We have discussed improving component reliability, the most likely area of improvement, and found it wanting.  Design and manufacturing reviews may provide a source of relief, but state-of-the-art imposes limits.  There had to be a solution, ultimately found in redundancy.

When WW-II ended and we entered the cold-war period, agencies such as DOD, DARPA and the AEC were faced with such reliability problems, but not until the manned space programs arrived were lives dependant upon a quick and safe resolution.  When NASA first entered the scene and work started on the Mercury program followed by Gemini, the need to solve the reliability problem intensified.  It quickly became clear that the envelope of the current state-of-the-art fell drastically short of the safety needed for manned-flight.  The sub-orbital flights of Mercury provided some breathing room, but Gemini was on the horizon.

During my participation in the many government research and development programs while an administrator, I was involved in most, if not all, the cold-war and space programs.  As such, I was intimately aware of the steps taken by the government to overcome many previously neglected problems.  Overall reliability was just one.  It took many years of effort to make programs such as Mercury, Gemini and Apollo possible.

NASA recognized the state-of-the-art limitations early enough to permit a solution by incorporating extensive redundancy in the design of their space vehicles.  The inability of industry to provide the required reliability at any cost dictated the alternate solution.  Our entire space program is based on obtaining the highest reliability possible at reasonable cost supplemented by extensive redundancy.  With the passage of time, and the expenditure of annual research and development dollars, the shift from redundancy to added reliability in design and manufacture has been major, steady and effective.  Despite our efforts, indications are that improvements to reliability are asymptotic and may remain so.

While we have yet to achieve our ultimate goal, total vehicle reliability, and it is questionable whether we ever will, research and development continues.

May 2009

No comments:

Post a Comment