When the war ended, I spent my first six years working at
RCA Communications with a team of analysts reviewing the severe damage by the
Japanese to receiving and transmitting facilities during the war. During this tour I found that despite the
major damage to equipment there were still a substantial number of individual
items that were repairable for reliable service by the Korean and Philippine
nations – a testament to the quality of prewar American workmanship.
Subsequently, my employment as an administrator with a
number of organizations introduced me to quality control as performed by my
employer firms and their suppliers. The
basic approach to quality control was for the firms to produce a product that
complied with the applicable requirements and specifications. The product could be physical, such as an
instrument or a truck; it could also be a service, the results of an
investigation or a report on a particular subject. Whatever the product, it was prepared and
tested by the manufacturer or producer prior to delivery to insure compliance
with the specifications.
For purposes of this essay, products have been placed into
two broad categories: Products, such as automobiles, clothes washers and
dryers, lawn care equipment, simple household tools, stoves and similar items,
manufactured and sold through retailers with a broad warranty guaranteeing that
the product would meet the purposes intended if used properly, are not a part
of this review; other products included are subject to review or additional
testing by the purchaser.
Companies purchasing the latter products maintained a
department for inspecting items received from their suppliers. The extent of the incoming inspections varied
widely with the companies. Small numbers
of spot checks would be performed if the purchaser had confidence in the
supplier’s ability to produce quality products consistently. Larger numbers of spot checks were performed
if the purchaser was planning to put the product to use in a critical service,
operation or product. The U.S. AEC, for
example, performed a 100% test of all products used in their nuclear testing
program.
All things considered, in those early days, producers tested
their products in an effort to ship a reliable, quality product to the
purchaser. The purchaser, in turn, would
only spot check the incoming product unless there was a special need for more
extensive testing.
But, as the years went by and producers sought ways to
reduce their costs, the quality control picture changed dramatically. Producers continued to perform quality
control tests, but gradually reduced the extent of the tests to a varying
minimum necessary, in the view of the producer, to meet specifications. This led to a more basic warranty which,
while reducing testing, emphasized the right of the purchaser to return the
item if it did not meet the purpose intended.
In the eyes of the producer, the item worked and if the purchaser could
show that it didn’t, then it could be returned.
This led to some gallows humor on the part of critics, with the familiar
favorite story of the parachute: If it didn’t open just return it for a
replacement.
What has all this testing to do with product reliability? It follows that if a product does its job
under the specified conditions and for the specified length of time, the product
is reliable. In some instances where it
was possible, but not necessarily probable, that a product was less reliable
than desired, redundancy was introduced and two or more identical parts were
installed to do the work of one. A
missile, for example, to improve performance in flight, may use redundancy in
many flight components, thereby increasing reliability but concurrently
increasing weight and cost.
A fact, not necessarily well-known, has to do with the NASA
Apollo program. In 1969 when Apollo
11/Eagle was destined to make the first landing on the Moon, NASA was faced
with a serious problem. Early on in the
design of Apollo and in anticipation of the planned flights, computers were
far, far less reliable than they are today.
In fact, the Windows 386, which was issued by Microsoft in December
1987, or twenty years after Apollo 11, still referred to DOS and was never
considered “powerful,” though it was a vast improvement over the Apollo 11
unit. Yet, NASA was going to send
astronauts to the Moon using computers that were only a shadow of the 386’s pathetic
capability. Redundancy! The answer to the problem was redundancy: NASA
decided to use quadruple redundancy and installed four computers for improved
reliability. The wisdom of this decision
became clearly apparent with the successful conclusion of the Apollo 13 near
disaster.
The problem faced by engineers to design a product of
multiple parts and maintain high reliability is reflected in the definition of
reliability and the formula devised by mathematicians to arrive at the
percentage of product reliability.
“The overall reliability of any device is defined as the
product of the design, component or parts, and fabrication or assembly
reliabilities.” This relationship can be
expressed as:
P(ov) = P(d) x P(c) x P(f), where
(ov), (d), (c) and (f) represent reliability: overall, design, component or
parts, and fabrication or assembly, respectively.
It is obvious that overall reliability can be increased by
improvement of the reliability of any one or more of the individual
reliabilities. Ah, but watch what
happens when the number of components increases.
The engineer, when designing his device, decides to include
components having individual reliabilities of 99.0%. If the device has 10 components, using the
definition and the formula, the overall reliability will drop to
90.4%. Should his device require 100
components, the ov will become a mere 36.6%, and if the component total
increases to 1000, we are left with an ov of <1%. A problem indeed and certainly not one that
can be tolerated when lives are at stake.
“Not necessarily so,” says our engineer. “I will solve the problem by increasing component
reliability.” And so he does, increasing
his component specification reliability to 99.99%, a not so easy reliability to
achieve. What happens? A device having 10 components will now have
an ov of 99.9%. Increasing the required
number of components to 100 will drop the ov to a slightly lower value of
99.1%, and 1000 components to 90.6%.
Depending on the requirements of the product, these percentages might be
acceptable, but at a substantial increase in cost due to using the very high
component reliability rate. However,
should the device be more complex and require 10,000 components, the ov tumbles
to 37.2%. A more complex device of 100,000
would reduce the ov to <1% and we’re back to square one.
Modern day manned space vehicles have varying numbers of
components ranging above 10,000. Using
component reliabilities of 99.0%, the best vehicle reliability that can be
expected is less than 1%. Raising
component reliability to 99.99%, an expensive and unrealistic goal, increases
the vehicle reliability to 37.2%, at best.
For a manned mission, these rates are totally unacceptable.
What to do? We have
discussed improving component reliability, the most likely area of improvement,
and found it wanting. Design and
manufacturing reviews may provide a source of relief, but state-of-the-art
imposes limits. There had to be a
solution, ultimately found in redundancy.
When WW-II ended and we entered the cold-war period,
agencies such as DOD, DARPA and the AEC were faced with such reliability
problems, but not until the manned space programs arrived were lives dependant
upon a quick and safe resolution. When
NASA first entered the scene and work started on the Mercury program followed
by Gemini, the need to solve the reliability problem intensified. It quickly became clear that the envelope of
the current state-of-the-art fell drastically short of the safety needed for
manned-flight. The sub-orbital flights
of Mercury provided some breathing room, but Gemini was on the horizon.
During my participation in the many government research and
development programs while an administrator, I was involved in most, if not
all, the cold-war and space programs. As
such, I was intimately aware of the steps taken by the government to overcome
many previously neglected problems.
Overall reliability was just one.
It took many years of effort to make programs such as Mercury, Gemini
and Apollo possible.
NASA recognized the state-of-the-art limitations early
enough to permit a solution by incorporating extensive redundancy in the design
of their space vehicles. The inability
of industry to provide the required reliability at any cost dictated the
alternate solution. Our entire space
program is based on obtaining the highest reliability possible at reasonable
cost supplemented by extensive redundancy.
With the passage of time, and the expenditure of annual research and
development dollars, the shift from redundancy to added reliability in design
and manufacture has been major, steady and effective. Despite our efforts, indications are that
improvements to reliability are asymptotic and may remain so.
While we have yet to achieve our ultimate goal, total
vehicle reliability, and it is questionable whether we ever will, research and
development continues.
No comments:
Post a Comment