T. Dombeck

9/11/02

FERMILAB ACCELRATOR CHAIN VULNERABILITIES

Summary

A study was undertaken to answer questions raised by Fermilab management about the vulnerability of the accelerator chain to the failure of one or more critical elements that might put the physics program out of commission for a long period of time. In this report we limit the study to those components that would jeopardize Collider operations for a period of three or more months. This question was posed to those responsible for the development, operation, and maintenance of the facility. There were tens of critical components identified whose failure might pose such a threat to operations. These items are summarized in this report, along with recommendations that might mitigate the possibility of a failure.

Historically, the Fermilab accelerator chain has had a few such catastrophic failures that occurred during the early decades of operation. A substation power transformer was lost in 1985 at which time Fermilab was running only external beams [1]. Though the repair required about eight months, the physics program at that time was able to limp along, but with much reduced capability. Another catastrophic failure during the first decade of Tevatron running resulted in the loss of thirty-three superconducting dipole magnets in the single year of 1989 [2, 3]. Such a loss occurring at this time would virtually exhaust the pool of spare magnets jeopardizing future running. The cause of these particular catastrophic failures was investigated and their sources have been addressed to ensure they do not recur.

The current study has uncovered an additional set of components that might cause catastrophic downtimes due to failure. A number of these critical elements lay in the oldest parts of the accelerator chain, namely the Linac, the 8-GeV Booster, and their associated transfer lines. These Proton Source systems have performed remarkably well over their thirty years of operations. In the case of the Booster, there is no reason to doubt it will continue to perform at its current repetition rate of about 2 Hz. However, the proposed increase in pulse rate from 2 to 12 Hz to meet the demands of the Mini-Boone and NUMI experiments, will place a strain on the aging equipment.

Furthermore, there are some components that were not designed for the increased Booster repetition rates, for instance, the four orbit bump magnets. There have not been any failures in these magnets over the ten years of their operation, and there are two spares, however there are concerns of heating problems at the highest repetition rates. Tests indicate that the magnets may perform appropriately up to 7.5 Hz and this would meet the requirements for the Tevatron and Mini-Boone. Higher repetition rates (~12 Hz) needed to include NUMI, may require a modification of the Booster injection system [4].

In this report, vulnerabilities due to one-of-a-kind components are also discussed. Such items appear throughout the accelerator complex. Though reliability has been good, some of these items do not have spares, such as the 400-MeV chopper at the end of the Linac. Other specialty items, such as kicker magnets, have single spares, but the lack of availability of replacement parts could pose a problem. To limit the scope of the present study, some one-of-a-kind components were not considered, for example a failure in the focusing horns in the NUMI or Mini-Boone beamlines, or failures in the experimental detectors. The repair of such devices might require much time to implement and put those experiments out of commission. However, such components were not considered because the Collider accelerator complex would not necessarily be affected by their failure.

In some cases, having spares would be prohibitively expensive, such as the Linac accelerating cavities. Creating spares would require much resource due to the number of unique structures down the length of the machine. Furthermore, experience at other laboratories, ANL [5] and LAMPF [6], have shown that accelerating cavities rarely experience catastrophic damage even after many decades of operation. Even though replacing a damaged structure would require many months of downtime, building an inventory of replacement parts would seem unwarranted.

There are other one-of-a-kind electrical devices that could stop operations where it would be less expensive to provide a spare. Some of these components are thirty years old, such as the 345-KV MSS switchgear, and are beyond the Laboratory's capability to repair in-house. In other cases, the expertise to fix the item has dissipated, such as the low-level rf for the 8-GeV Booster. Presently, estimates to repair or replace these items cover a range of potential downtimes and may constitute a risk to operations. Other one-of-a-kind devices, such as pickups in the Accumulator stochastic cooling system, are not as much of a risk. In a crisis where one of these units is lost, it is believed that a work-around or a replacement could be devised within a few weeks.

There are a few specialty components where multiple failures might jeopardize operations because only one spare exists. In this category are the low-beta magnets, the low-beta power supplies, and the Main Injector power supply transformers. In each case, replacements require a special order with a long-lead time. Procuring additional spares would be advisable. However, as they are moderately expensive, a staged approach might be possible. For instance, the greatest risk for the Main Injector supplies occurs when the NUMI experiment is operational, so for the immediate future the emphasis could be to provide an additional spare for the Tevatron low-beta supplies.

Another example of high-risk specialty items, are the kicker magnets in the Main Injector and in the Tevatron. In general, a failed kicker magnet could be quickly repaired except for one critical component, the long ceramic insert. Except for the single spares on hand, there is currently no source available to fabricate ceramic replacements. In view of this problem, the handling of the spare kickers should be placed under particular scrutiny to prevent inadvertent breakage of these key components. A long-term solution to obtain ceramic inserts should also be considered.

There are a number of infrastructure items that might produce long down periods if allowed to progress to the point of failure. Among these are the silting of the cooling ponds around the Tevatron and the degeneration of the wooden power poles. These items are inspected on a regular basis, and there is a progressive remedial plan worked out, though sufficient funding has not been identified as yet to complete the work.

Finally, there are staffing concerns at all levels throughout the Beams Division accelerator groups. The general feeling is that the complexity of the system has grown in the last decade with the addition of the Main Injector, the Recycler and various external beam experiments. The plans are to expand the program further with B-TeV and a 120-GeV beamline. There has not been a corresponding increase in personnel and has resulted in more reliance on contract labor. In some cases there has been a loss of particular expertise due to attrition. This could result in accidents due to "human error." Such was the case last year when a technician unfamiliar with the Tevatron components inadvertently connected a water cooling line to a He feed line resulting in the contamination of ten superconducting components. The results were not disastrous in this case, however, lack of proper personnel could make it difficult to recover from a large-scale failure in a timely fashion, thus jeopardizing the core physics research program of the Collider.


Introduction and Methodology

The definition of what constitutes a critical component whose failure could shut down Fermilab operations is intimately connected to the resulting downtime that can be tolerated. The number of such components increases dramatically, probably exponentially, with the inverse of the tolerated downtime. For instance, there are probably thousands of such items that might cause a one-week downtime. Fortunately, each such failure has a small probability of occurring and this situation is best analyzed from a machine reliability standpoint [8].

On the other hand, there are perhaps one hundred such critical components that might cause a one-month downtime, tens of items causing a three-month downtime, and only a few items that might cause a six-month downtime. As one month is on the order of the usual downtime for routine maintenance, a failure of this magnitude would probably result in the next scheduled maintenance period being advanced. Therefore, in this study we limited consideration to those items that might cause extended downtimes of three months or longer.

That said, it is difficult to assess if a certain component failure would indeed result in a downtime of three or more months. Facing such a disaster, the Laboratory would muster its forces to provide workarounds to bring operations back on line as soon as possible. On the other hand, it is equally difficult to analyze how a particular failure might cascade and generate related failures that spread throughout the complex, e. g., a power failure that ruins electronic circuits or computing files in other systems resulting in months of trouble shooting. Therefore, the philosophy in this report is to err on the conservative side, if a three-month downtime were believed to be possible for any particular component, it was put on the list.

This report contains a set of tables presented by accelerator type and/or subsystem with items identified by accelerator personnel that were deemed to pose a risk to operations. Comments are given on the current status of the identified components. A risk factor (high, medium or low) is assigned based on the perceived probability for a failure, though no attempt has been made to quantitatively analyze the probability of a failure occurring. Only in a few cases, such as superconducting dipoles, has there been a sufficient sample to arrive at meaningful conclusions. Finally, a discussion follows with recommendations to mitigate the effect of a potential failure for the highest risk items.

A summary is presented at the end of the report of a mutual study performed between Beams and Technical Division personnel. This study addressed the status of spares and the general question of performing repairs on large components such as conventional and superconducting magnets, as well as Tevatron spool pieces. These components require large-scale in-house maintenance facilities. In essence it was felt that there were sufficient spares on hand for most of the accelerator chain. There were exceptions. For instance, there are sufficient spares for the gradient magnets in the Booster, however the operational status of these spares is not confirmed. There are also a few specialty magnets, such as the kickers in the Main Injector and Tevatron, that have a spare, but fixing a broken device might not be possible. For other one-of-a-kind magnets, it was felt that even with a need to bring old tooling out of "moth balls," repairs could be achieved in the allotted time with the initiation of a crash program at the time of the failure.

Finally, there could be catastrophic losses due to fires or floods that might affect large parts of the accelerator complex. Disasters of this sort are better studied from an historical perspective along with the standard equipment installed to mitigate losses. Therefore, these are not discussed in this report. On the other hand, there are a number of potential risks for "human error" accidents, some of which are discussed in this report. For example, there are cases where both the piece of equipment and its spare are stored in the same location. This practice might jeopardize the recovery of operations after a fire. The vulnerability due to such procedural concerns are not discussed in any detail in this report, however, it might warrant a future investigation to determine what mitigating actions might be undertaken.


Front End Source, Linac, and Associated Beamlines

Component / No. / Spares / Risk* / Vulnerability Comments
Cockroft-Walton / 2 / L / Aging equipment, but two sources in operation, thus providing a spare (See discussion below.).
750-kV, 200-MHz Buncher Cavity / 1 / 0 / L / One of the oldest pieces of equipment in the accelerator chain (See discussion below.).
200-MHZ Structures and In-Tank Quads / 5 / 0 / L / Some of the oldest pieces of equipment in the accelerator chain (See discussion below.).
#7835 Amplifier Tubes (200 MHz) / 5 / 1 / H / Failure rate is about 3.5/yr., 7 in rebuild, single vendor (See discussion below.)
Modulator Switching Tubes (200 MHz) / 15 / 40 / M / Failure rate is about 7/yr., single vendor (See discussion below.)
805-MHz Structures and In-Tank Quads / 28 / 0 / L / 10 yrs. old (See discussion below.).
400-MeV Chopper / 1 / 0 / L / Required to run booster, no backup, but reliability high so not deemed risk at this time (See discussion below.)
400-MeV Spectrometer Magnet / 1 / 0 / L / No backup, but magnet has spare coil capacity and so not deemed risk at this time

*Risk of failure deemed H=high, M=medium, and L=low.

The Front End Source and the 200-MHz Components are some of the oldest pieces of equipment in the accelerator chain. They must be operational in order to have any protons available for the physics program. They are expected to operate with high availability, >97%, and have done so for many years. However, signs of aging are appearing, such as the modulators and the Amplifier Tubes, and the lack of vendors to make replacement parts.

The Front End Source situation was recently summarized in a talk presented by Bob Webber to the Laboratory Directorate [9] along with a proposal to replace the oldest items with more up-to-date components, such as an rfQ and 402-MHz accelerating cavities. The cost of such an upgrade is estimated to be about $27.5M, however, there are a number of reasons to consider such an upgrade, including a promise of a five times brighter proton beam. Even with the most optimistic timetable for the upgrade to take place, there would be a fairly long period of at least 6 years using the present equipment, in which backup 200-MHz parts must be maintained.