In 2003, the Alcoa Warrick Smelter was 43 years old, and it had the second-highest maintenance costs in the corporation’s global smelting system. Asset reliability in the plant continued to suffer, and equipment instability prevented success in fully implementing lean manufacturing tools. A formal assessment of the smelter’s repair and maintenance (R&M) efforts determined a mostly reactive approach with a focus on trying to be really good at response to emergency breakdowns.
That year, the location’s top management provided support to embark on a Reliability Excellence (REX) journey, which created a significant transformation. Ten years later in 2013, Alcoa Warrick Smelter’s R&M costs are 29% below its 2003 pre-REX base (44% lower adjusting for inflation) and OEE performance improvement gains have matched R&M savings dollar-for-dollar annually.
A formal asset integrity audit performed in 2010 by corporate-level resources confirmed that these cost savings were real — in other words, they weren’t gained by simply deferring R&M. In fact, the Warrick Smelter had the lowest percent of corrective actions needing attention in the next five years of all the corporation’s global smelters.
Building a high-performance engineering team contributed to the success of reliability excellence now in use for the smelting business at Alcoa Warrick Operations.
Importance of reliability engineers
Reliability engineering is different than a traditional engineering role in manufacturing. These are not engineers who provide the routine, day-to-day support for production centers. Instead, reliability engineers are in a strategic role — focused on failure prevention and, most importantly, helping to determine how to improve reliability and operate the plant’s assets at the lowest cost.
Do you have reliability engineers at your plant? If so, what types of tasks are they doing? Are they managing capital projects? Are they firefighting? Are they in tactical roles? If so, they’re not reliability engineers.
If a problem isn’t solved to root cause, it may keep recurring. If a plant doesn’t know which assets are the most critical, then the plant may be focusing on the wrong things. If a facility isn’t using equipment failure data to direct its resources on the true equipment bad actors, then there is most likely a lot of money being left on the table.
Reliability engineers help with all of that and lots more.
But we like firefighting
How many times have you experienced a major equipment failure at your plant, and felt relieved when it was over? We all have praised our firefighters; these are the individuals who excel in a crisis and, in many cases, thrive during every minute of it.
We need them. There is no doubt that when a production center is interrupted, we need resources to respond. And when a major downtime event occurs, we need people with strong troubleshooting skills and those who can get our equipment back up and running again. These are the ones who are working hard to reduce mean time to repair (MTTR). They could be engineers, technicians, craftsmen, or others. And when they get the equipment running again, we thank them and feel the weight lifted off our shoulders. How many of us have given lavish praise to these “knights in shining armor” when they swoop in to save the day? We probably all have, and that is not a bad thing, but what we sometimes forget in the heat of the moment is to step back and ask, “How did we get into this mess in the first place? Why did this failure occur?” And most importantly, “What are we going to do to prevent it from happening again?”
This is where the reliability engineers steps in. They are not focused on MTTR, but instead mean time between failures (MTBF). While others are on the scene, working to do whatever it takes to restore immediate production flow, the reliability engineers should be there investigating what happened. They will talk with the operators, review operational data and trends, take photographs of the scene, pull up past history of similar incidents, review camera footage, if available, and try to piece together all the available evidence. This is the detective work that will enable them to lead a root cause analysis (RCA) to determine why the failure occurred.
And speaking of RCA, it may seem obvious, but you need to have proper follow-up mechanisms to ensure the RCA action items are getting completed. Have your reliability engineers keep these corrective and preventive action items in front of your teams so that these tasks get done. Archive your RCA files to be easily accessible later on. And if the failure returns, retrieve the previous RCA and review it to try to understand what may have been missed and why the failure recurred.
Free-time reliability engineering
Focus. That is one of the most important parts of our success. If you have a reliability engineer in a hybrid role — doing some maintenance engineering or some project engineering — then you don’t have a true reliability engineer. In a hybrid role, the crisis of the day or a production manager’s pet project can take precedence over working on long-term objectives. If you expect reliability engineers to work on reliability when they can find the time, you aren’t going to make the gains you’re seeking. They must be focused.
Yes, we all are busier than ever these days. We all wear multiple hats. But the reliability engineering role is one where we must discipline ourselves to focus them solely on failure elimination and prevention. When you pull a reliability engineer off a proactive task to work on a reactive task, you are losing ground in your reliability efforts. Set yourself up with a maintenance engineer or maintenance professional to handle the tactical production needs while the reliability engineer is allowed to focus on the strategic efforts.
And you may ask, “How many reliability engineers do I need?” Well, it depends on several factors, such as the size of your plant or how reactive you are. A guideline is, if you have 100 craftsmen at your plant, then two dedicated reliability engineers would be about right.
What about PdM?
How does predictive maintenance (PdM) or condition-based monitoring fit in to all of this? PdM is an important part of REX, no doubt about it. But if you only do PdM without going after root cause, your equipment failures will return. PdM lets you know about problems early. It buys you the time to plan and schedule the repair vs. letting the equipment run to final failure. And we estimate that letting the equipment run to failure is on average seven to 10 times more expensive than repairing it proactively. So on one hand, it’s good to know about your equipment anomalies early. But, on the other hand, you need to ask yourself why the defect occurred in the first place. Root cause efforts can help with this.
Equipment autopsies
We do equipment autopsies at Alcoa. In our case, it is the death of a piece of equipment. It could also be on a near-death piece of equipment, one where the root cause for trouble remains undiagnosed. We want to understand why failures occur, so we can take steps toward preventing future failures. The root cause aspect of REX cannot be stressed enough. Let’s use our motor team as an example.
We have hundreds of motors at our plant — some small and small huge. As part of REX, we created a cross-functional team to focus on motors. The team’s mission is to improve reliability and reduce costs. The team is made up of representatives from management, engineering, and supervision, as well as PdM/reliability technicians. When we first started, we had many motor run-to-failures (RTFs). We started conducting motor autopsies to understand what was causing the failures; these are discussed as a team at our motor meetings. Initially, we found some common failure modes and developed corrective/preventive actions to go after those. Our motor team also implements best-practice maintenance strategies for motors — everything from proper motor greasing and storage practices to PdM. Our team sets aggressive, yearly objectives, and we track all motor costs meticulously. What are the results? What used to be several motors a week running to failure is now less than one motor that has run to failure. Our motor costs have decreased by more than 90% since we started, not to mention the significant improvement in production uptime. This is another example of how solving problems to root cause reduces costs and increases equipment reliability.
Metrics, more metrics
It’s true that there are literally hundreds of methods to measure success and their progress on reliability. Deciding which ones can seem like a daunting task. We tried to keep it simple. R&M costs are a key metric for us. If you are doing the reliability work and your costs are not coming down, then there is something wrong. However, measuring maintenance costs alone is not a good way to measure success. Many of us have examples of plants where maintenance was deferred and R&M costs decreased for a short time — that is, until the equipment starts to fall apart and then R&M costs skyrocket. Therefore, in addition to R&M costs, you need to somehow measure your plant’s reliability.
You could choose overall equipment effectiveness (OEE). If your OEE is increasing and your maintenance costs are decreasing, then that is generally a good sign. The percentage of emergency work is another useful metric. Think of emergency work as the killer of reliability. Pretty much everything about emergency work is bad. It’s costly and less efficient, and it can have a higher safety risk. So, if both your percent of emergency work and R&M costs are decreasing, that is good news. Regardless of which metrics you choose, our recommendation is to keep it simple. Don’t get bogged down in too many metrics, and remember that metrics are there to help you improve. So if the number is ugly, it is what it is. Measure it accurately and honestly. Then use it as a springboard to track progress and drive improvement.
What is that reliability group doing for me again?
Sometimes the firefighters tend to get the praise, and those preventing the failures are forgotten because they are behind the scenes. As part of REX, we need to constantly educate and reinforce the value of the various members of our reliability programs. This includes the reliability engineers and the PdM technicians.
Take the PdM technicians, for example. They are performing routes using various PdM technologies, such as vibration analysis, ultrasonic examination, and infrared thermography. They are our scouts, finding equipment defects high up on the PF curve. From there, corrective work orders should be generated to fix or repair the piece of equipment in a planned and scheduled fashion, before final failure occurs. Sometimes their efforts go unnoticed. As leaders and managers of REX, we need to make sure they are not taken for granted. We must broadcast their successes frequently and in ways that are easily understood by all, especially high-level management.
As an example, let’s say one of your PdM technicians finds an anomaly on a high criticality piece of equipment using ultrasound. This defect gets repaired through your work control process proactively and the equipment never reaches final failure. Then you go through a root cause exercise and determine ways to prevent this defect from recurring. This is a total success. But it may go completely unnoticed without intentional methods to recognize this achievement.
Your method of broadcasting these types of successes is up to you. Perhaps you prefer sending out a spreadsheet that lists the various cost savings/avoidances. Or maybe you have a single PowerPoint slide that describes what happened, what was done, and the benefit. Or you may post “Equipment Bad Actors Eliminated” on a bulletin board. Or you may even choose to have your reliability engineer give some success presentations to your plant lead team. Whatever methods you choose, make sure you know your audience and keep it simple. And don’t forget to tie your successes to financial impact.
The right seat on the bus
In order to be successful at REX, it’s important to have the right systems and processes in place. But it’s also important to have the right people in the right roles. Individuals should have passion for what they do. They should feel supported and understand why their jobs are important and what their direct impact is on the business. And all employees at the facility should understand their own roles and everyone else’s, too. REX is not just another program of the month. Instead, it’s a permanent change to the way you do things. It should be ingrained into your plant’s culture. It is truly a long-term journey, one that really never ends.
Ready to begin
“But we must get everything just right before starting.” No, you don’t. One key point to remember: don’t allow perfection to get in the way of good. Start simple. Build your momentum one experience at a time. You’ll be amazed at how successes feed upon each other. Always make sure you broadcast your successes. And the successes you broadcast should be that of the team.
And as you progress in your REX journey, you will have setbacks. Perhaps you had a PM kaizen where you cut too deep into your PMs. Or maybe you thought you did a thorough job on an RCA, but you missed a key root cause. Whatever it is, be up front about the mistake and use it as an opportunity to learn.
There will be bumps in the road. Don’t let that get you down. Be persistent and keep pushing ahead. Culture changes are not easy. Keep your vision, and remember that, with REX, it is a long-term investment. But it is an investment that will indeed pay off.