Podcast: The nuts and bolts of a successful failure mode and effects analysis
Brian Hronchek is a principal trainer and consultant at Eruditio. Over the years, Brian has worked as a maintenance manager, reliability engineer, and planning manager, and his background includes stints in the U.S. Marine Core and Purdue Aerospace Engineering. Brian recently spoke with Plant Services editor in chief Thomas Wilk about how to fill out the FMEA (click here for example template), and then use it to build robust equipment maintenance plans.
Below is an excerpt from the podcast:
PS: I enjoyed our first conversation, which focused on the more theoretical aspects of an FMEA. What is it? Why do you do it? What's its position when you're building a maintenance strategy? And I'm really happy that we get to talk again today because what we didn't have time to go over last time were the actual nuts and bolts, building blocks of what this thing is and how to do it. So thanks again for being here, and that's the big question we're going to answer today, is all right, how do you do this thing?
BH: How do you do this thing? This is such a great question to ask. You know in my own experience, walking in as a reliability engineer and learning these tools, as I was earning the CMRP, going in and stumbling over it a lot of times, it's like, how do I answer these questions? What's the difference between the failure effects? How do I answer what's in these blocks? So what I want to talk about today is really demystifying or putting a little strategy behind this because everything we do should lead to something else, and this is no different. If we do this right, it's going to answer so many questions downstream that it's just going to make the next step even easier, and the next step even easier until we have a plant that's running really well.
PS: When we were talking about this before recording, you mentioned that one of the first questions that I think people wonder when they're about to engage in one of these is that somebody has to make the FMEA happen. Someone's been directed to in the plant, or someone thinks they have to do it, but no one's given that person any direction. They're the ones who know they're responsible for it. They want to be responsible. And yet, what do they do first? Where do they look? So if you were there with them on the plant floor, what would you tell them on what's the first steps here?
BH: The first thing to understand is that your chain of command, your bosses up the chain, they want you to do this because they know it's the right thing to do, but they may not be the technical expert in it. So ask a few questions first.
The hierarchy: if you've done a good job on the hierarchy, then all of your assets that are like should be at the same level across your whole facility or across your whole business. All of your motors should be at Level 8. All of your pump skids should be at Level 7. All of the process sections should be at Level 6, or something along those lines. And the reason that's important is because you need to go back to your bosses and say what level are we doing our FMEA at? What level of the hierarchy? Are we doing it at 6, which is a little bit higher, which means we might get through it faster, but it may be a little bit less valuable for us. Or are we doing it at Level 8, where we're going to take a lot of time to get it done, but it's going to have more that can be duplicated and we're going to have a much more granular view of the entire business and exactly what is important to the business versus not.
Because if I go to Level 6 and I'm looking at a humongous asset, like an overhead crane, right, and we say well, it's a critical crane, yeah, but you know what the door latch on the cab is not as critical as the hoist motor, so you can't just treat everything on the crane as critical. If we get down to Level 8, then we're able to differentiate between the motor and the latch and the door and all those different things, right? Focus exactly on what we need.
PS: Before we go down this road much further, I wanted to ask what kind of work normally has to be in place before you do an FMEA? Are we talking about there's got to be some sort of criticality analysis in place at the very least? Anything else has to happen first?
BH: Yeah, hierarchy first. Build a good strong hierarchy, that's the foundation of your reliability engineering efforts, right. Criticality is second. You can do an FMEA without criticality, but that just means you're going to be over-maintaining your assets or you're going to be misguided in where you're applying your efforts. So do the criticality and take that top tier, pick it for your business. Is it 5%? Is it 10%? Is it 20%? What percent of your assets are going to go through the FMEA process? Then start working through only those assets first.
PS: Thanks for going over that. So you've got the criticality analysis in place, backed up by the hierarchy. You're about to move into the FMEA process. Again, imagine that you’re this person who has no idea what to do. They're going to read this out of a book and they're going to lean on you for help. What are the first pieces of information that you would inventory to start building out the FMEA?
BH: So there's a couple of things. First we have to build the tool. When you look at all of the columns and you get further to the right and there's that risk priority number (or RPN), which is made of severity, occurrence, and detection, let's build out the scales for severity, occurrence, and detection first.
All three of those scales are there for different reasons. Severity is measuring how much it's going to hurt the business if this failure mode occurs, or if this this failure mechanism appears. So that's severity. Now remember, last time we talked about criticality and root cause analysis being either a theoretical understanding of how much something impacts the business or a measure of the actual impact with the root cause analysis. Let's take those same ideas – the money, the safety, the customers – let's take those groups and let's use those same ones to measure the severity. We do this because #1 the work's already done, and #2 because it's an accurate measure of our business.
Let's build out our scales for severity. If this failure happens, how much does it cost? What's the worst that can happen? For example, if the worst it could happen is it shuts down for six months and it costs half a billion dollars, well, then that's going to be the 10. And what's the least that could happen? Well, it can be unnoticeable. We never even know that it happened, it doesn't impact anything, and that's going to be a 1.
Then we're going to scale everything in between, and I would encourage you guys to use a logarithmic scale, not a linear scale. So if the top one is a billion dollars, the next one down is half a billion, and the next one down is 1/4 billion Let's find some way to scale this so that the higher the number, the greater the leap it makes to start separating this particular failure mode from the distractions that might be around it. Let it highlight itself and stick out a little bit.
PS: OK.
BH: So that occurrence, how often does it occur? It could occur daily, weekly, monthly. It could occur once every long, long time. Now occurrence is another one where we can apply a little bit of strategy. If you have the data, then you use the data from the CMMS. If you don't have the data, get your mechanics out there and ask them how often it happens. And if you don't have either one of those things, then make a good guess, it'll get you in the ballpark, and then as you collect the data, because now you're going to be using your CMMS, then you can revisit this in a couple of years and get a finer tune on it.
The third category there is detection. We've seen some generic scales out there that say, you know, days, weeks, months, years. Can you find it before or after? Is it able to be detected at all? And those are all good things, but this is what I want you to think about. I want you to think about your business, and the way your business operates. What's your definition for emergency work? What's your definition for break in work? What's your planning horizon? Those types of things? How often do you do an outage? How often do you do a weekly or monthly regular shutdown? Because those are really, really great lines to draw between your scales.
So detection. If it can't be detected at all – 10. I can't detect it before it fails. I can't detect it after it fails. I will never know, right? It's a hidden failure. Next, is I can't detect it before, but I can detect it immediately after – 9. I know that it happened because it goes BLIP, and it was like there it is. But then what do you do from there? Emergency work. You know, for a lot of us, the definition for emergency work is if you have to do it in less than 24 hours. Well, OK, so can I detect it less than 24 hours ahead of time, before the failure?
Then break in work for a lot of us it's within the schedule week; if I can detect it within 24 hours, can I detect it within seven days before it happens? If it's beyond that, what's your planning horizon? Can you get a job planned? Let's say you have a job plan library and if it's identified two weeks ahead of time, I have time to get everything pulled and kitted and ready to go, if the procedure is already written. And then if I don't have one written, what's the next horizon? Is it four weeks before I have time to write a new procedure and get the bill of materials, everything put together and have it kitted and planned and ready to go? So use your business to show the stratification of the delineation between your scores and detection. Now hopefully that makes sense in a podcast rather than visually.
PS: It does, and I think we’ll do, Brian, we'll find a sample FMEA, we'll put some links in the podcast notes. When you were walking through some of those processes, I was thinking how long do you think it should take to build the average FMEA? If they don't have all the information handy, does it take about a week to do one of these things, to figure out what the information you have and to track down the additional data?
BH: It really doesn't take as long as it feels. Really it's getting rid of all this fog shrouding how to do it well. Once you run through a few, you can knock out a few hundred lines a day with an entire team in the room just focusing on “what's the next question I have to answer” and just blowing through line by line getting these things done. There's also a lot of duplication where you know, “I've done a motor once, I've got another motor to do, I just have to answer the questions again.” So the failure modes aren't different, but the criticality scores or the RPN scores are going to be different, right? So as you get going and get rolling, it starts to get easier and easier.
PS: OK.
BH: So the first part of setup is determining those scales right? And do that as a team. Bring your team in. Let everybody agree on what severity occurrence and detection you know the levels one through 10 or whatever it is. Let everybody agree on that because they're the ones that are going to have to score it later and you don't want them saying, “but I don't agree with your definition.”
Now we go back to the beginning and we have got our answer on what level of the hierarchy we're going to go after. And let's say we're going after Level 7, this is our pump skids, this is our fan assemblies, things at that level. It might be a motor, gearbox, pump or a motor and a fan and a pulley system. That's the level of asset we're going to go after. So go one level above that, draw your functional block diagram. Fill out everything inside that block with those assets – I've got a hydraulic system, inside the hydraulic system I have 20 pump skids, inside that I'm going to draw smaller blocks for the pump skids, right? And the pump skids feed into a manifold, pump skids draw from a reservoir, you have piping that's involved.
So now we've got our block diagram and what we're going to do is we're going to set ourselves up to answer all these questions in your FMEA – function, functional failure, failure modes, failure mechanisms – with this drawing. Everybody likes drawing. Pictures are better than words, right? Nobody loves sitting in front of an Excel spreadsheet and answering questions. That's horrible, right? So we're going to get on the whiteboard, we're going to draw this.
So the question is, what else do I draw? I've got the blocks. Well, what you're going to draw is the function. So my pump system has to provide fluid, it's a hydraulic system, I'm going to provide fluid. How much? I don't know. Do we have the drawings? Let's assume we have the drawings and it says we have to provide 500 gpm to whatever system is downstream, 500 gpm of clean hydraulic fluid. What if you don't have it? I have to provide some amount of hydraulic fluid to this system downstream. We know that we're providing hydraulic fluid, but the question is, is it important to get those numbers off the drawings or not.
It's great if you can, but it's not critical and this is why. When we get into our functional failures, there's really only three types of functional failures. There is too much, too little, or not at all.
If I jump in the shower in the morning and the pressure is 3,000 psi and I peel my skin off, that is too much. If I jump into the shower and it's trickling out, it's still running, but it's not enough, so there's too little, and if it doesn't come on at all, that's not at all. And it’s the same thing with our equipment, right? So even if I don't know it's 500 gpm, I know that I have to provide hydraulic fluid, and I can provide too much hydraulic fluid, too little, or not at all.
So the same thing between each one of your components. I've got a pump skid that is supposed to provide a smaller amount of fluid. I've got a filter that's supposed to provide cleanliness to a certain degree. If I can't answer the question of exactly how much, I can still answer, it's supposed to be clean. It's either not clean enough, it's too clean, or it doesn't do anything at all.
That that brings us to the functions helps us define the functional failures. And again you've got too much, too little or not at all. Now, is too much a bad thing? Sometimes. Sometimes it doesn't matter, right? Sometimes it's like, well, too much air is OK, too little air is where we have a problem. Too much flow is fine, too little flow is where there's a problem, so sometimes you can immediately go in and say, well, functional failures – too much, too little, or not at all – the one we don't care about is too much, so let's just focus on too little or not at all. Sometimes the too much is, you know, if you have a system that only has the capacity to hold 3,000 psi and you pump 4,000 psi into it and the whole thing explodes, then too much is a bad thing. Is it making sense so far?
Read the rest of the transcript
Like this episode? Listen to
About the Podcast
Great Question: A Manufacturing Podcast offers news and information for the people who make, store and move things and those who manage and maintain the facilities where that work gets done. Manufacturers from chemical producers to automakers to machine shops can listen for critical insights into the technologies, economic conditions and best practices that can influence how to best run facilities to reach operational excellence.