Fight cyberattacks and their impacts with a strong response plan
Ron Brash is the director of Cyber Security Insights at Verve Industrial, a multi-faceted role that includes engaging with customers to work on identifying vulnerabilities that are often not disclosed, explaining and researching technical details buried in products onsite or in a lab, and staying on top of trends or tools. Plant Services Chief Editor Thomas Wilk caught up with Brash for a discussion on cybersecurity trends in 2021, a conversation that took on added urgency as last year’s data breach of the U.S. federal government came increasingly to light.
PS: Thanks to the SolarWinds vulnerabilities, cybersecurity is back on people's minds after a lengthy break focusing on COVID. When the C-suite comes to visit maintenance and reliability (M&R), asking what they are doing right now to prevent or respond to cyberattacks, what, in your opinion should M&R be ready to say?
RB: Well, excluding the SolarWinds part of that question, from a generic standpoint (and I'm saying this with tongue in cheek of course) the honest answer is we need help. There's no shame in asking for help.
Very rarely have I ever been at a site where the site overseer or the site manager has an inkling about what's going on on the systems that are providing functionality to their process control (they know the systems, such as a DCS or relay does X, but not what it depends on from a software/networking or security perspective), so I think there's no shame in asking for assistance and making it a collaborative effort, to be honest. Especially if you inherit those assets, right? If your systems need improved networking investments that can provide better security for today and tomorrow for reliability, then ask for it by stating the business and security cases to do so.
I think that's also one of the barriers to these conversations in improving OT sites: IT is terrified of that land called OT. Both groups need to talk about each other. Engineers often aren't very good at talking about those type of things, but being honest about the lack of security features, or not having the processes and procedures to respond in a cyber-related event, especially if they're not aware of it is a good thing. Being honest about things isn't bad, and also, resources need to be empowered and strong enough to say, “No, I might not know about something X, but I can be proactive about it while taking charge.”
I think management is generally aware that they incur liability and risks when they avoid security discussions. And in the past, M&R, or whatever that business might use for that terminology, security and risk hasn't been a part of those discussions at the depth required. But if, as you say, in this case, the C-suite does come knocking on their door, then it's M&R's responsibility to make sure that the organization is adequately prepared to both manage the systems that are under their purview, and also to help engineer out the risks that affect the business for those same systems under their purview.
I think there's a good conversation that can be had there in a proactive way, and I don't think any C-suite or management would try to avoid that discussion because nowadays there's no excuse for C-suite and boards to not be aware of those issues. "We need help" shouldn't be a rear end-covering discussion. It should get M&R the visibility on their issues and elevate their challenges up to the management level, and by doing so, then being able to look for support via management to resolve those issues (e.g., budget and commitment).
M&R as a whole is not homogeneous either, so it's important to look at individual concerns, not just looking at things holistically across the organization. M&R might be in hot water if management finds out that M&R and management hadn't done their due diligence after an incident occurs because they have no insurance coverage. So, it definitely can be a spicy conversation, but I think honesty and not putting up walls is the best approach.
PS: Let me shift to a slightly more technical area of the discussion of these teams, who are being asked to tie in OT into the cloud versus air-gapping strategies. What are your thoughts on how people are balancing this trend to move more OT systems into the cloud, or at least network-connected, versus relying on air-gapping?
RB: I have a few takes on that, and hopefully, they'll all make sense, because there's a couple of threads here. So, in your question, you use the word “balancing” air-gapping versus the benefits of keeping systems on-premises or in the cloud. “Balancing” is an interesting choice of words because I don't think there actually is a balance discussion going on consciously. I argue that the openness and interoperability of systems that was all the rage in the 1980s and 1990s is winning the battle because systems of today have changed. They didn't come with Ethernet cards back in the day; they had serial, or didn't even have that, right? They were on some sort of coaxial bus or something, and today they all have IP addresses.
So, I think by de facto, that's just the way it is. For example, when you go buy a car nowadays it has Bluetooth. That's just the way it is. If you bought a car 10 years ago, it didn't have Bluetooth. I think the cloud will come in a similar paradigm change, whether we like it or not. I'm a bit of a pundit for saying let's not go to the cloud today, but I acknowledge that it's going to occur, and I do see the benefits of it. There is an increase of interconnected systems, whether locally or in a distributed, multisite design, and I am seeing less and less islands. So, clearly, the air-gapping strategy is going away because people are seeing it as more of a pain. And especially with COVID, focuses on increasing revenue and decreasing costs, it makes sense that you would see that.
Now, I think there is a place for islands, especially when you can't do anything about a particular type of system, or it's very, very important that it maintains a certain level of integrity. Maybe you don't want networking on it, especially, if you're dealing with a safety instrumentation system. But generally, I find when people or organizations use islands, and even data diodes, they are using it as a cop-out for performing electronic maintenance like upgrades and patching systems. They say, "Oh, we got a diode. That's fine. Leave it alone. We don't need that upgrade. That'll just only cause us headaches." Changing the oil in my car is also a pain, and yet it still needs to be done.
However, when people look at using islands and diodes, they actually are promoting more negative behaviors that put the organization at risk. Often you'll see people put a diode at the edge of a manufacturing cell, and then what will happen is, people will think they need to get around that diode, and that diode is a one-way communication device. You will see people go over with a transient laptop or a USB stick to push updates and make changes, and this leads to unsafe behaviors and security control bypasses. Talk about throwing away investments…
There's another risk there that if you follow that same design paradigm, you actually will wind up with an increased risk of a distributed infection. So, what that means is, and if you're looking at the SolarWinds issue, there was basically a delayed fuse, right? It would wait for 14 or 15 days before it activated. That is a very common strategy in malware where it goes in, it lays low, and then it comes to life. So, if you have that same transient laptop go to all of your lines, the infection might wait and take all the lines down at the same time, and you would never have known that. So, I do think there's a time and a place for air-gapping but there's risks of someone doing something they shouldn't be doing, or as a consequence of that security, convenience, and cost triangle, I think that needs to be balanced out in the long-term.
Now, onto cloud and on-prem systems, many organizations, and especially production systems, they are not geared for these things. Sure, you can put telemetry, SAT, historians, logging servers, and maybe even some of your Microsoft Active Directory systems in the cloud, but there is no way you can put physical inputs and outputs for, let us say, a conveyor belt or a bunch of physical counters in the cloud. It will not work, because it can't. Now, what I do think is hybrid or cloud infrastructure will eventually arrive, but it will be a monolithic shift in infrastructure required to support that paradigm change.
Listen to the entire interview
For example, I was in Japan last year as part of a U.S. joint work exercise, weirdly enough, as a Canadian, and so I met a bunch of creative individuals that were from ISA (International Society of Automation), and we went over there as a kind of a group collective. And then there was also a gentleman you might know of, Andy from INL, Andy Bochman, and he's a well-known figure in this space. And so when we went over to Japan, they were discussing about how they were planning to put a big portion of their OT infrastructure, especially the SCADA and monitoring aspects of it, into the cloud.
And when you think about it, that's so far ahead of what North America is doing, and potentially, even parts of Europe, but it's a fundamental difference between North America and Japan where Japan has fiber optics to everyone's curb. That is not the case in the United States nor is it in Canada. Of course, you might think, “well, that will come.” But it won't in the human short-term. And the cost for doing so, especially with the price of broadband and communication gear, generally it is really expensive, so I don't see that coming quite so rapidly as it may be in other countries.
So, concurrently, along those same lines, as a stopgap, I'm actually starting to see more and more virtualization at the site level. To me, that is a fundamental change, not only for it being a stepping stone to the eventual cloud move, but it also decreases physical hardware dependency and promotes recovery and resiliency. It also offers you additional digital twin testbed opportunities. As a hybrid shift to a cloud, or some people call it a fog, I see virtualization as that way to get to that.
Ultimately, I think if organizations are not able to secure systems locally, the move to the cloud is only going to put them at further risk. Even though it will be an eventuality, you're probably not ready for that change yet, so it's best to keep systems on-prem and secure them first, and then look at that transition later where it makes sense.
PS: I appreciate the depth of that answer because it really does chart out for our readers what the issues are in this area and what their options are, including virtualization as an emerging solution. Can you talk about the elements of a solid response plan to a cyberattack?
RB: Sure. Let's say it was ransomware. No plan should have payment of ransom, period, because you can't guarantee that the attacker won't come back and extort you further. So, you should almost always expect to have a rebuild and you should have processes around making asynchronous rebuilds de facto.
Ransomware doesn't necessarily have to be about extortion as well. It can be an activity to hide or misdirect attention from other activities. So, what you need to assume in a cyber-related incident such as a phishing attack that delivers some sort of malicious payload, or a ransomware that encrypts a system, you need to assume that that asset, once compromised, is toast. This doesn't necessarily mean that the hardware has to go into the destruction pile, but it does mean that in any event where the integrity of a system is directly affected, or the adjacent systems potentially are in question, you have to measure the potentiality of a compromise reoccurrence due to persistent access methods that an attacker might have placed on your system.
Their intention might not have been just to ransom you. Their intention might have been for you to restore a backup, burn through resources time/efforts, and that backup is then infected potentially as well with something causing further exhaustion of budgets, so you need to do a little extra due diligence and plan for bad days.
From a general standpoint though, a solid response plan expects the prerequisite of a well-defined and tested recovery and vulnerability management combination strategy, especially one that scales when under pressure. Rebuilding one system at a time isn't a big deal. But rebuilding a whole bunch of systems simultaneously at any given time, especially when you're under pressure, that's the bigger issue here. It's not uncommon for me to hear some site manager or site technician say, "Oh, I just rebuilt the system a while back because the hard drive failed." This isn't much different than that. The only difference is that they probably had other systems up and running and covering that system.
When you deal with ransomware or even a hardware failure, you need to do a recovery. But when it's malware-related, surely, it's probably going to affect more than a single system, and generally, you're going to need to rapidly correct the flaws that allowed the attacker in. You're going to need to archive the system images for forensic and posterity at least. You're going to have to restore and update software. You're going to have to reset credentials. And then you ultimately have to get things operational.
Nonetheless though, all these things are good opportunities for proceduralization, for stress-testing end-to-end procedures and the people involved. All of which are also enforceable through drills ahead of an incident. If people started looking at cyber as just one of the things that they must deal with, like a system failure, and began treating it as such, then they are going to get all the benefits out of being prepared, and they are going to be able to recover faster.
But ultimately, when systems fail, they generally do not fail one at a time in a ransomware-type attack. They fail several at a time, and so you also need to make sure that whatever your recovery strategy is, it can scale to multiple systems at once. That might mean multiple people. That might mean automation technology. But it also means you have the network bandwidth to restore several systems at once, too.
I know what I'm saying sounds like a lot, but I think an important piece here is that ransomware is the attack of today. Viruses, in a traditional sense, were an attack of the early 2000s, but if you were to plan for a virus recovery in the 2000s, you're probably going to be pretty well-recovered for ransomware today (minus the questionable effectiveness of antimalware definitions though). Many of the elements I'm talking about are reusable, and can be leveraged for any cyber-related incident, whether it's a physical failure or not, or a cyber disaster such as a wholesale ransomware situation. There's a positive that if you didn't have those things in place yesterday, you can benefit from them for today and also for tomorrow in many, many instances, and that's something I think most cyber people don't think about.
PS: COVID-19 has been testing supply chain resiliency around the world. Would you say that the supply chain in general has passed this stress test? And related, how vulnerable do you think the supply chain is these days to cyberattacks?
RB: A lot of that is dependent on where you are in the world and what you need, right, and the software supply chain is another issue altogether. Based on a GSBF ranking system – good, satisfactory, barely, or fail – I believe depending on where you are and what that chain is, we are at the line between barely and satisfactory in terms of the results that we see. In other words, I order my package, it arrives at my door. Or there's food in the grocery store, so I guess it's okay, right? There might be delays, but I guess it's okay. Or maybe the stock isn't quite as up to par as it should be.
Now, for essentials, it seems that the big organizations are pulling through this. But there's a cost, I think. People get tired. People get sick and systems break. So, for those of us that have been around cars, you don't run a car redline for an extended period of time because things will quickly degrade, and degrade at a rate that was much quicker than anticipated. Eventually, things will get hot and fail or break.
I suspect that as lockdowns keep occurring and people get tired, or maybe they skip a maintenance cycle because they're trying to churn out as much toilet paper or personal health protection and safety devices, things are going to start to break sooner than later because of that. Or what'll happen is we'll get through this outage, and then because we've skipped all those maintenance cycles, a whole bunch of bad things and catastrophic failures are going to occur after the fact. So, I think while operations are still occurring reasonably well, even though it's at a reduced level and we're getting our packages, I think we're missing the behind-the-scenes part.
Before COVID, most industrial process equipment in general is and was, poorly protected, and it was designed also on a model that was for previous generations but doesn't work well in a modern digital age. So, for example, if we're talking about electronic system maintenance, generally there's a mindset of if it's not broken, don't fix it. Or better yet, people thought of the concept of multiple redundant lines would help you in a physical failure event but if you look at things from an electronic perspective, when something goes sideways, and surely it will, there is zero technological diversity in place to prevent all of the manufacturing unit cells or lines from being brought down by a single weaponized approach that would target all lines.
So, to summarize up what I'm trying to say, I guess it's like we are in a fragile state. Right now you're playing cyber roulette even though it looks like you're still getting your packages. But I do think that we're going to incur future costs as an unintended consequence, especially for maintenance. I think when a few of those links in the chain start to snap or be fatigued, the whole system or parts of the system will collapse. I do think we haven't done enough to promote resiliency with respect to industrial control systems and operational technology. But that's just my sense from all of the sites I've ever been to.
I do think there's a big open question about what do we do after COVID. My perspective on the web of problems is that we have done a very poor job of investing in critical infrastructure and maintaining it and keeping it reliable. I think post-COVID, as companies/governments/municipalities will be really tightening their belts, I think and I fear, that people are going to let the security system, especially for maintenance of electronic systems, degrade even further because they're afraid of shutdowns and unintentional disruptions that result in a loss of revenue or service. In fact, we are starting to see it fail now in automotive – where they cannot get components, and so production is being halted or rationed.