Transcript 2
Transcript 2
Course Overview
Hi everyone. Welcome to the Business Continuity, Disaster Recovery, and Incident Response for the
Certified in Cybersecurity course. This certification prep course will help you prepare for the Certified
in Cybersecurity examination. In this course, we're going to cover the skills measured in the three
sections of the business continuity management system as listed in the exam outline. This domain
counts for about 10% of the total exam. This is the second of the five domains for the CC
examination, and it addresses the areas of business resilience and even survival. My name is Kevin
Henry. I'm an educator and security professional, and I've been developing and teaching information
security courses for over 20 years based on my own years of practical experience in the field. This
course will address the key topics related to the principles of incident management and response,
business continuity planning, and disaster recovery planning. This course is supported by several
reference and exercise files. To download the exercise files, navigate to the Exercise files tab and
click on the Download button. In the exercise files, you'll find a helpful study guide that you can use
to follow along during your certification prep for the CC exam. The study guide contains a glossary, a
list of key points to remember, and some sample questions. I'm happy to join you on your
certification prep journey with the Business Continuity, Disaster Recovery, and Incident Response
Incident Response
Welcome to the second domain of the Certified in Cybersecurity Certification course. This domain is
entitled Business Continuity, Disaster Recovery, and Incident Response. These three elements
make up the business continuity management system and are crucial to helping organizations
prepare for and manage the many problems and challenges that all organizations face. This domain
represents 10% of the examination content, which makes it the lowest weighted domain in the exam.
Consider this to be a free 10% in the exam. We'll cover these topics in a way that makes them
logical and understandable so you're well prepared for the exam questions. This domain is divided
into three sections, incident response, business continuity, and disaster recovery. Bad things
happen. Every organization must be prepared to face adversity and unexpected problems. Power
outages, hacking employee errors, storms, and equipment failures are some of the many types of
incidents that can and do affect business mission and operations. The secret to managing a crisis is
to be prepared, have a plan, in fact, many plans, different plans to deal with different types of
incidents. Most incidents can be resolved quickly allowing the resumption of normal business activity,
but sometimes an incident requires a more detailed response, a business continuity plan to enable
the continuation of critical business processes. And a severe crisis may require a disaster recovery
plan to rebuild services at perhaps another location. These plans work together to ensure life safety,
always the first priority, and the identification and containment of the incident and the ability to return
to normal as quickly as possible. I hope you enjoy this domain. Let's get started with incident
response.
Incident Response
Let's take a look at incident response. The outcomes of a business continuity management system
are that we have plans in place for incidents through incident response planning which address
things like life safety, containment of the incident, documentation of the incident, and the ability to
return to normal operations. Business continuity planning is based on a business impact analysis,
the critical business functions, the recovery time objective, the data recovery point objective, and the
requirements to enable recovery of systems. Disaster recovery planning, the relocation of IT and
other services to an alternate location. When we look at an event, an event can be defined as any
measurable occurrence, something happened, somebody walked in, somebody walked out. That's
an event. An incident is a type of event with a potential to affect business mission. In other words, we
could call it an adverse event. All incidents are types of events, but certainly not all events are types
of incidents. Our goal is to build resilient systems. We see all that used a lot today in the ability of
business resilience means we can continue operations even during adverse circumstances. We
have response plans in place to address especially things that have happened in the past. If it's
happened before, there is a chance it could happen again. We also have to know what are the
current trends and threats, what are the types of attacks being used today? Is today's problem, say
ransomware or DDoS attacks? We should know what the current, should we say, tool of choice of
hackers is. And of course, we should look at areas of change because everything worked well until
we made a change. In many cases, it's when we have a change in staff, a change in procedures, a
change in equipment that we get more incidents as well. Incident management is a structured
process that starts with preparation. Let's be prepared in case something happens, then we can
prevent it as much as possible if we know the things that can happen. But we have to be alert to the
fact that things can still happen, even though we are prepared and have prevented, so we need
good detection. When something happens, we need to stop it from spreading, and that, of course, is
containment. Then we want to get back to normal, restoration, and apply lessons that were learned.
We can see, for example, a fire is an example of an incident. We're prepared by having equipment
and alarms and smoke detectors. We try to prevent fires through good practice of not overloading
electrical circuits or having dangerous circumstances that could lead to fire, but we have those
detectors so if there is a fire we'd know about it. The first thing we want to do if there is a fire is to
contain it, stop it from spreading, close fire doors, for example. After the fire is out, we need to
rebuild, restoration, and then, of course, learn how could we make sure that this doesn't happen
again. The idea of preparation starts with policy. Do we have policies about how to deal with things
and who has the authority if there is an incident? So it's not such that in a case of a crisis, everybody
is wondering well, who can make the decisions? Who's in charge? We have defined team members,
each with their own role, and of course, with the procedures of how we would do things. We want to
make sure that everything is documented because when we have things documented, we'll be able
to go back and review what went well, what could we improve on, for example. And of course, we
want to have regular reporting back to management and our customers and employees of what is
the current status of the incident. The idea of prevention, of course, is to have learned what are the
things that could happen, so hopefully we reduce the vulnerabilities or re-reduce the likelihood of
something happening again. The better we can be at prevention, the better we can be hopefully at
avoiding having to deal with incidents at all. We know that a lot of this is learning from what are the
bad guys doing. The types of attacks they're using are the things I should especially be watching for,
in other words, offense drives defense. We need to monitor and know what's happening on our
systems, networks, applications, and users. We see far too often that the problem is the attack had
gone on for months and nobody recognized it because nobody knew what was normal activity. We
should test our controls to make sure they're working, and certainly we should have awareness
programs so people know what to watch for and what to do if something happens. The key points
review. The secret to incident management is preparation, manage the incident and don't let the
incident manage you. Prevention is better than recovery, and learn from past incidents how to be
better prepared.
Detection
Of course, we want to try to prevent incidents, but we have to be ready for when they happen. We
need to detect the incidents, and this can come through use of various tools and technology that, for
example, detects a change in behavior on a system network or user, looking for signatures of known
types of attacks, we often see this with, for example, malicious code, heuristics, which is a type of
artificial intelligence and tries to learn when there is something that maybe is undesirable. We use
alarms because they can notify us if there is something that's gone wrong and an alert can come in
that allow us as employees, customers, suppliers, so we're able to be aware of a problem and, of
course, communicate that with these outside parties as well. One of the things that can be important
is to do audits and reviews of how well we've handled incidents in the past. What are the things we
could learn? When it comes to incident detection, the first line of defense is often the help desk. They
are the first people who become aware of people calling in and saying we're having a problem. They
have trouble tickets, and we should look for trends and patterns in the types of problems that people
are having. Alerts come in from our various monitoring systems, maybe a security information event
management system, for example. But when something goes wrong, our first priority must always be
life safety, looking after our employee's customers, and certainly, the community around us as well.
But then we have to do some analysis of the incident. The analysis of the incident should lead to a
classification. Is this really an incident or is it just noise? It's not really serious and just we could call
here a false positive, or is it a true positive? This is an incident and something we need to then
immediately take action on. The identification of it as a real incident should lead to the classification
of whether or not this is just a minor problem. Is it serious or even catastrophic that could affect the
whole organization, for example. Depending on the classification, we can determine whether or not
it's just an internal problem or something has come from outside. Was it something was done
intentionally or just accidentally? And then, of course, we activate the appropriate response teams. If
it's a minor incident, maybe just a few people are involved, but if it's catastrophic, it could be that we
activate teams right up to the senior management level and even our public relations group as well.
We want to contain incidents so we can contain the bad effect or adverse impact of the incident, and
often, we'll do this through things like isolation. We have a system that's infected, we disconnect it
from the network. In the case of a fire, we close fire doors, we disable network connections, we put a
system into quarantine so we can examine and see what's going on. And then quite often, this is
where we'll use a sandbox. A sandbox means we put, for example, malware or an infected machine
into a secure environment where you can watch its execution, watch it, what it's trying to do, but is
limited into that area, often a virtual machine, so it can't infect or spread to other systems. One of the
things we often will do, power the system down and give us a chance to be able to stop it from then
continuing to generate whatever type of malicious activity it's doing. We then sometimes, in a minor
incident, might just monitor. For example, we have things like honeypots where we can try to watch
the type of activity, or we see something that's going wrong and it's not something which is spreading
quickly, but we can monitor so we can see whether or not there is something going on and how that
is developing. We can learn maybe some of the behavior, the tools and techniques of the attacker.
Some of the considerations when we want to contain or stop something from spreading depends on
whether or not this is a critical system. If this is a system that is critical to business operations,
maybe I can't power it down or isolate it. We also have to look is this something that's going to
spread or is it something which is just, for example, in one area and not going to start infecting other
systems or networks. We also said, in some cases, we'll allow an attack to continue because we're
trying to gather evidence, we're trying to learn what's actually been going on so hopefully we can
improve our response and protection. The key points review. Incident management starts with
preparation, but then follows up with the ability, the watchfulness, so we detect any type of an
Once the incident has been detected, classified, and we've tried to contain it, we want to then
eradicate the problem. In this case, eradication where we remove the damage, the damaged system
or software, and rebuild the system maybe from backups or making sure that we have a clean
backup that is not infected as well and apply any patches that were missing that maybe allowed the
attack to happen in the first place. In some cases today, the problem is that many of the attacks will
actually affect the hardware itself, and there has been a number of cases, especially with
ransomware, where it's actually required to actually replace the hardware because it's impossible to
remove the infection that's in there reliably. The idea of restoration is we want to get back to normal,
and of course, part of getting back to normal is to recover the things that are most important first. We
set out timelines and priorities for recovery. It's important, though, that we don't just get back to
normal and become re-infected. So we need to take steps to make sure that we've identified the
actual root cause of the initial infection and taken steps to prevent that from happening again. We've
talked a number of times about documentation and sometimes the documentation of the incident is
the most valuable thing we have. It outlines the steps and procedures we are to use in the recovery
process, but then it also documents what we did so that we can make sure that we can review it,
what went well, what could be improved, are there decisions that would have been easier to make if
we'd had more information, for example. So we keep this documentation in order to assist in
reviewing the feedback, and of course, future incidents. If we've already addressed this problem
once, it's really good if we know how we did it and we don't have to reinvent the wheel and try to find
out how to make that same repair again or repeat even the same mistakes again. Reporting is
important. We should obviously report when the incident is over and so that everybody knows that
this is now finished and completed. But part of the report should include our analysis and
assessment of the incident, what caused it. It could be more than one thing. It could be many small
things, not one big thing. We often say that the problem is that organizations look too much for the
trigger, but the trigger was just the spark that started it. That was a small part. There were many
other things that led up to the incident before maybe that spark or trigger happened. We document
and report on what we did. How did we fix the problem? And certainly from all of that, we assess
how the staff responded as well. Not everybody is good during a time of stress, and we want to know
who are the people that do work well and excel when it's a time of stress, so those are key people on
our teams. All of this should result in lessons learned. Now the problem, of course, with many
organizations is that by the time the incident is over, they didn't document anything, and therefore,
they don't learn what they could have learned from it. The key points review. We need to have
incident response plans because incidents will happen, so it's a critical capability required for every
business, but we also need senior management support when there is an incident. It's not that
everybody is guessing what should we do, but the senior management supports the plans we have
in place. We know that the plans should be detailed and action-oriented and should list the
procedures we will follow and it should be required that everybody follows those procedures. All of
the team members should be properly chosen, trained, and equipped to be able to do their job in a
crisis time. And certainly, incident response should link to our other plans as well, such as business
The final step in this incident response and incident management module is to review what we
learned from this incident. In other words, we conduct a post incident review and apply lessons
learned. When we review, we should look at what went well. We certainly want to continue the things
that went well, but we also want to do a very truthful and a self-assessment of what could be
improved. We want to know who demonstrated competence and the appropriate demeanor. Did
people get angry and argue during the middle of the crisis? Who were the ones that displayed
leadership, the ability to make good decisions, rational decisions, in the middle of chaos? One of the
things we'll sometimes do is we'll do a review right following the incident when the emotions are still
high, everybody's still a little bit so you should say, agitated, and that's often called a hot wash. Let's
hear, right now, what happened. The next step is to do a cold wash, to go back later and look at it in
the cold light of dawn, and now that people have had a chance to recover, think about it, and sort of
say, okay, what do we think now that we've had a little more time to reflect on it? Both are important
because sometimes in the cold wash, we can have lost some of the things that we knew about at the
time. But in a hot wash, we didn't use always the most rational thinking either. The idea of lessons
learned is to improve our preparation, improve our plans, improve our teams, make sure we have
the right tools and training that are needed, improve our prevention through things like enhanced
controls and improve our detection. I remember talking with one company that had a major breach,
and as they said, the one thing that they learned was they weren't even monitoring the right things.
They had monitored many things, but they didn't monitor the things that would have told them about
that breach. And certainly, we have to look at whether or not our containment really worked. Was it
an effective response? A lot of this comes down to awareness, letting people know what we can
learn, what they can do, certainly making the whole situation alive for them as well and address the
lessons learned through our various awareness sessions. One of the things is that we want
everybody in the staff to be a part of our security team and have a security culture so they are
conscious of the types of threats that are out there and know what to watch for. In summary, every
incident contains key learning points that the organization can learn from. We often say the problem
is trying to extract those small little flakes of gold from the mountain of rubble of the actual incident
itself. We want to improve our incident response so we're better prepared for future incidents.
Business Continuity
Business Continuity
Let's continue with this Business Continuity, Disaster Recovery, and Incident Response for the
Certified in Cybersecurity certification with a more detailed look at business continuity. Earlier on, we
saw this definition, business resilience, a common word being used today, and it can be defined as
the ability to continue operations, even during adverse circumstances, so this is the heartbeat or the
main thrust of some type of business continuity program continuing operations, not just recovering.
We saw before that incident response is very often the first step, but when it's a severe incident, it
might trigger the need to implement and start to use business continuity plans. The outcomes of the
business continuity management system were to have an incident response plan focused on life
safety containment, documentation, and return to normal, but then to have a business continuity plan
focused on business impact analysis, critical business functions, recovery time objective, the data
recovery point objective, and the recovery requirements. When we looked at disaster recovery
planning, we're looking at a catastrophic event that meant we had to relocate IT and other services
to an alternate location. Business continuity is just simply project management. It starts with project
initiation, then moves on to business impact analysis. Based on the business impact analysis, we'll
select our recovery strategy. Then we write plans for how to implement that recovery strategy in the
event of a serious incident, but we know that all plans need to be tested. We need to roll it out,
communicate it so that everyone is aware of what to do in a crisis, and certainly through testing, we
train our staff, and we also find any flaws in the plan. Every type of use of the plan, whether it's a test
or a real incident, will allow us also to learn more about how to make the plans better and maintain
the plan. The heartbeat of business continuity is understanding the business, and this is a process
known as analysis of the impact on the business, or BIA, and it could easily be said this is the critical
and most important step in the actual business continuity planning process. Through business
impact analysis, we identify what is critical, the critical business functions, processes, for example,
that are going to have the most impact on the profitability, the reputation, and operations of the
organization. Some departments are more important than others. For a while, I worked in internal
audit, and believe me, we weren't a critical process. Most of the business thought they'd run better
without us, but the ones that are important need to be identified so that's where we set our priorities.
We also need to know what are the critical supporting processes in order to support those critical
business functions. In other words, the dependencies that critical business functions have on
supporting processes. When we want to recover a business process, we need to know what we
need in resources, people, data, facilities, equipment, and supply chain. The BIA allows us to
determine our priorities for recovery. Let's look at how this all works. We have the element of time
and business impact analysis is all about impact over time. In that way, it's different from risk
management because when we looked at risk management back in the Security Principles course,
we saw that risk was based on impact and likelihood. So here, we're looking at impact over time, so
very much an overlapping type of supporting process, but slightly different from a risk assessment.
Over time, the business is running as normal, normal operations, but then one day, we encounter a
crisis. As a result of that crisis, our level of business drops to 0. We're no longer producing a product,
we're no longer meeting our mission. Now, immediately we should start to determine what is the
impact of that inability to operate our business over time, and we can see that that quite often will
grow kind of exponentially at the end. Over the first few hours, people understand if we've got a little
bit of an outage, but the longer it goes, the greater the damage to our reputation and finance
becomes. Now, this is different for different business processes. Obviously, if this is the life support
system, this is measured in minutes, not in hours or days. One of the things we try to determine
through all of this is when the level of impact would be high enough that we actually encounter
business failure, the business has to shut down. We are unable to continue business operations.
We've lost the confidence of our customers, our owners, our bankers, for example, and that point in
time at which we would encounter business failure can be called the maximum tolerable downtime.
Sometimes we'll hear that called the maximum tolerable period of disruption. In the old days, we
used to hear it called maximum allowable downtime. I think sometimes they change the name just to
keep us all a little confused. So we look at all the business processes of the organization. We said
that some are more critical than others, and we want to know what are the critical supporting
processes for each of the critical business processes as well. We'll quite often then group. There is
no way to recover a business process without also recovering its supporting processes, so our
recovery plan should look at recovering both of them, should we say, concurrently. We can say it
simply this way, you cannot recover essential services without recovering supporting processes. One
of the things we need to learn is what will our owners, what will regulators, and what will our
customers tolerate? These would be tolerable levels of outage. We all know that, in some cases, the
customer will say, oh yeah, sure, your systems are down, I'll call back in an hour. In other cases, we
will lose the customer. So this is where we have to understand what our customers expect. Are there
regulations that say we must provide a certain level of service bound by say, government
regulations? All of these can help us determine the point of business failure, something we called
before the maximum tolerable downtime for those critical processes and their supporting processes.
Then we want to determine what is our ideal time of recovery, and this is known as the recovery time
objective and will have different recovery time objectives for different processes. The, of course,
requirement is that the recovery time objective must be, in fact, we could say, significantly less than
the maximum tolerable downtime. I don't want to write a plan that would have me recover my critical
business process an hour before the business would fail. The other thing we have to look at is the
recovery point objective, and I always call it the data recovery point objective because what this
refers to is what is my data recovery point. I'm really saying that if I have a major interruption, how
much data can I afford to lose. So really what this measures is the amount of data that can be lost in
the case of an outage and how old the data would be when it's restored. When we looked at the
resource requirements, we need to identify what would be required in order to restore systems. Now
that, as we said, also included some of our supporting processes, our dependencies, but also it
includes things like the controls we put in place that could be added to try to make sure that this
doesn't just happen again right away. So let's go back to that diagram we looked at before. The idea
here of BIA was that we determine what was the level of impact over time until the point of business
failure. Then we want to say, okay, what would it cost for us to recover the business? Now, the cost
of recovery is often the inverse of the duration of the outage. In other words, I could have a very
minimal amount of, should we say, outage time, but then the cost of the recovery is very high. So in
most cases, instead, we will try to find more of that crossover point at which point we could say the
cost of recovery is sort of, I should say, inline with the impact. This is where we want to set our
recovery time objective. So we write plans to try to recover these critical business processes by this
point in time. But when I recover, say after a fire that wiped out my head office, I have to go to my
data backups and maybe I did data backups on a regular basis, but the time of the failure was not
the same as the time of my last data backup. So I, when I rebuild my systems, am going to have to
use the most recent backup I have, which quite simply means that quite likely all of the data from the
time of the last backup until the time of the crisis will actually be lost data. All of this allows me to set
out my priorities and plans for recovery. I establish one of the priorities for system recovery based on
cost, as well as the level of impact to the business. And of course, I must have a plan which is
feasible, not unrealistic, I can't recover a major system in a few minutes. It must be something which
is acceptable, acceptable to should we say our customers, our owners, management, something
which is suitable for the type of business we're in. And of course, this is something that quite often is
a little bit contentious. We'll have a lot of different people think, well, my department is most
important so you should recover my department first. In the end, we need to go back to senior
management and hope that they will approve the actual choices we've made for which parts of the
business should be recovered first. The key points review. Its business impact analysis that provides
us the information we need in order to move ahead with selecting our recovery strategies and writing
plans. It's critical to the business continuity planning process. It identifies all of the critical business
processes, documents the resources required to restore those processes, and gives us now the
ability to choose restoration timelines. It sets out our priorities, and through this, helps us to move on
One of the essential resources required to recover our IT systems today is really data, and we have
to be prepared so we are able to recover the data if we have a major outage. That means we have
some type of data preservation plan. We use this term the recovery point objective. The recovery
point objective means we won't lose too much data and that recovery point objective determines, or
in many cases, influences or drives what our backup strategy should be. Do we back up our data to
the cloud so that it should be available from an offsite location if our head office burned down? Do
we actually have some type of storage area network with say internal hard drives or some type of
removable storage that we could put off in a secure location, say every day? Do we mirror our data
on two different, should we say, systems, maybe even geographically dispersed locations? Do we
take all of our data, say once an hour and write it off into an electronic vault or maybe every 1000
transactions so it goes offsite, and if there was a problem with the primary site, well the most I would
ever lose is that 1000 transactions that have happened since the last time I did a vault. When we
deal with databases, whenever we make a change to a database, we write a little journal entry that
allows us to recover the actual changes made to the database, even if the database was corrupt or
failed. The thing is that if that journal is just kept on the same system that failed, it's probably going to
be lost as well. So we will write that journal off to another location. We'll take a full database backup
on a regular basis, and we can apply those journals to bring the database right up to the time of the
failure, minimizing the amount of actual data loss. We want to build our systems to be resilient. That
means quite often fault tolerant. We put in things like, for example, duplication and redundancy of
equipment and networks so that if one failed, the others will still be able to keep going, and one of
the solutions to that is a cluster. Maybe I have a number of servers working together and all of them
sharing the load. If one goes down, the others just keep on processing and should have a very
minimal impact on our customers and users. We build high-availability systems, systems where
we've built in the ability to failover if a piece of equipment fails, for example. We also make sure we
have the appropriate levels of quality of service which ensures that we have the bandwidth, the
storage we need for our processing to actually be then handled. In summary, in this module, we set
out the foundation for continuity of operations. Our goal is to ensure the organization is prepared to
deal with and manage disruption to business mission and operations. This is so that we can sustain
Disaster Recovery
Disaster Recovery
Let's continue looking at Business Continuity, Disaster Recovery, and Incident Response for the
Certified in Cybersecurity course. Now let's take a look at the third part of this, disaster recovery. We
looked earlier at this slide about the outcomes of a business continuity management system, and we
said the three parts incident response planning was, first of all, concerned with life safety,
containment, documentation, and getting back to normal. Business continuity planning was based on
the business impact analysis, the critical business functions, the recovery time objective, the data
recovery point objective, and the various recovery requirements. Now when we look at disaster
recovery planning, we're looking primarily at the relocation of IT and other services to an alternate
location. Our primary location has been damaged, we can't use it, so we need to recover by
rebuilding systems, for example, our processes, at another place. When we choose those other
places, we could call that our recovery site, there is a number of factors that were used in
determining what was an appropriate recovery site. For example, how quickly do I need to recover?
If it's 8 hours drive away, that's maybe not something that's going to work if I need to recover in 4
hours. So the recovery time objective drives the site selection, but we also know that if I need to
recover very quickly, it's probably going to cost me more as well. So in some cases, the fastest
recovery would be having redundant sites, if one fails, the other is still running, but that doubles my
cost of operation. So quite often, we choose a less expensive option, such as a warm site. We also
have to look at how are we going to prioritize our systems recovery. We want to prioritize by
recovering the most critical business processes first. Now most critical could be from a financial
perspective or it could be from a reputational perspective as well. We also realize there are
challenges. If I have a recovery site too far away, it could be difficult to manage when I have
employees and systems at different sites based on a course process criticality. So the selection of
that contingency site is going to bring in a number of factors such as what would it cost, what's its
availability, can I be sure it's there when I need it, and will it help me meet my recovery time
objective? I want it close enough, but not so close that it could be affected by the same threat that
damaged my primary site, so proximity is a consideration. We want to have a site which is secure so
we don't have to worry about other problems, for example, relocating the site which itself would be
under an immense threat. We also have to worry about employees. They need to get to that site,
and logistics is often missed in disaster recovery plans. How can my employees get to this alternate
site? If they have to work there for the next 6 months to a year, that may not be so easy if that site is
hours drive away and there is no public transit available. We also want to make sure we have
support, whether or not we're discussing power, fire, police, ambulances, food, all of these are
Writing the plan. Now, we usually say writing the plan and we use the singular often a business
continuity plan, but there is quite often for a large organization 100 different plans. Our recovery in
the case, for example, of a fire, is very different than it is in the case of malware, for example. But we
write plans to deal with the various types of situations we could expect to face. A plan should be
thorough, it should address all types of situations. We say yes, but there can always be things
happen we didn't expect, but if I've written good plans, those could be adjusted to whatever type of
incident this is. We get the team together because we want the business continuity plan and disaster
recovery plans to address all areas, not just IT or not just the business, but we have to look at
everything from finance to operations and logistics. A plan should be a series of steps and actions.
We should try to minimize verbiage. We don't want a person to have to read pages of documentation
in the middle of a crisis. Instead, we want them to read and say do this, then do this, check, do this,
check off, and all these things mean we move towards the actual resumption of business processes.
We should write the plan for what we often call a worst case scenario, the most resource intensive
situation because then we can always use a part of the plan if it's not a worst case scenario, and that
means that in a worst case scenario, any type of lesser incident or situation would still be addressed
in that plan. One of the problems we have is that during a crisis, we have an elevated level of risk.
We know that, for example, many of the normal controls we would have had in place, separation of
duties, for example, are missing. We have people making decisions that go beyond what was their
normal budgetary authority. So this is an elevated security risk as well we have to watch for. We
want to have teams that are ready to go. We assign roles and responsibilities, as well as, of course,
the leaders, but for every leader, there should be a deputy, a person who can fill in if that leader was
not available. Ideally, we want to have people on the teams that understand more than just their area
so that if another team was in some ways impaired from being able to do their job, there is
cross-training and there is support that can be provided. An important thing in a crisis is to have clear
leadership and lines of reporting. We define who's in charge, who makes the decisions, who talks to
the media so that we have good and clearly understood reporting relationships and it's not such that
everybody's just doing whatever they think is best. We need to ensure that the people on our teams
have the appropriate training so they can execute their responsibilities, as well as the tools they
would need in order to do their job. An important part in any crisis is communication, communication
with our employees, managers, our customers, all of the stakeholders, or in other words, all of the
people who could be affected by this crisis. We want management to know what's going on so they
can provide direction and certainly answer questions from the media. We quite often have to report
to government and regulatory agencies, let's say if we had a spill of diesel fuel or some other type of
our customers so they have confidence that we are there to support and help them and it's not such
that we are going to disappear and their warranties are now worth nothing. This is especially
important when we're dealing with a privacy breach. We want all of our customers to be confident
that we have done everything we can to protect their information, but also that we're being upfront
about what had happened and how we will prevent that from happening in the future. We need to
communicate with our suppliers. We quite often rely on them some of the raw materials we'll need
and we don't want them to stop shipping those products because they're afraid they'll never get paid.
And of course, our shareholders. By law, in many cases, we have to communicate with our
shareholders all at the same time so they're all aware of what's going on if this is something that
could affect share price. When we talk about reporting, we want to do regular reports on the status of
the crisis to management, and this quite often can be done through an emergency operation center,
the heartbeat or control point where we'll actually manage all the various teams and activities, and
from this point, we can communicate to management what's going on. We should have checklists,
our plans are action-oriented, so we can show milestones and progress we've made towards
addressing various types of systems or issues. And then, of course, we want to get back to normal.
We will call this the process of restoration. To restore to normal means I will recover the business
functions at whatever is now going to be my primary site. Now, normally, when we recovered after
the incident, we recovered our most critical business processes first, but when I restore, I'm going to
recover the actual less important areas. That will allow me to test my migration plan, my networks,
and my systems before I jeopardize my most critical business processes by trying to move them into
whatever the new normal is going to be. No plan can be trusted unless it's been tested, and we do
tests of the plan with the intention of finding any deficiencies. The point of the test is to find
something that could go wrong so we can fix it before the incident. The testing also helps us to train
our staff so they develop skills and know how to respond effectively. The test should be thorough,
they should be as accurate and realistic as possible so we know that this is how things would work in
a real world situation. When we test, it's always good to start small. Do some little tests of just
individual processes before we move on to more complex types of tests. One of the problems is that
very often from any incident and from any, we could say, test, there have been lessons that have
been identified. It is important that those become lessons learned. We apply what we learned so we
improve it so it doesn't just happen again. In summary, in this module, we set out the requirements
for disaster recovery planning. This is for the most serious types of incidents that would require
relocation of operations.
Domain Summary
Domain Summary
Congratulations on completing the Business Continuity, Disaster Recovery, and Incident Response
for the Certified in Cybersecurity examination. Let's do a quick summary of the important things we
covered in this domain. This domain is worth 10% of the examination. It looked at these three areas
and how they relate to each other and how they ensure that our systems will be available for
business to operate in a secure manner. The first step in all of this really is incident response. We
deal with incidents as they happen, some maybe major, some maybe minor, but sometimes we
need to then also invoke a second process, that of business continuity. That is when the duration of
an incident would exceed acceptable timelines, and we need to take steps to keep the business
going, hence the name business continuity. One of the things that we often have to do when there
has been a major disruption is recover things like IT services, and that is why we also have disaster
recovery often seen to be the recovery of IT, even at an alternate location, which in many ways is
kind of a subset of business continuity. We have to remember that when something happens, the
first priority is always life safety. We want to make sure that people are safe, and therefore, that is
the first thing we must address. We can look at how NIST, the National Institute for Standards and
Technology, defined all of these areas of incidents as, first of all, being prepared. We're prepared, we
know what to do, then when we detect something, we already have a plan. We execute that plan to
try to contain the incident and recover from what actually happened. And sometimes as we're trying
to contain, we learn more, we do more detection and analysis until we finally have completed and
eradicated the problem and we can do a review, what did we learn, and what we learned as part of
post incident activity can help us be better prepared for next time. When we looked at business
continuity, we defined a number of key parts of what we're trying to do. We often don't try to recover
everything. We set a priority on critical business functions first, and we do this through that process
we called business impact analysis, in other words, analyzing what that impact of an outage would
be on the business. We also had to determine what our drop dead deadlines were, the maximum
tolerable downtime, and that is the point by which we had to recover or else maybe we could be out
of business altogether, but that wasn't our goal for recovery. Our goal for recovery was based on the
recovery time objective, that's when we wanted to recover by, and we set that so that we could put in
place a plan to help us to recover the critical business functions by that point in time. We looked at
disaster recovery as recovery of operations at an alternate location which included, of course, the
recovery of the data we needed for the business to run, the personnel required, the equipment that
we required for our business to operate, and of course, looking at things such as where that location
could be as well. So here we've looked at these three important points worth 10% of the exam, and
we can move on to our next steps. Review each of these areas, make sure we understood them and
didn't just memorize them, for example, do the sample questions to ensure we really have
understood the concepts behind them, and then proceed to the next domain, Access Control
Concepts.