System Monitoring
Protocols like SNMP give us the opportunity to perform network monitoring.
But in this episode I want to drill down a little bit and talk about if we had
these tools what are we really looking for what are we monitoring what do we want
to know about our networks and our individual hosts to make sure everything's up
and running.
So let's just go through some basic concepts.
Some of the stuff that you'll probably see on the exam.
Now keep in mind there's a lot more.
This is just some of the core things I want you to consider when you're doing
network monitoring.
So what I've got here in front of me is a very popular free tool called Zabx. This
is a SNMP tool and by the way I like Zabx but there's a lot of great ones out there
that you pay for.
For example Solar Winds is a company that makes incredibly powerful tools.
You got to pay for them.
But in an enterprise environment who cares it's actually not that much money to
have the right kind of tools to make sure your networks running.
So anyway back to the screen what we're taking a look at here is Zabx's front and
that they use to monitor our networks.
So let's just take a little peek around and see what we got here.
First of all over here you'll see that I've got some errors that are showing up
right now.
So what's happening here is I've got to what you could call alerts.
These guys call them simply problems and it gives me some idea of what some of my
problems might be.
Now let's take a really close look here one of these says one of the processes is
more than 75 percent busy on my Zabx server itself.
How does it know that it's that's a problem.
Well what happened here is I injected a baseline I said for my particular server
there.
Cpu utilization when it's running normally which is what a baseline is all about
should be no more than about 70 percent.
So what's happened here is this one has gone up it's at seventy five percent.
And because I established a baseline that said what it should be.
It sees that there's an exception to that rule and it's now coming back to me and
saying I've got a problem.
So let's take a look right below there.
Now if you look here this is my primary switch pretty much.
I've got one switch one router and a server connected to that switch.
So and the switch itself it's looking at one very specific connection on that
gigabit Ethernet switch.
So it's 0.18 and I've got some information.
This has happened before of you times.
So it says high error rate one of the things we run into when we're talking about
monitoring is we have to deal with certain types of metrics the types of
information that we're looking for.
So error rate is a big deal when we're talking about error rate as we're talking
about frames and or packets depending on this will be a switch so it'll be frames
that are malformed broken fractured something that's going on and that's coming in
and it's above my bandwidth value and it's telling me that I've got a problem.
So error rate is a big issue.
What percentage or what amount of frames and or packets coming into my device are
physically mulled up.
I'm not in the right way.
The other one is going to be utilization what we talk about utilization we're
talking about really Cpu's.
So let's see if I can pull up utilization on this guy.
All right.
So what you're actually looking at here is the Cpu load on this particular domain
controller.
Now you'll notice that the way I've got this one set up by the way these graphs
don't just magically appear I have to configure each one of these depending on what
metrics I'm interested in.
So we take a look at this and I've got three different levels here.
I've got a one minute average a five minute average and a 15 minute average.
So as you can imagine the 1 minute average is really spiky.
And then the five minute average is a little less spiky.
And then the 15 minute average is fairly smooth.
So what I'm doing here is kind of keeping a track of how this particular Cpu load
on my domain controller works there.
The trick here is really what I'm doing is everything's running great right now.
So what I'm doing is I'm establishing a baseline.
This is what I would expect it to do.
Now I can look at this graph over time and I could say something like OK if my Cpu
load gets above 400K right here I can go ahead and set an alarm up at some form of
alert and then have a notification that notification can show up as we saw a moment
ago on my dashboard just a little flashing light.
I can get a text message on my phone if I want.
I've even got an application I can run on my smartphone and basically have a mini
dashboard with notifications and all that.
So there's a million ways to notify.
It's just a matter of which one you want to configure for yourself.
So when we're talking about utilization really talking about Cpu the other one I
want to talk about is packet drops.
The recent packet drops as an important metric is that Pat get drops.
Measure the amount of packets that a particular device can't handle.
All devices particularly switches are the best example have buffers and there's a
certain point where they're getting so much traffic that they begin to get a buffer
overflow and they begin to drop packets.
Now that isn't going to stop communication because if something gets dropped it
will be reset by whoever sent it.
But it is giving us a sense that that particular device is getting a little bit
overloaded now that can be for the entire device or it could be on a per port basis
depending on how you've got your setup configured.
The other big one is bandwidth, bandwidth just basically says how much data Am I
moving per second.
This can be an number of packets.
This can be in bytes or megabytes or gigabytes or the right system.
But one very important feature for anybody in terms of metric is understanding how
much data are we moving.
For example I've got a web server on here.
Now this web server is pretty new so I want to be able to watch it as people begin
to access this web server more and more than my bandwidth should be going up and
that's a good thing.
Means I'm making money but on the same token I might reach a point where I'm
starting to max it out.
I may want to consider bringing in another web server I may want to consider
increasing the size of my pipe to that web server but it's very important that we
measure that bandwidth or throughput to make sure we understand what's going on.
All right.
The other thing I don't have an example for here is I'm just going to mention it is
the other big issue comes into play is what we call file integrity monitoring.
I've got a lot of critical files.
Now let's say I've got a database for example and I've got some really big SQL
files and I need to keep an eye on them.
If these files reach a certain size for example I could run into trouble if these
files at least in terms of their hash values that let me know that the file is in
good order has a problem.
I need to know about it.
So file integrity modernizing is a very very important tool but it tends to be a
little bit more specialized in terms of a database or something where you're
looking at individual files.
So when it comes to monitoring.
Keep in mind that you've got a lot of choices as an SNMP runs underneath all of
these different tools and you'll see guys get into fist fights trying to decide
which monetary tool they prefer one over the other.
But keep in mind it doesn't matter you're all going to have base lines you're all
going to have metrics and it's just a matter of personal choice.
Abnormal warnings of high error rate or utilization might signify security breaches
or broken equipment
A baseline helps identify irregular activity that needs to be investigated
File integrity is an important part of a monitoring program