0% found this document useful (0 votes)
77 views74 pages

Best Practicesv1.0 1

The document outlines best practices for VMware vSphere environments, covering various settings such as ESXi images, host hardware requirements, and VM configurations. It emphasizes the importance of proper resource allocation, security measures, and efficient storage management, while also detailing recommendations for networking and monitoring. Key recommendations include using the latest VMware Tools, enabling DRS and HA, and avoiding shared disk clusters.

Uploaded by

Suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views74 pages

Best Practicesv1.0 1

The document outlines best practices for VMware vSphere environments, covering various settings such as ESXi images, host hardware requirements, and VM configurations. It emphasizes the importance of proper resource allocation, security measures, and efficient storage management, while also detailing recommendations for networking and monitoring. Key recommendations include using the latest VMware Tools, enabling DRS and HA, and avoiding shared disk clusters.

Uploaded by

Suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLS, PDF, TXT or read online on Scribd
You are on page 1/ 74

S.

No Setting

1 ESXi Image

2 Host Hardware Requirement

3 vSphere CPU Scheduler

4 Timekeeping synchronization

5 VMware Tools

6 vCenter Server Appliance

7 Virtual disk layouts simple

8 DRS Affinity Rules

9 EVC Usage
10 Avoid using shared disk clusters (i.e., RDM)

11 Distributed Resource Scheduler

12 vSphere High Availability

13 VM Snapshot

14 vHardware Versions

15 ESXi Host Performance and Power Management

16 Fault Tolerance Usage


17 VMware Thin Provisioning

18 VM Hot-Add Usage

19 Lockdown Mode

20 VM Configuration

21 Transparent Page Sharing

22 Certificate

23 Datastore sizing

24 Storage Multi-pathing
25 Thin Provisioning and Datastore space

26 VASA Providers

27 Storage DRS

28 RDM Disks

29 Datastore Mapping and LUN ID

30 ATS & VAAI

31 VMCP

32 APD & PDL

33 SIOC
34 Drivers & Firmware

35 Distributed Switches

36 Network Discovery Protocol

37 Network State Tracking

38 vMotion – vLAN

39 Physical networking considerations

40 Physical Uplinks Design Considerations

41 DvSwitch Health Check


42 DvSwitch NIC Teaming

43 VM vNIC Usage

44 Drivers & Firmware

45 VMware Support Request Template

46 VM Antivirus / Any Scans on Virtual Infrastructure

47 Performance Metrics
48 Log Management

49 VM Backup Solution

50 Monitoring vCenter Events

51 VMware PowerCLI

52 vRealize Operations and Log Insight

53 VMware Patching
54 SATHC Usage

55 ESXi Active Directory Integration


Best Practice
Compute
• Use the ESXi Customization Image from the Hardware vendor instead of venilla image from VMware
• Deploy one or more management VMs. To be used for 3rd party scripts and tools such as IBM SRM, S
• Monitor hardware status from the vCenter “Hardware Status” tab available from the host inventory
alerts in your event management system accordingly

• Hypervisor hosts should have 10+ cores per physical processor


• Hyper-threading cores should not be counted when planning the pCPU:vCPU ratio for capacity plann
considered the upper limit for a normalized workload, be sure to understand that some environments
may require a lower CPU ratio for adequate performance.)
• Consider the number of VMs per host when sizing your ESXi host hardware
• Use caution when sizing RAM in ESXi host (The 1.25:1 (vRAM:pRAM) ratio has been established with
• RAM to CPU sizing of 4-5GB per vCPU
• Hosts should have fully redundant hardware components
• Have enough network adapters and ports to have only redundant uplinks on your vSwitches (2 for M
NFS (If used))

• Security vulnerabilities can exist in a host’s physical processors, which are then advertised to a virtu
• Notable vulnerabilities include Meltdown, Spectre, L1 Terminal Fault (L1TF | Foreshadow), and MDS
Fallout/RIDL/ZombieLoad).
• Kyndryl requires using the SCAv1 scheduler as it provides the most security by mitigating all aspect
deployments must be sized and designed with SCAv1 enabled at build.
• A side effect of the SCAv1 scheduler, while the most secure, is reduced processor capacity and perf
processor utilization potential and loss of hyper-threading availability to the entire system.

• Use In-guest time synchronization mechanisms


• Use NTP Device external to vSphere
• Always configure NTP on your ESXi hosts, and make that NTP configuration consistent across all hos
• NTP is configured on each ESXi host and the vCenter Server Appliances
• Update VMware tools with the latest releases as you can utilize more features for in-Guest OS, Secu
• Use most current supported versions of VMware Tools or open-vm-tools

• Run vCenter Server as a Virtual Appliance with vCenter HA enabled


• Configure vCenter Server Appliance using the Automatic Backup feature
• Schedule your Backup via Backup Tools (such as SP4VE) when vCenter Server is not under heavy lo
• Configure HA Recovery Priority to ‘High’ for the vCenter Server Virtual Machines
• Configure “High” for Memory and CPU on Shares for vCenter Server Appliance
• Use vCenter Server Appliance with embedded platform services controller as standard
• Use vCenter High Availability (VCHA) as standard

• Keep virtual disk layouts simple: one vmdk = one logical partition
• Keep “.vmdk” files for VMs on same datastore
• Use DRS Affinity Rules; but do not overuse
• Configure Affinity and Anti-Affinity rule based on the requirements
• The more VMs with DRS rules in place will reduce the options DRS has to move VMs, thus impacting
• Enable EVC and set it to the maximum supported CPU baseline for each cluster
• By default, EVC is not enabled in VMware clusters, because not all CPUs support EVC. To find out if y
https://kb.vmware.com/s/article/1003212
• The use of RDMs add unnecessary complexity and additional requirements that, at times, out weigh
• In typical RDM configurations that you have to make sure that there is SCSI ID consistency across th
be used
• In addition, you cannot take snapshots of shared disk volumes which makes backup and change roll
• Kyndryl recommends AGAINST using RDM and other shared disk technologies on VMware clusters.
• Use DRS in fully automated mode
• DRS will move Virtual Machines between hosts using vMotion to evenly balance cluster workload
• DRS considers CPU, Memory and Network utilization metrics

• Enable vSphere HA on the Cluster


• VMCP to configured with the below parameters
vSphere HA = Enabled
Enable Host Monitoring = Enabled
Host Failure Response = Restart VMs
Response for Host Isolation = Disabled
Datastore with PDL = Power off and restart VMs
Datastore with APD = Power off and restart VMs – Conservative Restart Policy
VM Monitoring = Disabled.
Admission Control = Slot Policy (powered-on VMs)
Heartbeat Datastores = Automatically select accessible from the hosts
• Starting with vSphere 7.0 Update 1, HA and DRS are controlled with a new feature called vSphere C
on each vSphere cluster depending on the size of the cluster

• Snapshots should not be kept longer than 24-72 hours.


• Snapshots are NOT a replacement for VM Backups.
• No more than 2-3 snapshots on a VM at any time.
• Create vCenter Server alarms to detect if a virtual machine is running from a snapshot. Configure th
• Use RVTools / vInfoTools to monitor the VM's that are running with snapshots.
• Updated vHardware to the latest supported by the running vSphere version.
• Virtual hardware features include BIOS and EFI, available virtual PCI slots, maximum number of CPU
typical to hardware.

UEFI:
• Set the power settings on the UEFI to Platform Controlled
• Disable c-States
• Enable Hardware-assisted CPU Virtualization (VT-x / AMD-v)
• Enable Hyper-Threading
• Enable CPU/RAM speed to Maximum Performance
ESXi:
• Configure Power Management to High Performance.

• Fault tolerance provides higher level of business continuity than vSphere HA.
• Do not use Fault Tolerance unless there is a strong use case.
• Fault Tolerance is not widely use within Kyndryl and there might be better ways to provide HA on a
• Fault Tolerance should be used only to the particular use cases we have mentioned.
• Fault Tolerance highly recommends 10-Gbit logging network.
• Fault tolerance Logging Network has un-encrypted traffic.
• vSphere Enterprise Plus license allows FT to be enabled in a Virtual Machine with up to 8 vCPUs.
• Kyndryl recommends use of Thin Provisioning at storage layer. Do NOT use thin from VMware if stor
• NEVER use thin provisioning on vSphere AND Storage devices simultaneously
• NFS storage, which does not use SCSI reservations or locks, by default uses thin provisioning on the
are an acceptable use
• Disable CPU and RAM Hot-Add on virtual machines
• The primary disadvantage is that when Hot-Add is enabled, it disables vNUMA for that virtual machi
physical CPU and physical memory that a VM uses within the physical NUMA infrastructure of the host
• Kyndryl recommends keeping Lockdown mode DISABLED expect for the most secure environments
Normal Lockdown mode may be set.
• Remember that it is also best practice to keep the SSH Service and ESXi Shell Service disabled. Ena
disable again when complete.

• Use optimal VM configuration for best performance as below


VMware Tools - Use most current VMware Tools versions supported by the Guest OS
Virtual Hardware - Use most current supported for your vSphere ESXi hosts
Boot Firmware - UEFI for Virtual Hardware 13 and above, BIOS for lower levels
CPU - Build VMs with minimal amount of CPU possible. Add CPUs only if application performance requir
RAM - Build VMs with minimal amount of RAM possible. Add RAM only if application performance requi
CPU / RAM Hot Add – Disabled
VM Swap - Keep with Virtual Machine configuration file
Network Adapter - Use VMXNET3 for best performance and least CPU overhead when supported by Gues
SCSI Controller - Use VMware Paravirtual when supported by Guest OS
VMDK - Thick Provisioned Lazy Zero (Thick Provisioned Eger Zero can be used for VM disks hosting dat
CD-ROM - Keep detached
Floppy Disk - Keep detached (or even remove completely)
USB Controller - Remove from all VMs
DirectPath IO (VMXNET3) - DirectPath IO should be disabled unless VM product specification requires it.

• Kyndryl recommends setting TPS mode to “SHARED” for standard VMware platforms to use the hard
• Disable Salting - Set Mem.ShareForceSalting = 0 to enable Intra-VM page sharing.
• Kyndryl recommends keeping TPS mode to the default setting “NON-SHARED” for the most secure e
Ensure that the vCenter, ESXi certificates are valid and also check STS certificates are working fine.
Storage
• Use datastores with fixes sizes and enough free space.
• General recommendation and best practice for standard datastore size that is no larger than 4TB (s
• No more then 12-15 VMs per datastore.
• Maintain 15-20% free space on all datastores.
• Do not use LUN extents or expand LUNs

• Use Round Robin multipathing to access your LUNs, this implies that your storage backend has activ
• When using RoundRobin, also set the IOPS limit to make the use of Round Robin more efficient on e
• By default, ESXi sets the RoundRobin IOPS limit per path to 1000. Kyndryl recommends changing th
If using Fiber Channel storage, then set the IOPS limit to ‘10’.
If using iSCSI storage, set the IOPS limit to ‘1’.
• It is recommended is NOT use VMware Thin Provisioning on any cluster.
• Use thin Provisioning on the storage device if required by the account.
• Thin provisioning on the SAN side may be available but should not be used while using thin provisio
provisioning on the storage side rather than on the hypervisor side when available.

• VMware vSphere vStorage APIs for Storage Awareness (VASA) is a set of software APIs that a storag
array to vCenter Server.
• In your vSphere HTML5 UI, select from the Host & Clusters view your “vCenter Server” > “Configure
URL, login account and password.
Security certificate from the VASA provider may need to be installed on the vCenter Server system.

• Use Storage Clusters and Storage DRS to balance space usage and I/O Metrics.
The first setting is the SDRS automation level; this should be set to ‘Fully Automated’ to take full adva
• In the runtime rules, you can enable or disable I/O Metric Inclusion to be part of any SDRS recomme
and only datastore usage will be considered when recommending SDRS migrations.
• If using EasyTier (or similar feature, which is enabled by default on most IBM managed Storage arra
triggered migrations” in Storage DRS (sDRS).
• Do NOT use Raw Disk Mapping (RDM), If you need to use RDM’s, use the ‘Fixed’ multi-pathing policy
RDM LUN. The use of RDM’s for VM disk add additional support requirements for next to no increase in
• Keep LUN IDs identical across all hosts ESXi hosts in a cluster.
• Having different LUN IDs for the same LUN across different hosts can cause issues such as but not li
issues, etc.
• Do NOT use duplicate LUN IDs within an ESXi host.
• Ensure Datastore presentation is consistent across all ESXi hosts in a cluster.

• Disable ATS Heartbeat, This setting must be kept ENABLED (set to ‘1’) if running vSAN or NFS on NE
if not kept enabled.
VMFS3.UseATSForHBOnVMFS5 = “0”
• Keep VAAI enabled, Enables the ESXi host to offload specific virtual machine and storage operations
benefits, it is recommended that these settings be kept enabled on all modern vSphere deployments.
• VAAI is preferred to be kept enabled if you are running SVC version 7.5.x and onwards.If you are run
problems.
HardwareAcceleratedLocking, HardwareAcceleratedMove, HardwareAcceleratedInit

• Enable VM Component Protection, VMCP can protect VMs from storage related events such as Perm
• VMCP will not prevent these issues from occurring, but it will allow HA to perform a quick and autom

• Set All Path Down in ESXi Advanced Settings


Misc.APDHandlingEnable = 1
Misc.APDHandlingTimeout = 140
• Set Permanent Device Loss in ESXi Advanced Settings
• PDL is a situation that can occur when a disk device either fails or is removed from the vSphere hos
state, the vSphere host can take action to prevent directing any further, unnecessary I/O to this devic
Disk.AutoremoveOnPDL = 1
VMkernel.Boot.terminateVMonPDL = True

• Enable Storage I/O Control (SIOC) on all datastores and keep settings consistent within a datastore c
• If SIOC is not consistently selected for all LUNs in a SDRS cluster, the IO metric used for SDRS optim
• Keep Storage adapter firmware and drivers at current levels.
• Be cautious of known issues and compatibility concerns with each version.
Network
• Use Distributed vSwitch for virtual machine networks
• Distributed virtual switches are defined at the data center level, which means that virtual switch con
same data center.
• Distributed virtual switches enable advanced features such as Rx traffic shaping, consistent network
Switch API or LLDP (Link Layer Discovery Protocol, which is a standard equivalent of CDP, Cisco Discov

• Enable discovery protocols on your physical network switches


• When you are using Cisco physical network components, the Cisco Discovery Protocol can be used t
vSphere client
• Note that in vSphere 6.7 and above, another open standard discovery protocol can be used on non-
vSphere Distributed Switches v5.0.0 and later can use LLDP and you can enable this in the dvSwitch a
• Note that LLDP must also be enabled in the physical switch configuration and on the specific switch

• Recommendation is to use Link Status Only and Link State Tracking on physical switches that suppo
• With link state tracking enabled on the physical network switches, set the network failover detection
(this is also the default)
• If you want to monitor downstream link failures and there is no Link State Tracking on the physical s
Beacon Probing.
• Keep vMotion on separate VLAN from all other traffic
• vMotion should be placed on a separate, isolated, non-routable VLAN from all other host network tra

When configuring Top of Rack (ToR) switches, consider the following best practices
• Configure redundant physical switches to enhance availability
• Configure switch ports that connect to ESXi hosts manually as trunk ports
• Modify the Spanning Tree Protocol (STP) on any port that is connected to an ESXi NIC to reduce the
• If DHCP is required on the account, provide DHCP or DHCP Helper capabilities on all VLANs that are

• To accommodate your workload needs, the vSphere services and products you are running you will
your host. This also depends on the throughput that each host has available per uplink. A general rule
Speed - Qty - Service
10GBE - 2 - vSphere
10GBE - 4 - vSphere + (vMotion and/or vSAN)
10GBE - 6 - vSphere + vMotion + vSAN + NSX-T
25GBE - 2 - vSphere
25GBE - 2 - vSphere + (vMotion and/or vSAN)
25GBE - 4 - vSphere + vMotion + vSAN + NSX-T
A 1GbE connection will also be needed for the server management card (IMM/iLO) on each host.

• Keep DvSwitch Health Check feature disabled


• If you are having network configuration problems in your environment and believe this feature will h
perform a health check and disable within 20-30 minutes. If left enabled for a prolonged time, it may
due to MAC flooding of the physical network
• Use multiple pNIC uplinks for maximum redundancy.
• The general best practice is to NOT use Link Aggregation Control Protocol (LACP), but to use Load B
• However, in certain scenarios, there might be specific requirements that require LACP to be utilized
• Remember that LACP introduces complexity since it needs to be configured on both the physical sw
vSphere is very dependent on the network switch configuration, so you need to make sure to align all
• The hashing algorithm used on the physical switch must be the same as the one used on LAG o
• All NICs that are used for LACP must use the same Speed and Duplex settings.
• The Number of ports in an LACP port channel on the physical switch must be equal to the num
• Pay careful attention to configuration steps in environments with limited number of uplinks (e.g
connectivity to Management interface of ESXI host.
• LACP also introduces a set of technical limitations which are documented in VMware vSphere 6

• Limit the number of vNICs on a virtual machine.


• With these increased network speeds, you should connect only one vNIC to each virtual machine
• We also understand that multiple vNICs to connect to more than one vLAN may be necessary due to
• Following this best practice will ensure that management of VMs is kept as simple as possible. It wil
between two subnets that must be kept separate (i.e., PCI, DMZ, etc.).
• Keep Network adapter firmware and drivers at current levels.
• Be cautious of known issues and compatibility concerns with each version.
Management
• Use the VMware Support Request Template when opening new cases.
https://kyndryl.box.com/v/VMwareSRTemplate
• Please be sure to open any VMware tickets with the proper severity. Sev1 tickets should only be ope

• Plan antivirus\vulnerability scans in VMs carefully.


• In a virtual infrastructure, most virtual machines share the same storage backend while they have t
infrastructure
• So, while it may be fine to scan all your internal storage every week in the latter configuration, this
would mean potentially hundreds of virtual machines would hit the storage backend with I/O intensive
• You want to make sure that the weekly or monthly A/V scans are scheduled at different and/or even
• This guidance is valid for any management software for all VMs that may cause high I/O when proce

• You should never assume that you will look at performance troubleshooting in a virtual machine the
management and troubleshooting in a virtual infrastructure requires very specific skills.
• VMware offers several mechanisms to help you deal with resource contention, such as resource poo
• If your VM is running slow, it could indeed be lack of CPU, RAM, Memory or Network resources in you
even in your cluster. It could also be caused by inappropriate resource contention policies such as ina
memory configuration in a NUMA node…
• You may also consider using specific tools for troubleshooting performance in virtual infrastructures
Servers or a third-party product like VMware vRealize Operations
• Ensure that all logs on all vSphere components are configured to retained per Best Practices and ac
• The following Advanced Setting are required to keep logs both locally as well to a managed syslog s
Syslog.global.logDirUnique = True
Syslog.global.LogHost = <external syslog server IP or FQDN>
LoggingGlobalLogDir = [<datastore name>]/scratch/log
• It is important that all logs are retained for at least 180 days (or as dictated by account CSD).
• It is suggested that all ESXi hosts and vCenter Server logs be configured to be stored on a syslog se

• Use a VM Backup solution that leverages the vStorage API


• For the vast majority of your data, it is strongly recommended to use a backup proxy solution such a
• In Kyndryl Backup as a Service, the recommended solution is SP4VE because of the tight integration
used backup solution in Strategic Outsourcing accounts

• Configuring vCenter to forward alarms to your event management system is thus a critical compone
• vCenter provides default and customizable events that let you be notified of things like
Storage access degradation
Datastore free space
Network connectivity loss or redundancy degraded
Host hardware status
Loss of connectivity between a host and the vCenter server
• Available options are to use SNMP, use SMTP notifications, configure custom actions on vCenter alar
event forwarding) and use custom monitoring solutions that are designed for VMware infrastructures
Leverage VMware PowerCLI for Automation

• You should use vRealize Operations (vROps) and vRealize Log Insight (vRLI) to help manage your VM
servers.
• Main purpose for vROps is be the eyes and ears of your vSphere infrastructure so you can easily und
even occur. It will also help you have a proper size and performance of your environment and the wor
• vRLI will collect and apply analytics to the logs from your vSphere Infrastructure as well from your W
occurs you can see what happened before the issue and help you troubleshoot the situation
• vROps and vRLI have full integration with one another, as well as with vSphere

• Patch ESXi hosts at least every Quarter


• Upgrade host firmware and drivers every 6 months.
• In addition to OS patching, you should maintain a lifecycle policy for server firmware drivers for you
etc.
• Allow a release period of 2-4 weeks before applying the latest patch/firmware. This time allows critic
than you
• Always read the release notes of each patch or firmware update as it can bring new constraints or k
• These times frames are a general guidance. You may need to update more often depending on cust
fixed with updated versions
• Keep ESXi patches and Firmware uniform across all hosts in your vSphere Cluster
• When applying new storage or network drivers, be sure to consult with your Storage/Network admin
• Deploy and use SATHC for all VMware deployments. Remediate any findings to comply with VMware
• SATHC for VMware focuses on identifying potential risks and findings on VMware environments.
• The items reviewed by SDE decision points, are divided into 6 overall topics:
Virtual Machine Settings
Host Settings
Host Network
Host Storage
Cluster Settings
General Management
• Keep unique and accurate all the naming standards such as Clusters, Datacenters, Datastores and o

• It is not recommended to join ESXi hosts to the Active Directory.


• Active Directory is a common attack vector for malicious actors to compromise an environment and
hosts to Active Directory
• Use local accounts, do not integrate ESXI hosts to Active Directory.
• Make sure local accounts password are stored securely in an approved credentials vault with contro
1Password)
• Direct access to ESXi hosts should be a path of last resort. It should only be accessed directly as par
should be used to manage ESXi where possible. (e.g., PowerCLI)
• Network access to ESXi hosts should be restricted to specific sources (such as Jump Hosts or VDI or
S.No Setting

1 User Management

2 Lpar configuration

3 Logging

4 Timekeeping synchronization

5 VIOS
6 Network Installation Manager (NIM)

7 HACMP{cluster}

8 Insecure Daemons

9 DLPAR

10 OS Backup

11 Sysdump

12 CLUSTER RESOURCE GROUP POLICY

13 Patching

14 HMC
15 Performance tuning

16 Fault Tolerance

17 Monitoring

18 Authentication

19 Patching Tools

20 Paging space

21 Zoning and Topology

22 Storage Multi-pathing
23 Disk config

24 FCS config

25 ENT Network config

26 Network config

27 Drivers & Firmware


Best Practice
Compute

• Direct root access should be not allowed to users .In the /etc/sshd/sshd_config file, find PermitRootL
• histexpire should be set to recommened value 26 .
•histsize should be set to recommened value 20.
•maxage should be set to recommened value 8.
•maxexpired should be set to recommened value 2.
•maxrepeats should be set to recommened value 2.
•minage should be set to recommened value 0.
•minalpha should be set to recommened value 2.
•mindiff should be set to recommened value 0.
•minlen should be set to recommened value 6(8 for root user).
•minother should be set to recommened value 2.
•pwdwarntime should be set to recommened value 5.
• Enhanced access should be provided to users according to their requirement via sudo.

• Shared CPU scheme should be used for better utilization of available resources on frame.
• 1:2 ratio need to configure for pcpu:vcpu for better performance.vpcu should not be more that avail
• Min and Max. cpu and memory set aacordingly to application future usages so that we can utilize dl
• Use virtualized storage ( NPIV) and ensure adequate disk space and performance for LPARs.
• CPU: Assign dedicated/Shared processors or CPU pools to LPARs based on workload characteristics (
•Memory: Allocate memory appropriately to avoid overcommitment and ensure adequate performanc
a change management process to track and manage modifications to LPARS settings and configuratio

•Its recommended to use separate filesystem for logs as if we use default FS sometimes /var fills up
logs only logging will stop after 100 utilization .system still remains up.
•We need to configure syslog.conf accordingly for new FS below is configuration example:
mail.debug /usr/local/logs/mailog rotate size 2m files 10 compress
*.emerg /usr/local/logs/syslog rotate size 2m files 10 compress
*.alert /usr/local/logs/syslog rotate size 2m files 10 compress
*.crit /usr/local/logs/syslog rotate size 2m files 10 compress
*.err /usr/local/logs/syslog rotate size 2m files 10 compress
auth.notice /usr/local/logs/infolog rotate size 2m files 10 compress
•Utilize different log levels (debug, info, notice, warning, err, crit, alert, emerg) appropriately in syslog
•Enable auditing (auditd) to track and log security-related events such as user logins, file access, and

• Use ntp server and update the /etc/ntp.conf properly for time sync.
• Always configure NTP on your all lpars for time sync.
• There is some execption for DB servers where we are not suppose to use Xntpd as per DB requirem

• Dual VIO setup should configured for redundancy


• Use redundant network adapters (SEA) for network connectivity to avoid single points of failure.
• Separate network traffic for management, client access, and inter-VIOS communication.
• Utilize VLANs for network segmentation and isolation.
• Maintain up-to-date documentation of VIOS configurations, network settings, storage mappings, and
• Implement a change management process to track and manage modifications to VIOS settings and
• Nfs all deomon should be running nfsd,mountd,lockd,statd because we are using nim server as nfs m
• NIM should be on N-1 OS level.Keep the NIM software up to date with the latest patches and fixes pr
• Golden Image customized properly according to the project .
• NIMVG must be configured separatly with atleast 500GB storage
• Integrate NIM with automation tools (e.g., Ansible, Puppet) to automate repetitive tasks such as dep
•Develop and implement standardized test procedures for NIM operations, including deployment, pro
• Implement regular backups of NIM configuration data, scripts, and resources to ensure rapid recove

NIM operations to capture events, errors, and activities for troubleshooting and auditing purposes.

• It is used for critical application to provide extra redundancy for application services.
• Ensure shared access to critical data and applications using shared storage solutions (SAN) with red
• Implement MPIO for redundant paths to shared storage to enhance reliability and performance.
• Use cluster management tools (such as PowerHA SystemMirror CLI or GUI) to monitor cluster health
• Conduct regular failover testing to verify the effectiveness of cluster configurations and failover pro
• Telnet should be in stop state as its knowns as one of security holes.
• FTP should be in stop state as its knowns as one of security holes.
• RSH should be in stop state as its knowns as one of security holes.
• RMC connection should be there between LPAR and HMC .
• IBM.DRM should be running on Client LPAR.
• Mksysb should be taken on weekly basis
• Alt_clone need to be taken for system so that we have fast backup plan. Its manadory before any m

for any cluster failure.


• VIOS mapping backup should be taken on weekly basis.
• VIOS OS backup should be taken on weekly basis.
• primary and secondary sysdump should be configured on all LPARS
• Estimated size should be configured for sysdump LV using sysdumpdev -e

• Start Policy

“Online On Home Node Only”


• Fallover Policy

“Fallover To Next Priority Node In The List”


• Fallback Policy

“Never fallback”

• All systems should be on N-1 level[Firmware,I/O,VIO,AIX,HACMP,HMC]


• Patching should be happen atleast quaterly.
• Upgrade I/O firmware and server firmware twice a year.
• Keep patches and Firmware uniform across all hosts in your environment.

•Ensure that the HMC version is compatible with the firmware versions of the managed Power System
•Integrate the HMC with IBM Power Systems Director or other management tools to automate tasks, m
•Ensure HMC configurations comply with organizational security policies, industry regulations (e.g., PC
• Adjust kernel parameters (vmo, no, ioo) to optimize memory, paging, and I/O performance based on
• Use JFS2 or enhanced concurrent capable file systems (such as GPFS) for improved performance an
• Configure TCP/IP settings (no) for optimal network performance, including buffer sizes, congestion c
•Utilize AIX performance monitoring tools (nmon, topas, vmstat, iostat) to monitor system resources (
• Conduct performance testing and validation (AIXpert, benchmarks) before deploying new applicatio
performance impacts.
• Configure TCP/IP settings (no) for optimal network performance, including TCP window size, packet

• PowerHA SystemMirror (formerly HACMP): Implement clustering solutions to provide failover capabi
SystemMirror allows for automatic failover and recovery in the event of node or resource failures.
• Use RAID (Redundant Array of Independent Disks) configurations (e.g., RAID 1) to mirror data across
• Implement redundant network interfaces, switches, and paths to eliminate single points of failure a
• Use redundant power supplies and uninterruptible power supplies (UPS) to maintain power availabil
• Implement application-level monitoring and automated failover mechanisms (e.g., application cluste
intervention.
• Keep AIX, firmware, and hardware drivers up to date with the latest patches, fixes, and security upd

• HMC GUI PCM


• Topas CEC view
• Nmon
For user security we should use authentication tools :
•PAM
• LDAP

• FLRT(fix level recommendation tool)


• FLRTVC (efixes and ifixes)

• ensure an adequate amount of paging space is allocated. AIX recommends having at least 2-4 times
• Spread paging space across multiple physical disks or disk controllers to distribute I/O load and prev
• Perform capacity planning to anticipate future memory needs based on growth projections and work

Storage AND Network

• Core-to-edge topology with director-class switches.


• All SAN topologies should always be divided in two fabrics with Failover configuration.
• use soft zoning—that is, creating zones using only the worldwide port name (WWPN), for individual

• For AIX the number of paths is to have between two to four paths, despite the fact that AIX can sup
reduce the boot time and cfgmgr, and improve error recovery, failover, and failbacks operations.
• Provides failover protection by maintaining multiple paths to storage devices. If one path fails, I/O o
application availability.
• The best practice is to use AIX native path-control module PCM (AIXPCM). The AIXCMP is integrated
activities related to updating the PCM
• Specifies the number of I/O operations that AIX can concurrently send to the hdisk device.this need
• Reservation policy: Required for all MPIO devices.this need to set [reserve_policy=no_reserve]
• Determines the way the I/O is distributed across paths. this need to set [algorithm=shortest_queue]
• FC ERROR RECOVERY (fc_err_recov): Switch fails I/O operations immediately without timeout if a pa
• Dynamic Tracking (dyntrk): Allows dynamic SAN changes. Recommended value: yes
•Maximum transfer size (max_xfer_size): How big the block I/O size can pass over the HBA port. Reco

Set Buffer allocation on all interfaces to utilize maximam output from network :
• max_buf_huge=128
• min_buf_huge=64
• max_buf_large=128
• min_buf_large=64
• max_buf_medium=512
• min_buf_medium=256
• max_buf_small=4096
• min_buf_small=2048
• entX-a max_buf_tiny=4096
• min_buf_tiny=2048

• Configure /etc/hosts and /etc/resolv.conf files for local hostname resolution and DNS server settings
• Enable DNS caching (/etc/netsvc.conf) to reduce DNS lookup times and improve application perform
• Configure network bonding (EtherChannel, IEEE 802.3ad) to aggregate multiple network interfaces
• Choose network switches from reputable vendors that are compatible with AIX and support industry
• Port Speed and Capacity: Select switches with adequate port speed (e.g., 1Gbps, 10Gbps, 25Gbps)
• Plan and implement a network topology that meets performance and redundancy requirements (e.g
needs, and fault tolerance.
• Use Virtual LANs (VLANs) to logically segment network traffic, improve security, and optimize netwo

• Keep Storage adapter firmware and drivers at current levels.


• Be cautious of known issues and compatibility concerns with each version.
S.No Setting

1 Oracle Database Configuration best practices

2 Oracle patching

3 Oracle RAC best practices


4 Oracle ASM best Practices
5 Oracle DataGuard best practices
6 Oracle RMAN backup & Recovery
7 Oracle database performance best practices
Best Practice
Compute

• Use SPFILE which enables a single,central parameter file to hold all database initialization paramete
ASM Disk group
• Enable Archive log mode & forced logging for critical databases
• When using Manual memory management adequate SGA(SGA_MAX_SIZE) & PGA(PGA_AGGREGATE_TARGET) sizes need to
components, especially buffer cache size and shared pool size, to sufficiently high enough values.
• Adequate SYSAUX tablespace size need to set which stores AWR snapshots , by default Oracle database captures once every
.• Adequate SYSAUX tablespace size need to set which stores AWR snapshots , by default Oracle database captures once ever
S.No Setting

1 MS SQL Configuration best practices

2 MS SQL patching
3 MS SQL High Availability best practices

6 MS SQL Database backup & Recovery

MS SQL database performance best


7
practices
Best Practice

• Use separate SQL files into different disk - SQL Server accesses data(.MDF) and log files(.ldf) with ve
Data file access is mostly random whilst transaction log file access is sequential.Separating files that
patterns helps to minimize disk head movements, and thus optimizes storage performance. Use RAID
data, log files, and TempDB for best performance and availability.
• TempDB sizing -- Proactively inflate TempDB files to their full size to avoid disk fragmentation.A good rule of thumb for creati
Tempdb data files = # of cores
• Memory configuration -- Set Min server & Max server Memory that SQL instance utilize. Leave atleast 6GB of RAM for the o
performance issues.
• Max degree of Parallelism(Maxdop) --Server with single NUMA node Keep MAXDOP at or below # of logical processors ( < 8
logical processers set MAXDOP 8
.• Using Antivirus Software -- customers might have requirements that antivirus scanning software must run on all servers, in
Server instances. Microsoft has published strict guidelines if you need to run antivirus where SQL Server is installed specifying
engine to be configured. • SQL Server Database Autogrowth -- the auto-growth setting for a
correctly, the database may experience various or few auto-grow events. Each time when SQL Server
transactions stop. Make sure you enable Instant file initialization (IFI) as it allows SQL Server to skip th
begin using the allocated space immediately for data files. It doesn’t impact growths of your transacti
need all the zeroes. • Auto Create Sta
Statistics --By default, SQL Server maintains create and update statistics automatically for your datab
have the option to manually disable these features. Disabling auto create and update statistics should
are three ways to create SQL Server statistics:

1)If the auto_create_statistics option is enabled (enabled by default)


2)Manually create statistics
3)When a new index is create

1) Patch SQL servers regularly --recommended to patch SQL servers regularly, at least once a month
more often for large enterprises.

2) Establish a rollback plan for emergencies --Sometimes, patches create more problems than they fix
patch rollback plan comes in handy.

3) Avoid patching servers when agent tasks are running

4) Notify users of upcoming patches

5) Set up a SQL patching schedule


1. Regular Backups: Even with HA solutions in place, regular backups are essential for data protection
2. Test Failover Procedures: Periodically test your failover procedures to ensure they work as expecte
outage.
3. Monitor Performance: Use SQL Server monitoring tools to keep an eye on the performance and hea
4. Keep Systems Updated: Apply the latest patches and updates to your SQL Server and Windows Ser
vulnerabilities and improve stability.
5. Document Everything: Maintain thorough documentation of your HA configuration, procedures, and
quick reference during an emergency.
S.No Platform
1 WebSphere
2 WebSphere
3 WebSphere
4 WebSphere
5 WebSphere
6 WebSphere
7 WebSphere
8 WebSphere
9 WebSphere
10 WebLogic
11 WebLogic
12 WebLogic
13 WebLogic
14 WebLogic
15 Middleware(Common for All)
16 Middleware(Common for All)
17 Middleware(Common for All)
18 Middleware(Common for All)
19 Middleware(Common for All)
20 Middleware(Common for All)
21 Middleware(Common for All)
22 Middleware(Common for All)
23 Middleware(Common for All)
24 Middleware(Common for All)
25 Middleware(Common for All)
26 Middleware(Common for All)
27 Middleware(Common for All)
BestPractice
WebSphere Dumps location should be configured in JVM. Pls use separate FS. Not under /root, /WebSp
Automatic Restart in JVM Monitoring Policy Should be enabled
IIM should be installed on separate or dedicated mount point
Unused JVM or Cluster should not exists
IP Address should not be used in Profile Configuration
WebSphere Garbage Collection Should be Enabled
WebSphere Min & Max Heap Size of JVM should be Equal (for better performance)
Fixpack and IM update to latest versions
Ulimit File Descriptor should be minimum 10k
The DOMAIN_HOME and the ORACLE_HOME are in the same file system path
Redundant Oracle Home binary directories need to be configured
Cluster listen address need to be configured
derby.jar file has not been removed and DERBY_FLAG is not set to FALSE either
Patching need to be upto date or N-1
Third party certificate is advisable and need to trust the certificate between App and WebServers
Log rotation and log archival need to be in place
Logs and dumps need to be separate filesystem or drive
Need to use TLSv1.2 or higher
Based on requirement LDAP/Configuration parameter(like timeout/connection pool/..) need to be in-pl
Process need to be run with Functional ID (like wasadm/jbossadm/tomcat/weblogic
Console security need to be enabled
Either Product Monitoring(PMI) or third party Monitoring (like Tivoli) need to be in-place for performan
Auto restart script need to be in-place for all product
Functional ID should have proper permission to execute
Advised to use Customized Port
System time need to be sync across platforms (advised to use NTP)
SOP need to be available for all activity
Sr.No Technologies Frequency
1 Oracle Daily
2 Oracle Daily
3 Oracle Daily
4 Oracle Daily
5 Oracle Daily
6 Oracle Daily
7 Oracle Daily
8 Oracle Daily
9 Oracle Daily
10 Oracle Daily
11 Oracle Daily
12 Oracle Daily
13 Oracle Daily
14 Oracle Daily
15 Oracle Daily
16 Oracle Daily
17 Oracle Daily
18 Oracle Daily
19 Oracle Daily
20 Oracle Daily
21 Oracle Daily
22 Oracle Daily
23 Oracle Daily
24 Oracle Daily
25 Oracle Daily
26 Oracle Daily
27 Oracle Daily
28 Oracle Daily
29 Oracle Weekly
30 Oracle Weekly
31 Oracle Weekly
32 Oracle Weekly
33 Oracle Weekly
34 Oracle Weekly
35 Oracle Weekly
36 Oracle Weekly
37 Oracle Weekly
38 Oracle Weekly
39 Oracle Weekly
40 Oracle Weekly
41 Oracle Weekly
42 Oracle Weekly
43 Oracle Weekly
44 Oracle Weekly
45 29 Oracle
46 Oracle Weekly
47 Oracle Weekly
48 Oracle Weekly
49 Oracle Weekly
50
51 Oracle Monthly
52 Oracle Monthly
53 Oracle Monthly
54 Oracle Monthly
55 Oracle Monthly
56 Oracle Monthly
57 Oracle Monthly
58 Oracle Monthly
59 Oracle Monthly
60 Oracle Daily - night shift
61 Oracle Daily - night shift
62 Oracle Daily - night shift
63 Oracle Daily - night shift
64 Oracle Daily - night shift
65 Oracle Daily - night shift
66 Oracle Daily - night shift
67 Oracle Daily - night shift
68 Oracle Daily - night shift
69 Oracle Daily - night shift
70 Oracle Quarterly
71 Oracle Quarterly
72 Oracle Quarterly
T
Oracle Database instance is running or not
Database Listener is running or not.
Check any session blocking the other session
Check the alert log for an error
Check the Top session using more Physical I/O
Check the number of log switch per hour
How much redo generated per hour
Run the AWR/statpack report if any performance issues reported
Detect lock objects
Check the SQL query consuming lot of resources.
Check the usage of SGA
Display database sessions using undo/rollback segments
State of all the DB Block Buffer
Check the tables/indexes are fragmented
Check the Chaining & Migrated Rows
Check the stale stats on objects
Check is there any dbms jobs run & check the status of the same
Check the Sync of the database from primary & standby db
Check and monitor the RMAN Backup files & its storage
Check the Recovery Size Area(FRA)
Check the latest Archivelog and Full Backup are done or not
Check the usage of physical RAM and SGA – Paging or Swapping exist or not.
Check Sql_ids running more than 20mins on OLTP databases
Check all CRONTAB housekeeping script logs
If its RAC make sure all Cluster services are ONLINE
Check any filesystem space issue
Check OEM critical/warning Alerts
Check Grid ASM rebalancing status
Check the size of tables & check weather it need to partition or not
Check for Block corruption
Check the tables without PK
Check the tables having no Indexes
Check the tables having more Indexes
Check the tables having FK but there is no Index
Check the objects having the more extents
Check the frequently pin objects & place them in separate tablespace & in cache Check the objects re
Check the free space at O/s Level
Check the CPU, Memory usage at O/s level define the threshold for the same.
Check the used & free Block at object level as well as on tablespaces.
Check the objects reaching to it’s Max extents
Check free Space in the tablespace
Check invalid objects of the database
Check open cursor not reaching to the max limit
Check locks not reaching to the max lock
Weekly
Check I/O of each data file
Check the stale stats
Check the CPU, Memory usage at O/s level define the threshold for the same & follow the document S
Check is there any dead lock was occurred
Monitor weekly report of RMAN full database backup and incremental backups
Check the database size & compare it previous size to find the exact growth of the database
Find Tablespace Status, segment management, initial & Max Extents and Extent Management
Check location of data file also check auto extendable or not
Check default tablespace & temporary tablespace of each user
Check the Indexes which is not used yet
Check the Extents of each object and compare if any object extent are overridden which is define at t
Tablespace need coalescing
Check the overall database statistics
Trend Analysis of objects with tablespace, last analyzed, no. of Rows, Growth in days & growth in KB
Analyzed the objects routinely.
Check the Index need to Rebuild
Check the tablespace for respective Tables & Indexes
Check the No. of DML operation perform after last analysis
Check the No. of Date of Last Analysis & No. of Record in the Table
How to determine the table is required to analysis or not
Check Catproc & Catlogs Objects are Valid or not (dba_registry)
Share Pool Advisory
Buffer Cache Advisory
Check the UNDO tablespace and retention
Patching activity on GI/RDBMS home
Database Reorganization
Check the quota of non-system tables in the system tablespace.
Domain Sub domain Best Practice Area
DOCUMENTATION Red hat Documentation
DOCUMENTATION Red hat Documentation
LICENSE MANAGEMENT Red hat
LICENSE MANAGEMENT Red hat Manageability

LOG MANAGEMENT Red hat Manageability

LOG MANAGEMENT Red hat Capacity

LOG MANAGEMENT Red hat Manageability

DATA GATHERING & UPLOADING Red hat Documentation


PATCH MANAGEMENT Red hat Manageability
SPECIFIC TECHNICAL CHECKS Red hat Manageability
SPECIFIC TECHNICAL CHECKS Red hat Manageability
SPECIFIC TECHNICAL CHECKS Red hat Availability
SPECIFIC TECHNICAL CHECKS Red hat Performance
SPECIFIC TECHNICAL CHECKS Red hat Manageability
SPECIFIC TECHNICAL CHECKS Red hat Capacity
SPECIFIC TECHNICAL CHECKS Red hat Capacity
SPECIFIC TECHNICAL CHECKS Red hat Performance
SPECIFIC TECHNICAL CHECKS Red hat Performance
SPECIFIC TECHNICAL CHECKS Red hat Manageability
SPECIFIC TECHNICAL CHECKS Red hat Manageability
BACKUP Red hat Resilience

SPECIFIC TECHNICAL CHECKS High Availability Manageability

SPECIFIC TECHNICAL CHECKS High Availability Resilience

SPECIFIC TECHNICAL CHECKS High Availability Resilience

SPECIFIC TECHNICAL CHECKS High Availability Availability

SPECIFIC TECHNICAL CHECKS High Availability Resilience

BACKUP Red hat Resilience


Best practice
SOP for Red hat Linux OS must be available
Check list for on-boarding new server must be available
All Licensed products must be activated and license inventory should be maintained
Alert must be configured for expiring licenses to avoid any non-compliance

Centralized logging tool must be used to capture , Analysis and Process the logs

Log rotation policy must be applied on all systems.

(Audting)The events and the degree of detail that the logs must capture should be configured to include the fo
system actions:
- User logon events
- User creation events
- Security events
- Failed events
- Object access
- Commands execution

Upload the sosreport for all server ( VMs and Hosts) to further analysis the system configurations
Centralized patching tool should be available (Ex. Satellite , YUM , SMT )
Redundant NTP should be configured
DNS should be configured and avoid using local resolution (/etc/hosts)
Bonding should be used for physical servers
Server uptime should not be more than 2 years
KDUMP need to be enaled and configured
File system should not be utilized greater than 85%
Memory utilization does not exceed 90%
OS tuning should be enabled (as per application recommnedatation)
NTP should be synchronized
SYSSTAT Must be installed on all servers
SAR should be configured to capture the performance reports
Full File system backup should be configured for critical VMs

Dedicated cluster logging should be enabled

Fencing resource should be configured and tested

Two nodes parameter should be enabled in Coro sync cluster configuration

Cluster status should be healthy

Cluster Configuration Backup need to be taken Monthly

ReaR Backup must be implemented for Physical servers where snapshot is not possible
Best Practice Motivation
It ensures proper BAU operations
It ensures proper BAU operations
It ensures proper BAU operations
It ensures all devices are complaint
Centralized logging is a critical component of observability into modern infrastructure and applications. With
diagnose problems and understand user journeys
It ensure the File System availability

It ensure that all critical events are captured.

Required for health Checks validation


It help in managing the patches repositories
Ensures the uniform time throughout
Use of local name resolution
Ensure server is reachable in case of port/cable issue
Ensure that server maintained is done
Ensure that RHEL support policies are not impacted
Ensure the Disk Performance and availability
Ensure Server performance is not impacted
Ensure Server performance is not impacted
Ensure the timestamps on all devices are in sync
Ensure that troubleshooting packages are available in case of issue
Ensures that historical performance data is available
Ensure files recovery for critical VMs

Ensure that cluster events are logged properly

Its ensure service availability

Its ensure that quorum does not effect the service availability for two node cluster

Ensure that application availability

Ensure Cluster recovery in Case of Failure

Ensure files recovery for critical Servers.


Sr.No Technologies Frequency
1 MSSQL Daily
2 MSSQL Daily
3 MSSQL Daily
4 MSSQL Daily
5 MSSQL Daily
6 MSSQL Daily
7 MSSQL Daily
8 MSSQL Daily
9 MSSQL Daily
10 MSSQL Daily
11 MSSQL Daily
12 MSSQL Daily
13 MSSQL Daily
14 MSSQL Daily
15 MSSQL Daily
16 MSSQL Daily
17 MSSQL Daily
18 MSSQL Daily
19 MSSQL Daily
20 MSSQL weekly on Monday
21 MSSQL Weekly
22 MSSQL Weekly
23 MSSQL Weekly
24 MSSQL Weekly
25 MSSQL Weekly
26 MSSQL Weekly
27 MSSQL Weekly
29 MSSQL Monthly
30 MSSQL Weekly
31 MSSQL Weekly
32 MSSQL Weekly
34 MSSQL Weekly
35 MSSQL Weekly
36 MSSQL Weekly
37 MSSQL weekly
38 MSSQL Monthly
39 MSSQL Monthly
40 MSSQL Monthly
41 MSSQL Monthly
42 MSSQL Monthly
43 MSSQL Monthly
44 MSSQL Daily - night
45 MSSQL Quarterly
46 MSSQL Quarterly
47 MSSQL Quarterly
48 MSSQL Quarterly
49 MSSQL Quarterly
50 MSSQL Quarterly
51 MSSQL Quarterly
Check SQL Server Services are running or not
Check All the disk drives are under threshold
Check any session blocking the other session
Check the error log for an error
Check any long running sessions.
Check DC & DR are in SYNC for logshipping.
Check for Read replica is in sync with Primary Replica in Always ON AG.
Check for backup failure (log/diff/full)
Check Memory Utilisation at OS Level
Check CPU Utilisation at OS Level
check status for any DBA Jobs scheduled daily.
Checklist followed for new Subsystem build.
Check the backups are moving to remote location.
Check and monitor the Backup storage
Check if any maintainance jobs running during business hours & check the status of the same
Check if any backup failure happened in the last 24 hours.
Check the latest log backup and Full Backup are done or not
Monitor critical alerts based on Mail alerting for critical events.
Check for any change is there FOR TODAY and SOP for the activity is included otherwise prepare Deta
Check all SQL Server DBA Jobs status log for weekend jobs.
Check the size of tables & check weather purging or archivings are in place for very large tables. Thes
check if DBCC CHECKDB failure and issues.
Check server level wait stats.
Check very large tables having no Indexes
Check large tables have duplicate or multiple duplicate indexes.
Check the tables having FK but there is no Index
Check the Index fragmentation
Check the free space at O/s Level
Check the CPU, Memory usage at O/s level define the threshold for the same.
House keeping scripts should be scheduled.
Check Disk queue length and response time in case of critical servers slowness.
Check the CPU, Memory usage at O/s level define the threshold for the same & follow the document S
Check is there any dead lock was occurred
Monitor weekly report for full database backup and differential backups
Database Reorganization
Check the database size & compare it previous size to find the exact growth of the database
Check the database size & compare it previous size to find the exact growth of the database
Check the Indexes which is not used yet
Check Patch level is upto date
Check the overall database statistics
Check the Index need to Rebuild
check for critical servers indexing jobs (for rebuild) are running without any blockings. Need to make
MS SQL Version & Service pack and database name and size of each database should be collected an
check the size of the database and compare it with last quarter size and project the growth of the dat
Patching
Check Sql server version for EOL/EOS and prepare plan for upgrade.
Generate Memory/CPU utilisation report and store it for compare in each quarterly to make a projectio
Make sure SOP for all the BAU activites are available that will prevent human errors.
Make sure Check list for built a new SQL Server standalone/cluster/alwaons on AG/ DR Logshipping m
S.No Setting
Compute

1 Domain User ID

2 OS Security

3 Monitoring

4 NTP

5 Vulnerabilities

6 recovery plan
Best Practice
Compute

Enable Single Sign-On (SSO): Simplify access for users while maintaining security by implementing SSO
Use Role-Based Access Control (RBAC): Assign permissions based on roles to minimize access rights to what is necessary.
Lower Exposure of Privileged Accounts: Limit the number of privileged accounts and monitor their use closely
Enforce Strong Password Policies: Require complex passwords that combine letters, numbers, and symbols, and ensure they a

Patch and Install Security Updates: Utilize features like Hotpatch to install updates without rebooting, ensuring higher availab
Upgrade to the Latest Version: Keep your server updated to the latest version, such as Windows Server 2022, to benefit from
Restrict Remote Access: Limit remote access to servers and ensure secure configurations
Extended Security Updates : For older versions, take advantage of ESUs to protect your workloads as you plan for modernizati
Ongoing Monitoring: Implement continuous monitoring to detect and respond to security threats promptly

Implement continuous monitoring to detect and respond to security threats promptly


• Use In-guest time synchronization mechanisms
• Use NTP Device external to PDC

Understand Common Vulnerabilities: Be aware of common security gaps such as outdated systems, misconfigurations, and cre

Plan for Security Breaches: Have a recovery plan in place for vital data and infrastructure function in case of a security compro
S.No Setting
Networ

1 User Management

2 Logging

3 NTP

4 ACCESS RULES

5 Monitoring

6 Routing Protocols

7 Routine
Best Practice
Network

• Individual userids/ Authentication Servers


• Password should be encrypted .
•AAA should be inplace for Authentication,Authorisation and Accounting
• Password minimum length should be of 9
•Alphanumeric password should be there
• New password must be different than last password
•Password aging should be there
•Radius communication should be secured

•Log system messages and debug output to a remote host.


•Enable logging of system messages.
•Enable system message logging to a local buffer.
• Timestamp should be enabled
•Logging console should be enabled
•Logging retention should be there for minimum 90 Days or based on the client policy

• Configure NTP server for time sync.


• Always configure NTP on your network and security devices to keep same time logging
• Create accesslist to allow NTP only from secured and trusted server.

• Telnet and aux services should be disabled


• SSHv2 should be enabled with higher key
• HTTP/HTTPS/FTP and other insecured services should be disabled
• access-list in line VTY session
• Session timeout should be configured
• Banner should be enabled
• SNMPv3 should be configured
•Monitor all of the required parameters based on the KPI(CPU,MEMORY,BANDWIDTH,ERRORS,Failures
• Daily reporting should be enabled
• SNMPv1 and v2 should disabled
• Restrict the snmp server to trusted.
•Disable defauld community(Private/Public)
• Implement regular backups of NIM configuration data, scripts, and resources to ensure rapid recove

comprehensive logging for NIM operations to capture events, errors, and activities for troubleshooting
• Authentication should be configured on Routing Protocols(OSPF/BGP or any protocol which is used)
• There should be control on routes via route-map or path metrics.

• Take weekly or month device backup based on the policy in Client Infra
• If there is change or activity, Take the backup of the device before starting
• Change control should
•Monitoring and Reporting should be in place.
S.No Parameters

1 Array Health

2 Firmware

3 Capacity

4 CPU Stats
5 Host Stats
6 Host port Stats
7 PD stats
8 IO threshold for each ports
9 VV stats
10 Storage Multipathing
11 SNMP
12 SMTP
13 Stale Volumes

14 Persistent Ports

15 High availability

16 RAID groups
17 SAS Drives configuration

18 NL SAS drives configuration

19 Fast cache/Easy Tiering

20 Provisioning VVs
21 Priority Optimization

22 Adaptive Optimization
23 Front-end port cabling

24 Standard SOPs for New host addition, Zoning, Volume creation

25 Automated Health check

Check # Check conditions

1 Password expiration must be min. 1 and max. 90 days


2 Number of allowed login attempts must be min. 1 and max. 5
3 Minimum days to wait before password change must be at least 1
4 Minimum number of unique passwords must be at least 8

5 Password Requirements - Minimum password length must be at least 15


6 Minimum number of numeric digits in password must be at least 1
7 Access via Telnet (port 23) must be denied access, except from BNA
8 Access via HTTP (port 80) must be denied access, except from BNA

9 Check that unused ports are persistent disabled (nolight + disabled)

10 Password setting Enable Admin lockout must be enabled

11 Password setting lockout duration must be set to min. 30 minutes

12 Forward syslog messages to a syslog server

1 Check for no L-Ports present


2 Timeout must be 10 min. or less
3 Check that NTP is configured

4 SNMP v1 write must be disabled

5 SNMP v1 trap enabled, IP-address present and severity is set to 3 or 4

6 SNMP v1 community must be different from default

7 SNMP v1 accesscontrollist must be ro

8 Fixed speed on IFL/ISL ports

9 Chassisname must be different from default

10 Check for mixed zoning not used

11 Check for hanging zones or aliases

12 Check for undefined zones or aliases

13 All F-Ports must be used in zones

14 Each CP must have an ipaddress configured

15 Remaining buffers must be greater than 50 pr. ASIC

16 F-ports in FID127 not recommended

17 check number of LSAN zones are less than 2800


18 Trunking must be enabled on all Fibrechannel ISLs

19 Bottleneck Monitor must be enabled and configured for LrOnly


20 MAPS policy must be either Kyndryl_SO Standard or Default policy

21 MAPS alerting must be enabled

22 In Order Delivery (IOD) setting

23 DLS must be set with Lossless enabled

27 Insistent Domain ID must be enabled

Credit Tools: recovery must be enabled and set to LrOnly for internal and
29 backend ports

30 Edge Hold Time must be within the range 80 - 220

31 SNMP v3 trap enabled, IP-address present and severity is set to 3 or 4

32 Dynamic Portname should be enabled running FOS 7.4+


33 Credit Tools: C2 FE Complete Credit Loss Detection must be enabled

34 MAPS config: Active policy, notification should not contain SDDQ or FENCE
35 Relay Host (SMTP) should be present

36 Security Protocol must be minimum TLS v1.2 (FOS v8+)

1 SFP status on all ports including TX and RX values

2 If trunking is enabled, trunking license must be present


3 Fabric Watch or Fabric Vision license should be installed

4 Firmware must be identical on primary and secondary partitions

1 Deskew on ISLs must be below 30


2 Check all Fans are OK

3 Overall switch health must be healthy

4 All temp. Sensors must be OK

5 All temp. Sensors must be below 75 degrees Centigrade

6 HA must be in synchronous state

7 Switch violation must be empty


8 CRC errors per port must be below 2500 per day

9 Encoding errors outside of frame per port must be below 25 per minute

10 Encoding errors inside of frame per port must be below 25 per minute

11 Er_bad_os per port must be below 5 per minute

12 Er_rx_c3_timeout per port must be below 100 per day

13 Er_tx_c3_timeout per port must be below 100 per day

14 Er_c3_dest_unreach per port must be below 5 per minute

15 Power Supply must be present and OK

16 Blades must be OK if present

17 SFPs must be healthy

18 ISL port must not be segmented

19 Check that Brocade zoning DB is not fuller than 80%


ErrDump must not contain any MAPS messages, with severity CRITICAL, which
20 there have not been taken action ons
ErrDump must not contain any MAPS messages, with severity ERROR, which
21 there have not been taken action on
ErrDump must not contain any MAPS messages, with severity WARNING, which
22 there have not been taken action on
23 One or more ports detected as slow drain device and being set in quarantine

24 ITW errors per port must be below 500 per day.

25 HA Local CP must be Active


26 HA Remote CP must be Standby and Healthy

27 Time TX Credit Zero (tim_txcrd_z) per port must be below 1min / 24H
Storage Best Practices
Hardware Health status - All the Hardware health should be optimal to perform
daily operations
Upgrading to the most current Firmware allows the storage system to benefit
from the ongoing design improvements and enhancements.

• Verify the Total allocated , used and free space are as per the thresholds.
• Effective capacity management (by removing unmapped volumes, stale
volumes/snapshots, periodic cleanup)
• Look for Drive Types used and empty slots for additional storage

Verify the CPU stats and Idle time


Connected Hosts and status
Frontend Ports IOPS, Reads and Writes Svt etc
Frontend Ports IOPS, Reads and Writes Svt etc
For each port, the average I/O, KBytes per sec
Virtual Volumes IOPS, Reads and Writes Svt etc
Multipathing validation of all hosts
SNMP alerting configuration - For call home support with vendor
SMTP alerting configuration - For email alerts
Verify unmapped Volumes for space reclammation

For Fibre Channel host ports, the following requirements must be met:
• The same host port on host-facing HBAs in the nodes in a node pair must be
connected to the same Fibre Channel fabric and preferably
different Fibre Channel switches on the fabric (for example, 0:1:1 and 1:1:1).
• The host-facing HBAs must be set to “target” mode.
• The host-facing HBAs must be configured for point-to-point connection. (There
is no support for loop mode.)
• The Fibre Channel fabric being used must support NPIV and have NPIV enabled.

DR set up and Replication features

• For all drive types, use RAID 6 for maximum availability.


• When creating storage pools, accept defaults according to
performance/capacity requirements.
• The number of pools should be kept to a minimum.
• Do not set growth limits on Pools. If a warning threshold is required, set a
growth warning (warning in terms of capacity), not an allocation
warning (warning in percentage).
• SAS drives should be RAID 5 type by default. This configuration yields the
highest availability for modern high-capacity drives.
• For applications that have a very high write ratio (more than 50% of the access
rate), create a pool using RAID 1 if performance (as
opposed to usable capacity) is the primary concern.

• NL SAS drives should be in RAID 6, which is the default.


• You can change the set size (data to parity ratio) from the default value of 8
(6+2) if the system configuration supports it.
• Do not use RAID 5 with NL disks.

• Reduces latency for random read-intensive workloads


• Responds quickly, providing smart and flexible second stage data caching
based on application and workload demands
• Enables Flash Cache across the entire system or for particular workloads to
accelerate
It helps to segrate the data to the corresponding drive types depends on the
criticality of the data (Moves hot data to the SSD drives, warm data to the SAS
drives and cold data to the NL-SAS drives

• Because the granularity of deduplication is 16 KiB, the efficiency is greatest


when I/Os are aligned to this granularity. For hosts that use
file systems with tunable allocation units, consider setting the allocation unit to a
multiple of 16 KiB.
• Deduplication is performed on the data contained within the VVs of a pool. For
maximum deduplication, store data with duplicate affinity
on VVs within the same CPG.
• Deduplication is ideal for data that has a high level of repetition. Data that has
been previously deduplicated, compressed, or encrypted is
not a good candidate for deduplication and should be stored on thinly provisioned
volumes.
• When using an array as external storage to a third-party array, deduplication
might not function optimally.
• Use TPVVs with dedup enabled when there is a high level of redundant data
and the primary goal is capacity efficiency.
• Use TPVVs with dedup enabled with the appropriate dataset.
• When implementing maximum limits (the maximum amount of IOPS/bandwidth
a given VVset or domain is allowed to achieve), the best
practice is to use System Reporter data in order to quantify volume performance
and set maximum limits rules accordingly.
• When implementing the minimum goal (the minimum amount of
IOPS/bandwidth below which the system will not throttle a given VVset
or domain in order to meet the latency goal of a higher priority workload), you
should set the minimum goal by looking at the historical
performance data and understanding the minimum amount of performance that
should be granted to the applications that reside in that
VVset. The volumes in the VVset might use more IOPS/bandwidth than what is set
by the minimum goal, but they will be throttled to the
given limit as the system gets busier. The performance might also go below the
minimum goal. This can happen if the application is not
pushing enough IOPS or if the sum of all minimum goals defined is more than the
I/O capability of the system or a given tier of storage.
• Latency goal (the service time [svctime] the system attempts to fulfill for a
given QoS rule) requires rules with a minimum goal
specification to exist so the system can throttle those workloads. A reasonable
latency goal should be set. You can do this by looking at
historical performance data. The latency goal is also influenced by the tier on
which the volume resides.

• The following combinations are acceptable within the same Adaptive


Optimization configuration (policy):
– SSD, Fast Class/SAS, and NL
– Fast Class/SSD and Fast Class
– Fast Class/SAS and NL
• Using different RAID levels within the same policy is acceptable.
• When configuring two- or three-tier solutions containing SSDs, if region density
data is not available for sizing the SSD tier, assume that
the SSD tier will only provide the greatest number of IOPS per drive. The increase
in the estimated number of IOPS on larger drives is not
because of a difference in technology, but rather the increased probability that
“hot” data regions are on the larger SSDs and not on the
smaller SSDs.
• Always size the solution assuming the NL tier will contribute 0% of the IOPS
required from the solution.
• Configurations that only contain SSD and NL are not recommended unless this
is for a well-known application with a very small ratio of
active capacity compared to the total usable capacity (1%–2%)
• Each HPE 3PAR controller node should be connected to two fabrics to protect
against fabric failures.
• Ports of the same pair of nodes with the same ID should be connected to the
same fabric. Example:
– 0:2:3 and 1:2:3 on fabric 1
– 0:2:4 and 1:2:4 on fabric 2
• Connect odd ports to fabric 1 and even ports to fabric 2 and so forth.
Standard SOPs should be maintained to have a uniform configuration to be
implemented across all storage arrays

Deploy SATHC (Kyndryl owned Tool) for Storage Automated Health check which
helps to run the pre configured commands in all the storage arrays and generate
an excel output with the alerts and best practises which has not been followed as
per the configurations. Also it helps to fix the issues by providing the remediation
steps.
Brocade Health Check Best Practices

Command to issue
Security/Compliance Settings

passwdcfg --show

ipfilter --show

switchshow | grep -v light


passwdcfg --showall
passwdcfg.adminlockout: 1
Must be 1, then admin lockout is enabled. Default is 0.
passwdcfg --showall
passwdcfg.lockoutduration: 30
Must be min 30, Default is 0.

syslogdipshow
Configuration settings

switchshow
timeout
tsclockserver

snmpconfig --show snmpv1

snmpconfig --show accesscontrollist

switchshow switchshow all E* ports must be XG not NX

chassisshow chassisname must not be different from Kyndryl_xxxx_xxx

either pWWN or D,P, if both are present in same zone, check will Fail.

zones not used in effective configuration

defined zones or alias pointing against non existing WWPNs/D,Ps

check that all F-Ports are used in alias or zone

ipaddrshow

portbuffershow

Check whether F-Ports exist in FID 127 - not recommended

zoneshow count number of LSAN_ in zoneshow


switchshow All E-ports for a switch must have either "trunk master" or "master is"
stated.
bottleneckmon --showcredittools Internal port credit recovery is Enabled with
LrOnly
mapspolicy --show -summary - Prereq is FOS v7.2
mapsconfig --show
RASLOG,SNMP,EMAIL,SW_CRITICAL,SW_MARGINAL,SFP_MARGINAL Since mapsdb
--show command will be used for another check it can also be used for this check.
Prereq is FOS v7.2

iodshow

look for NOT seeing "IOD is not set"


dlsshow

DLS is set with Lossless enabled


What command to display ?

Should be set to Insistent Domain ID=yes

creditrecovmode --show
look for Enabled with LROnly for port recovery and Enabled on credit loss
detection

segment: configshow

match: switch.edgeHoldTime:s*(d+)
snmpconfig --show snmpv3

Trap Entry must contain IP


Without proper names set on switch ports, debugging and alerting is more
difficult and can cause longer time to investigate.
creditrecovmode --show
Internal port credit recovery is Enabled with LrOnly
C2 FE Complete Credit Loss Detection is Enabled

mapspolicy --show -summary


relayconfig --show
seccryptocfg --show
HTTPs should be min TLS v1.2
HW/SW settings

sfpshow port number

licenseshow If ISLs present, trunk license must be also


licenseshow

firmware show all v<version> must be identical


Status Monitoring

trunkshow
fanshow

switchstatusshow

tempshow

tempshow

hashow must be: HA enabled, Heartbeat Up, HA State not in sync

switchviolation --dump -dcc


portstatsshow er_crc "value" <2500

portstatsshow er_enc_out "value" <10000

portstatsshow er_enc_in "value" <10000

portstatsshow er_bad_os "value" <2500

portstatsshow er_rx_c3_timeout "value" <2500

portstatsshow er_tx_c3_timeout "value" <2500

portstatsshow er_c3_dest_unreach "value" <2500

psshow

slotshow

sfpshow

switchshow
Mod_* and *_Flt -> bad SFP ->
no_sync --> Bad attached device

switchshow | check if port is segmented

239 8 47 3def00 id N8 Online FC E-Port


segmented,10:00:00:05:33:19:aa:00 (zone conflict)(Trunk master)

look for word segmented


cfgsize

Relation between available and max


"errdump --severity critical --count 20 --reverse" {msgs_critical}

"errdump --severity error --count 20 --reverse"

"errdump --severity warning --count 20 --reverse"


sddquarantine --show
fosexec --fid all -cmd "sddquarantine --show"

Good examples:

"sddquarantine" on FID 128:


Ports marked as Slow Drain Quarantined in the Local Switch:
None

portshow-all
Invalid_word: greater than 500 per day

#### <portshow-all>
portIndex: 0
portName: slot1 port0
portHealth: OFFLINE

Authentication: None
portDisableReason: Persistently disabled port
portCFlags: 0x0
portFlags: 0x4021 PRESENT U_PORT DISABLED LED
LocalSwcFlags: 0x0
portType: 17.0
portState: Persistently Disabled
Protocol: FC
portPhys: 4 No_Light portScn: 2 Offline
port generation number: 0
state transition count: 1

portId: 590000
portIfId: 43120038
portWwn: 20:00:00:05:1e:48:7a:00
portWwn of device(s) connected:

Distance: normal
portSpeed: N8Gbps

Credit Recovery: Inactive


LE domain: 0
Peer beacon: Off
FC Fastwrite: OFF

hashow must be:

SANA_DCX2:FID16:dlutz> hashow
Local CP (Slot 4, CP0): Active, Cold Recovered
Remote CP (Slot 5, CP1): Standby, Healthy
HA enabled, Heartbeat Up, HA State synchronized
hashow must be:

SANA_DCX2:FID16:dlutz> hashow
Local CP (Slot 4, CP0): Active, Cold Recovered
Remote CP (Slot 5, CP1): Standby, Healthy
HA enabled, Heartbeat Up, HA State synchronized

portstatsshow
port: 173
tim_txcrd_z 536351149
Failed because the captured value [536351149] was not less than or equal to
[24000000]

Value must be below 24M


Potential risks

Industry security standards defines certain requirements surrounding passwords,


such as length, expiration, unique characters, frequency, etc.
By not meeting the security requirements set forth in the specification we put
Customer and Kyndryl at risk and may breach our agreed SLA with the customer

As per ITCS104 telnet port 23 is not allowed to be used, except from BNA
As per ITCS104 HTTP port 80 should be deinied access, except from BNA.
Having non disabled ports will create difficult while collecting the inventory of the
environment

By not meeting the security requirements set forth in the specification we put
Customer and Kyndryl at risk and may breach our agreed SLA with the customer

By not meeting the security requirements set forth in the specification we put
Customer and Kyndryl at risk and may breach our agreed SLA with the customer

Problems on the fabric may go undetected.Syslog messages should be forwarded


to an external syslog server for long term safe keeping and log analysis.

Having Loop port in the environment will lead into performance issues. Loop port
may appear in the SAN due to imporper login / regestring of host into the SAN
This error has occurred because timeout is set to high
Without clock synchronization it is much more difficult to correlate logs of events
across multiple devices, and unsynchronized clocks may cause problems with
some protocols.

SNMP authentication is weak, which could allow an attacker to modify a


configuration through SNMP write more easily than using a userid and password.
Not defining IP address or Traps for the SNMP, generated alerts can not be
trapped and sent for the investigation. Risk of serious events are undetected
causing impact for customers.
SNMP communities are functionally equivalent to passwords and may be used to
access device configuration and status information. If SNMP write is enabled they
can be used to modify device configuration.

SNMP authentication is weak, which could allow an attacker to modify a


configuration through SNMP write more easily than using a userid and password.
Having ISL / IFL ports in "Auto Negotiate" mode switches will keep on check for
the connectivity. Which will lead to both the switches to exchange the
capabilities which may lead into principal switch polling
From FOS 6.3.x support save requies chassis name as it collects data basing on
chassis name.

Potential security and performance issues, when not following Best Practice

Potential security and performance issues, when not following Best Practice

Potential security and performance issues, when not following Best Practice

Potential security and performance issues, when not following Best Practice
Having no IP assigned to one or more control processor will result in access
issues to the environment.

When ever the environment will run out of buffer to buffer credit or will have less
than 50 buffer to buffer credits per ASIC will result into loss of frames which
eventually result into loss of data and major performance issues due to latency.
Considering the topology of the fabric, host / server may loose the connectivity to
the storage resulting in loss of data.
In a switch where virtual fabrics are enabled should have only EX ports as Base
frame (FID127) is mainly used for routing. Also Base fabric is used for Inter
Chassis Links (ICLs) between two physical chassis
If the number of LSAN zones reaches the limit it will increase the band width
eventually will lead into performance issues.
Running with a ISL without trunking, ie. a single Fibre connection(s) between 2
SAN switchs there are some risks:Single Point of Failure, causing Fabric
Segmentation and loss of connectivity if the only (or last) connection between 2
switchs are lostPerformance bottleneck, ISL Trunking is designed to significantly
reduce traffic congestion in storage networks.If there are 2 ISLs between the
switches there are multiple scenarios why a trunk is not formedEither there are
no trunking licensethe links are cabled in different ASICs in either endthe
diffrence in length of the cables are to longthere are "noise" in one cable, could
be a bad connector or patch panel, a too much bend cable etc.

Credit loss resulting in frame discards, causing I/O failures and I/O timeouts.
Lack of monitoring data, to solve issues with switch and/or links.

No Events sent, if issues are detected by MAPS.

FICON device require that all frames and frame sequences be received in order or
device/equipment checks will occur. IOD being enabled is a requirement for any
fabric with FICON devices.

Similar for OpenSystems, IOD must be disabled to ensure maximum performance


during reconfigure

Without lossless I/O operations will pause when rebalancing occurs due to switch
issues resulting in I/O timeouts to host systems.
The switch domain ID may change after a switch reboot impacting some switch
functions and causing servers to discover storage devices with new address
causing potential storage access issues.

Error have occurred either because Credit tools are not enabled and/or recovery
is set to something other than LrOnly.

Running Fabric OS version 7.4 to 8, this feature enables credit loss recovery on
internal and back-end ports
Edge hold time is not set to recommended value of either:
0 - meaning default value is applied
OR
Set to a number in the interval 80 - 220.
Not defining IP address or Traps for the SNMP, generated alerts can not be
trapped and sent for the investigation.
Risk of serious events are undetected causing impact for customers.
Without proper names set on switch ports, debugging and alerting is more
difficult and can cause longer time to investigate.
Internal credit loss will not be detected which will result in critical performance
impact to hosts using the fabric.

The Brocade default policies have caused many outages when FENCE and SDDQ
are enable as they (event the conservative one) are too aggressive for Kyndryl
SO average environment.

With the Kyndryl_SO custom thresholds we allow FENCE and SDDQ to be optional
enabled for high availability fabrics.
Important events can go undetected

Unauthorized access and audit failure

Having non Brocade vendor SFP connected to Brocade DCX results in port offline
Not providing Truncking license to the switch is a limitation to the expansion of
fabric
Lack of monitoring data, to solve issues with switch and/or links.

Running different firmware versions might result in running undesired firmware


version after boot, not having same features and functions available as expected.

Whenever you have a high deskew value, environment is in risk of performance


issues. Brocade says that Deskew value up to 30 will be supported and you will
not see any performance issues but once the deskew crosses 20, environment
will have performance issues:1. Performance issues due to following reasons2.
Data loss due to frame loss3. Increase in latency due to high Deskew4. Extended
Problem Determination
Having FANs running with abnormal speed leads to switch shutdown
When ever switch status will change from HEALTHY to MARGINAL it is a potential
threat that switch may go down which will impact the environment. Hosts will
loose the connectivity with SAN
When ever switch temperatures will exceed the permissable limits switch will go
down. Considering the topology of the fabric, host / server will loose the
connectivity to the storage resulting in loss of data.
When ever switch temperatures will exceed the permissable limits switch will go
down. Considering the topology of the fabric, host / server will loose the
connectivity to the storage resulting in loss of data.
If HA is not in synchronous state, the heart beat between the two CPs is not up
and HA is disabled resulting in no failover in case of any kind of problem with
active CP.
If the device is not defined in switch data base and connected, port will go offline
and device will not log into swith data base
Frames with CRC errors are typically discarded by the receiving device causing a
command timeout to the device the sent the frame.
Links with encoding errors outside of frame will probably have poor link quality
causing order sets to be corrupted, resulting in IO errors.
Links with encoding errors outside of frame will probably have poor link quality
causing order sets to be corrupted, resulting in IO errors.

Loss of synchronization if running 8Gb link, causing interuption to data stream.


Discards or frames will result in IO timeout, and retransmit of frames, causing
interuption of data stream.
Discards or frames will result in IO timeout, and retransmit of frames, causing
interuption of data stream.
Discards or frames will result in IO timeout, and retransmit of frames, causing
interuption of data stream.
Loss of power, resulting in offline the entire Switch and thereby impacting server
and storage conectivity.
Any blades present, not OK is potentially1. impacting server and storage
conectivity due to Faulty2. Waste of capacity if present but not used (status
DISABLED)

Lack of connectivity between the switch and device, server or switch in the other
end, potentially causing severe problems like Single Point of Failures.

This condition indicates that there is at least one switch in the fabric we can no
longer communicate with, resulting in a segmented fabric where not all devices
are able to communicate.

Not able to add more items to Configuration, like zones and alias
Pending type of message, impact can be anything from single server to entire
environment, causing performance, stability and other issues.
Pending type of message, impact can be anything from single server to entire
environment, causing performance, stability and other issues.
Pending type of message, impact can be anything from single server to entire
environment, causing performance, stability and other issues.
If port was not quarantined, it could potentially impact rest of the fabric, causing
performance and availability issues.

If the port remains quarantine performance for the device on this port will be
limited.
Note: Only a fixed number of ports can be quarantined at anytime having too
many ports will limited the ability to protect against other slow drain devices.

Link is typically online but unusable resulting in loss of a path to the attached
device.

If HA is not in synchronous and workingstate, the heart beat between the two CPs
is not up and HA is disabled resulting in no failover in case of any kind of problem
with active CP.
If HA is not in synchronous and working state, the heart beat between the two
CPs is not up and HA is disabled resulting in no failover in case of any kind of
problem with active CP.

Congestion

You might also like