Sample SLAs (Gold, Silver, Bronze, Copper)
These SLAs outline the levels of service and support that our IT department will provide to various stakeholders within the organization (Gold, Silver, Bronze, Copper):
Gold SLA (Mission-Critical Services)
Availability: 99.99% uptime for mission-critical systems.
NOTE: How will you measure this? There are various commercial and open-source tools available, such as Nagios, Zabbix, Prometheus, and commercial solutions like New Relic or Datadog. Ensure that your monitoring covers a diverse set of locations or regions where your service is used. This helps identify regional outages and provides a more accurate picture of overall availability. Use synthetic transactions or automated scripts to simulate user interactions and test the availability and responsiveness of your services. This proactive approach can help identify issues before they impact real users.
Response Time: <15 minutes for critical incidents.
NOTE: This needs to be defined. The SLA should clarify what constitutes a "response." It may be a simple acknowledgment that the request has been received. SLAs may also define exceptions or exclusions when response time commitments do not apply. For example, scheduled maintenance periods or situations where the issue is caused by the customer's actions.
Resolution Time: Critical incidents resolved within 2 hours.
NOTE: Based on Response Time (see above).
Support Hours: 24/7 support availability.
Cybersecurity: Continuous monitoring and immediate response to security incidents.
Regular penetration testing and vulnerability assessments - External.
RTO for cybersecurity incidents: <1 hour. RPO for cybersecurity incidents: 0 data loss.
Data Retention/Recovery:
Snapshots every 15 minutes for 8 hours.
Daily backups for 21 days with an RPO of 4 hours. Data recovery within 2 hours (RTO) for mission-critical data.
Weekly backups for 6 weeks.
Monthly backups for 6 months.
Yearly backups for 5 years.
NOTE: Backups should be immutable for 14 days.
Automated recovery testing.
Change Requests: Expedited processing for high-priority changes.
Silver SLA (Business-Critical Services)
Availability: 99.9% uptime for business-critical systems.
NOTE: How will you measure this? There are various commercial and open-source tools available, such as Nagios, Zabbix, Prometheus, and commercial solutions like New Relic or Datadog. Ensure that your monitoring covers a diverse set of locations or regions where your service is used. This helps identify regional outages and provides a more accurate picture of overall availability. Use synthetic transactions or automated scripts to simulate user interactions and test the availability and responsiveness of your services. This proactive approach can help identify issues before they impact real users.
Response Time: <30 minutes for high-priority incidents.
NOTE: This needs to be defined. The SLA should clarify what constitutes a "response." It may be a simple acknowledgment that the request has been received. SLAs may also define exceptions or exclusions when response time commitments do not apply. For example, scheduled maintenance periods or situations where the issue is caused by the customer's actions.
Resolution Time: High-priority incidents resolved within 4 hours.
NOTE: Based on Response Time (see above).
Support Hours: Business hours support with on-call availability.
Cybersecurity: Regular security assessments and proactive threat hunting.
RTO for cybersecurity incidents: <4 hours. RPO for cybersecurity incidents: 1 hour.
Data Retention/Recovery:
Snapshots every 60 minutes for 8 hours.
Daily backups for 14 days with an RPO of 8 hours. Data recovery within 4 hours (RTO) for business-critical data.
Weekly backups for TBD.
Monthly backups for TBD.
Yearly backups for TBD.
NOTE: Backups should be immutable for 14 days.
Automated recovery testing.
Change Requests: Timely processing of change requests within business hours.
Bronze SLA (Standard Services)
Availability: 99.5% uptime for standard services.
NOTE: How will you measure this? There are various commercial and open-source tools available, such as Nagios, Zabbix, Prometheus, and commercial solutions like New Relic or Datadog. Ensure that your monitoring covers a diverse set of locations or regions where your service is used. This helps identify regional outages and provides a more accurate picture of overall availability. Use synthetic transactions or automated scripts to simulate user interactions and test the availability and responsiveness of your services. This proactive approach can help identify issues before they impact real users.
Response Time: <1 hour for standard incidents.
NOTE: This needs to be defined. The SLA should clarify what constitutes a "response." It may be a simple acknowledgment that the request has been received. SLAs may also define exceptions or exclusions when response time commitments do not apply. For example, scheduled maintenance periods or situations where the issue is caused by the customer's actions.
Resolution Time: Standard incidents resolved within 8 hours.
NOTE: Based on Response Time (see above).
Support Hours: Business hours support.
Cybersecurity: Periodic security assessments.
RTO for cybersecurity incidents: <8 hours. RPO for cybersecurity incidents: 4 hours.
Data Retention/Recovery:
Snapshots every day for three days.
Weekly backups for 6 weeks with an RPO of 24 hours. Data recovery within 8 hours (RTO) for standard data.
Monthly backups for TBD.
Yearly backups for TBD.
Change Requests: Processed within standard change request timelines.
Copper SLA (Best Effort Services)
Availability: "Best effort" availability.
Measurement and monitoring NOT required,
"Whoa, we're half way there...Whoa oh, livin' on a prayer...Take my hand, we'll make it, I swear...Whoa oh, livin' on a prayer"
Response Time: "Best effort" response time, subject to resource availability.
Resolution Time: "Best effort" resolution time, subject to resource availability.
Support Hours: Limited support during business hours.
Cybersecurity: Limited cybersecurity monitoring and response.
RTO for cybersecurity incidents: "Best effort", subject to resource availability. RPO for cybersecurity incidents: "Best effort" subject to resource availability.
POTENTIAL QUARANTINE/ISOLATION.
Data Retention/Recovery:
Monthly backups for 6 months with an RPO of 72 hours. Data recovery on a best effort basis, subject to resource availability.
Change Requests: Processed on a best effort basis, subject to resource availability.
Please note that the categorization of services into these SLA levels is based on their criticality to the organization. Gold SLA services are the most critical and receive the highest level of attention, while Copper SLA services are considered "best effort" and may have longer response and resolution times. These SLAs will help ensure that we provide the appropriate level of support to meet the needs of different parts of our organization while managing resource allocation effectively.
Maintain open and transparent communication with your customers. If an incident occurs that affects availability, inform your customers promptly and provide updates on the resolution process.
VMware Tags and SLAs
Using VMware tags can be a helpful way to implement different Service Level Agreements (SLAs) for virtual machines (VMs) and infrastructure components within a VMware environment. VMware tags allow you to categorize and label VMs and other objects, making it easier to manage and enforce SLAs. Here's how you can use VMware tags to accomplish this:
1. Create SLA Categories:
First, determine the categories that align with your SLAs (e.g., Gold, Silver, Bronze, Copper).
Create a VMware tag for each SLA category. In the VMware vSphere Client, you can do this by navigating to the "Tags & Custom Attributes" section.
2. Assign Tags to VMs and Objects:
Assign the appropriate SLA tags to each VM and infrastructure object (e.g., datastores, clusters) based on their SLA requirements.
You can assign tags to VMs when creating or editing them in the vSphere Client. Tags can also be applied to other objects such as datastores or resource pools.
3. Automation and Policies:
Use VMware's policy-based management features to automate actions based on the assigned tags. For example:
Resource Allocation: Create resource allocation policies that allocate CPU, memory, and storage resources based on the VM's SLA tag.
Backup and Recovery: Integrate backup and recovery solutions that automatically back up VMs with specific tags according to their SLA requirements.
Security Policies: Implement security policies that vary based on the security requirements associated with each SLA category.
4. Monitoring and Reporting:
Leverage VMware management and monitoring tools to keep track of SLA compliance. You can create custom dashboards or reports that show the performance and compliance status of VMs based on their tags.
5. Resource Allocation and DRS:
Use VMware's Distributed Resource Scheduler (DRS) to automatically balance VMs across hosts based on their SLA tags. This ensures that VMs with higher SLA requirements get the necessary resources.
6. Alerting and Notifications:
Configure alerting and notification systems to trigger alerts when VMs with specific SLA tags experience issues or breaches of their SLAs.
7. Scaling and Maintenance:
Automate scaling and maintenance tasks based on SLA tags. For example, VMs with a "Gold" tag may require more frequent updates and patching than those with a "Copper" tag.
8. Documentation and Communication:
Maintain documentation that outlines the SLA categories and their associated VMware tags. Ensure clear communication with your team so that everyone understands the significance of each tag.
What Are vSphere Tags and How to Use Them
vSphere 7 Tagging Best Practices
By using VMware tags in this way, you can effectively manage and enforce SLAs for your virtual infrastructure, making it easier to provide the appropriate level of service to different parts of your organization while ensuring resource allocation, security, and compliance.