0% found this document useful (0 votes)
27 views9 pages

Universal Service Monitoring with Datadog

Uploaded by

harshascosmos1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views9 pages

Universal Service Monitoring with Datadog

Uploaded by

harshascosmos1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 9

will be working with Architecture teams........

Will have to handle clients, suggest


them how to create dashboards

12/06: [9:24 PM] Shukla, Vivek


Harshavardhan, Tenepalli, Sathyanathan, Sangeetha - I believe you saw the email
conversation between Jason and Matt so please be vigilant
to check any invite from Jason (hoping he will be sending it tomorrow as till
today all are in Dallas for customer meet) and be ready...
please refer to the following email for customer challenges..and it will be
opportunity for you to recommend
good innovative solutions/options during meetings with them...
like 2 star 1
[9:25 PM] Shukla, Vivek
Once get a chance, please setup a quick 10-15 mins call with Kalyan Krishna,
Prabhakula to share desktop and show
email to review how you want to project yourself for requirements gathering and any
recommendations while meeting several from apps team

Datadog UNIVERSAL SERVICE MONITORING: difficult to find why a container is using


more cpu without factoring the services that run within them.
these services are one of the biggest blindspots in monitoring the infrastructures
health.
But, universal service monitoring automatically discovers,monitors and maps every
service running on your infrastructure
without codechanges and redeploying.
Using datadog agent and EBPF .
USM starts discovering services and their dependencies and starts mapping them in
the servicemap.
Every service discovered by USM is now covered with consistent health
metrics,throuput,errorcount and latency meaning you can cover your entire
fleet of services with a single monitor.

MATT'S REQUIREMENTS
LINKS:1.https://docs.datadoghq.com/agent/fleet_automation/(fleet automation)

https://www.datadoghq.com/blog/deploy-datadog-on-windows-with-ansible/(Deploy DD
agent with ansible)

2.https://learn.microsoft.com/en-us/azure/azure-monitor/vm/vminsights-optout
(to disable azure fleet)

https://learn.microsoft.com/en-us/azure/virtual-desktop/insights?tabs=monitor

[1:24 AM] Parrish, Jason


DataDog Administrator: This is the person who is responsible for setting up and
configuring the DataDog environment, including the DataDog agents,
integrations, dashboards, alerts, and logs. The DataDog administrator also manages
the access and permissions of the DataDog users and
ensures that the DataDog environment is secure and compliant with the
organization's policies and standards.
The DataDog administrator should have the following skills:
Knowledge of DataDog features and capabilities
Experience with cloud platforms and services, such as AWS, Azure, or GCP
Proficiency in scripting languages, such as Python, Ruby, or PowerShell
Ability to troubleshoot and resolve DataDog issues
Good communication and documentation skills
[1:24 AM] Parrish, Jason
DataDog Developer: This is the person who is responsible for developing and
deploying the applications, systems, and services that are monitored by DataDog.
The DataDog developer also creates and maintains the custom metrics, traces, and
logs that are sent to DataDog for analysis and visualization. The DataDog developer
should have the following skills:
Knowledge of DataDog APIs and SDKs
Experience with application development and deployment tools, such as Git, Jenkins,
or Docker
Proficiency in programming languages, such as Java, C#, or PHP
Ability to integrate DataDog with other tools and platforms, such as Slack,
PagerDuty, or Splunk
Good testing and debugging skills
[1:24 AM] Parrish, Jason
DataDog Analyst: This is the person who is responsible for analyzing and
interpreting the data collected by DataDog. The DataDog analyst also creates and
modifies the DataDog dashboards, reports, and alerts that provide insights and
actionable information to the stakeholders. The DataDog analyst should have the
following skills:
Knowledge of DataDog query language and visualization tools
Experience with data analysis and reporting tools, such as SQL, Excel, or Tableau
Proficiency in data science and machine learning techniques, such as regression,
clustering, or anomaly detection
Ability to communicate and present the data findings and recommendations to the
stakeholders
Good critical thinking and problem-solving skills

[12:46 AM] Banks, Mike


Active Directory - Anatoly Miller
Remi - Kelvin Malyar
AutoSys
GoAnywhere
CDS/Marketdata
Veritas - Matt Juaire, Steven Daugherty, Roy Borges
Verint - Matt Juaire, Steven Daugherty, Roy Borges
like 3
[12:47 AM] Banks, Mike
Active Directory / Replication is #1
like 3 star 1
[1:01 AM] Parrish, Jason
RBAC_App_Datadog_Users
[1:14 AM] Banks, Mike
Some more DataDog videos: https://www.youtube.com/playlist?
list=PL0xeHY_ImQVVXHAExfdxLdfufEtZs2Ye2
like 3
Datadog Tutorials - YouTube
Share your videos with friends, family, and the world
[1:23 AM] Shukla, Vivek
SRE - Solution Reliability Engineer

[11:06 PM] Shukla, Vivek


Documentation provided by Datadog and Atlassian(OpsGenie) for integration:
Datadog----OpsGenie (datadoghq.com)
OpsGenie-----Integrate Opsgenie with Datadog | Opsgenie | Atlassian Support
OpsGenie to JSM ----Integrate Opsgenie with Jira Service Management Cloud |
Opsgenie | Atlassian Support
like 9
Getting Started with Datadog
Use OpsGenie as a notification channel in Datadog alerts and events.
[11:07 PM] Shukla, Vivek
Kalyan Krishna, Prabhakula - Please aggressively emphasize knowledge upskilling of
our team for the Datadog integration and daashboard creation.
As we discussed, it is the top most requirement for customer.
like 1

[9:41 PM] Shukla, Vivek


Everyone - Start going through it as the first opportunity and share once
completed.
Kalyan Krishna, Prabhakula - Please add it to our KT tracker to get completed by
team on priority.

https://www.datadoghq.com/blog/monitoring-101-collecting-data/
like 9
Monitoring 101: Collecting the right data | Datadog
Collect metrics and classify data so that you can receive meaningful, automated
alerts about potential problems, and quickly get to the bottom of performance
issues

The tag "env:core-vm-domain_infra" likely represents metadata associated with


Active Directory (AD) hosts within your infrastructure. Let's break down what it
might mean:

env: This likely stands for "environment" and indicates the environment or context
in which the host resides. In this case, "core-vm-domain_infra" might suggest that
these hosts belong to the infrastructure environment of your core virtual machine
domain.

core-vm-domain_infra: This part of the tag provides more specific information about
the environment or purpose of the hosts. "Core" might indicate that these hosts are
essential or central to your infrastructure. "VM" could signify that these hosts
are virtual machines. "Domain" could suggest that they belong to a specific domain
within your infrastructure, such as an Active Directory domain. "_infra" might
further specify that these hosts are infrastructure-related.

So, altogether, "env:core-vm-domain_infra" likely indicates that these hosts are


part of the core infrastructure, specifically related to virtual machines within a
domain environment.

Yes, you can use this tag to filter your AD hosts in Datadog. Using this tag, you
can specifically target hosts that match this metadata, allowing you to monitor and
manage them more effectively within your monitoring and analytics platform.
I would like to see tomorrow what
dashboards and sample integration our team has created/developed by using the
Datadog environment that we can use for 2 weeks after signup...

Harshavardhan, Tenepalli
1/12/2024, 11:23 PM
Vivek, we can't install Datadog agent on TEK Laptop, we need Admin credentials for
that. But, we did have a lab environment pre-configured with dashboards and
integrations in the Datadog learning centre courses we have been assigned
Yes, we need to create in the lab environment.... you also please get ready a good
robust value-driven dashboard ready for Tuesday Demo. Thanks!

Please gain required knowledge and create a good dashboard with more and more
metrics...CPU, Memory,
SLA...whatever we can and be ready for a good dashboard demo on Tuesday..
Dashboard should look good and value-driven for support team..thanks!!

END TO END DEMO FOR DATADOG APP INTEGRATION AND DASHBOARD.........Dasboard for
support team(meaningful and important metrics)

we need a machine plus datadog login plus agent install; configured a windows
server in dd-1 host

dashboard creation with widgets, we dont need to monitor all machines(only the
priority ones/ones in production) ,
powerpacks used to monitor system cpu,present i have taken default ones, what is
our requirement and what to create in dashboards

WE should come to dashboard only upon receiving an incident, thats why we need
alerting and notifications based on user threshholds...
thats why the need for monitors that check metrics,integration
availability,network endpoints and more.

Everyone and Harshavardhan, Tenepalli , Maity, Subhadeep ,Gambiraopet, Shivanand,


Sathyanathan, Sangeetha
- The new requirement regarding Datadog is from DBA team... please review and if
possible get ready with similar Datadog Demo for Jason, Mike,
and me for the following to cover... it will be a good start to start working with
our first stakeholder...

1.Encrypt database datadog user creds.


2.Create dashboard for DBA team with simple and in-depth view.
3.Setup some alerts if resources above threshold.
4.Create a DB dashboard for APP teams.(PLEASE SHARE DETAILS OF SPOC-I.E SPECIFIC
POINT OF CONTACT)

Active Directory - Anatoly Miller


Remi - Kelvin Malyar
AutoSys
GoAnywhere
CDS/Marketdata
Veritas - Matt Juaire, Steven Daugherty, Roy Borges
Verint - Matt Juaire, Steven Daugherty, Roy Borges

what tags are there for these applications?

Tag filter-

Coming to the hosts I have added all oracle & few Sqlserver , 1 postgres already.
SO only few Sqlserver missing other than that it done.

Please try to get ready to demo on Friday for sure as we need to start working with
the stakeholder earliest possible to show them our value..

Requirement......
Harshavardhan, Tenepalli, Gambiraopet, Shivanand, Sathyanathan, Sangeetha - Please
try your THE BEST to compete by today this new task/requirement
shared by Mike Banks for adding another hosts in the same dashbaord...a few metrics
as needed
and Reply All by today before logging off so we will be on track to show it to Matt
possibly early next week.
Thanks for your focused attention on it.
Great,
if it gets ready before our shift end then please IM to Mike./Jason and get setup a
quick meeting including
Kalyan Krishna, Prabhakula to walk through

SRE stands for Site Reliability Engineer, not Solution Reliability Engineer. Site
Reliability Engineering (SRE) is a discipline that incorporates
aspects of software engineering and applies them to infrastructure and operations
problems. The main goals of SRE are to create scalable and highly reliable software
systems.
SREs are responsible for ensuring that the systems are reliable, scalable, and
efficient, while also automating operational tasks to improve reliability and
performance.
They work closely with software developers to design and implement reliable
systems and processes.

[1/5 1:24 AM] Parrish, Jason


DataDog Administrator: This is the person who is responsible for setting up and
configuring the DataDog environment, including the DataDog agents, integrations,
dashboards, alerts, and logs. The DataDog administrator also manages the access and
permissions of the DataDog users and ensures that the DataDog environment is secure
and compliant with the organization's policies and standards. The DataDog
administrator should have the following skills:
Knowledge of DataDog features and capabilities
Experience with cloud platforms and services, such as AWS, Azure, or GCP
Proficiency in scripting languages, such as Python, Ruby, or PowerShell
Ability to troubleshoot and resolve DataDog issues
Good communication and documentation skills
[1/5 1:24 AM] Parrish, Jason
DataDog Developer: This is the person who is responsible for developing and
deploying the applications, systems, and services that are monitored by DataDog.
The DataDog developer also creates and maintains the custom metrics, traces, and
logs that are sent to DataDog for analysis and visualization. The DataDog developer
should have the following skills:
Knowledge of DataDog APIs and SDKs
Experience with application development and deployment tools, such as Git, Jenkins,
or Docker
Proficiency in programming languages, such as Java, C#, or PHP
Ability to integrate DataDog with other tools and platforms, such as Slack,
PagerDuty, or Splunk
Good testing and debugging skills
[1/5 1:24 AM] Parrish, Jason
DataDog Analyst: This is the person who is responsible for analyzing and
interpreting the data collected by DataDog. The DataDog analyst also creates and
modifies the DataDog dashboards, reports, and alerts that provide insights and
actionable information to the stakeholders. The DataDog analyst should have the
following skills:
Knowledge of DataDog query language and visualization tools
Experience with data analysis and reporting tools, such as SQL, Excel, or Tableau
Proficiency in data science and machine learning techniques, such as regression,
clustering, or anomaly detection
Ability to communicate and present the data findings and recommendations to the
stakeholders
Good critical thinking and problem-solving skills

SRE Datadog likely refers to the utilization of Datadog, a popular monitoring and
analytics platform, within Site Reliability Engineering (SRE) practices. Datadog is
commonly used by SRE teams to monitor the performance, availability, and
reliability of their systems and applications.

SREs leverage Datadog's features to:

Monitor the health and performance of various components in their infrastructure,


including servers, containers, databases, and applications.
Set up alerts and notifications to proactively detect and respond to issues before
they impact users.
Analyze metrics and logs to identify trends, troubleshoot problems, and optimize
system performance.
Create dashboards and visualizations to gain insights into system behavior and
share information with other teams.
By integrating Datadog into their SRE workflows, teams can effectively manage and
maintain the reliability and availability of their systems in modern, dynamic
environments.

Harshavardhan, Tenepalli, Sathyanathan, Sangeetha - As discussed, please daily


provide our Datadog efforts progress made via email to Mike Banks
and Jason Parrish with cc to Prabhakula and me. Please share status in bullet
points to showcase what was accomplished/completed today..
and next action items..
please send out this email daily before logging off..thanks!

[1/5 1:23 AM] Shukla, Vivek


SRE - Site Reliability Engineer
[1/5 1:24 AM] Parrish, Jason
DataDog Administrator: This is the person who is responsible for setting up and c
onfiguring the DataDog environment, including the DataDog agents, integrations,
dashboards, alerts, and logs. The DataDog administrator also manages the access and
permissions of the DataDog users and ensures that the DataDog environment is secure
and compliant with the organization's policies and standards. The DataDog
administrator should have the following skills:
Knowledge of DataDog features and capabilities
Experience with cloud platforms and services, such as AWS, Azure, or GCP
Proficiency in scripting languages, such as Python, Ruby, or PowerShell
Ability to troubleshoot and resolve DataDog issues
Good communication and documentation skills

[1/5 1:24 AM] Parrish, Jason


DataDog Administrator: This is the person who is responsible for setting up and
configuring the DataDog environment, including the DataDog agents, integrations,
dashboards, alerts, and logs. The DataDog administrator also manages the access and
permissions of the DataDog users and ensures that the DataDog environment is secure
and compliant with the organization's policies and standards. The DataDog
administrator should have the following skills:
Knowledge of DataDog features and capabilities
Experience with cloud platforms and services, such as AWS, Azure, or GCP
Proficiency in scripting languages, such as Python, Ruby, or PowerShell
Ability to troubleshoot and resolve DataDog issues
Good communication and documentation skills
[1/5 1:24 AM] Parrish, Jason
DataDog Developer: This is the person who is responsible for developing and
deploying the applications, systems, and services that are monitored by DataDog.
The DataDog developer also creates and maintains the custom metrics, traces, and
logs that are sent to DataDog for analysis and visualization. The DataDog developer
should have the following skills:
Knowledge of DataDog APIs and SDKs
Experience with application development and deployment tools, such as Git, Jenkins,
or Docker
Proficiency in programming languages, such as Java, C#, or PHP
Ability to integrate DataDog with other tools and platforms, such as Slack,
PagerDuty, or Splunk
Good testing and debugging skills
[1/5 1:24 AM] Parrish, Jason
DataDog Analyst: This is the person who is responsible for analyzing and
interpreting the data collected by DataDog. The DataDog analyst also creates and
modifies the DataDog dashboards, reports, and alerts that provide insights and
actionable information to the stakeholders. The DataDog analyst should have the
following skills:
Knowledge of DataDog query language and visualization tools
Experience with data analysis and reporting tools, such as SQL, Excel, or Tableau
Proficiency in data science and machine learning techniques, such as regression,
clustering, or anomaly detection
Ability to communicate and present the data findings and recommendations to the
stakeholders
Good critical thinking and problem-solving skills

dd is expecting some executable which we are unable to build unless we have


expertise

1.tags,who are maintining/assigning, in our monitoring should we handle only


ui's(metrices/dashboards/alerts) ; is it part of my responsibilities to do
configurations for the machines
Shold we do configurations for agents or if they have condigured agents, do we need
access to config files. is my responsibility only to create monitors for installed
machines.
2.AD Metrices,I will try to look for AD generic thresholds , I am not sure wether
they will be helping us exactly (by referring in both AD&DD
Documentation)....request timeout/latency
3.custom logs, we need from this location(APP Teams will ask in future) . we need
to integrate with DD. We need to define/configure with their requirements.(watch
couse DD Custom
logs configuration)we need yaml file configuration to get custom logs as required
by APP Teams.( we need to update yaml file, should we need access credentials)
4.raise a request with DD Support Team since we have enerprise version.DB- even
with the documentation DD Is expecting some executable to encrypt and then decrypt
to provide
those details with DD which is only possible with executable and we are not sure
how the executable has to be. If we raise a request then dd support will tell us
about configurations.
File we are using is already exposed - DB Team has to tell us.
Regarding secrets- we are not in a position to implement it, we have tried but it
is expecting some executable which we are unable to build unless we havesome
expertise in
programming.(lets be clear)Even if we are saving the secrets in a file and then
using this encryption, the file which we are saving the secrets are already beig
exposed.so how it is
being controlled? Need clarity regaarding: how do you handle file in which secrets
are placed if they are compromised. yaml file-700 etc. permissions
5.DBA:they might ask - Here i have configured 4 servers we need a Dashboard with
this many servers.

LET ME LOOK INTO THIS AND COMEBACK TO YOU.

5.

2. *Integration Requirements*:
- Are there any existing systems or services that need to be integrated with
Datadog?
- Do you use any specific frameworks, languages, or technologies that require
custom instrumentation?

3. *Critical Metrics and KPIs*:


- What are the most critical performance metrics and key performance indicators
(KPIs) for your application?
- Which metrics do you currently track or wish to track using Datadog?

4. *Alerting and Notifications*:


- What are your criteria for triggering alerts?
- Who should be notified in case of an incident or performance degradation?
- Do you have any specific escalation policies or response procedures?

5. *Logging and Tracing*:


- Do you currently use any logging or tracing solutions?
- Are there any specific log data or traces that you want to collect and analyze
with Datadog?

6. *Scaling and Resource Usage*:


- How does your application scale under load?
- What resource utilization metrics are important for monitoring scalability?
- Are there any performance bottlenecks or resource constraints that need to be
monitored?

7. *Security and Compliance*:


*Security and Compliance*:
- Are there any security or compliance requirements that need to be addressed
with monitoring?
- Do you have any specific security-related events or metrics that should be
monitored?

8. *User Experience and Business Metrics*:


- How do you measure the user experience and business impact of your
application?
- Are there any specific user actions or business metrics that you want to
monitor?

9. *Customization and Dashboarding*:


- What level of customization do you require for dashboards and visualizations?
- Are there any specific reports or views that stakeholders need access to?

10. *Training and Support*:


- Do team members require training on using Datadog?
- What level of support do you expect during the setup and ongoing monitoring
process?

By asking these questions, you can gain insights into the application team's
monitoring needs and tailor the Datadog setup to effectively meet
those requirements.

You might also like