Devops
Devops
@aiwebix
[Link]
Foundational Concepts
1. What is DevOps?
DevOps is a modern approach to building and delivering software that combines cultural
philosophies, practices, and tools to increase an organization's ability to deliver applications at
high velocity. It aims to break down the traditional walls or "silos" that exist between software
development teams (Dev) and IT operations teams (Ops).
Detailed Explanation: Think of a traditional restaurant. The chefs who cook the food
(Developers) work in the kitchen, and the waiters who serve the food to customers (Operations)
work in the dining area. If a customer complains that the food is cold, the waiters might blame
the chefs for slow cooking, and the chefs might blame the waiters for slow serving. Each team
works in its own silo, and the customer remains unhappy. This is the traditional, inefficient model
of software delivery.
DevOps makes the chefs and waiters work as one single, collaborative team with a shared goal:
delivering a great experience to the customer. They share ownership of the entire process.
● Cultural Philosophy: The team adopts a mindset of shared responsibility. If the food is
cold, it's the team's problem to solve together.
● Practices: They implement better processes, like creating a direct communication
channel between the kitchen and the dining floor.
● Tools & Automation: They use tools like food warmers (automation) and an efficient
kitchen-to-table routing system (a CI/CD pipeline) to ensure the software (hot, delicious
food) is delivered quickly and reliably.
The core business value of DevOps is to shorten the systems development life cycle, which
means faster time to market for new features, increased reliability and stability of the system,
and the ability to innovate more quickly than competitors.
While they are highly complementary and often used together, they focus on different parts of the
DevOps, on the other hand, is a broader philosophy that covers the entire delivery lifecycle,
from the moment an order is placed to the moment the food is delivered and even after. It
includes the kitchen's process (Agile) but also focuses on how the food is packaged, delivered to
the table, and how customer feedback is collected and sent back to the kitchen. DevOps answers
the bigger question, "How do we deliver and run the software reliably and quickly?"
Scope Primarily the development lifecycle. The entire software delivery and operations
lifecycle.
Teams Focuses on collaboration within the Focuses on collaboration between Dev, Ops,
dev team. and Security.
Goal To build the right software and To deliver and run that software reliably and
adapt to change. at speed.
● Culture: This is the most important and foundational principle. It's about changing the
organizational mindset from finger-pointing to shared ownership. It means fostering a
culture of collaboration between all teams (Dev, Ops, Security, Business). A key part of
this is creating psychological safety and adopting a blameless approach to failures.
These are the founding principles from the book The Phoenix Project that describe the journey to
achieving a DevOps mindset. They provide a high-level framework for thinking about your
processes.
● The First Way (Systems Thinking): This means looking at the entire software delivery
process as one single system, from a business idea all the way to a feature running in
production and delivering value to the customer. The goal is to maximize the flow of
work through this system, moving from left (Development) to right (Operations) as
quickly and smoothly as possible. This involves identifying and removing bottlenecks,
just like a factory would optimize its assembly line.
● The Second Way (Amplify Feedback Loops): This is about creating fast and constant
feedback from the right side (Operations) back to the left side (Development). When a
problem happens in production, developers should know about it immediately, not days
later. This is achieved through comprehensive monitoring, automated testing that provides
instant results, and quick communication channels. The goal is to find and fix problems as
close to the source as possible, which makes them cheaper and easier to fix.
● The Third Way (Culture of Continual Experimentation and Learning): This is about
creating a high-trust culture where it's safe to experiment, take calculated risks, and learn
A blameless culture is an environment where, when something goes wrong, the immediate
question is "What went wrong?" and "How can we prevent this from happening again?"—not
"Who did this?" It's a foundational element of DevOps because it focuses on fixing the system
and the process, not on blaming an individual.
This is critically important because DevOps encourages speed and frequent deployments, which
naturally increases the chance of errors. If people are afraid of being punished for making a
mistake, they will start to hide problems, avoid taking risks, and point fingers at other teams. This
creates a culture of fear that destroys collaboration and slows everything down. A blameless
culture fosters psychological safety, which encourages transparency, learning, and a sense of
shared ownership, ultimately making the entire system more reliable and resilient.
Site Reliability Engineering (SRE) is a discipline that takes the principles of software
engineering and applies them to operations problems. It can be thought of as a very specific,
prescriptive implementation of DevOps, originally developed at Google.
While DevOps is a broad philosophy about culture and collaboration ("what" and "why"), SRE
provides the specific practices and roles for "how" to do it. For example, SRE introduces
concrete concepts like:
● Service Level Objectives (SLOs): Setting specific, measurable reliability targets for a
service.
● Error Budgets: A data-driven way to balance reliability with the need to release new
features.
● Toil Automation: A commitment to automating manual, repetitive operational tasks.
Essentially, if DevOps is the goal of making Dev and Ops work together, SRE is a job description
for an engineer who achieves that goal using software engineering practices.
Think of it like multiple authors writing a book. Instead of each author writing their entire
chapter in isolation and then trying to combine them all at the end (which would be a nightmare
of conflicting storylines), CI is like having them add one paragraph at a time to a shared
document. After each new paragraph is added, an automated editor checks for spelling and
grammar mistakes. This allows the team to find and fix integration problems early and quickly,
leading to higher quality code and a more stable project.
Continuous Delivery (CD) is the logical next step after Continuous Integration. It's the practice
of automating the entire software release process. After the code is successfully built and tested
in the CI stage, the CD process automatically packages it and prepares it for release.
The result is a production-ready artifact that has passed all automated checks. With Continuous
Delivery, the operations team can deploy this new version to production with the push of a
button. The key idea is that the software is always in a releasable state, which dramatically
reduces the risk and anxiety of release days.
Continuous Deployment takes Continuous Delivery one step further by removing the manual
button push. In this practice, every single code change that passes the entire automated test suite
is automatically deployed to production without any human intervention.
This is the ultimate goal for many DevOps teams, but it requires a very high level of confidence
in the automated testing and monitoring systems. While Continuous Delivery ensures you can
release at any time, Continuous Deployment ensures you do release every time a change is ready.
A CI/CD pipeline is the automated workflow that acts as the backbone of the entire DevOps
process. It's the set of steps that takes a developer's code from their machine all the way to the
production environment.
● Source Stage: This stage is triggered when a developer commits code to a version control
system like Git.
● Build Stage: The pipeline checks out the code, compiles it if necessary, and packages it
into a runnable artifact. For example, it might create a Docker image or a Java JAR file.
● Test Stage: The pipeline runs a series of automated tests against the artifact to ensure it's
high quality and doesn't have any bugs. This can include unit tests, integration tests, and
security scans.
● Deploy Stage: If all the tests pass, the artifact is deployed to an environment. This
usually starts with a staging or testing environment, and after further checks, it's promoted
to the production environment for users.
● Jenkins: The most popular open-source automation server. It's incredibly powerful and
flexible due to its massive library of plugins, but this flexibility can also make it complex
to set up and manage.
● GitLab CI/CD: A CI/CD tool that is tightly integrated into the GitLab platform. It's
known for its power and simplicity, using a single YAML file (.[Link]) for
configuration and providing built-in features like a container registry.
● GitHub Actions: A CI/CD tool that is natively integrated into GitHub. It's event-driven,
meaning you can trigger workflows on any GitHub event (like a push, a pull request, or a
new issue). It has a large marketplace of reusable "actions" that make it easy to build
complex pipelines.
● CircleCI: A popular cloud-based CI/CD tool known for its speed, performance, and
easy-to-use YAML configuration.
A Jenkinsfile is a text file that contains the definition of a Jenkins pipeline. It is checked into the
source code repository alongside the application code. This is the core of "pipeline as code."
Instead of configuring a pipeline by clicking around in the Jenkins user interface, you define all
the stages, steps, and logic of your pipeline in this file using a Groovy-based language. There are
two types of syntax:
13. How would you design a CI/CD pipeline for a microservices architecture?
The key principle for a microservices pipeline is that each service must be independently
deployable. To achieve this without creating a management nightmare, you would design a
reusable pipeline template.
Instead of creating a unique, handcrafted pipeline for each of the dozens or hundreds of services,
you create a master template that defines the standard stages (build, test, scan, deploy). In
Jenkins, this is done with a Shared Library. In GitLab CI, this is done with a include keyword
to import a template. In GitHub Actions, this is a Reusable Workflow.
Each microservice then has its own repository and a very small pipeline configuration file that
simply calls this reusable template. This approach keeps the core pipeline logic centralized and
easy to update, while giving each service team the autonomy to release their service whenever
they need to.
14. What is Git? Explain the difference between git pull and git fetch.
Git is a distributed version control system, which means it's a tool for tracking changes in code
and coordinating work among multiple people.
● git fetch: This command connects to a remote repository (like GitHub) and downloads all
the new changes, but it does not integrate any of them into your current working files. It's
like going to the post office to pick up your mail; you have the letters in your hand, but
you haven't opened them yet. This is a safe way to review what others have done without
affecting your own work.
● git pull: This command is actually two commands in one: git fetch followed by git merge.
It downloads all the new changes and then immediately tries to merge them into your
current branch. Using the same analogy, this is like having the mail delivered directly to
your desk and automatically opened and sorted into your current paperwork.
15. What is the difference between git rebase and git merge?
Both commands are used to integrate changes from one branch into another, but they do so in
very different ways, resulting in a different project history.
git cherry-pick is a command that allows you to select a single commit from one branch and
apply it onto another branch.
Imagine you have a bug fix on a feature branch that you also need on your main production
branch right now, but you're not ready to merge the entire feature branch. git cherry-pick allows
you to go into the feature branch, "pick" just that one bug fix commit, and apply it to your main
branch. It's a surgical way to move a specific change between branches.
A branching strategy is a set of rules and conventions that a team agrees on for how they will
create, name, and merge branches in Git.
● GitFlow is a very structured and complex strategy that is well-suited for projects with
scheduled, versioned releases (like traditional software). It uses multiple long-lived
branches, including main for stable releases and develop for integrating features. It also
uses temporary branches for features, releases, and hotfixes. This makes it robust but can
be slow and complex for modern web applications.
● Trunk-Based Development is a much simpler strategy where all developers work on a
A .gitignore file is a simple text file where you list files and directories that Git should
intentionally ignore. Any file or directory listed in it will not be tracked by Git, meaning it won't
be staged or committed to the repository.
Every project generates files that are specific to a local environment or are temporary build
artifacts (like log files, dependency folders such as node_modules, or compiled code).
Committing these files to a shared repository would cause unnecessary conflicts and bloat. The
.gitignore file prevents this. Most importantly, it's used to ignore files containing secrets and
credentials (like .env files). Committing secrets is a major security risk, and using .gitignore is
the first line of defense against it.
Imagine building a house with a detailed architectural blueprint instead of just telling workers to
"build a wall here." The blueprint is your code. This approach makes your infrastructure
repeatable, consistent, and scalable. You can store your infrastructure code in Git, review
changes with your team, and automatically deploy it, just like you would with application code.
Configuration Management is a process for ensuring that your servers and software are
maintained in a desired, consistent state. It's a subset of IaC that focuses on what happens on a
server after it has been created.
For example, you might use an IaC tool like Terraform to create a new virtual server. Then, you
would use a Configuration Management tool like Ansible to log into that server and perform
tasks like installing a web server, configuring its settings, creating user accounts, and starting
services. It's about managing the software and settings on the machine itself.
21. What is the difference between imperative and declarative IaC tools?
Terraform state is a file (by default, [Link]) that acts as Terraform's brain. It's a JSON
file that keeps track of the infrastructure it manages. It's critically important because it maps the
resources defined in your code to the real-world resources that have been created in your
Without the state file, Terraform would have no memory of what it created. For team
collaboration, the state file must be stored in a remote, shared location (like an AWS S3
bucket). This is called remote state. This remote location should also support state locking,
which is a crucial feature that prevents multiple people from running Terraform at the same time
and corrupting the state.
Ansible is a configuration management and automation tool. It's extremely popular for two main
reasons:
1. Agentless: Unlike many other tools, Ansible does not require you to install any special
software (an "agent") on the servers you want to manage. It communicates with them
over standard, existing protocols like SSH for Linux and WinRM for Windows. This
makes it incredibly easy and fast to set up.
2. Simple (YAML): Ansible's automation workflows, called "playbooks," are written in
YAML, which is a simple, human-readable language. This lowers the barrier to entry,
allowing system administrators and developers who aren't expert programmers to write
powerful automation.
A "Golden Image" is a perfectly configured, standardized template for a virtual machine (VM)
or a container. It's like a master cookie cutter that you use to create all your new servers.
Instead of starting with a bare operating system and configuring it every time, you create a
golden image that already has the OS hardened, all necessary security patches applied, and
common tools like monitoring and logging agents pre-installed. This ensures that every new
server you create is consistent, secure, and ready to go much faster. This process should be
automated using a tool like HashiCorp's Packer, which allows you to define the image as code.
A container is a lightweight, standalone package that contains an application and all its
dependencies. It's like a standardized shipping container for software. A Virtual Machine (VM),
on the other hand, is a complete emulation of a physical computer.
The key difference is in what they virtualize. A VM virtualizes the hardware, so each VM needs
its own full copy of an operating system. Containers virtualize the operating system itself, so
multiple containers can share the same host OS kernel. This makes containers much smaller
(megabytes vs. gigabytes for VMs), faster to start (seconds vs. minutes), and more efficient in
resource usage.
Docker is the most popular platform for developing, shipping, and running applications inside
containers. It provides a simple set of tools and a standardized format that made containers easy
to use. The Dockerfile is a text file that acts as a recipe, containing all the instructions needed to
build a Docker image, which is the template for creating a container.
A container orchestrator is a tool that automates the deployment, management, scaling, and
networking of containers. When you move from running a few containers on your laptop to
running hundreds or thousands of containers in production across multiple servers, you need an
orchestrator to manage the complexity.
● Control Plane (The Brain): This is the set of master nodes that manage the state of the
cluster. Its components include the kube-apiserver (the front door to the cluster), etcd (the
cluster's database), kube-scheduler (which decides which node a new application should
run on), and kube-controller-manager (which keeps everything running correctly).
● Worker Nodes (The Muscle): These are the machines where your actual applications
run. Each worker node has a kubelet (an agent that communicates with the control plane),
kube-proxy (which handles networking), and a Container Runtime (like Docker) to run
the containers.
A Pod is the smallest and most basic deployable object in Kubernetes. It represents a single
instance of a running application. A Pod can contain one or more containers, but they are all
managed as a single unit. Containers within the same Pod are always located on the same worker
node and share the same network and storage resources.
● A Deployment is the standard way to run a stateless application, like a web server or an
API. It manages a set of identical, interchangeable Pods (replicas) and can handle tasks
like rolling updates and scaling. If a Pod in a Deployment dies, Kubernetes simply
replaces it with a new, identical one.
● A StatefulSet is used to run stateful applications, like a database, where each instance is
unique and not interchangeable. It provides important guarantees like a stable, unique
network identity for each Pod and stable, persistent storage that is tied to that identity.
A Service is a Kubernetes object that provides a stable network endpoint (a single, consistent IP
address and DNS name) for a set of Pods. Pods in Kubernetes are ephemeral—they can be
An Ingress is an API object that manages external access to the services in a cluster, primarily
for HTTP and HTTPS traffic. While a Service provides internal access, an Ingress is what
exposes your application to the outside world.
It acts as a smart router or reverse proxy. It can handle tasks like routing traffic to different
services based on the requested URL (e.g., /api goes to the API service, / goes to the web service)
and managing SSL/TLS certificates. To work, an Ingress requires an Ingress Controller (like
Nginx or Traefik) to be running in the cluster.
34. What are Persistent Volumes (PV) and Persistent Volume Claims (PVC)?
● Persistent Volume (PV): This is a piece of storage in the cluster, like an AWS EBS
volume or a Google Persistent Disk. It's an administrator-level resource that represents
the available storage.
● Persistent Volume Claim (PVC): This is a request for storage made by a user or an
application. A Pod uses a PVC to request a specific size and type of storage. Kubernetes
then binds this PVC to an available PV. This system decouples the application's need for
storage from the underlying storage infrastructure.
Helm is often called "the package manager for Kubernetes." It's a tool that simplifies the process
of installing and managing complex Kubernetes applications.
An application might require many different Kubernetes objects to run. Helm packages all of
these YAML definitions into a single, versioned unit called a Chart. This Chart can be easily
shared and installed with a single command. Helm also allows you to use templates, so you can
customize a Chart for different environments (like dev and prod) without duplicating all the files.
Monitoring is the eyes and ears of your system. In a fast-paced DevOps environment where
changes are deployed frequently, you need immediate feedback on the health and performance
● Proactively detect issues: Find and resolve problems before they impact your users.
● Understand deployment impact: See immediately if a new release has improved or
degraded performance.
● Make data-driven decisions: Use metrics to decide when to scale your infrastructure or
optimize your application.
● Ensure reliability: Track your performance against your SLOs and SLAs.
Observability is a more advanced concept than monitoring. It's a measure of how well you can
understand a system's internal state just by observing its external outputs. The three pillars that
provide this understanding are:
● Metrics: Aggregated numeric data over time. They tell you what is happening (e.g., CPU
is high).
● Logs: Detailed, timestamped records of events. They tell you why it's happening (e.g., a
specific error message is flooding the logs).
● Traces: A detailed view of a single request as it flows through all the different services in
a distributed system. They tell you where the problem is happening (e.g., which specific
microservice is slow).
39. What are some popular monitoring tools (the "Prometheus Stack")?
The open-source stack built around Prometheus is the de facto standard for cloud-native
monitoring.
● Elasticsearch: A powerful search and analytics engine based on Lucene. It's where all the
log data is stored and indexed.
● Logstash: A data processing pipeline that can ingest data from various sources, transform
or enrich it, and then send it to a destination like Elasticsearch.
● Kibana: A visualization tool that allows you to explore, search, and create dashboards for
the log data stored in Elasticsearch. (Note: Today, this is often called the Elastic Stack,
and a lightweight data shipper called Beats is often used to send data to Elasticsearch.
These three acronyms are the foundation of SRE and are used to define and measure reliability.
DevSecOps is the philosophy of integrating security practices into every phase of the DevOps
lifecycle. The goal is to "shift left," which means moving security considerations to the earliest
stages of development, rather than treating security as a final gate at the end of the process. It's
about making security a shared responsibility of the entire team and automating security checks
throughout the CI/CD pipeline.
43. What are some security practices you would integrate into a CI/CD
pipeline?
This is a critical security concept when using a cloud provider like AWS, Azure, or GCP. It
defines which security tasks are handled by the cloud provider and which are handled by you, the
customer.
In general, the provider is responsible for the security of the cloud. This includes the physical
security of the data centers, the hardware, and the core networking and virtualization
infrastructure. The customer is responsible for security in the cloud. This includes configuring
your network security (like security groups), managing user access (IAM), encrypting your data,
and securing your applications.
The most secure and recommended practice is to integrate Kubernetes with an external,
dedicated secret management system like HashiCorp Vault or AWS Secrets Manager.
While Kubernetes has a built-in Secret object, it only Base64 encodes the data by default, which
is not truly encrypted. An external vault provides strong encryption, fine-grained access control
policies, and detailed audit logs. You would use a tool like the Secrets Store CSI driver, which
This scenario requires a swift, systematic response known as a "hotfix." The process is:
1. Assess: Immediately assess the impact and exploitability of the vulnerability to determine
its urgency.
2. Patch: Work with the development team to update the vulnerable library to a patched,
secure version.
3. Test & Deploy: Push the fix through the full, automated CI/CD pipeline. It is critical not
to skip tests, as the new library version could introduce breaking changes.
4. Hotfix: Deploy the patched version to production immediately as a high-priority release.
5. Verify: Monitor the service closely after deployment to confirm that the fix is effective
and has not negatively impacted application performance.
48. A developer tells you "It works on my machine." How do you solve this for
good?
This phrase indicates an inconsistency between the developer's local environment and the testing
or production environment. This can be due to differences in OS versions, system libraries, or
language runtimes. Docker solves this by using a Dockerfile to define the exact environment for
the application. This file is used to build a Docker image, which is a self-contained, immutable
template. Because the container packages the application with its environment, you create a
consistent and portable unit that runs the exact same way everywhere.
A CrashLoopBackOff error means Kubernetes is trying to start a container, but it's repeatedly
crashing. To troubleshoot, I would follow these steps:
1. Check the Logs: The first and most important command is kubectl logs <pod-name>.
This will show the output from the container and will usually contain an error message
that tells you exactly why the application is failing.
A reverse proxy is a server that sits in front of one or more backend servers and forwards client
requests to them. From the client's perspective, it appears as if they are communicating directly
with the reverse proxy itself.
GitOps is an operational framework that uses a Git repository as the single source of truth for
both declarative infrastructure and applications.
The core idea is that the desired state of your entire system is declared in a Git repository. Any
change to production must be made via a commit to this repository. An automated agent (like
Argo CD or Flux) runs inside your Kubernetes cluster, constantly comparing the live state of the
cluster to the desired state in Git. If it detects a difference, it automatically "pulls" the changes
and applies them to the cluster. This provides a perfect audit trail, eliminates configuration drift,
and makes rollbacks as simple as reverting a Git commit.
A sidecar container is an auxiliary container that runs alongside a main application container
within the same Kubernetes Pod. It's used to extend or enhance the functionality of the main
container without being part of the application's code.
Because containers in the same Pod share the same network and storage, the sidecar can interact
closely with the main container. This pattern allows you to separate concerns. Common use cases
include a logging agent sidecar (like Fluentd) that collects and forwards logs, a monitoring
sidecar that exports metrics, or a service mesh proxy (like Envoy) that intercepts all network
traffic.
You write each schema change (like CREATE TABLE) in a separate, version-numbered SQL
script, which is then committed to your Git repository. The migration tool maintains a special
history table in your database to track which scripts have already been applied. As part of your
CI/CD pipeline, you run the migration tool, which automatically applies only the new, unapplied
migration scripts in the correct order. This makes database changes repeatable,
version-controlled, and safe to automate.
A service mesh is a dedicated and transparent infrastructure layer that handles service-to-service
communication within a microservices architecture. It provides a way to manage complex
communication challenges like traffic management, security, and observability without requiring
any changes to the application code.
Chaos Engineering is the discipline of proactively and deliberately injecting failures into a
production system to build confidence in its ability to withstand turbulent and unexpected
conditions. It's about finding weaknesses before they cause a real outage.
The goal is not to break things randomly, but to run controlled experiments. You start with a
hypothesis about how your system will behave when a failure is injected (e.g., "If we shut down
one of our three API servers, latency will not increase"). You then carefully inject that failure in
production and measure the impact to see if your hypothesis was correct. This practice helps you
uncover hidden dependencies and build more resilient systems. Netflix's Chaos Monkey is the
most famous example.
56. How do you keep your skills updated in the fast-changing world of
DevOps?
Staying current in DevOps requires a proactive and continuous learning mindset. My approach is
multi-faceted:
A pull request (PR) or merge request (MR) is a feature of Git hosting platforms like GitHub and
GitLab. It's a formal way to propose changes to a repository. A developer creates a PR to notify
team members that they have completed a feature on a separate branch and it's ready to be
reviewed and merged into the main branch. It's a central place to discuss the changes, perform
code reviews, and run automated CI checks before integration.
58. What is the difference between git stash, git clean, and git reset?
● git stash: Temporarily saves your uncommitted local changes (both staged and unstaged)
so you can switch branches or work on something else. You can later re-apply these
changes with git stash pop.
● git clean: Removes untracked files from your working directory. This is a destructive
command and is useful for getting a completely clean working state.
● git reset: Unstages files and can be used to undo commits. git reset --soft moves the
HEAD pointer but keeps your changes. git reset --hard is destructive and discards all local
changes, resetting your branch to a specific commit.
Pipeline as code is the practice of defining your CI/CD pipeline's configuration in a text file that
is stored in the project's source code repository (e.g., a Jenkinsfile or .[Link]). This
provides several benefits: the pipeline is version-controlled, it can be reviewed and edited like
any other code, and it provides a single source of truth for the entire build and release process.
An artifact repository is a storage system designed to hold the binary outputs (artifacts) of your
build process. Examples include Docker images, Java JAR files, or Python packages. It's
62. What is the difference between a Docker image and a Docker container?
● Image: A Docker image is a read-only, inert template that contains the application and all
its dependencies. It's the blueprint.
● Container: A Docker container is a runnable instance of an image. It's the live, running
application. You can have many containers running from the same image.
Docker volumes are the preferred mechanism for persisting data generated by and used by
Docker containers. They are managed by Docker and are stored in a dedicated part of the host
filesystem. Volumes are used because the container's own filesystem is ephemeral; if the
container is removed, its data is lost. Volumes provide a way to store data (like a database's files)
outside the container's lifecycle, allowing it to persist.
Both specify the command to be executed when a container starts, but they have different
purposes.
● CMD: Provides the default command and/or parameters for an executing container. These
can be easily overridden by supplying a command when you run the container (e.g.,
docker run <image> new-command).
● ENTRYPOINT: Configures a container that will run as an executable. It's harder to
override. The best practice is to use ENTRYPOINT to specify the main executable and
CMD to specify the default flags or arguments for that executable.
A multi-stage build is a feature in a Dockerfile that allows you to use multiple FROM statements.
Each FROM instruction begins a new build stage. This is extremely useful for creating small,
secure production images. For example, you can use a first stage with a full SDK to build your
application, and then a second, minimal stage that only copies the compiled artifact from the first
stage, without including all the build tools and dependencies.
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the
cluster, Pods are added to them. As nodes are removed, those Pods are garbage collected. This is
useful for deploying cluster-wide daemons like a log collector (e.g., Fluentd) or a monitoring
agent (e.g., Prometheus Node Exporter) on every node.
Network Policies are a Kubernetes resource that control the traffic flow between Pods. By
default, all Pods in a cluster can communicate with each other. Network Policies allow you to
define firewall rules to restrict that communication, creating a more secure, zero-trust network.
For example, you can create a policy that only allows frontend Pods to communicate with
backend Pods on a specific port.
A rolling update is the default deployment strategy for Kubernetes Deployments. It ensures zero
downtime by slowly replacing Pods with the old version of an application with Pods with the
new version, one by one or in small batches. During the update, it ensures that a certain number
of Pods are always available to serve traffic.
● Liveness Probe: The kubelet uses liveness probes to know when to restart a container. If
the liveness probe fails, the kubelet kills the container, and the container is subject to its
restart policy. This is used to recover from deadlocks.
● Readiness Probe: The kubelet uses readiness probes to know when a container is ready
to start accepting traffic. If the readiness probe fails, the endpoints controller removes the
Pod's IP address from the endpoints of all Services that match the Pod. This is used to
ensure a Pod only receives traffic when it's ready.
Toil is the kind of operational work that is manual, repetitive, automatable, tactical, devoid of
enduring value, and that scales linearly as a service grows. The goal of an SRE team is to reduce
and eliminate toil by automating these tasks, which frees up engineers to work on more valuable,
long-term engineering projects.
A Content Delivery Network (CDN) is a geographically distributed network of proxy servers and
their data centers. It is used to provide high availability and performance by distributing the
service spatially relative to end-users. CDNs are commonly used to deliver static content like
images, CSS, and JavaScript files. By caching this content closer to the user, a CDN reduces
latency and offloads traffic from the origin servers.
DNS (Domain Name System) is the phonebook of the Internet. It translates human-readable
domain names (like [Link]) into machine-readable IP addresses (like [Link]).
An A record is the most basic type of DNS record and is used to map a domain name to the IP
address (IPv4) of the computer hosting that domain.
75. What is the difference between a load balancer and a reverse proxy?
The terms are often used interchangeably, but there is a subtle difference. A reverse proxy
accepts a request from a client, forwards it to a server that can fulfill it, and returns the server's
response. A load balancer is a specific type of reverse proxy that distributes incoming traffic
among a group of backend servers to improve reliability and capacity. All load balancers are
reverse proxies, but not all reverse proxies are load balancers.
Mutual TLS is a method for mutual authentication. Unlike standard TLS, where only the client
verifies the server's identity, mTLS requires both the client and the server to present a certificate
and verify each other's identity before establishing a secure connection. This is a key security
feature in zero-trust networks and is often implemented by service meshes like Istio.
A WAF is a specific type of firewall that applies a set of rules to an HTTP conversation. It sits in
front of a web application and is designed to protect it from common web-based attacks like SQL
injection, cross-site scripting (XSS), and file inclusion by filtering and monitoring the HTTP
traffic.
The principle of least privilege is a security concept that states that a user or process should only
be given the minimum levels of access – or permissions – needed to perform its job function. For
example, an application that only needs to read from a database should be given a read-only user
account, not an administrator account.
1. Acknowledge & Communicate: Immediately acknowledge the alert and inform the team
and stakeholders that an incident is in progress.
2. Triage & Mitigate: Form a response team and work to quickly identify the impact and
scope. The immediate priority is to restore service, even if it's a temporary fix or a
rollback.
3. Identify Root Cause: Once the service is stable, begin the investigation to find the
underlying root cause of the issue.
4. Resolve & Document: Implement a permanent fix for the root cause.
5. Post-Mortem: Conduct a blameless post-mortem to document the incident, the
resolution, and the action items to prevent it from happening again.
● Push-based: A central server (like Jenkins) actively "pushes" the new version of the
application to the target servers. This is the traditional model.
● Pull-based: An agent running on the target server "pulls" the desired configuration from
a central repository and applies it. This is the model used in GitOps, where an agent like
Idempotency is the property of an operation that ensures it can be applied multiple times without
changing the result beyond the initial application. It's critical for automation because it makes
scripts safe to re-run. For example, an idempotent script to create a user would create the user if it
doesn't exist, and do nothing if it already exists, without throwing an error. Tools like Ansible are
designed to be idempotent.
A feature flag is a technique that allows you to turn certain functionality on and off during
runtime without deploying new code. It's like a set of if/else statements in your code that can be
controlled remotely. This allows you to deploy new features "dark" (turned off) to production,
and then turn them on for specific users (like internal testers or a small percentage of customers)
for testing before a full release. It decouples deployment from release.
A Value Stream Map is a lean management tool used to visualize the entire flow of work required
to deliver a product or service to a customer. In DevOps, it's used to map the entire software
delivery process, from idea to production. The goal is to identify bottlenecks, waste, and areas for
improvement in the delivery pipeline.
● Monolith: An application built as a single, unified unit. All components are tightly
coupled and deployed together. It's simpler to develop initially but becomes harder to
scale and maintain.
● Microservices: An application built as a collection of small, independent services. Each
service is responsible for a specific business capability, is developed and deployed
independently, and communicates with other services over a network. This architecture is
more complex but offers better scalability, resilience, and team autonomy.
An API Gateway is a management tool that sits between a client and a collection of backend
services. In a microservices architecture, it acts as a single entry point for all clients. It can handle
tasks like request routing, authentication, rate limiting, and logging, which simplifies the client's
interaction with the backend and reduces the burden on individual microservices.
This refers to the small group of users or servers that receive a new release in a Canary Release
strategy. Just like the canaries that were used to detect toxic gases in coal mines, this small group
serves as an early warning system. If they experience issues, the release is rolled back before it
affects the wider user base.
"Shift-left" is the core principle of DevSecOps. It means moving security practices to the left, or
earlier, in the software development lifecycle. Instead of performing a security audit at the end of
the process, security is integrated from the beginning, with automated scans in the CI pipeline,
threat modeling during design, and security-conscious coding practices.
A service catalog is a centralized portal that provides a list of standardized IT services that can be
requested by developers or other users. In a DevOps context, this could be a self-service portal
where a developer can request a new CI/CD pipeline, a new Kubernetes namespace, or a new
cloud environment, all of which are provisioned automatically.
This is an analogy used to describe the shift in mindset required for cloud computing.
● Pets: Servers that are treated as unique and indispensable, like a family pet. They are
given names, and when they get sick, you nurse them back to health. This is the
traditional data center model.
● Cattle: Servers that are treated as identical, disposable units in a herd. They are given
numbers, not names. When one gets sick, you don't try to fix it; you simply replace it with
a new, healthy one. This is the cloud-native, immutable infrastructure model.
Immutable infrastructure is a model where servers are never modified after they are deployed. If
you need to update an application or apply a patch, you don't log in and change the server.
Instead, you build a new server from a new image with the changes, deploy it, and then terminate
the old one. This approach leads to more consistent and reliable systems and prevents
configuration drift.
Technical debt is a concept in software development that reflects the implied cost of rework
caused by choosing an easy (limited) solution now instead of using a better approach that would
take longer. Just like financial debt, if technical debt is not repaid, it can accumulate "interest,"
making it harder and harder to implement changes in the future.
A linter is a static code analysis tool used to check for programmatic and stylistic errors in source
code. In DevOps, linters are used to automatically check the syntax and quality of everything
from application code to Dockerfiles and Terraform configurations as part of the CI pipeline.
A DevOps evangelist is a leader who promotes and champions the adoption of DevOps culture
and practices within an organization. They work to break down silos, educate teams, and
demonstrate the value of collaboration, automation, and continuous improvement.
This is a personal question, but a good answer shows your passion and experience.
● Example: "My favorite tool is currently Terraform. I appreciate its declarative approach
The future of DevOps is heading towards greater abstraction and intelligence. We're seeing a rise
in Platform Engineering to provide developers with seamless self-service experiences. AI and
AIOps will play a much larger role in automating complex tasks, from analyzing pipeline data to
predicting and preventing production incidents. Finally, with the growth of technologies like
WebAssembly, we'll see even more focus on building secure, portable, and efficient applications
that can run anywhere, from the cloud to the edge.
Infrastructure Drift is the term for when the configuration of your production environment
"drifts" or becomes different from the state defined in your Infrastructure as Code (IaC)
repository. This usually happens when someone makes a manual change to the live environment
(e.g., changing a security group rule in the AWS console) instead of updating the code.
This is dangerous because it makes your IaC an unreliable source of truth and can cause
automated deployments to fail or have unintended consequences. The best way to prevent it is
by:
1. Enforcing a GitOps Workflow: Mandate that all changes must be made through code
and a pull request.
2. Implementing Drift Detection: Use tools that can periodically scan your live
environment and compare it to the state defined in your code. Tools like Terraform can
show you the drift when you run a terraform plan, and other specialized tools can run
these checks automatically and alert you if drift is detected.