Skip to content

Conversation

@weilerN
Copy link
Contributor

@weilerN weilerN commented Dec 11, 2025

📝 Description

Add first-class support for configuring Kubernetes probes (readiness, liveness, startup) on the ApplicationRuntime sidecar.
This includes a clear API for setting and deleting probes, validation of probe types, and automatic enrichment of HTTP probe ports from the internal application port at deploy time.


🛠️ Changes Made

  • Introduced ProbeType and ProbeTimeConfig enums for consistent probe configuration keys.
  • Added set_probe() and delete_probe() APIs to ApplicationRuntime for managing sidecar probes.
  • Implemented validation and safe handling of probe types (enum and string-based).
  • Added automatic enrichment of HTTP probe ports using internal_application_port just before deployment.
  • Explicitly blocked probe APIs on non-Application runtimes with clear errors.

✅ Checklist

  • I updated the documentation (if applicable)
  • I have tested the changes in this PR
  • I confirmed whether my changes are covered by system tests
    • If yes, I ran all relevant system tests and ensured they passed before submitting this PR
    • I updated existing system tests and/or added new ones if needed to cover my changes
  • If I introduced a deprecation:

🧪 Testing

  • Added unit tests covering all supported scenarios and edge cases
  • Dev test: tested that probes changes are propagated properly to the Nuclio side:
    • Set a sleep as part of healthcheck function:

        @app.route("/internal")
        def internal():
            sleep(10)
            return "/internal"     
      
    • created application runtime and deployment failed (unhealthy):
      Warning Unhealthy 3m1s (x867 over 138m) kubelet Readiness probe failed: Get "http://10.233.67.57:5000/internal": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

image
  • Configured probes:
application.set_probe(type="readiness",
                    initial_delay_seconds=15,
                    period_seconds=10,
                    timeout_seconds=20,
                    http_path="/internal",
                    http_port=5000)
  • Than deployed again and the function deployed healthy:
image

🔗 References


🚨 Breaking Changes?

  • Yes (explain below)
  • No

🔍️ Additional Notes

@weilerN weilerN force-pushed the ML-11708-support_runtime_probes branch 2 times, most recently from 17e66cb to a3f2f2f Compare December 11, 2025 17:00
Copy link
Member

@TomerShor TomerShor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An initial review

Comment on lines 67 to 236
_K8S_CLASSES = {
"V1HTTPGetAction": V1HTTPGetAction,
"V1ExecAction": V1ExecAction,
"V1TCPSocketAction": V1TCPSocketAction,
"V1GRPCAction": V1GRPCAction,
"V1Probe": V1Probe,
}
_SCHEMA_CACHE: dict[str, dict] = {}


def _get_schema(class_name: str) -> dict:
"""Get and cache Kubernetes class schema."""
if class_name in _SCHEMA_CACHE:
return _SCHEMA_CACHE[class_name]

k8s_class = _K8S_CLASSES.get(class_name)
if not k8s_class:
raise ValueError(f"Unknown Kubernetes class: {class_name}")

schema = getattr(k8s_class, "openapi_types", None) or getattr(
k8s_class, "swagger_types", None
)
if not schema:
raise RuntimeError(f"No schema found for {class_name}")

_SCHEMA_CACHE[class_name] = schema
return schema


def _validate_schema(
data: dict, schema: dict, schema_class_name: str
) -> tuple[bool, str]:
"""Recursively validate dict against Kubernetes API schema."""
_type_checkers = {
"str": lambda v: isinstance(v, str),
"int": lambda v: isinstance(v, int)
or (isinstance(v, float) and v.is_integer()),
"bool": lambda v: isinstance(v, bool),
"datetime": lambda v: isinstance(v, str | dict),
"date": lambda v: isinstance(v, str | dict),
"object": lambda v: isinstance(v, str | dict),
}

def _is_compatible(field_value: typing.Any, expected_field_type: str) -> bool:
if field_value is None:
return True
if expected_field_type in _type_checkers:
return _type_checkers[expected_field_type](field_value)
return False

for key, value in data.items():
if key not in schema:
return False, f"{schema_class_name}: unknown field '{key}'"

expected_type = schema[key]

# Handle nested Kubernetes classes
if isinstance(expected_type, str) and expected_type in _K8S_CLASSES:
if value is None:
continue
if not isinstance(value, dict):
return (
False,
f"{schema_class_name}.{key}: expected dict, got {type(value).__name__}",
)
try:
nested_schema = _get_schema(expected_type)
is_valid, error = _validate_schema(value, nested_schema, expected_type)
if not is_valid:
return False, error
except (RuntimeError, ValueError) as e:
return False, f"{schema_class_name}.{key}: {str(e)}"

elif not _is_compatible(value, expected_type):
return (
False,
f"{schema_class_name}.{key}: invalid type {type(value).__name__}",
)

return True, ""


def _validate_http_probe(
http_get: dict, probe_type: str, sidecar_name: str
) -> str | None:
"""Validate HTTP probe configuration.
Only path and port are mandatory if httpGet is configured.
:param http_get: The httpGet configuration dict from the probe
:param probe_type: The type of probe (readinessProbe, livenessProbe, startupProbe)
:param sidecar_name: The name of the sidecar container
:return: Error message string if validation fails, None if validation passes
"""
if not isinstance(http_get, dict):
return f"{probe_type} in sidecar '{sidecar_name}': httpGet must be a dict"
if not http_get.get("path"):
return f"{probe_type} in sidecar '{sidecar_name}': httpGet.path is required"
if not http_get.get("port"):
return f"{probe_type} in sidecar '{sidecar_name}': httpGet.port is required"
if not isinstance(http_get.get("port"), int | str):
return (
f"{probe_type} in sidecar '{sidecar_name}': httpGet.port must be int or str"
)
scheme = http_get.get("scheme")
if scheme and scheme not in ["HTTP", "HTTPS"]:
return f"{probe_type} in sidecar '{sidecar_name}': httpGet.scheme must be HTTP or HTTPS"
return None


def _validate_sidecar_probes(sidecars: list[dict]) -> None:
"""Validate probe configurations in sidecars against Kubernetes V1Probe schema."""
# Minimum values are based on Kubernetes probe configuration documentation:
# https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes
timing_fields = {
PROBE_TIMING_INITIAL_DELAY_SECONDS: (
lambda v: isinstance(v, int) and v >= 0,
"non-negative integer (minimum 0)",
),
PROBE_TIMING_PERIOD_SECONDS: (
lambda v: isinstance(v, int) and v >= 1,
"integer >= 1 (minimum 1)",
),
PROBE_TIMING_TIMEOUT_SECONDS: (
lambda v: isinstance(v, int) and v >= 1,
"integer >= 1 (minimum 1)",
),
PROBE_TIMING_FAILURE_THRESHOLD: (
lambda v: isinstance(v, int) and v >= 1,
"integer >= 1 (minimum 1)",
),
PROBE_TIMING_SUCCESS_THRESHOLD: (
lambda v: isinstance(v, int) and v >= 1,
"integer >= 1 (minimum 1)",
),
}

try:
probe_schema = _get_schema("V1Probe")
except (RuntimeError, ValueError) as e:
raise mlrun.errors.MLRunBadRequestError(f"Failed to load probe schema: {e}")

for sidecar in sidecars:
sidecar_name = sidecar.get("name", "unknown")

for probe_type in PROBE_KEYS:
probe = sidecar.get(probe_type)
if not probe:
continue

# Schema validation
is_valid, error = _validate_schema(probe, probe_schema, "V1Probe")
if not is_valid:
raise mlrun.errors.MLRunBadRequestError(
f"Invalid {probe_type} in sidecar '{sidecar_name}': {error}"
)

# HTTP probe validation - only path and port are mandatory if httpGet is configured
http_get = probe.get("httpGet")
if http_get:
error = _validate_http_probe(http_get, probe_type, sidecar_name)
if error:
raise mlrun.errors.MLRunBadRequestError(error)

# Timing parameters validation
for field, (validator, desc) in timing_fields.items():
if field in probe and not validator(probe[field]):
raise mlrun.errors.MLRunBadRequestError(
f"{probe_type} in sidecar '{sidecar_name}': {field} must be a {desc} integer"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Private functions should be below Public functions
  2. This entire section shouldn't be in this file at all. While relevant for nuclio, it is a very kubernetes thing, so I would put it in mlrun/k8s_utils.py.

Generally, the endpoints package should be more of a high level transport layer. It should call the validations but not necessarily implement all of them.
I see that this file doesn't fully adhere to this convention, but it doesn't mean we should continue this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - why is this needed? was this a requirement?
When gets the request from MLRun, it unmarshals the probe configuration into the Probe Go type, so if it doesn't adhere to it - the request will fail.
So are do we really need to do it in MLRun first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomerShor After re-reading the requirements, I realized I missed an important part of the validation definition:

plus provide the user with an option to supply a dictionary with any parameter that is part of the probe configuration (in which case we will not validate the dictionary and it’s up to the user to make sure the parameters are correct).

Given this, the validation I added is not needed on the MLRun server side, so I’m going to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weilerN weilerN force-pushed the ML-11708-support_runtime_probes branch from 870c3d6 to fb81c79 Compare December 14, 2025 12:48
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@weilerN weilerN force-pushed the ML-11708-support_runtime_probes branch 3 times, most recently from b05db2a to fac0095 Compare December 15, 2025 09:05
Copy link
Member

@rokatyy rokatyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some inline comments

Comment on lines 341 to 338
try:
type = ProbeType(type.value if hasattr(type, "value") else type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also be done like here:

@classmethod
def is_valid(cls, value: str) -> bool:
return value in cls._value2member_map_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 389 to 392
if not isinstance(type, ProbeType):
try:
type = ProbeType(type.value if hasattr(type, "value") else type)
except (ValueError, AttributeError):
raise ValueError(
f"Invalid probe type: {type}. Must be one of: {[p.value for p in ProbeType]}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value is used, even if it was modified after set_probe() was called
"""
internal_port = None
if hasattr(self.spec, "internal_application_port"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is defined as a property

@property
def internal_application_port(self):
return self._internal_application_port
, so it always exists, no need to check this again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1031 to 1032
http_get["port"] = internal_port
sidecar[probe_key] = probe_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a debug log with an enriched port value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1026 to +1013
raise ValueError(
f"Cannot enrich {probe_type.value} probe: HTTP probe requires a port, "
"but internal_application_port is not set. "
"Please set the internal_application_port or provide http_port in set_probe()."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just fyi - technically it should not be none, because we always enrich it here

self.internal_application_port = (
internal_application_port
or mlrun.mlconf.function.application.default_sidecar_internal_port
)
. So it can be none only if user explicitly set _internal_application_port to None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rokatyy Is it possible that mlrun.mlconf.function.application.default_sidecar_internal_port is None?
If not than no problem I will remove this check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weilerN not really

Comment on lines 357 to 367
# Override timing parameters from explicit arguments
if initial_delay_seconds is not None:
probe_config[ProbeTimeConfig.INITIAL_DELAY_SECONDS.value] = (
initial_delay_seconds
)
if period_seconds is not None:
probe_config[ProbeTimeConfig.PERIOD_SECONDS.value] = period_seconds
if failure_threshold is not None:
probe_config[ProbeTimeConfig.FAILURE_THRESHOLD.value] = failure_threshold
if timeout_seconds is not None:
probe_config[ProbeTimeConfig.TIMEOUT_SECONDS.value] = timeout_seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Override timing parameters from explicit arguments
if initial_delay_seconds is not None:
probe_config[ProbeTimeConfig.INITIAL_DELAY_SECONDS.value] = (
initial_delay_seconds
)
if period_seconds is not None:
probe_config[ProbeTimeConfig.PERIOD_SECONDS.value] = period_seconds
if failure_threshold is not None:
probe_config[ProbeTimeConfig.FAILURE_THRESHOLD.value] = failure_threshold
if timeout_seconds is not None:
probe_config[ProbeTimeConfig.TIMEOUT_SECONDS.value] = timeout_seconds
probe_config.update({
config.value: value
for config, value in {
ProbeTimeConfig.INITIAL_DELAY_SECONDS: initial_delay_seconds,
ProbeTimeConfig.PERIOD_SECONDS: period_seconds,
ProbeTimeConfig.FAILURE_THRESHOLD: failure_threshold,
ProbeTimeConfig.TIMEOUT_SECONDS: timeout_seconds,
}.items()
if value is not None
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weilerN weilerN force-pushed the ML-11708-support_runtime_probes branch 2 times, most recently from daf226c to 2d5ccf0 Compare December 16, 2025 12:03
@weilerN weilerN marked this pull request as ready for review December 17, 2025 07:47
@weilerN weilerN requested review from a team and liranbg as code owners December 17, 2025 07:47
@weilerN weilerN force-pushed the ML-11708-support_runtime_probes branch from 2bbacfc to a92461e Compare December 17, 2025 09:01
@weilerN weilerN requested review from TomerShor and rokatyy December 17, 2025 09:04
"\n",
"### Setting probes\n",
"\n",
"The `set_probe()` method allows you to configure probes.\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that this overrides any existing probe configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auth_session = f"{_env_prefix}AUTH_SESSION"


# Kubernetes probe type keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Kubernetes probe type keys
# Kubernetes probe types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +368 to +369
# Store probe configuration in the sidecar
sidecar = self._set_sidecar(self._get_sidecar_name())
sidecar[type.key] = probe_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I don't pass ANY argument? all default to None, so will this basically delete the probe configuration, or should we deny this usage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomerShor

  • If no arguments are provided (i.e. application.set_probes()), the execution will fail because type is a mandatory parameter.
  • Good question. I’ll raise this with Product. Currently, calling application.set_probes(type="readiness") creates a probes sidecar with empty values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weilerN weilerN requested a review from TomerShor December 18, 2025 06:49
Copy link
Member

@TomerShor TomerShor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !
great work

Copy link
Member

@rokatyy rokatyy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@weilerN weilerN force-pushed the ML-11708-support_runtime_probes branch 2 times, most recently from 83159bb to 6ceef05 Compare December 18, 2025 12:20
@TomerShor
Copy link
Member

@weilerN pull latest development, the CI failures should be fixed

@weilerN weilerN force-pushed the ML-11708-support_runtime_probes branch 3 times, most recently from d041bb1 to e4f83a4 Compare December 21, 2025 15:27
@weilerN weilerN force-pushed the ML-11708-support_runtime_probes branch from e4f83a4 to 772dce5 Compare December 22, 2025 10:51
@weilerN weilerN requested a review from TomerShor December 22, 2025 11:40
@TomerShor TomerShor merged commit cd91468 into mlrun:development Dec 22, 2025
16 checks passed
TomerShor pushed a commit that referenced this pull request Dec 23, 2025
### 📝 Description
This change adds backend validation for sidecar probe configurations
during Nuclio function deployment.

Before deploying the function, the API now validates that each
configured sidecar probe (`readinessProbe`, `livenessProbe`,
`startupProbe`) defines exactly one Kubernetes health-check mechanism
(`httpGet`, `exec`, `tcpSocket`, or `grpc`). Configurations that are
missing a health check or define multiple checks are rejected with a
clear 400 Bad Request error.

This validation is intentionally minimal and aligned with the Kubernetes
V1Probe schema, ensuring obvious misconfigurations are caught early
while avoiding deep field-level validation.

---

### 🛠️ Changes Made
- Added backend validation for sidecar probes before deployment.

---

### ✅ Checklist
- [ ] I updated the documentation (if applicable)
- [x] I have tested the changes in this PR
- [x] I confirmed whether my changes are covered by system tests
- [x] If yes, I ran all relevant system tests and ensured they passed
before submitting this PR
- [ ] I updated existing system tests and/or added new ones if needed to
cover my changes
- [ ] If I introduced a deprecation:
  - [ ] I followed the [Deprecation Guidelines](./DEPRECATION.md)
  - [ ] I updated the relevant Jira ticket for documentation

---

### 🧪 Testing
- Dev test: tested both negative and positive flows
- Added unit tests

---

### 🔗 References
- Ticket link: https://iguazio.atlassian.net/browse/ML-11788
- Design docs links:
- External links:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes

---

### 🚨 Breaking Changes?

- [ ] Yes (explain below)
- [x] No

<!-- If yes, describe what needs to be changed downstream: -->

---

### 🔍️ Additional Notes
- This PR follows up on #9043
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants