[Application Runtime] Support setting probes for the application runtime sidecars #9043

weilerN · 2025-12-11T16:54:24Z

📝 Description

Add first-class support for configuring Kubernetes probes (readiness, liveness, startup) on the ApplicationRuntime sidecar.
This includes a clear API for setting and deleting probes, validation of probe types, and automatic enrichment of HTTP probe ports from the internal application port at deploy time.

🛠️ Changes Made

Introduced ProbeType and ProbeTimeConfig enums for consistent probe configuration keys.
Added set_probe() and delete_probe() APIs to ApplicationRuntime for managing sidecar probes.
Implemented validation and safe handling of probe types (enum and string-based).
Added automatic enrichment of HTTP probe ports using internal_application_port just before deployment.
Explicitly blocked probe APIs on non-Application runtimes with clear errors.

✅ Checklist

I updated the documentation (if applicable)
I have tested the changes in this PR
I confirmed whether my changes are covered by system tests
- If yes, I ran all relevant system tests and ensured they passed before submitting this PR
- I updated existing system tests and/or added new ones if needed to cover my changes
If I introduced a deprecation:
- I followed the Deprecation Guidelines
- I updated the relevant Jira ticket for documentation

🧪 Testing

Added unit tests covering all supported scenarios and edge cases
Dev test: tested that probes changes are propagated properly to the Nuclio side:
- Set a sleep as part of healthcheck function:
```
  @app.route("/internal")
  def internal():
      sleep(10)
      return "/internal"     
```
- created application runtime and deployment failed (unhealthy):
  Warning Unhealthy 3m1s (x867 over 138m) kubelet Readiness probe failed: Get "http://10.233.67.57:5000/internal": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Configured probes:

application.set_probe(type="readiness",
                    initial_delay_seconds=15,
                    period_seconds=10,
                    timeout_seconds=20,
                    http_path="/internal",
                    http_port=5000)

Than deployed again and the function deployed healthy:

🔗 References

Ticket link: https://iguazio.atlassian.net/browse/ML-11708
Design docs links:
External links:

🚨 Breaking Changes?

Yes (explain below)
No

🔍️ Additional Notes

TomerShor

An initial review

mlrun/common/runtimes/constants.py

TomerShor · 2025-12-11T18:40:34Z

server/py/services/api/api/endpoints/nuclio.py

+_K8S_CLASSES = {
+    "V1HTTPGetAction": V1HTTPGetAction,
+    "V1ExecAction": V1ExecAction,
+    "V1TCPSocketAction": V1TCPSocketAction,
+    "V1GRPCAction": V1GRPCAction,
+    "V1Probe": V1Probe,
+}
+_SCHEMA_CACHE: dict[str, dict] = {}
+
+
+def _get_schema(class_name: str) -> dict:
+    """Get and cache Kubernetes class schema."""
+    if class_name in _SCHEMA_CACHE:
+        return _SCHEMA_CACHE[class_name]
+
+    k8s_class = _K8S_CLASSES.get(class_name)
+    if not k8s_class:
+        raise ValueError(f"Unknown Kubernetes class: {class_name}")
+
+    schema = getattr(k8s_class, "openapi_types", None) or getattr(
+        k8s_class, "swagger_types", None
+    )
+    if not schema:
+        raise RuntimeError(f"No schema found for {class_name}")
+
+    _SCHEMA_CACHE[class_name] = schema
+    return schema
+
+
+def _validate_schema(
+    data: dict, schema: dict, schema_class_name: str
+) -> tuple[bool, str]:
+    """Recursively validate dict against Kubernetes API schema."""
+    _type_checkers = {
+        "str": lambda v: isinstance(v, str),
+        "int": lambda v: isinstance(v, int)
+        or (isinstance(v, float) and v.is_integer()),
+        "bool": lambda v: isinstance(v, bool),
+        "datetime": lambda v: isinstance(v, str | dict),
+        "date": lambda v: isinstance(v, str | dict),
+        "object": lambda v: isinstance(v, str | dict),
+    }
+
+    def _is_compatible(field_value: typing.Any, expected_field_type: str) -> bool:
+        if field_value is None:
+            return True
+        if expected_field_type in _type_checkers:
+            return _type_checkers[expected_field_type](field_value)
+        return False
+
+    for key, value in data.items():
+        if key not in schema:
+            return False, f"{schema_class_name}: unknown field '{key}'"
+
+        expected_type = schema[key]
+
+        # Handle nested Kubernetes classes
+        if isinstance(expected_type, str) and expected_type in _K8S_CLASSES:
+            if value is None:
+                continue
+            if not isinstance(value, dict):
+                return (
+                    False,
+                    f"{schema_class_name}.{key}: expected dict, got {type(value).__name__}",
+                )
+            try:
+                nested_schema = _get_schema(expected_type)
+                is_valid, error = _validate_schema(value, nested_schema, expected_type)
+                if not is_valid:
+                    return False, error
+            except (RuntimeError, ValueError) as e:
+                return False, f"{schema_class_name}.{key}: {str(e)}"
+
+        elif not _is_compatible(value, expected_type):
+            return (
+                False,
+                f"{schema_class_name}.{key}: invalid type {type(value).__name__}",
+            )
+
+    return True, ""
+
+
+def _validate_http_probe(
+    http_get: dict, probe_type: str, sidecar_name: str
+) -> str | None:
+    """Validate HTTP probe configuration.
+
+    Only path and port are mandatory if httpGet is configured.
+
+    :param http_get: The httpGet configuration dict from the probe
+    :param probe_type: The type of probe (readinessProbe, livenessProbe, startupProbe)
+    :param sidecar_name: The name of the sidecar container
+    :return: Error message string if validation fails, None if validation passes
+    """
+    if not isinstance(http_get, dict):
+        return f"{probe_type} in sidecar '{sidecar_name}': httpGet must be a dict"
+    if not http_get.get("path"):
+        return f"{probe_type} in sidecar '{sidecar_name}': httpGet.path is required"
+    if not http_get.get("port"):
+        return f"{probe_type} in sidecar '{sidecar_name}': httpGet.port is required"
+    if not isinstance(http_get.get("port"), int | str):
+        return (
+            f"{probe_type} in sidecar '{sidecar_name}': httpGet.port must be int or str"
+        )
+    scheme = http_get.get("scheme")
+    if scheme and scheme not in ["HTTP", "HTTPS"]:
+        return f"{probe_type} in sidecar '{sidecar_name}': httpGet.scheme must be HTTP or HTTPS"
+    return None
+
+
+def _validate_sidecar_probes(sidecars: list[dict]) -> None:
+    """Validate probe configurations in sidecars against Kubernetes V1Probe schema."""
+    # Minimum values are based on Kubernetes probe configuration documentation:
+    # https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes
+    timing_fields = {
+        PROBE_TIMING_INITIAL_DELAY_SECONDS: (
+            lambda v: isinstance(v, int) and v >= 0,
+            "non-negative integer (minimum 0)",
+        ),
+        PROBE_TIMING_PERIOD_SECONDS: (
+            lambda v: isinstance(v, int) and v >= 1,
+            "integer >= 1 (minimum 1)",
+        ),
+        PROBE_TIMING_TIMEOUT_SECONDS: (
+            lambda v: isinstance(v, int) and v >= 1,
+            "integer >= 1 (minimum 1)",
+        ),
+        PROBE_TIMING_FAILURE_THRESHOLD: (
+            lambda v: isinstance(v, int) and v >= 1,
+            "integer >= 1 (minimum 1)",
+        ),
+        PROBE_TIMING_SUCCESS_THRESHOLD: (
+            lambda v: isinstance(v, int) and v >= 1,
+            "integer >= 1 (minimum 1)",
+        ),
+    }
+
+    try:
+        probe_schema = _get_schema("V1Probe")
+    except (RuntimeError, ValueError) as e:
+        raise mlrun.errors.MLRunBadRequestError(f"Failed to load probe schema: {e}")
+
+    for sidecar in sidecars:
+        sidecar_name = sidecar.get("name", "unknown")
+
+        for probe_type in PROBE_KEYS:
+            probe = sidecar.get(probe_type)
+            if not probe:
+                continue
+
+            # Schema validation
+            is_valid, error = _validate_schema(probe, probe_schema, "V1Probe")
+            if not is_valid:
+                raise mlrun.errors.MLRunBadRequestError(
+                    f"Invalid {probe_type} in sidecar '{sidecar_name}': {error}"
+                )
+
+            # HTTP probe validation - only path and port are mandatory if httpGet is configured
+            http_get = probe.get("httpGet")
+            if http_get:
+                error = _validate_http_probe(http_get, probe_type, sidecar_name)
+                if error:
+                    raise mlrun.errors.MLRunBadRequestError(error)
+
+            # Timing parameters validation
+            for field, (validator, desc) in timing_fields.items():
+                if field in probe and not validator(probe[field]):
+                    raise mlrun.errors.MLRunBadRequestError(
+                        f"{probe_type} in sidecar '{sidecar_name}': {field} must be a {desc} integer"
+                    )


Private functions should be below Public functions

This entire section shouldn't be in this file at all. While relevant for nuclio, it is a very kubernetes thing, so I would put it in mlrun/k8s_utils.py.

Generally, the endpoints package should be more of a high level transport layer. It should call the validations but not necessarily implement all of them.
I see that this file doesn't fully adhere to this convention, but it doesn't mean we should continue this.

Also - why is this needed? was this a requirement?
When gets the request from MLRun, it unmarshals the probe configuration into the Probe Go type, so if it doesn't adhere to it - the request will fail.
So are do we really need to do it in MLRun first?

@TomerShor After re-reading the requirements, I realized I missed an important part of the validation definition:

plus provide the user with an option to supply a dictionary with any parameter that is part of the probe configuration (in which case we will not validate the dictionary and it’s up to the user to make sure the parameters are correct).

Given this, the validation I added is not needed on the MLRun server side, so I’m going to remove it.

Fixed here - CR comments 1 - Avoid non required validations

review-notebook-app · 2025-12-14T17:07:21Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

rokatyy

some inline comments

rokatyy · 2025-12-12T11:42:08Z

mlrun/runtimes/nuclio/application/application.py

+            try:
+                type = ProbeType(type.value if hasattr(type, "value") else type)


can also be done like here:

mlrun/mlrun/common/schemas/partition.py

Lines 25 to 27 in 7728707

@classmethod

def is_valid(cls, value: str) -> bool:

return value in cls._value2member_map_

Fixed here - CR 1- fixing Kate comments

rokatyy · 2025-12-12T11:44:03Z

mlrun/runtimes/nuclio/application/application.py

+        if not isinstance(type, ProbeType):
+            try:
+                type = ProbeType(type.value if hasattr(type, "value") else type)
+            except (ValueError, AttributeError):
+                raise ValueError(
+                    f"Invalid probe type: {type}. Must be one of: {[p.value for p in ProbeType]}"
+                )


Fixed here - CR 1- fixing Kate comments

rokatyy · 2025-12-12T11:46:35Z

mlrun/runtimes/nuclio/application/application.py

+          value is used, even if it was modified after set_probe() was called
+        """
+        internal_port = None
+        if hasattr(self.spec, "internal_application_port"):


it is defined as a property

mlrun/mlrun/runtimes/nuclio/application/application.py

Lines 196 to 198 in 435559a

@property

def internal_application_port(self):

return self._internal_application_port

, so it always exists, no need to check this again

Fixed here - CR 1- fixing Kate comments

rokatyy · 2025-12-12T11:53:13Z

mlrun/runtimes/nuclio/application/application.py

+                    http_get["port"] = internal_port
+                    sidecar[probe_key] = probe_config


add a debug log with an enriched port value

Fixed here - CR 1- fixing Kate comments

rokatyy · 2025-12-12T11:54:44Z

mlrun/runtimes/nuclio/application/application.py

+                        raise ValueError(
+                            f"Cannot enrich {probe_type.value} probe: HTTP probe requires a port, "
+                            "but internal_application_port is not set. "
+                            "Please set the internal_application_port or provide http_port in set_probe()."


just fyi - technically it should not be none, because we always enrich it here

mlrun/mlrun/runtimes/nuclio/application/application.py

Lines 162 to 165 in a3f2f2f

self.internal_application_port = (

internal_application_port

or mlrun.mlconf.function.application.default_sidecar_internal_port

)

. So it can be none only if user explicitly set _internal_application_port to None

@rokatyy Is it possible that mlrun.mlconf.function.application.default_sidecar_internal_port is None?
If not than no problem I will remove this check

@weilerN not really

rokatyy · 2025-12-15T09:15:51Z

mlrun/runtimes/nuclio/application/application.py

+        # Override timing parameters from explicit arguments
+        if initial_delay_seconds is not None:
+            probe_config[ProbeTimeConfig.INITIAL_DELAY_SECONDS.value] = (
+                initial_delay_seconds
+            )
+        if period_seconds is not None:
+            probe_config[ProbeTimeConfig.PERIOD_SECONDS.value] = period_seconds
+        if failure_threshold is not None:
+            probe_config[ProbeTimeConfig.FAILURE_THRESHOLD.value] = failure_threshold
+        if timeout_seconds is not None:
+            probe_config[ProbeTimeConfig.TIMEOUT_SECONDS.value] = timeout_seconds


Suggested change

# Override timing parameters from explicit arguments

if initial_delay_seconds is not None:

probe_config[ProbeTimeConfig.INITIAL_DELAY_SECONDS.value] = (

initial_delay_seconds

)

if period_seconds is not None:

probe_config[ProbeTimeConfig.PERIOD_SECONDS.value] = period_seconds

if failure_threshold is not None:

probe_config[ProbeTimeConfig.FAILURE_THRESHOLD.value] = failure_threshold

if timeout_seconds is not None:

probe_config[ProbeTimeConfig.TIMEOUT_SECONDS.value] = timeout_seconds

probe_config.update({

config.value: value

for config, value in {

ProbeTimeConfig.INITIAL_DELAY_SECONDS: initial_delay_seconds,

ProbeTimeConfig.PERIOD_SECONDS: period_seconds,

ProbeTimeConfig.FAILURE_THRESHOLD: failure_threshold,

ProbeTimeConfig.TIMEOUT_SECONDS: timeout_seconds,

}.items()

if value is not None

})

Fixed here - CR 1- fixing Kate comments

TomerShor · 2025-12-17T11:16:03Z

docs/runtimes/application.ipynb

+    "\n",
+    "### Setting probes\n",
+    "\n",
+    "The `set_probe()` method allows you to configure probes.\n"


Add a note that this overrides any existing probe configuration

Fixed here - CR 2- align function parameters with an user friendly usage

TomerShor · 2025-12-17T11:16:21Z

mlrun/common/runtimes/constants.py

    auth_session = f"{_env_prefix}AUTH_SESSION"
+
+
+# Kubernetes probe type keys


Suggested change

# Kubernetes probe type keys

# Kubernetes probe types

Fixed here - CR 2- align function parameters with an user friendly usage

TomerShor · 2025-12-17T11:17:58Z

mlrun/runtimes/nuclio/application/application.py

+        # Store probe configuration in the sidecar
+        sidecar = self._set_sidecar(self._get_sidecar_name())
+        sidecar[type.key] = probe_config


What if I don't pass ANY argument? all default to None, so will this basically delete the probe configuration, or should we deny this usage?

@TomerShor

If no arguments are provided (i.e. application.set_probes()), the execution will fail because type is a mandatory parameter.

Good question. I’ll raise this with Product. Currently, calling application.set_probes(type="readiness") creates a probes sidecar with empty values.

Fixed here - CR 3 - validate no empty request in client side

TomerShor

LGTM !
great work

rokatyy

LGTM

TomerShor · 2025-12-21T07:33:53Z

@weilerN pull latest development, the CI failures should be fixed

### 📝 Description This change adds backend validation for sidecar probe configurations during Nuclio function deployment. Before deploying the function, the API now validates that each configured sidecar probe (`readinessProbe`, `livenessProbe`, `startupProbe`) defines exactly one Kubernetes health-check mechanism (`httpGet`, `exec`, `tcpSocket`, or `grpc`). Configurations that are missing a health check or define multiple checks are rejected with a clear 400 Bad Request error. This validation is intentionally minimal and aligned with the Kubernetes V1Probe schema, ensuring obvious misconfigurations are caught early while avoiding deep field-level validation. --- ### 🛠️ Changes Made - Added backend validation for sidecar probes before deployment. --- ### ✅ Checklist - [ ] I updated the documentation (if applicable) - [x] I have tested the changes in this PR - [x] I confirmed whether my changes are covered by system tests - [x] If yes, I ran all relevant system tests and ensured they passed before submitting this PR - [ ] I updated existing system tests and/or added new ones if needed to cover my changes - [ ] If I introduced a deprecation: - [ ] I followed the [Deprecation Guidelines](./DEPRECATION.md) - [ ] I updated the relevant Jira ticket for documentation --- ### 🧪 Testing - Dev test: tested both negative and positive flows - Added unit tests --- ### 🔗 References - Ticket link: https://iguazio.atlassian.net/browse/ML-11788 - Design docs links: - External links: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes --- ### 🚨 Breaking Changes? - [ ] Yes (explain below) - [x] No  --- ### 🔍️ Additional Notes - This PR follows up on #9043

github-actions bot added area/sdk area/server area/api labels Dec 11, 2025

weilerN force-pushed the ML-11708-support_runtime_probes branch 2 times, most recently from 17e66cb to a3f2f2f Compare December 11, 2025 17:00

TomerShor requested changes Dec 11, 2025

View reviewed changes

weilerN force-pushed the ML-11708-support_runtime_probes branch from 870c3d6 to fb81c79 Compare December 14, 2025 12:48

github-actions bot added the area/tests label Dec 14, 2025

github-actions bot added the area/docs label Dec 14, 2025

weilerN force-pushed the ML-11708-support_runtime_probes branch 3 times, most recently from b05db2a to fac0095 Compare December 15, 2025 09:05

rokatyy reviewed Dec 15, 2025

View reviewed changes

github-actions bot added the area/system-tests label Dec 16, 2025

weilerN force-pushed the ML-11708-support_runtime_probes branch 2 times, most recently from daf226c to 2d5ccf0 Compare December 16, 2025 12:03

weilerN marked this pull request as ready for review December 17, 2025 07:47

weilerN requested review from a team and liranbg as code owners December 17, 2025 07:47

weilerN force-pushed the ML-11708-support_runtime_probes branch from 2bbacfc to a92461e Compare December 17, 2025 09:01

weilerN requested review from TomerShor and rokatyy December 17, 2025 09:04

TomerShor requested changes Dec 17, 2025

View reviewed changes

weilerN requested a review from TomerShor December 18, 2025 06:49

TomerShor approved these changes Dec 18, 2025

View reviewed changes

rokatyy approved these changes Dec 18, 2025

View reviewed changes

weilerN force-pushed the ML-11708-support_runtime_probes branch 2 times, most recently from 83159bb to 6ceef05 Compare December 18, 2025 12:20

weilerN mentioned this pull request Dec 18, 2025

[Application Runtime] Validate probes health check #9088

Closed

10 tasks

weilerN force-pushed the ML-11708-support_runtime_probes branch 3 times, most recently from d041bb1 to e4f83a4 Compare December 21, 2025 15:27

weilerN added 13 commits December 22, 2025 12:51

Client side implementation

29484cd

Server side implementation

9834d52

Move consts to common + fmt

30e4868

linter

c586efb

CR comments 1 - Improve constants usage

74f22bf

CR comments 1 - Avoid non required validations

edb6542

default scheme should be upper case

ddc4ff5

Add UTs

2386468

Update docs

90dbcd7

Add system test

716476e

CR 1- fixing Kate comments

c377d83

CR 2- align function parameters with an user friendly usage

449cd69

CR 3 - validate no empty request in client side

772dce5

weilerN force-pushed the ML-11708-support_runtime_probes branch from e4f83a4 to 772dce5 Compare December 22, 2025 10:51

weilerN requested a review from TomerShor December 22, 2025 11:40

TomerShor approved these changes Dec 22, 2025

View reviewed changes

TomerShor merged commit cd91468 into mlrun:development Dec 22, 2025
16 checks passed

weilerN mentioned this pull request Dec 22, 2025

[Application Runtime] Validate probes health check #9101

Merged

10 tasks

		try:
		type = ProbeType(type.value if hasattr(type, "value") else type)

	@classmethod
	def is_valid(cls, value: str) -> bool:
	return value in cls._value2member_map_

	@property
	def internal_application_port(self):
	return self._internal_application_port

		http_get["port"] = internal_port
		sidecar[probe_key] = probe_config

	self.internal_application_port = (
	internal_application_port
	or mlrun.mlconf.function.application.default_sidecar_internal_port
	)

		auth_session = f"{_env_prefix}AUTH_SESSION"


		# Kubernetes probe type keys

[Application Runtime] Support setting probes for the application runtime sidecars #9043

[Application Runtime] Support setting probes for the application runtime sidecars #9043

Uh oh!

Conversation

weilerN commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Description

🛠️ Changes Made

✅ Checklist

🧪 Testing

🔗 References

🚨 Breaking Changes?

🔍️ Additional Notes

Uh oh!

TomerShor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

review-notebook-app bot commented Dec 14, 2025

Uh oh!

rokatyy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomerShor left a comment

Choose a reason for hiding this comment

Uh oh!

rokatyy left a comment

Choose a reason for hiding this comment

Uh oh!

TomerShor commented Dec 21, 2025

Uh oh!

weilerN commented Dec 11, 2025 •

edited

Loading