Skip to content

Conversation

@rueian
Copy link
Collaborator

@rueian rueian commented Feb 20, 2025

Why are these changes needed?

Part of the #3076, adding length validation to RayCluster, RayService, and RayJob to make sure we don't cut k8s service names:

  1. RayCluster: 63 - len("-serve-svc") = 53 characters at most.
  2. RayService: 53 - len("-xxxxx") = 47 characters at most.
  3. RayJob: 53 - len("-xxxxx") = 47 characters at most.

After #3101, the longest suffix for a K8s service is now "-serve-svc", which takes 10 characters. Therefore, we can only use 63-10=53 characters for the name of a RayCluster.

After #3102, we now add a "-" and 5 random characters to the name of either RayService or RayJob to create the underlying RayCluster. Therefore, they can only have 53-6=47 characters for their name.

Any RayCluster, RayService, or RayJob with a name exceeding the above limit will not be reconciled further and will receive an invalid spec event, which can be observed by kubectl describe.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@rueian rueian force-pushed the validate-raycluster-name branch from 1032d24 to 5cd981c Compare February 20, 2025 22:24
@rueian rueian changed the title feat: validate the length of RayCluster name and worker group names [feat][RayCluster] validate the length of RayCluster name and worker group names Feb 21, 2025
@rueian rueian force-pushed the validate-raycluster-name branch from d465e97 to b24ba93 Compare February 21, 2025 02:22
labels := map[string]string{
utils.RayClusterLabelKey: cluster.Name,
utils.RayIDLabelKey: utils.GenerateIdentifier(cluster.Name, rayv1.HeadNode),
utils.RayIDLabelKey: utils.CheckLabel(utils.GenerateIdentifier(cluster.Name, rayv1.HeadNode)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old bug that forgets to chunk the value for utils.RayIDLabelKey.

labels := map[string]string{
utils.RayClusterLabelKey: cluster.Name,
utils.RayIDLabelKey: utils.GenerateIdentifier(cluster.Name, rayv1.HeadNode),
utils.RayIDLabelKey: utils.CheckLabel(utils.GenerateIdentifier(cluster.Name, rayv1.HeadNode)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old bug that forgets to chunk the value for utils.RayIDLabelKey.

)

func TestRayClusterGCSFaultTolerence(t *testing.T) {
func TestRayClusterGCSFaultTolerance(t *testing.T) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typo.

@rueian rueian marked this pull request as ready for review February 21, 2025 05:09
@kevin85421 kevin85421 self-assigned this Feb 21, 2025
@rueian rueian force-pushed the validate-raycluster-name branch 2 times, most recently from e6ee5ac to 8817876 Compare February 25, 2025 00:18
@rueian rueian marked this pull request as draft February 25, 2025 23:49
@rueian rueian force-pushed the validate-raycluster-name branch from 3426bc8 to e26affb Compare March 2, 2025 23:13
logger := ctrl.LoggerFrom(ctx)

// making sure the name is valid
svc.Name = utils.CheckName(svc.Name)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more utils.CheckName to cut service names.

Comment on lines -265 to +267
return CheckName(fmt.Sprintf("%s-%s-%s", ownerName, rayv1.HeadNode, "svc")), nil
return fmt.Sprintf("%s-%s-%s", ownerName, rayv1.HeadNode, "svc"), nil
case RayClusterCRD:
headSvcName := CheckName(fmt.Sprintf("%s-%s-%s", ownerName, rayv1.HeadNode, "svc"))
headSvcName := fmt.Sprintf("%s-%s-%s", ownerName, rayv1.HeadNode, "svc")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more utils.CheckName to cut service names.

// GenerateServeServiceName generates name for serve service.
func GenerateServeServiceName(serviceName string) string {
return CheckName(fmt.Sprintf("%s-%s-%s", serviceName, ServeName, "svc"))
return fmt.Sprintf("%s-%s-%s", serviceName, ServeName, "svc")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more utils.CheckName to cut service names.

@rueian rueian changed the title [feat][RayCluster] validate the length of RayCluster name and worker group names [feat] validate the name length of RayCluster, RayService, and RayJob Mar 3, 2025
@rueian rueian force-pushed the validate-raycluster-name branch from e26affb to c9c3e9c Compare March 3, 2025 04:40
@rueian rueian force-pushed the validate-raycluster-name branch from c9c3e9c to 78f39a2 Compare March 3, 2025 05:18
@rueian rueian marked this pull request as ready for review March 3, 2025 06:51
@davidxia
Copy link
Collaborator

davidxia commented Mar 3, 2025

Would it be useful to also validate in webhooks like here?

func (w *RayClusterWebhook) validateName(rayCluster *rayv1.RayCluster) *field.Error {
if !nameRegex.MatchString(rayCluster.Name) {
return field.Invalid(field.NewPath("metadata").Child("name"), rayCluster.Name, "name must consist of lower case alphanumeric characters or '-', start with an alphabetic character, and end with an alphanumeric character (e.g. 'my-name', or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')")
}
return nil
}

These webhooks can optionally be enabled by user (disabled by default). They prevent creation of invalid resources with a more obvious error than the controller events.

@rueian
Copy link
Collaborator Author

rueian commented Mar 3, 2025

Hi @davidxia, while it would certainly be useful to replicate the validation to the webhook, we don’t do it now. Most Kuberay users don’t use the validation webhook.

@rueian rueian force-pushed the validate-raycluster-name branch from af47a93 to 32f719c Compare March 3, 2025 16:48
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt that adding E2E tests for this was a bit overkill. Does env tests check the K8s resource name length?

@rueian rueian force-pushed the validate-raycluster-name branch 2 times, most recently from 342afe0 to 5371440 Compare March 6, 2025 06:49
@rueian
Copy link
Collaborator Author

rueian commented Mar 6, 2025

I felt that adding E2E tests for this was a bit overkill. Does env tests check the K8s resource name length?

Sure, I changed them to env tests.

}

if err := utils.ValidateRayClusterMetadata(instance.ObjectMeta); err != nil {
logger.Error(err, fmt.Sprintf("The RayCluster metadata is invalid %s/%s", instance.Namespace, instance.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using fmt.Sprintf in logger.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


if err := utils.ValidateRayClusterMetadata(instance.ObjectMeta); err != nil {
logger.Error(err, fmt.Sprintf("The RayCluster metadata is invalid %s/%s", instance.Namespace, instance.Name))
r.Recorder.Eventf(instance, corev1.EventTypeWarning, string(utils.InvalidRayClusterSpec),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvalidRayClusterSpec -> InvalidaRayClusterMetadata?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

}

func ValidateRayJobSpec(rayJob *rayv1.RayJob) error {
if len(rayJob.Name) > MaxRayJobNameLength {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can (1) separate the logic into ValidateRayJobMetadata / ValidateRayServiceMetadata (2) only pass CR spec to the function ValidateXXXXSpec as a follow up PR. This will be a good first issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. It's weird that we have ValidateRayClusterMetadata and ValidateRayCluseterSpec, but only ValidateRayJobSpec and no ValidateRayJobMetadata.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. I can open the issue for (2) after this is merged.

namespace := "default"
rayCluster := rayClusterTemplate(strings.Repeat("r", utils.MaxRayClusterNameLength), namespace)
numOfHosts := int32(4)
rayCluster.Spec.WorkerGroupSpecs[0].NumOfHosts = numOfHosts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to set numOfHosts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it to create the -headless service. A comment is added.

numOfHosts := int32(4)
rayCluster.Spec.WorkerGroupSpecs[0].NumOfHosts = numOfHosts
rayCluster.Annotations = make(map[string]string)
rayCluster.Annotations[utils.EnableServeServiceKey] = "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind adding comments to explain why this test requires this (i.e. -serve-svc is the longest suffix name)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, a comment is added.

time.Second*3, time.Millisecond*500).Should(BeNil(), "Should be able to see RayCluster: %v", rayCluster.Name)
})

It("Check RayCluster service names", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also check Pods are created

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! Added above.


var _ = Context("RayService env tests", func() {
Describe("RayService with a maximum name", Ordered, func() {
// If Autoscaler scales up the pending or active RayCluster, zero downtime upgrade should not be triggered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted. thanks!

It("Should create a pending RayCluster", func() {
Eventually(
getPreparingRayClusterNameFunc(ctx, rayService),
time.Second*15, time.Millisecond*500).Should(Not(BeEmpty()), "Pending RayCluster name: %v", rayService.Status.PendingServiceStatus.RayClusterName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change all time.Second*15 in this PR to time.Second*3 for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@rueian rueian force-pushed the validate-raycluster-name branch from 5371440 to d3a273c Compare March 6, 2025 07:54
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left some nit comments. Other LGTM

}

if err := utils.ValidateRayClusterMetadata(instance.ObjectMeta); err != nil {
logger.Error(err, "The RayCluster metadata is invalid", "namespace", instance.Namespace, "name", instance.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need namespace and name? The context information should be added automatically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

errorMessage: fmt.Sprintf("RayCluster name should be no more than %d characters", MaxRayClusterNameLength),
},
{
name: "Both RayCluster name is ok (== MaxRayClusterNameLength)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "both" refer to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

ObjectMeta: metav1.ObjectMeta{
Name: strings.Repeat("A", MaxRayClusterNameLength),
},
Spec: rayv1.RayClusterSpec{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is cluster spec needed for this test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


if err := utils.ValidateRayClusterSpec(&instance.Spec, instance.Annotations); err != nil {
logger.Error(err, fmt.Sprintf("The RayCluster spec is invalid %s/%s", instance.Namespace, instance.Name))
logger.Error(err, "The RayCluster spec is invalid", "namespace", instance.Namespace, "name", instance.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need namespace and name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@kevin85421
Copy link
Member

@MortalHappiness please review another iteration. Thanks!

@rueian rueian force-pushed the validate-raycluster-name branch from 7da6ea2 to ee6c895 Compare March 7, 2025 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants