Skip to content

Conversation

@danielye11
Copy link

@danielye11 danielye11 commented Oct 3, 2022

Update the logic for container stats collection, storing the stats on the container object now. This is in preparation for the CRI changes described in KEP-2371, which adds a new gRPC call that returns unstructured Prometheus metrics. These will not be returned via a runtime struct, and instead will be added to stats. The additional unstructured metrics described in the above KEP will also be modified and retrieved from the container stats cache.

Directly tied to this issue

@k8s-ci-robot
Copy link

Hi @danielye11. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.


c := s.containers[id]
c.Stats = newContainerStats
if c.Stats != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to check if c.Stats is nil here? Shouldn't it be initialized previously?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this checks for the case where old stats is nil here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also get a memory panic if I take it out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, where do we actually assign c.Stats? Is it ever assigned?

As I see we are only returning *containerstorestats.ContainerStats in generatedContainerMetrics, but we are not actually storing the stats on the container object? Is that right? If that's the case, then c.Stats will always be nil...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good point -- actual caching of the metrics should be fixed now.

@samuelkarp
Copy link
Member

CI is failing as all of your commits are missing valid Signed-off-by lines in the commit message. Please see the contribution guide for instructions to sign-off your commits.

@samuelkarp
Copy link
Member

Your commit messages and PR description don't really provide much context for the changes you'd like to contribute to containerd. Can you describe more of what you're trying to achieve? Is there an associated issue for this PR with more information?

@samuelkarp samuelkarp added kind/enhancement area/cri Container Runtime Interface (CRI) status/more-info-needed Awaiting contributor information status/needs-update Awaiting contributor update labels Oct 5, 2022
@danielye11
Copy link
Author

Your commit messages and PR description don't really provide much context for the changes you'd like to contribute to containerd. Can you describe more of what you're trying to achieve? Is there an associated issue for this PR with more information?

Updated description of PR

@bobbypage
Copy link
Contributor

Left a few small comments, but looking really good. Maybe time to move this out of draft?

@danielye11 danielye11 force-pushed the stats-refactor branch 6 times, most recently from 8459d3f to 1eda6ec Compare October 12, 2022 21:55
@danielye11 danielye11 marked this pull request as ready for review October 12, 2022 22:00
@samuelkarp samuelkarp self-requested a review October 12, 2022 22:05
@danielye11 danielye11 force-pushed the stats-refactor branch 3 times, most recently from 362cfa2 to b330dc0 Compare October 14, 2022 22:41
@danielye11 danielye11 force-pushed the stats-refactor branch 3 times, most recently from 650ce29 to 45f4d5f Compare October 17, 2022 21:54
Change how ListPodSandboxStats and ListContainerStats retrieves metrics.
Previously, metrics were fetched from cgroup and returned directly via a
runtime struct (ListContainerStats). With the need of Prometheus
exporting, we will now collect these metrics and cache them, fetching
metrics from the cache and exporting them to whichever metric type we
need.

Cache the metrics added in stats object of container and sandbox and revamp getUsageNanoCores to not cache metrics inside that function. Add and update tests based on changes.

Signed-off-by: Daniel Ye <[email protected]>
Copy link
Member

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielye11, thanks for opening this PR! I took a first pass at it and left a few comments, but also I have two more overall comments:

  1. I think I'm missing the caching and expiration/eviction behavior here. I do see where stats are stored (in the container.Store), but it looks like they're never updated and also never read.
  2. It would be helpful for me as a reviewer to break this up into at least two commits: one where the structural changes (types, refactoring) are done and a second with the functional changes (new data + caching behavior).

Comment on lines +21 to +24
ContainerCPUStats
ContainerMemoryStats
ContainerFileSystemStats
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for struct embedding here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All stats whose additional metrics will eventually be returned via Prometheus format. Rationale would be to add the additional caching logic in the future, and since stats and metrics will be set in the same place (reading the same linux cgroup file) we can read from the container stats store or sandbox stats store.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe my question wasn't clear. What's the reason for embedding versus using fields?

// embedding
type ContainerStats struct {
	ContainerCPUStats
	ContainerMemoryStats
	ContainerFileSystemStats
}

// fields
type ContainerStats struct {
	CPU        ContainerCPUStats
	Memory     ContainerMemoryStats
	FileSystem ContainerFileSystemStats
}

Comment on lines +27 to +28
// Timestamp in nanoseconds at which the information were collected. Must be > 0.
Timestamp int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of these three structs (ContainerCPUStats, ContainerMemoryStats, and ContainerFileSystemStats contain a Timestamp field. Since all three are then embedded into ContainerStats, only one of the three Timestamp fields will be visible as ContainerStats.Timestamp; the other two will be shadowed. This can be a source of confusing behavior; if we do need to have this embedding pattern, please move Timestamp out of the individual structs and up into ContainerStats directly.

Timestamp int64
// Cumulative CPU usage (sum across all cores) since object creation.
UsageCoreNanoSeconds uint64
// Total CPU usage (sum of all cores) averaged over the sample window.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the sample window?


type ContainerCPUStats struct {
// Timestamp in nanoseconds at which the information were collected. Must be > 0.
Timestamp int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing code using the ContainerStats.Timestamp field are seeing a time.Time. This code changes it to an int64. Is there a specific reason why? Generally time.Time would be preferred as a standard type for representing time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRI fields use int64, figured it would be easier than converting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an internal struct that needs to be converted on the way out anyway, I think it'd make more sense to centralize the conversion and maintain the standard time.Time here.

WorkingSetBytes uint64
// Available memory for use. This is defined as the memory limit - workingSetBytes.
AvailableBytes uint64
// Total memory in use. This includes all memory regardless of when it was accessed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I can parse "when it was accessed". Does this mean you're tracking the total number of bytes over the lifetime of a container (so maybe you mean the max usage?).

case *v1.Metrics:
if metrics.Memory != nil && metrics.Memory.Usage != nil {
workingSetBytes := getWorkingSet(metrics.Memory)
func (c *criService) setContainerStats(containerID string, isSandbox bool, cs *containerstorestats.ContainerStats) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function appears to do nothing if there is already a stats.ContainerStats attached to the container or sandbox. Is that intentional?

}

cpuStats, err := c.cpuContainerStats(meta.ID, false /* isSandbox */, s, protobuf.FromTimestamp(stats.Timestamp))
c.setContainerStats(meta.ID, false, &cs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd to store a pointer to a struct, and then continue to mutate the struct (lines 115 and 121). It looks like this breaks the intent of the mutex in the sandbox.Store and container.Store, opening up the possibility of a data race.

Comment on lines +244 to +249
if isSandbox {
sandbox, err := c.sandboxStore.Get(containerID)
if err != nil {
return fmt.Errorf("failed to get sandbox container: %s: %w", containerID, err)
}
case *v2.Metrics:
if metrics.Memory != nil {
workingSetBytes := getWorkingSetV2(metrics.Memory)
oldStats = sandbox.Stats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like setContainerStats is ever called with isSandbox set to true. Can you remove the parameter and the codepath?

}

func (c *criService) cpuContainerStats(ID string, isSandbox bool, stats interface{}, timestamp time.Time) (*runtime.CpuUsage, error) {
func (c *criService) generatedCPUContainerStats(ID string, isSandbox bool, stats interface{}, timestamp time.Time) (*containerstorestats.ContainerCPUStats, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID and isSandbox parameters appear unused. Can you remove them?

return &cs, nil
}

func (c *criService) generatedMemoryContainerStats(ID string, stats interface{}, timestamp time.Time) (*containerstorestats.ContainerMemoryStats, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID parameter appears unused. Can you remove it?

@samuelkarp samuelkarp removed the status/needs-update Awaiting contributor update label Oct 21, 2022
@danielye11
Copy link
Author

@danielye11, thanks for opening this PR! I took a first pass at it and left a few comments, but also I have two more overall comments:

  1. I think I'm missing the caching and expiration/eviction behavior here. I do see where stats are stored (in the container.Store), but it looks like they're never updated and also never read.
  2. It would be helpful for me as a reviewer to break this up into at least two commits: one where the structural changes (types, refactoring) are done and a second with the functional changes (new data + caching behavior).

Thanks for review, will address comments and split up PR. Just some more context, but for this PR we don't specifically need more advanced caching logic. There's not a big regression for adding more structs in container store and it will help a lot for adding prometheus exporting for extra pod and container metrics (which will be put in container stats and sandbox stats store).

@github-actions
Copy link

This PR is stale because it has been open 90 days with no activity. This PR will be closed in 7 days unless new comments are made or the stale label is removed.

@github-actions github-actions bot added the Stale label Aug 27, 2023
@github-actions
Copy link

github-actions bot commented Sep 3, 2023

This PR was closed because it has been stalled for 7 days with no activity.

@github-actions github-actions bot closed this Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) kind/enhancement needs-ok-to-test Stale status/more-info-needed Awaiting contributor information

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants