Skip to content

Add multi tenant query federation#3250

Merged
pracucci merged 15 commits intocortexproject:masterfrom
simonswine:crosstenant
Dec 23, 2020
Merged

Add multi tenant query federation#3250
pracucci merged 15 commits intocortexproject:masterfrom
simonswine:crosstenant

Conversation

@simonswine
Copy link
Copy Markdown
Contributor

@simonswine simonswine commented Sep 29, 2020

What this PR does:

  • This intends to allow cross tenant querying. The implemenation follows this proposal

Which issue(s) this PR fixes:

Fixes #923

Checklist/TODO

  • Merge Add tenant resolver #3486 first
  • Provide a design document with the details of the the form of the implementation (esp. what this would mean for configured tenant limits and metrics exposed about queries)
  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@simonswine simonswine force-pushed the crosstenant branch 3 times, most recently from fc84860 to 023f1f1 Compare October 23, 2020 11:29
@simonswine simonswine force-pushed the crosstenant branch 2 times, most recently from 1d43854 to 3ceeb5e Compare October 26, 2020 16:07
@simonswine simonswine changed the title Multi tenant queries proof-of-concept WIP: Multi tenant query federation Nov 13, 2020
@simonswine simonswine force-pushed the crosstenant branch 2 times, most recently from 2a41025 to d815857 Compare November 26, 2020 18:06
Comment on lines 23 to 28
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too sure if I can cacheGenNumber like that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's likely it will lead to collisions. However, it will increase the number of characters that prefix the Memcached item key string. This means that for multi-tenant queries across enough tenants caching could break due to the item key being too long. The existing limit is not very high at 150 characters. Multi-tenant queries will already have larger cache item keys since it will contain the concatenation of multiple tenant IDs. The cachegen number will exacerbate this issue further.

There are two options for dealing with this:

  1. Ignore the cache generation for multi-tenant queries in this PR. It's not a commonly used feature and IMO it would be fine to ignore it for this PR. It

  2. Consolidate the cachegen number into a small, consistent, unique string. One option off the top of my head is to convert the strings to integers and adding them together then use a string of the resulting value. The

My preference would be option 1 since it would reduce the scope of the PR a bit more and make it a bit easier to review. Cache generation handling can be added in a future PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have deferred that to a future PR as suggested and added a TODO comment

@simonswine simonswine changed the title WIP: Multi tenant query federation Add multi tenant query federation Dec 1, 2020
Copy link
Copy Markdown
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I added a few questions and nits. I want to go through it one more time before I add an approval. However, I didn't any major issues in my first pass.

@simonswine simonswine marked this pull request as ready for review December 3, 2020 18:20
Copy link
Copy Markdown
Contributor

@jtlisi jtlisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Most of the nits/suggestions are very minor except for the cachegen topic. Once that is sorted I'll approve and see if we can get a second reviewer on this.

Comment on lines 23 to 28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's likely it will lead to collisions. However, it will increase the number of characters that prefix the Memcached item key string. This means that for multi-tenant queries across enough tenants caching could break due to the item key being too long. The existing limit is not very high at 150 characters. Multi-tenant queries will already have larger cache item keys since it will contain the concatenation of multiple tenant IDs. The cachegen number will exacerbate this issue further.

There are two options for dealing with this:

  1. Ignore the cache generation for multi-tenant queries in this PR. It's not a commonly used feature and IMO it would be fine to ignore it for this PR. It

  2. Consolidate the cachegen number into a small, consistent, unique string. One option off the top of my head is to convert the strings to integers and adding them together then use a string of the resulting value. The

My preference would be option 1 since it would reduce the scope of the PR a bit more and make it a bit easier to review. Cache generation handling can be added in a future PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: We should be able to enable this feature for every e2e test and not have any failures. Doing so could provide a good smoke test for the feature.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtlisi Not entirely sure if you meant to do this locally once or do you want to have that as part of the integration test run by CI?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a TODO, will look into this tomorrow

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to commit any changes here. Just smoke testing that the tests pass if you enable it for globally on your local should be enough.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done that today and it didn't not come up with any unexpected failure (apart from the Backward_compatibility tests, which is expected as the feature flag is not recognised by them)

Copy link
Copy Markdown
Contributor

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Do we want to have an entirely new package for tenantfederation? It seems like all that is there is a new type of queryable which could also just be added to the querier package. Similarly, should the configuration be a new top level, or nested under querier? One thing that would help me decide is if there is more cross cutting flags that will be needed? Right now the flag is only used in querier related code.

This experimental feature allows queries to cover data from more than a
single Cortex tenant and can be activated by supplying
`-tenant-federation.enabled` to all cortex instances.

To query multiple tenants a `|` separated tenant list can be specified
in the `X-Scope-OrgID` header. The source tenant of a metric will be
exposed in the label `__tenant_id__`.

Signed-off-by: Christian Simon <[email protected]>
This ensures the limit is aggregated correctly of the setting of each
individual tenant. It also implements the logic for the v2 query
frontend.

Signed-off-by: Christian Simon <[email protected]>
Signed-off-by: Christian Simon <[email protected]>
Regexp matcher need to be initialized, this adds a test for regexp
matcher and fixes the underlying issue.

Signed-off-by: Christian Simon <[email protected]>
Signed-off-by: Christian Simon <[email protected]>
To avoid any future regressions in the multi tenant support within the
query path, a faillint command tests if TenantID is used and if it
finds one, it suggestest using TenantIDs instead>

Signed-off-by: Christian Simon <[email protected]>
@simonswine
Copy link
Copy Markdown
Contributor Author

* `pkg/frontend/transport/handler.go` is still usig `tenant.TenantID`. How should it behave there?

* What do you think about adding a `faillint` rule to fail on `tenant.TenantID` used in `pkg/frontend/...`, `pkg/scheduler/...` and `pkg/querier/tenantfederation/...` to make sure we handle it everywhere (in the future)?

* What's about `tenant.TenantID`  usage in `pkg/querier/queryrange`? Have you checked where we should handle it? For example, I think `limitsMiddleware` should be tenants federation aware at least. This is another package where having a `faillint` may help to prevent future regressions.

@pracucci I think I have addressed all of the remaining points (I think 🙂 ).

Esp those ones had the biggest changes, so certainly are in need for another review:

  • The result cache gets skipped once there are more than one tenant b41a4df
  • limitsMiddleware is now using the "strictest" of the tenant's limits 0667242
  • Fail lint is checking the packages in the query path: e312687

@pracucci
Copy link
Copy Markdown
Contributor

pracucci commented Dec 23, 2020

Thanks a lot @simonswine for your hard work and patiently address all comments! I agree on skipping the query results caching in this PR and address it in a follow up PR (given we have to design how to deal with cache invalidation done for deleted series).

This PR looks in a good shape, considering it's an experimental feature. We can merge this PR, so that we begin testing it, and in the meanwhile you can work on follow up work (eg. query results caching).

@pracucci pracucci merged commit 6614def into cortexproject:master Dec 23, 2020
@bhiravabhatla
Copy link
Copy Markdown

Is this released as part of 1.6. Could not find this in release notes. Its already present in configuration documentation though.

@pstibrany
Copy link
Copy Markdown
Contributor

pstibrany commented Jan 20, 2021

Is this released as part of 1.6. Could not find this in release notes. Its already present in configuration documentation though.

It will be part of Cortex 1.7.

Documentation on the website currently reflects master branch, with latest (unreleased) changes.

@cabrinha
Copy link
Copy Markdown

cabrinha commented Feb 11, 2021

In order to add Cortex as a datasource for grafana, when using auth and tenants... Whats the config for the datasource need to look like in order to see every tenant in one datasource?

@simonswine
Copy link
Copy Markdown
Contributor Author

In order to add Cortex as a datasource for grafana, when using auth and tenants... Whats the config for the datasource need to look like in order to see every tenant in one datasource?

@cabrinha unfortunately there is no way to specify this currently. You need to list the tenants explicitly (separated by a |). If you care about this enough it would be good to create a new issue with a feature request and detailing your use case a bit more.

simonswine added a commit to simonswine/dskit that referenced this pull request Feb 21, 2022
* Add tenant query federation

This experimental feature allows queries to cover data from more than a
single Cortex tenant and can be activated by supplying
`-tenant-federation.enabled` to all cortex instances.

To query multiple tenants a `|` separated tenant list can be specified
in the `X-Scope-OrgID` header. The source tenant of a metric will be
exposed in the label `__tenant_id__`.

Signed-off-by: Christian Simon <[email protected]>

* Aggregate the limit of maxQueriers correctly

This ensures the limit is aggregated correctly of the setting of each
individual tenant. It also implements the logic for the v2 query
frontend.

Signed-off-by: Christian Simon <[email protected]>

* Fix tenant labels and make LabelNames more efficient

Signed-off-by: Christian Simon <[email protected]>

* Use tsdb_errors for error handling

Signed-off-by: Christian Simon <[email protected]>

* Fix uninitialized label matcher

Regexp matcher need to be initialized, this adds a test for regexp
matcher and fixes the underlying issue.

Signed-off-by: Christian Simon <[email protected]>

* Use map for filterValuesByMatchers to reduce allocations

Signed-off-by: Christian Simon <[email protected]>

* Address review suggestions

Signed-off-by: Christian Simon <[email protected]>

* Add validation.SmallestPositiveNonZeroIntPerTenant to avoid code duplication

Signed-off-by: Christian Simon <[email protected]>

* Add limitations and experimental status to docs

Signed-off-by: Christian Simon <[email protected]>

* Remove unnecessary cast of defaultTenantLabel

Signed-off-by: Christian Simon <[email protected]>

* Handle query-range limits for multi tenant queries

Signed-off-by: Christian Simon <[email protected]>

* Skip results cache, if multi tenants query

Signed-off-by: Christian Simon <[email protected]>

* Add failint to ensure query path supports multiple tenants

To avoid any future regressions in the multi tenant support within the
query path, a faillint command tests if TenantID is used and if it
finds one, it suggestest using TenantIDs instead>

Signed-off-by: Christian Simon <[email protected]>

* Align CHANGELOG line with the flag description

Signed-off-by: Christian Simon <[email protected]>

* Add a limitation about missing results cache

Signed-off-by: Christian Simon <[email protected]>
simonswine added a commit to grafana/dskit that referenced this pull request Mar 22, 2022
* Add tenant query federation

This experimental feature allows queries to cover data from more than a
single Cortex tenant and can be activated by supplying
`-tenant-federation.enabled` to all cortex instances.

To query multiple tenants a `|` separated tenant list can be specified
in the `X-Scope-OrgID` header. The source tenant of a metric will be
exposed in the label `__tenant_id__`.

Signed-off-by: Christian Simon <[email protected]>

* Aggregate the limit of maxQueriers correctly

This ensures the limit is aggregated correctly of the setting of each
individual tenant. It also implements the logic for the v2 query
frontend.

Signed-off-by: Christian Simon <[email protected]>

* Fix tenant labels and make LabelNames more efficient

Signed-off-by: Christian Simon <[email protected]>

* Use tsdb_errors for error handling

Signed-off-by: Christian Simon <[email protected]>

* Fix uninitialized label matcher

Regexp matcher need to be initialized, this adds a test for regexp
matcher and fixes the underlying issue.

Signed-off-by: Christian Simon <[email protected]>

* Use map for filterValuesByMatchers to reduce allocations

Signed-off-by: Christian Simon <[email protected]>

* Address review suggestions

Signed-off-by: Christian Simon <[email protected]>

* Add validation.SmallestPositiveNonZeroIntPerTenant to avoid code duplication

Signed-off-by: Christian Simon <[email protected]>

* Add limitations and experimental status to docs

Signed-off-by: Christian Simon <[email protected]>

* Remove unnecessary cast of defaultTenantLabel

Signed-off-by: Christian Simon <[email protected]>

* Handle query-range limits for multi tenant queries

Signed-off-by: Christian Simon <[email protected]>

* Skip results cache, if multi tenants query

Signed-off-by: Christian Simon <[email protected]>

* Add failint to ensure query path supports multiple tenants

To avoid any future regressions in the multi tenant support within the
query path, a faillint command tests if TenantID is used and if it
finds one, it suggestest using TenantIDs instead>

Signed-off-by: Christian Simon <[email protected]>

* Align CHANGELOG line with the flag description

Signed-off-by: Christian Simon <[email protected]>

* Add a limitation about missing results cache

Signed-off-by: Christian Simon <[email protected]>
simonswine added a commit to grafana/dskit that referenced this pull request Mar 29, 2022
* Add tenant resolver (cortexproject/cortex#3486)

* Add tenant resolver package

This implements the multi tenant resolver as described by the [proposal]
for multi tenant query-federation.

By default it behaves like before, but it's implementation can be
swapped out.

[proposal]: cortexproject/cortex#3364

Signed-off-by: Christian Simon <[email protected]>

* Replace usages of `ExtractOrgID`

Use TenantID or UserID depending on which of the methods are meant to be
used.

Signed-off-by: Christian Simon <[email protected]>

* Replace usages of `ExtractOrgIDFromHTTPRequest`

This is replaced by ExtractTenantIDFromHTTPRequest, which makes sure
that exactly one tenant ID is set.

Signed-off-by: Christian Simon <[email protected]>

* Add methods to `tenant` package to use resolver directly

Signed-off-by: Christian Simon <[email protected]>

* Remove UserID method from Resolver interface

We need a better definition for what we are trying to achieve with
UserID before we can add it to the interface

Signed-off-by: Christian Simon <[email protected]>

* Update comment on the TenantID/TenantIDs

Signed-off-by: Christian Simon <[email protected]>

* Improve performance of NormalizeTenantIDs

- reduce allocations by reusing the input slice during de-duplication

Signed-off-by: Christian Simon <[email protected]>

* Add multi tenant query federation (cortexproject/cortex#3250)

* Add tenant query federation

This experimental feature allows queries to cover data from more than a
single Cortex tenant and can be activated by supplying
`-tenant-federation.enabled` to all cortex instances.

To query multiple tenants a `|` separated tenant list can be specified
in the `X-Scope-OrgID` header. The source tenant of a metric will be
exposed in the label `__tenant_id__`.

Signed-off-by: Christian Simon <[email protected]>

* Aggregate the limit of maxQueriers correctly

This ensures the limit is aggregated correctly of the setting of each
individual tenant. It also implements the logic for the v2 query
frontend.

Signed-off-by: Christian Simon <[email protected]>

* Fix tenant labels and make LabelNames more efficient

Signed-off-by: Christian Simon <[email protected]>

* Use tsdb_errors for error handling

Signed-off-by: Christian Simon <[email protected]>

* Fix uninitialized label matcher

Regexp matcher need to be initialized, this adds a test for regexp
matcher and fixes the underlying issue.

Signed-off-by: Christian Simon <[email protected]>

* Use map for filterValuesByMatchers to reduce allocations

Signed-off-by: Christian Simon <[email protected]>

* Address review suggestions

Signed-off-by: Christian Simon <[email protected]>

* Add validation.SmallestPositiveNonZeroIntPerTenant to avoid code duplication

Signed-off-by: Christian Simon <[email protected]>

* Add limitations and experimental status to docs

Signed-off-by: Christian Simon <[email protected]>

* Remove unnecessary cast of defaultTenantLabel

Signed-off-by: Christian Simon <[email protected]>

* Handle query-range limits for multi tenant queries

Signed-off-by: Christian Simon <[email protected]>

* Skip results cache, if multi tenants query

Signed-off-by: Christian Simon <[email protected]>

* Add failint to ensure query path supports multiple tenants

To avoid any future regressions in the multi tenant support within the
query path, a faillint command tests if TenantID is used and if it
finds one, it suggestest using TenantIDs instead>

Signed-off-by: Christian Simon <[email protected]>

* Align CHANGELOG line with the flag description

Signed-off-by: Christian Simon <[email protected]>

* Add a limitation about missing results cache

Signed-off-by: Christian Simon <[email protected]>

* Restrict path segments in TenantIDs (cortexproject/cortex#4375) (cortexproject/cortex#4376)

This prevents paths generated from TenantIDs to become vulnerable to
path traversal attacks. CVE-2021-36157

Signed-off-by: Christian Simon <[email protected]>

* Update nolint statement

Co-authored-by: Bryan Boreham <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query across multiple users / tenants / instances

7 participants