Skip to content

Inject tracking data#2110

Merged
karriebear merged 11 commits intostreamlit:developfrom
jrhone:inject_tracking_data
Oct 8, 2020
Merged

Inject tracking data#2110
karriebear merged 11 commits intostreamlit:developfrom
jrhone:inject_tracking_data

Conversation

@jrhone
Copy link
Copy Markdown
Contributor

@jrhone jrhone commented Oct 7, 2020

What

  • We need to inject s4a data into user apps so it gets sent along as tracking data to Segment.
  • Users should be prevented from injecting this data and messing up our metrics.
  • Data needs to be injected early in the app lifecycle, before any tracking data is sent.

How

  • Set a global variable on the window object from s4a, and then the core application will be able to do something like window.parent.trackingData

Data to inject:

  • hosted: s4a/s4t/localhost
  • user id: users database id in s4a, hashed for security reasons
  • github owner
  • github repo
  • github branch
  • github path

Note

  • I snuck in a change to add the installation id to the event tracking

Copy link
Copy Markdown
Contributor

@akrolsmir akrolsmir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some parts that I was personally confused about:

We need to inject s4a data into user apps so it gets sent along as tracking data to Segment.

Curious, why can't the s4a outer wrapper send this data directly to segment?

hosted: s4a/s4t/localhost

How can "localhost" be chosen if streamlitTracking is only exposed on S4A/S4T?

users database id in s4a, hashed for security reasons

Note that hashing isn't an infallible way of obscuring data (but I'm expecting that this is just one of many security measures.) What happens if the user database id is exposed?

// Use the tracking data injected by S4A if the app is hosted there
private getHostTrackingData(): Record<string, unknown> {
if (
window.location.origin.toLowerCase().endsWith(".streamlit.io") &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more specific prefix for S4A that we want to use? (share.streamlit.io, s4a.streamlit.io)?

I wonder if we'd want to someday e.g. host apps on docs.streamlit.io but not want to include those in metrics; but we can also just cross that bridge when we get there.

Copy link
Copy Markdown
Contributor Author

@jrhone jrhone Oct 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will double check the specificity of the prefix..

Regarding the docs.streamlit.io example tho.. if it's just a page that has some user apps in iFrames, it would need to set a value for window.streamlitTracking so the embedded user apps tracked it. So by doing nothing, nothing's tracked.

If I didn't get your example right let me know.

@jrhone
Copy link
Copy Markdown
Contributor Author

jrhone commented Oct 7, 2020

Curious, why can't the s4a outer wrapper send this data directly to segment?

The S4A wrapper will have its own separate tracking as well (mainly for things happening in the dashboard), but we also want to tag core's data with some S4A attributes.

@jrhone
Copy link
Copy Markdown
Contributor Author

jrhone commented Oct 7, 2020

How can "localhost" be chosen if streamlitTracking is only exposed on S4A/S4T?

That's a great point. I haven't given too much thought yet to how we'll add the hosted:localhost tag but I think it's out of scope with regards to this PR so will punt for now.

Technically though it shouldn't need streamlitTracking or depend on being in an iFrame.

@jrhone
Copy link
Copy Markdown
Contributor Author

jrhone commented Oct 7, 2020

Note that hashing isn't an infallible way of obscuring data (but I'm expecting that this is just one of many security measures.) What happens if the user database id is exposed?

Well, one case I can think of... if we have any APIs that accept a user_id a malicious user could try to access the api with this information as input, but we should be enforcing proper access controls on our API endpoints.

I suppose if we used auto incrementing IDs then we could be exposing a range of IDs to 3rd parties so they could try to harvest data from multiple users, given the ID of one.

But yea, just adding an additional security measure.

@jrhone jrhone marked this pull request as ready for review October 8, 2020 14:55
@jrhone jrhone requested review from a team October 8, 2020 14:55
expect(mm.identify.mock.calls[0][1]).toMatchObject({
authoremail: SessionInfo.current.authorEmail,
machineIdV1: SessionInfo.current.installationIdV1,
machineIdV2: SessionInfo.current.installationIdV2,
Copy link
Copy Markdown
Contributor

@karriebear karriebear Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like machineIdV1 and machineIdV2 should still belong in this object? It doesn't look like there's any conditional statements that would now remove these? Unless you're just covering these with the additional tests down below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea just moved the check for these into the tests below to make it a bit cleaner.

declare global {
interface Window {
streamlitDebug: any
streamlitTracking: Record<string, unknown>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to something clearer? Like streamlitShareMetadata or something?

Because "Streamlit tracking" could mean all sorts of things...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to streamlitShareMetadata

@karriebear karriebear merged commit 6a1a1d0 into streamlit:develop Oct 8, 2020
private static getHostTrackingData(): Record<string, unknown> {
if (isInChildFrame() && window.parent.streamlitShareMetadata) {
return pick(window.parent.streamlitShareMetadata, [
"hosted",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we rename this to "hostType" or "hostedAt" ?

Also, this can take values "s4a", "s4t", "localhost", right? But what if the app is in, say, Heroku? Should we have "other" as a valid string too? (Or replace "localhost" with "other" if Heroku shows as localhost today?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, renaming to hostedAt

Currently we are only assigning S4A as a value to this and there are no restrictions on the list of values we can add.

So we can add other and localhost later on when we think of a way to detect and include this data.

"repo",
"branch",
"mainModule",
"creatorId",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking: on the metrics side we also have access to the app's URL, right? Or is that something we need to add to streamlitShareMetadata?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In BigQuery we have the app's URL

Screen Shot 2020-10-08 at 3 29 41 PM

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screen Shot 2020-10-08 at 3 30 55 PM

tconkling added a commit to tconkling/streamlit that referenced this pull request Oct 12, 2020
# By karrie (7) and others
# Via GitHub
* develop:
  Removing cache option from main menu if s4a (streamlit#2149)
  Fix empty deploy page (streamlit#2148)
  Don't wait for unit tests before starting Cypress (streamlit#2142)
  Fix formatting of st.file_uploader docstring (streamlit#2141)
  Fix broken link in 0.68 changelog (streamlit#2144)
  Fix useEffect warning (streamlit#2137)
  Add global GTM container (streamlit#2128)
  Allow Streamlit server to handle Range Requests (streamlit#1967)
  rename hosted to hostedAt (streamlit#2132)
  Update change log
  Update notices
  Up version to 0.68.0
  Rename hosted to hostedAt in tracking data (streamlit#2132)
  Inject tracking data (streamlit#2110)
  [Feature Branch] File uploader (streamlit#2130)
  links for docs (streamlit#2129)
  Upgrade ProtobufJS and fix build script (streamlit#2118)
  Refresh landing page (streamlit#2116)
  Improve docstrings + tutorials for Layout (streamlit#2117)
  Better 'streamlit run' error message when no extension provided (streamlit#2115)

# Conflicts:
#	frontend/src/components/elements/Video/Video.tsx
#	frontend/src/components/widgets/FileUploader/FileUploader.test.tsx
#	frontend/src/components/widgets/FileUploader/FileUploader.tsx
#	frontend/src/lib/utils.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants