Skip to content

Conversation

@adlersantos
Copy link
Member

@adlersantos adlersantos commented Feb 10, 2022

Description

This PR sets the stage for adding more categories (subfolders) in every datasets/$DATASET folder, such as the upcoming docs folder for the datasets' documentation set.

To do this, we need to change how the files and folders under every dataset is organized.

From

datasets/$DATASET/_terraform/
datasets/$DATASET/_images/
datasets/$DATASET/$PIPELINE_1/
datasets/$DATASET/$PIPELINE_2/

to introducing two levels infra and pipelines:

datasets/$DATASET/infra/
datasets/$DATASET/pipelines/_images/
datasets/$DATASET/pipelines/$PIPELINE_1/
datasets/$DATASET/pipelines/$PIPELINE_2/

which allows us to also add a docs folder. When the docset feature is ready, the hierarchy will look like

datasets/$DATASET/infra/
datasets/$DATASET/pipelines/
datasets/$DATASET/docs/

and we can keep adding other domain-specific folders as necessary without affecting infra- or pipelines-related stuff.

Checklist

Note: If an item applies to you, all of its sub-items must be fulfilled

  • (Required) This pull request is appropriately labeled
  • Please merge this pull request after it's approved
  • I'm adding or editing a feature
    • I have updated the README accordingly
    • I have added tests for the feature
  • I'm adding or editing a dataset
    • The Google Cloud Datasets team is aware of the proposed dataset
    • I put all my code inside datasets/<DATASET_NAME> and nothing outside of that directory
  • I'm adding/editing documentation
  • I'm submitting a bugfix
    • I have added tests to my bugfix (see the tests folder)
  • I'm refactoring or cleaning up some code

@adlersantos adlersantos added revision: readme Improvements or additions to the README feature request New feature or request cleanup Cleanup or refactor code labels Feb 10, 2022
@adlersantos adlersantos changed the title !feat: Reorganize pipelines and infra files into their respective folders feat!: Reorganize pipelines and infra files into their respective folders Feb 10, 2022
Copy link
Contributor

@leahecole leahecole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking nits. Otherwise, lgtm

- GCS bucket to store final, downstream, customer-facing data
- Sometimes, for very large datasets, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) job
- GCS bucket to store downstream data, such as those linked to in the [Datasets Marketplace](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset).
- Sometimes, for very large datasets that requires processing to be parallelized, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) (i.e. Apache Beam) job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
- Sometimes, for very large datasets that requires processing to be parallelized, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) (i.e. Apache Beam) job
- Sometimes, for very large datasets that requires processing to be parallelized, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) (Apache Beam) job

# Environment Setup

We use Pipenv to make environment setup more deterministic and uniform across different machines.
We use Pipenv to make environment setup more deterministic and uniform across different machines. If you haven't done so, install Pipenv using the instructions found [here](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We use Pipenv to make environment setup more deterministic and uniform across different machines. If you haven't done so, install Pipenv using the instructions found [here](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv).
We use Pipenv to make environment setup more deterministic and uniform across different machines. If you haven't done so, install Pipenv using these [instructions](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv).

nit - more screenreader friendly hyperlink

In addition, the command above creates a "dot env" directory in the project root. The directory name is the value you set for `--env`. If it's not set, the value defaults to `dev` which generates the `.dev` folder.

Consider this "dot" directory as your own dedicated space for prototyping. The files and variables created in that directory will use an isolated environment. All such directories are gitignored.
Consider a dot directory as your own sandbox, specific to your machine, that's mainly used for prototyping. As will be seen later, this directory is where you will set the variables specific to your environment: such as actual GCS bucket names, GCR repository URLs, and secrets (we recommend using [Secret Manager](https://cloud.google.com/composer/docs/secret-manager) for this). The files and variables created or copied in the dot directories are isolated from the main repo, i.e. all dot directories are gitignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nits

Suggested change
Consider a dot directory as your own sandbox, specific to your machine, that's mainly used for prototyping. As will be seen later, this directory is where you will set the variables specific to your environment: such as actual GCS bucket names, GCR repository URLs, and secrets (we recommend using [Secret Manager](https://cloud.google.com/composer/docs/secret-manager) for this). The files and variables created or copied in the dot directories are isolated from the main repo, i.e. all dot directories are gitignored.
We strongly recommend using a dot directory as your own sandbox, specific to your machine, that's mainly used for prototyping. This directory is where you will set the variables specific to your environment: such as actual GCS bucket names, GCR repository URLs, and secrets (we recommend using [Secret Manager](https://cloud.google.com/composer/docs/secret-manager) for this). The files and variables created or copied in the dot directories are isolated from the main repo, meaning that all dot directories are gitignored.

@happyhuman happyhuman merged commit 7408d44 into main Feb 10, 2022
@happyhuman happyhuman deleted the new-dataset-subfolders branch February 10, 2022 22:15
@adlersantos adlersantos mentioned this pull request Feb 11, 2022
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cleanup Cleanup or refactor code feature request New feature or request revision: readme Improvements or additions to the README

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants