-
Notifications
You must be signed in to change notification settings - Fork 69
feat!: Reorganize pipelines and infra files into their respective folders #292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
leahecole
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non blocking nits. Otherwise, lgtm
| - GCS bucket to store final, downstream, customer-facing data | ||
| - Sometimes, for very large datasets, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) job | ||
| - GCS bucket to store downstream data, such as those linked to in the [Datasets Marketplace](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset). | ||
| - Sometimes, for very large datasets that requires processing to be parallelized, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) (i.e. Apache Beam) job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
| - Sometimes, for very large datasets that requires processing to be parallelized, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) (i.e. Apache Beam) job | |
| - Sometimes, for very large datasets that requires processing to be parallelized, you might need to provision a [Dataflow](https://cloud.google.com/dataflow/docs) (Apache Beam) job |
| # Environment Setup | ||
|
|
||
| We use Pipenv to make environment setup more deterministic and uniform across different machines. | ||
| We use Pipenv to make environment setup more deterministic and uniform across different machines. If you haven't done so, install Pipenv using the instructions found [here](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| We use Pipenv to make environment setup more deterministic and uniform across different machines. If you haven't done so, install Pipenv using the instructions found [here](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv). | |
| We use Pipenv to make environment setup more deterministic and uniform across different machines. If you haven't done so, install Pipenv using these [instructions](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv). |
nit - more screenreader friendly hyperlink
| In addition, the command above creates a "dot env" directory in the project root. The directory name is the value you set for `--env`. If it's not set, the value defaults to `dev` which generates the `.dev` folder. | ||
|
|
||
| Consider this "dot" directory as your own dedicated space for prototyping. The files and variables created in that directory will use an isolated environment. All such directories are gitignored. | ||
| Consider a dot directory as your own sandbox, specific to your machine, that's mainly used for prototyping. As will be seen later, this directory is where you will set the variables specific to your environment: such as actual GCS bucket names, GCR repository URLs, and secrets (we recommend using [Secret Manager](https://cloud.google.com/composer/docs/secret-manager) for this). The files and variables created or copied in the dot directories are isolated from the main repo, i.e. all dot directories are gitignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nits
| Consider a dot directory as your own sandbox, specific to your machine, that's mainly used for prototyping. As will be seen later, this directory is where you will set the variables specific to your environment: such as actual GCS bucket names, GCR repository URLs, and secrets (we recommend using [Secret Manager](https://cloud.google.com/composer/docs/secret-manager) for this). The files and variables created or copied in the dot directories are isolated from the main repo, i.e. all dot directories are gitignored. | |
| We strongly recommend using a dot directory as your own sandbox, specific to your machine, that's mainly used for prototyping. This directory is where you will set the variables specific to your environment: such as actual GCS bucket names, GCR repository URLs, and secrets (we recommend using [Secret Manager](https://cloud.google.com/composer/docs/secret-manager) for this). The files and variables created or copied in the dot directories are isolated from the main repo, meaning that all dot directories are gitignored. |
Description
This PR sets the stage for adding more categories (subfolders) in every
datasets/$DATASETfolder, such as the upcomingdocsfolder for the datasets' documentation set.To do this, we need to change how the files and folders under every dataset is organized.
From
to introducing two levels
infraandpipelines:which allows us to also add a
docsfolder. When the docset feature is ready, the hierarchy will look likeand we can keep adding other domain-specific folders as necessary without affecting infra- or pipelines-related stuff.
Checklist
Note: If an item applies to you, all of its sub-items must be fulfilled
READMEaccordinglydatasets/<DATASET_NAME>and nothing outside of that directorytestsfolder)