M1 Data Engineering Tasks and Components
M1 Data Engineering Tasks and Components
01
Data engineering
tasks and
components
Proprietary + Confidential
03 Data formats
In this module, first, you learn about the role of a data engineer.
Second, we will cover the differences between a data source and a data sink.
Then, we will review different types of data formats that a data engineer will
encounter.
The next topic addresses the options for storing data on Google Cloud.
Finally, we will look at the features of Analytics Hub that allow you to easily share
datasets both within and outside your organization.
Proprietary + Confidential
03 Data formats
Get the data to where it can be useful raw data ingestion and storage
What does a data engineer do? At a basic level, a data engineer builds data pipelines.
Why does the data engineer build data pipelines? Because they want to get their data
into a place, such as a dashboard or report or machine learning model, from where
the business can make data-driven decisions.
The data has to be in a usable condition so that someone can use this data to make
decisions. Many times, the raw data is, by itself, not very useful.
Once data becomes useful, the data engineer will often apply updates or
transformations to add new value to the data.
Finally, data engineers create processes and operations to move data usage into
production settings.
Proprietary + Confidential
In the most basic sense, a data engineer moves data from data sources to data sinks
in four stages: replicate and migrate, ingest, transform, and store.
Proprietary + Confidential
Scheduling capabilities
Storage
Transfer
Transfer
Appliance
Service Change data capture
Replicate and
Ingest Transform Store
migrate
The replicate and migrate stage of a data pipeline focuses on the tools and options to
bring data from external or internal systems into Google Cloud for further refinement.
There are a wide variety of tools and options at your disposal. They will be covered in
more detail throughout this course.
Proprietary + Confidential
03 Data formats
Unstructured or structured
Replicate and
Ingest Transform Store
migrate
The ingest stage of a data pipeline is the point where data becomes a data source
and is available for usage downstream.
Think of a data source as the starting point of your data journey. It is raw,
unprocessed data waiting to be transformed into valuable insights. Any system,
application, or platform that creates, stores, or shares data can be considered a data
source.
Two examples of Google Cloud products used in the ingest phase are Cloud Storage,
a data lake holding various types of data sources, and Pub/Sub, an asynchronous
messaging system delivering data from external systems.
Proprietary + Confidential
Replicate and
Ingest Transform Store
migrate
The transform stage of a data pipeline represents action taken on a data source to
adjust, modify, join, or customize a data source so that it matches a specific
downstream data or reporting requirement.
You explore each of these patterns in their own modules later in the course.
Proprietary + Confidential
Analytics
Bigtable
Hub
Simple data sharing
Replicate and
Ingest Transform Store
migrate
The store stage of a data pipeline represents the last step, when we deposit data in
its final form.
A data sink is the final stop in the data journey. It's where processed and transformed
data is stored for future use, analysis, and decision-making. Think of it as the
reservoir at the end of the river, where valuable information is collected and readily
available.
Two examples of Google Cloud products used in the store phase are BigQuery, a
serverless data warehouse, and Bigtable, a highly scalable NoSQL database.
Proprietary + Confidential
03 Data formats
Unstructured Structured
Storage/process data
Data type
Unstructured data is usually suited for Cloud Storage, but BigQuery also offers the
capability to store unstructured data via object tables.
There is also structured data, which represents information stored in tables, rows, and
columns.
Proprietary + Confidential
03 Data formats
Hot data Once per month Once every 90 days Once a year
Compliance data
There are several key products on Google Cloud that are used by data engineers.
One main product is Cloud Storage. Unstructured data is usually well-suited to be
stored in Cloud Storage.
Within Cloud Storage, objects are accessed by using HTTP requests, including
ranged GETs to retrieve portions of the data. The only key is the object name. There
is object metadata but the object itself is treated as unstructured bytes. The scale of
the system allows for serving large static content and accepting user-uploaded
content including videos, photos, and files. Objects can be up to 5 Terabytes each.
Cloud Storage is built for availability, durability, scalability, and consistency. It's an
ideal solution for hosting static websites and storing images, videos, objects and
blobs, and any unstructured data.
Cloud Storage has four primary storage classes: standard storage, nearline storage,
coldline storage, and archive storage. The classes are differentiated by the expected
period of object access.
Proprietary + Confidential
High PostgreSQL
scalability
AlloyDB
SQL BigQuery
Analytical
NoSQL Bigtable
workload
You have a full range of cost-effective storage services for structured data to choose
from when developing with Google Cloud. No one size fits all, and your choice of
storage and database solutions will depend on your application and workload.
Spanner is Google Cloud’s fully managed relational database service that offers both
strong consistency and horizontal scalability.
Firestore is a fast, fully managed, serverless, NoSQL document database built for
automatic scaling, high performance, and ease of application development.
The two key concepts in data engineering are that of the data lake and the data
warehouse.
A data lake is a vast repository for storing raw, unprocessed data in various formats,
including unstructured, semi-structured, and structured. It serves as a centralized
storage solution for diverse data types, enabling flexible use cases like data science,
applications, and business decision-making.
Built-in machine
learning and Integration with other
geographic storage services
information system
BigQuery
BigQuery is a great solution for online analytical processing, or OLAP, workloads for
big data exploration and processing. BigQuery is also well-suited for reporting with
business intelligence tools.
Proprietary + Confidential
RUN
> bq query --use_legacy_sql=false \
'SELECT contributor_username, \
# Get count of comments by user for articles
COUNT(comment) AS comments \
# with 'Google' in the title
FROM ...;'
SELECT
contributor_username,
COUNT(comment) AS comments
FROM
`[Link]` Web UI with SQL Editor
WHERE
title LIKE "%Google%"
AND contributor_username IS NOT NULL bq command line tool
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100;
REST API (7 languages supported)
The second is via the bq command line tool which is part of the Google Cloud SDK.
The last is via a robust REST API which supports calls in seven programming
languages.
Proprietary + Confidential
`[Link]`
Project X Project Y
Dataset
Dataset A Dataset C
Tables
Views
ML models
Dataset B Dataset D
Routines
BigQuery organizes data tables into units called datasets. These datasets are scoped
to your Google Cloud project. When you reference a table from the command line in
SQL queries or in code, you refer to it by using the construct [Link].
Proprietary + Confidential
dataset IAM
100 #####@[Link] US
200 #####@[Link] US
700 #####@[Link] IT
Access control is through IAM and is at the dataset, table, view, or column level. In
order to query data in a table or view, you need at least read permissions on the table
or view.
Proprietary + Confidential
03 Data formats
Data governance
in BigQuery
Unified metadata Auto-discovery Data lifecycle
Dataplex
Insights and
Data quality Data classification Data organization semantic search
* future capabilities
Metadata is a key element to making data more manageable and useful across an
organization. Dataplex is a comprehensive data management solution that allows you
to centrally discover, manage, monitor, and govern distributed data across your
organization.
With Dataplex, you can break down data silos, centralize security and governance,
while enabling distributed ownership, and easily search and discover data based on
business context.
Dataplex also offers built-in data intelligence, support for open-source tools, and a
robust partner ecosystem, helping you to trust your data and accelerate time to
insights.
Proprietary + Confidential
Dataplex lets you standardize and unify metadata, security policies, governance,
classification, and data lifecycle management across this distributed data.
Another common use case is when your data is accessible only to data engineers,
and is later refined and made available to data scientists and analysts. In this case,
you can set up a lake to have the following:
A raw zone for the data which is accessed by data engineers and data scientists.
03 Data formats
Pipeline management
Tables
ETL process? BigQuery
Views
No data usage visibility
ML models
Routines
Onboard users in Complex permissions
IAM?
You need to consider security and permissions, destination options for data pipelines,
data freshness and accuracy, and finally, usage monitoring.
Shared
Public listing 1 Publish
dataset Users
Analytics Hub
BigQuery
2 Search
1 2 4
Private data exchange
3 Subscribe
Shared Linked
Private listing
dataset dataset
Analytics Hub 4 Query
BigQuery 3 BigQuery
Link
Analytics Hub helps organizations unlock the value of data sharing, leading to new
insights and business value.
With Analytics Hub, you create a rich data ecosystem by publishing and subscribing
to analytics-ready datasets. Because data is shared in place, data providers are able
to control and monitor how their data is being used.
Analytics Hub provides a self-service way to access valuable and trusted data assets,
including data provided by Google.
Finally, Analytics Hub provides an opportunity to monetize data assets. Analytics Hub
removes the tasks of building the infrastructure required for monetization.
Proprietary + Confidential
30 min
Learning objectives
● Load data into BigQuery from various sources.
● Load data into BigQuery using the CLI and the Google
Cloud console.
● Use DDL to create tables.
In this lab, you practice loading data into BigQuery. The primary objective of this lab is
to load data into BigQuery using both the command-line interface and the Google
Cloud console. You also experience loading several datasets into BigQuery and using
the Data Description Language or DDL.