0% found this document useful (0 votes)
4 views27 pages

M1 Data Engineering Tasks and Components

The document outlines the essential tasks and components of data engineering, focusing on the role of a data engineer, data sources versus sinks, data formats, and storage solutions on Google Cloud. It details the stages of data pipelines including replication, ingestion, transformation, and storage, as well as tools and services available for each stage. Additionally, it discusses metadata management and sharing datasets using Analytics Hub, emphasizing the importance of data governance and management across distributed data.

Uploaded by

saai0603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views27 pages

M1 Data Engineering Tasks and Components

The document outlines the essential tasks and components of data engineering, focusing on the role of a data engineer, data sources versus sinks, data formats, and storage solutions on Google Cloud. It details the stages of data pipelines including replication, ingestion, transformation, and storage, as well as tools and services available for each stage. Additionally, it discusses metadata management and sharing datasets using Analytics Hub, emphasizing the importance of data governance and management across distributed data.

Uploaded by

saai0603
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Proprietary + Confidential

01
Data engineering
tasks and
components
Proprietary + Confidential

Data engineering tasks and components

01 The role of a data engineer

02 Data sources versus data syncs

03 Data formats

04 Storage solution options on Google Cloud

05 Metadata management options on Google Cloud

06 Share datasets using Analytics Hub

In this module, first, you learn about the role of a data engineer.

Second, we will cover the differences between a data source and a data sink.

Then, we will review different types of data formats that a data engineer will
encounter.

The next topic addresses the options for storing data on Google Cloud.

Then, we cover the choices available for metadata management.

Finally, we will look at the features of Analytics Hub that allow you to easily share
datasets both within and outside your organization.
Proprietary + Confidential

Data engineering tasks and components

01 The role of a data engineer

02 Data sources versus data syncs

03 Data formats

04 Storage solution options on Google Cloud

05 Metadata management options on Google Cloud

06 Share datasets using Analytics Hub

Let’s start by discussing the role of a data engineer.


Proprietary + Confidential

A data engineer builds data pipelines


to enable data-driven decisions

Get the data to where it can be useful raw data ingestion and storage

Get the data into a usable condition data transformation

Add new value to the data data provisioning and enrichment

Manage the data security, privacy, discovery, governance

Productionize data processes pipeline monitoring and automation

What does a data engineer do? At a basic level, a data engineer builds data pipelines.

Why does the data engineer build data pipelines? Because they want to get their data
into a place, such as a dashboard or report or machine learning model, from where
the business can make data-driven decisions.

The data has to be in a usable condition so that someone can use this data to make
decisions. Many times, the raw data is, by itself, not very useful.

Once data becomes useful, the data engineer will often apply updates or
transformations to add new value to the data.

Of course, new data environments require data management practices to ensure


currency and accuracy.

Finally, data engineers create processes and operations to move data usage into
production settings.
Proprietary + Confidential

Data engineering tasks evolve around ingesting,


transforming, and storing data

Replicate and Ingest Transform Store


migrate

Transfer raw Raw data is Process data Processed data


data into available in a using EL, ELT, or is available in a
Google Cloud data source ETL tools data sink

In the most basic sense, a data engineer moves data from data sources to data sinks
in four stages: replicate and migrate, ingest, transform, and store.
Proprietary + Confidential

Replication and migration services onboard


your data into Google Cloud

gcloud Online or offline transfer


Datastream
storage

Scheduling capabilities
Storage
Transfer
Transfer
Appliance
Service Change data capture

Replicate and
Ingest Transform Store
migrate

The replicate and migrate stage of a data pipeline focuses on the tools and options to
bring data from external or internal systems into Google Cloud for further refinement.

There are a wide variety of tools and options at your disposal. They will be covered in
more detail throughout this course.
Proprietary + Confidential

Data engineering tasks and components

01 The role of a data engineer

02 Data sources versus data syncs

03 Data formats

04 Storage solution options on Google Cloud

05 Metadata management options on Google Cloud

06 Share datasets using Analytics Hub


Proprietary + Confidential

Data sources are the origin point of


your raw data on Google Cloud

Cloud Asynchronous messaging


Spanner
Storage

Unstructured or structured

Pub/Sub … many more!


Relational databases

Replicate and
Ingest Transform Store
migrate

The ingest stage of a data pipeline is the point where data becomes a data source
and is available for usage downstream.

Think of a data source as the starting point of your data journey. It is raw,
unprocessed data waiting to be transformed into valuable insights. Any system,
application, or platform that creates, stores, or shares data can be considered a data
source.

Two examples of Google Cloud products used in the ingest phase are Cloud Storage,
a data lake holding various types of data sources, and Pub/Sub, an asynchronous
messaging system delivering data from external systems.
Proprietary + Confidential

Transformation services add new


value to your data

Extract and load


Dataproc Dataform

Extract, load, transform

Dataflow … many more!


Extract, transform, load

Replicate and
Ingest Transform Store
migrate

The transform stage of a data pipeline represents action taken on a data source to
adjust, modify, join, or customize a data source so that it matches a specific
downstream data or reporting requirement.

There are three main transformation patterns:

● extract and load;


● extract, load, and transform; and
● extract, transform, and load.

You explore each of these patterns in their own modules later in the course.
Proprietary + Confidential

Data sinks store your processed data


on Google Cloud

Streaming and batch load


BigQuery Dataplex

Security and governance

Analytics
Bigtable
Hub
Simple data sharing

Replicate and
Ingest Transform Store
migrate

The store stage of a data pipeline represents the last step, when we deposit data in
its final form.

A data sink is the final stop in the data journey. It's where processed and transformed
data is stored for future use, analysis, and decision-making. Think of it as the
reservoir at the end of the river, where valuable information is collected and readily
available.

Two examples of Google Cloud products used in the store phase are BigQuery, a
serverless data warehouse, and Bigtable, a highly scalable NoSQL database.
Proprietary + Confidential

Data engineering tasks and components

01 The role of a data engineer

02 Data sources versus data syncs

03 Data formats

04 Storage solution options on Google Cloud

05 Metadata management options on Google Cloud

06 Share datasets using Analytics Hub


Proprietary + Confidential

Data can have different formats

Unstructured Structured
Storage/process data

Data type

Business need Documents Tables


Images Rows
Audio files Columns

Data exists in two primary formats, unstructured and structured.

Unstructured data is information stored in a non-tabular form such as documents,


images, and audio files.

Unstructured data is usually suited for Cloud Storage, but BigQuery also offers the
capability to store unstructured data via object tables.

There is also structured data, which represents information stored in tables, rows, and
columns.
Proprietary + Confidential

Data engineering tasks and components

01 The role of a data engineer

02 Data sources versus data syncs

03 Data formats

04 Storage solution options on Google Cloud

05 Metadata management options on Google Cloud

06 Share datasets using Analytics Hub


Proprietary + Confidential

Cloud Storage holds your unstructured data

Standard storage Nearline storage Coldline storage Archive storage


Application data
Database backups
Log files

Hot data Once per month Once every 90 days Once a year
Compliance data

Reliability and scalability Accessed by HTTP request

Retrieved by object name Max object size of 5 TB

There are several key products on Google Cloud that are used by data engineers.
One main product is Cloud Storage. Unstructured data is usually well-suited to be
stored in Cloud Storage.

Within Cloud Storage, objects are accessed by using HTTP requests, including
ranged GETs to retrieve portions of the data. The only key is the object name. There
is object metadata but the object itself is treated as unstructured bytes. The scale of
the system allows for serving large static content and accepting user-uploaded
content including videos, photos, and files. Objects can be up to 5 Terabytes each.

Cloud Storage is built for availability, durability, scalability, and consistency. It's an
ideal solution for hosting static websites and storing images, videos, objects and
blobs, and any unstructured data.

Cloud Storage has four primary storage classes: standard storage, nearline storage,
coldline storage, and archive storage. The classes are differentiated by the expected
period of object access.
Proprietary + Confidential

Options for storing structured data

Transactional Local/regional Cloud


SQL
workload scalability SQL

High PostgreSQL
scalability
AlloyDB

Global scalability Spanner


Structured
NoSQL Firestore

SQL BigQuery

Analytical
NoSQL Bigtable
workload

You have a full range of cost-effective storage services for structured data to choose
from when developing with Google Cloud. No one size fits all, and your choice of
storage and database solutions will depend on your application and workload.

Cloud SQL is Google Cloud’s managed relational database service.

AlloyDB is a fully managed, high-performance PostgreSQL database service from


Google Cloud.

Spanner is Google Cloud’s fully managed relational database service that offers both
strong consistency and horizontal scalability.

Firestore is a fast, fully managed, serverless, NoSQL document database built for
automatic scaling, high performance, and ease of application development.

BigQuery is a fully managed, serverless enterprise data warehouse for analytics.

Bigtable is a high-performance NoSQL database service. Bigtable is built for fast


key-value lookup and supports consistent sub 10 millisecond latency.
Proprietary + Confidential

Data lake versus data warehouse


Data lake Data warehouse

Native (unstructured, semi-structured,


Data format Schema (structured or semi-structured)
structured)

Pre-processed and aggregated from


Data type Raw
multiple data sources

Data science, applications, business


Purpose Long-term business analysis
decisions

Tools/processes to enable data discovery,


Dependencies governance, security, metadata Standalone
management

Service Cloud Storage BigQuery

The two key concepts in data engineering are that of the data lake and the data
warehouse.

A data lake is a vast repository for storing raw, unprocessed data in various formats,
including unstructured, semi-structured, and structured. It serves as a centralized
storage solution for diverse data types, enabling flexible use cases like data science,
applications, and business decision-making.

A data warehouse is a structured repository designed for storing pre-processed and


aggregated data from multiple sources. Primarily used for long-term business
analysis, it enables efficient querying and reporting for informed decision-making.
Data warehouses often operate as standalone systems, independent of other data
storage solutions.
Proprietary + Confidential

BigQuery is a serverless, fully managed


data warehouse

Security on dataset, Rich ecosystem for data


table, column and row transformations
level

Built-in machine
learning and Integration with other
geographic storage services
information system
BigQuery

Scalable storage and Real-time analytics


analytics services on streaming data

BigQuery is a fully managed, serverless enterprise data warehouse for analytics.


BigQuery has built-in features like machine learning, geospatial analysis, and
business intelligence. BigQuery can scan terabytes in seconds, and petabytes in
minutes.

BigQuery is a great solution for online analytical processing, or OLAP, workloads for
big data exploration and processing. BigQuery is also well-suited for reporting with
business intelligence tools.
Proprietary + Confidential

Connecting to BigQuery is easy

RUN
> bq query --use_legacy_sql=false \
'SELECT contributor_username, \
# Get count of comments by user for articles
COUNT(comment) AS comments \
# with 'Google' in the title
FROM ...;'
SELECT
contributor_username,
COUNT(comment) AS comments
FROM
`[Link]` Web UI with SQL Editor
WHERE
title LIKE "%Google%"
AND contributor_username IS NOT NULL bq command line tool
GROUP BY 1
ORDER BY 2 DESC
LIMIT 100;
REST API (7 languages supported)

BigQuery has several easy-to-use options for accessing data.

The first is via the Google Cloud console’s SQL editor.

The second is via the bq command line tool which is part of the Google Cloud SDK.

The last is via a robust REST API which supports calls in seven programming
languages.
Proprietary + Confidential

How BigQuery’s resources are organized

`[Link]`
Project X Project Y

Dataset

Dataset A Dataset C
Tables

Views

ML models
Dataset B Dataset D
Routines

BigQuery organizes data tables into units called datasets. These datasets are scoped
to your Google Cloud project. When you reference a table from the command line in
SQL queries or in code, you refer to it by using the construct [Link].
Proprietary + Confidential

You can secure your BigQuery resources


on multiple levels

dataset IAM

table Dataset access


Table/view access

user_id [Link] country

100 #####@[Link] US

200 #####@[Link] US

300 #####@[Link] US BigQuery


400 #####@[Link] UK

500 #####@[Link] UK Column-level security


Row-level security
600 #####@[Link] IT

700 #####@[Link] IT

Access control is through IAM and is at the dataset, table, view, or column level. In
order to query data in a table or view, you need at least read permissions on the table
or view.
Proprietary + Confidential

Data engineering tasks and components

01 The role of a data engineer

02 Data sources versus data syncs

03 Data formats

04 Storage solution options on Google Cloud

05 Metadata management options on Google Cloud

06 Share datasets using Analytics Hub


Proprietary + Confidential

Centrally discover, manage, monitor, and govern


distributed data with Dataplex
Analytics

BigQuery Dataproc Dataflow Vertex AI Data to AI governance

Data governance
in BigQuery
Unified metadata Auto-discovery Data lifecycle
Dataplex

Insights and
Data quality Data classification Data organization semantic search

Unified security Unified governance Data discovery with


Data Catalog
Storage

End-to-end data lineage


Cloud Storage Multi-cloud* On-premises* Streaming*

* future capabilities

Metadata is a key element to making data more manageable and useful across an
organization. Dataplex is a comprehensive data management solution that allows you
to centrally discover, manage, monitor, and govern distributed data across your
organization.

With Dataplex, you can break down data silos, centralize security and governance,
while enabling distributed ownership, and easily search and discover data based on
business context.

Dataplex also offers built-in data intelligence, support for open-source tools, and a
robust partner ecosystem, helping you to trust your data and accelerate time to
insights.
Proprietary + Confidential

Example: group and share your data


based on readiness using Dataplex

Data Data Data


All users
engineers engineers scientists

Landing zone Raw zone Curated zone

ingested data cleaned processed


Dataplex

limited access immutable source of trust

Bucket Bucket Bucket Dataset Dataset Dataset

Dataplex lets you standardize and unify metadata, security policies, governance,
classification, and data lifecycle management across this distributed data.

Another common use case is when your data is accessible only to data engineers,
and is later refined and made available to data scientists and analysts. In this case,
you can set up a lake to have the following:

A raw zone for the data which is accessed by data engineers and data scientists.

A curated zone for the data which is accessed by all users.


Proprietary + Confidential

Data engineering tasks and components

01 The role of a data engineer

02 Data sources versus data syncs

03 Data formats

04 Storage solution options on Google Cloud

05 Metadata management options on Google Cloud

06 Share datasets using Analytics Hub


Proprietary + Confidential

How about sharing data outside your


organization? This is challenging

Your organization External organization

Export and copy? Data freshness


Dataset

Pipeline management
Tables
ETL process? BigQuery
Views
No data usage visibility
ML models

Routines
Onboard users in Complex permissions
IAM?

Sharing data is challenging especially outside of your organization.

You need to consider security and permissions, destination options for data pipelines,
data freshness and accuracy, and finally, usage monitoring.

Analytics Hub was created to meet these data sharing challenges.


Proprietary + Confidential

Sharing data across organizations with


Analytics Hub is easy
Publisher project Public data exchange Subscriber project

Shared
Public listing 1 Publish
dataset Users
Analytics Hub
BigQuery
2 Search
1 2 4
Private data exchange
3 Subscribe
Shared Linked
Private listing
dataset dataset
Analytics Hub 4 Query
BigQuery 3 BigQuery

Link

Analytics Hub helps organizations unlock the value of data sharing, leading to new
insights and business value.

With Analytics Hub, you create a rich data ecosystem by publishing and subscribing
to analytics-ready datasets. Because data is shared in place, data providers are able
to control and monitor how their data is being used.

Analytics Hub provides a self-service way to access valuable and trusted data assets,
including data provided by Google.

Finally, Analytics Hub provides an opportunity to monetize data assets. Analytics Hub
removes the tasks of building the infrastructure required for monetization.
Proprietary + Confidential

Lab: Loading Data into BigQuery

30 min

Learning objectives
● Load data into BigQuery from various sources.
● Load data into BigQuery using the CLI and the Google
Cloud console.
● Use DDL to create tables.

In this lab, you practice loading data into BigQuery. The primary objective of this lab is
to load data into BigQuery using both the command-line interface and the Google
Cloud console. You also experience loading several datasets into BigQuery and using
the Data Description Language or DDL.

Lab URL: [Link]

You might also like