0% found this document useful (0 votes)

38 views5 pages

Migrating Data From HDFS To Big Query

Uploaded by

Madhu Sudhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views5 pages

Migrating Data From HDFS To Big Query

Uploaded by

Madhu Sudhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Migrating data from HDFS to Big Query

This guide will walk you through the entire process of migrating 500 Hive external tables in
Parquet format to BigQuery using Google Cloud services such as Google Cloud Storage
(GCS), Dataflow, Dataform, and Cloud Composer.

Step 1: Setup IAM Permissions

Ensure the necessary IAM roles are assigned to the service accounts used by Cloud
Composer, Dataflow, and other Google Cloud services:

 Google Cloud Storage: roles/storage.admin

 BigQuery: roles/bigquery.admin
 Dataflow: roles/dataflow.admin
 Cloud Composer: roles/composer.admin
 Service Account: Ensure the service account used by Airflow has the necessary permissions
to execute the above roles.

Step 2: Transfer Data from HDFS to GCS using hadoop distcp

If you have direct access to the Hadoop cluster, you can use hadoop distcp to transfer the
data from HDFS to GCS:

sh
Copy code
hadoop distcp hdfs://namenode:8020/path-to-hive-tables/
gs://your-bucket/path/

Step 3: Create Dataflow Pipeline for Transformation and Loading

Create a Dataflow job to read Parquet files from GCS, add new columns, and load the data
into BigQuery.

Dataflow Pipeline (Python)

1. Create the Dataflow script (dataflow_pipeline.py):

python
Copy code
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,
GoogleCloudOptions, StandardOptions
from apache_beam.io import ReadFromParquet, WriteToBigQuery
from datetime import datetime

def add_columns(element):
element['timestamp'] = datetime.utcnow().isoformat()
element['source_name'] = 'hive_source'
return element

def run():
options = PipelineOptions()
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = 'your-project-id'
google_cloud_options.job_name = 'hive-to-bigquery'
google_cloud_options.staging_location = 'gs://your-bucket/staging'
google_cloud_options.temp_location = 'gs://your-bucket/temp'
options.view_as(StandardOptions).runner = 'DataflowRunner'

with beam.Pipeline(options=options) as p:
(p
| 'ReadFromParquet' >> ReadFromParquet('gs://your-bucket/path-to-
parquet-files/*.parquet')
| 'AddColumns' >> beam.Map(add_columns)
| 'WriteToBigQuery' >> WriteToBigQuery(
'your-project-id:your_dataset.your_table',
schema='SCHEMA_AUTODETECT',

create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)

if __name__ == '__main__':
run()

2. Upload the script to GCS:

sh
Copy code
gsutil cp dataflow_pipeline.py gs://your-bucket/path-to-dataflow-
script.py

Step 4: Setup Dataform for Schema Management

1. Initialize a Dataform project:

sh
Copy code
dataform init my_dataform_project
cd my_dataform_project

2. Configure dataform.json:

json
Copy code
{
"warehouse": "bigquery",
"defaultSchema": "your_dataset"
}

3. Create SQLX files for table definitions:

Create a definitions directory if it doesn't exist, and add SQLX files for your tables.
For example, definitions/example_table.sqlx:

sqlx
Copy code
config {
type: "table",
description: "This is an example table created from a Dataform
script",
columns: {
id: "The unique identifier",
name: "The name of the entity",
timestamp: "The timestamp when the row was inserted",
source_name: "The source of the data"
}
}

select
id,
name,
current_timestamp() as timestamp,
'hive_source' as source_name
from
your_dataset.your_table

4. Run Dataform:

sh
Copy code
dataform run

Step 5: Orchestrate the Process Using Cloud Composer

Create modularized Airflow DAGs to automate each step of the process.

5.1 Airflow DAG to Transfer Data to GCS

Create an Airflow DAG (transfer_data_to_gcs_dag.py) to transfer data from HDFS to

GCS:

python
Copy code
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 1,
}

with DAG('transfer_data_to_gcs', default_args=default_args,

schedule_interval='@daily') as dag:
transfer_data = BashOperator(
task_id='distcp_to_gcs',
bash_command='hadoop distcp hdfs://namenode:8020/path-to-hive-
tables/ gs://your-bucket/path/'
)
5.2 Airflow DAG to Run Dataflow Job

Create an Airflow DAG (load_data_to_bigquery_dag.py) to run the Dataflow job:

python
Copy code
from airflow import DAG
from airflow.providers.google.cloud.operators.dataflow import
DataflowCreatePythonJobOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 1,
}

with DAG('load_data_to_bigquery', default_args=default_args,

schedule_interval='@daily') as dag:
run_dataflow_job = DataflowCreatePythonJobOperator(
task_id='run_dataflow',
py_file='gs://your-bucket/path-to-dataflow-script.py',
dataflow_default_options={
'project': 'your-project-id',
'region': 'us-central1',
'staging_location': 'gs://your-bucket/staging',
'temp_location': 'gs://your-bucket/temp',
'runner': 'DataflowRunner'
}
)
5.3 Airflow DAG to Run Dataform

Create an Airflow DAG (run_dataform_dag.py) to run Dataform:

python
Copy code
from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 1,
}

with DAG('run_dataform', default_args=default_args,

schedule_interval='@daily') as dag:
run_dataform = BashOperator(
task_id='run_dataform',
bash_command='dataform run --project-dir
/path/to/your/dataform/project'
)

Conclusion

By following these steps, you can efficiently migrate your Hive external tables to BigQuery.
This approach leverages GCS for intermediate storage, Dataflow for transformation and
loading, Dataform for schema management, and Cloud Composer for orchestration. Each
component is modular, allowing for maintainability and scalability.

Bigquery Scenarios - Dipakraj Patil
No ratings yet
Bigquery Scenarios - Dipakraj Patil
37 pages
Hadoop To GCP Migration Plan
No ratings yet
Hadoop To GCP Migration Plan
3 pages
ETL Pipelines with Dataflow and BigQuery
0% (1)
ETL Pipelines with Dataflow and BigQuery
15 pages
05 Data Warehouse Using Google Big Query
No ratings yet
05 Data Warehouse Using Google Big Query
6 pages
(Agent Builder) - BigQuery Structured Data (NJ)
No ratings yet
(Agent Builder) - BigQuery Structured Data (NJ)
49 pages
Internship
No ratings yet
Internship
17 pages
Module IV
No ratings yet
Module IV
5 pages
To Migrate Data From Teradata To Google BigQuery
No ratings yet
To Migrate Data From Teradata To Google BigQuery
4 pages
Big Data HW
No ratings yet
Big Data HW
6 pages
Dataflow Pipeline to BigQuery
No ratings yet
Dataflow Pipeline to BigQuery
6 pages
Google's Professional Data Engineer - ExamTopics
100% (2)
Google's Professional Data Engineer - ExamTopics
234 pages
Lab 4 Creating A Streaming Data Pipeline For A Real
No ratings yet
Lab 4 Creating A Streaming Data Pipeline For A Real
18 pages
GCP Data Engineer Resume
No ratings yet
GCP Data Engineer Resume
1 page
Bigdata
No ratings yet
Bigdata
3 pages
Sqoop Import Techniques Guide
No ratings yet
Sqoop Import Techniques Guide
18 pages
APIs
No ratings yet
APIs
5 pages
Syllabus NM Infosys Big Data
No ratings yet
Syllabus NM Infosys Big Data
3 pages
Question 6
No ratings yet
Question 6
3 pages
GCP Data Engineering Course Overview
No ratings yet
GCP Data Engineering Course Overview
7 pages
Google Cloud Professional Data Engineer Exam Guide
No ratings yet
Google Cloud Professional Data Engineer Exam Guide
6 pages
Analyzing Unstructured Data in Hadoop
No ratings yet
Analyzing Unstructured Data in Hadoop
5 pages
4.4 - Managed Services
No ratings yet
4.4 - Managed Services
17 pages
File Module 5 - en - en
No ratings yet
File Module 5 - en - en
16 pages
Cloud Computing Lab-3
No ratings yet
Cloud Computing Lab-3
8 pages
Data Engineering 101 - BigQuery
No ratings yet
Data Engineering 101 - BigQuery
49 pages
Data Engineer Interview Q
No ratings yet
Data Engineer Interview Q
17 pages
GCS & BigQuery: Data Management Guide
No ratings yet
GCS & BigQuery: Data Management Guide
3 pages
Sandip - GCP
No ratings yet
Sandip - GCP
3 pages
Data Ingest
No ratings yet
Data Ingest
5 pages
From Data To Insights Course Summary
No ratings yet
From Data To Insights Course Summary
67 pages
Google Cloud Solutions & Labs Guide
No ratings yet
Google Cloud Solutions & Labs Guide
3 pages
Hive Setup for Data Engineers
No ratings yet
Hive Setup for Data Engineers
8 pages
BDA04 GoogleCloud
No ratings yet
BDA04 GoogleCloud
33 pages
Spark to BigQuery Migration Guide
No ratings yet
Spark to BigQuery Migration Guide
7 pages
Data Import and Analytics Guide
No ratings yet
Data Import and Analytics Guide
10 pages
Bigdata
No ratings yet
Bigdata
10 pages
Kiran - Data Engineer
No ratings yet
Kiran - Data Engineer
6 pages
Code
No ratings yet
Code
14 pages
06 Data Processing Using Google Cloud Functions
No ratings yet
06 Data Processing Using Google Cloud Functions
2 pages
Google Big Query Quick 5min Understanding
No ratings yet
Google Big Query Quick 5min Understanding
5 pages
GCP Fund Module 8 Big Data and Machine Learning in The Cloud
No ratings yet
GCP Fund Module 8 Big Data and Machine Learning in The Cloud
41 pages
Cloud Dataproc Spark Guide
No ratings yet
Cloud Dataproc Spark Guide
4 pages
Gcs To BQ Via Dataproc Dag
No ratings yet
Gcs To BQ Via Dataproc Dag
2 pages
Hive on Google Cloud Dataproc Guide
No ratings yet
Hive on Google Cloud Dataproc Guide
16 pages
Exp 5 Big Data Analytics and Computing Lab Manual
No ratings yet
Exp 5 Big Data Analytics and Computing Lab Manual
28 pages
BigQuery Introduction
No ratings yet
BigQuery Introduction
11 pages
Building Batch Data Pipelines On Google Cloud
No ratings yet
Building Batch Data Pipelines On Google Cloud
18 pages
Understanding SecLookup in GCP
100% (5)
Understanding SecLookup in GCP
12 pages
Big Data Unit 5 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 5 (Easy Notes) Edushine Classes
42 pages
Framework For Migrate Your Data Warehouse Google BigQuery WhitePaper
100% (1)
Framework For Migrate Your Data Warehouse Google BigQuery WhitePaper
21 pages
Big Data Course Outline (40 HRS) Big Data and Its Ecosystem
No ratings yet
Big Data Course Outline (40 HRS) Big Data and Its Ecosystem
1 page
Sertif GCP
No ratings yet
Sertif GCP
177 pages
GCP Data
No ratings yet
GCP Data
6 pages
GCP Ace Questions
No ratings yet
GCP Ace Questions
43 pages
Big Data Engineer Course Syllabus
No ratings yet
Big Data Engineer Course Syllabus
21 pages
Airflow Assignment 1
No ratings yet
Airflow Assignment 1
3 pages
Ravi
No ratings yet
Ravi
4 pages
GCP Sample
No ratings yet
GCP Sample
7 pages
Madhusudhan Senior Data Engineer
No ratings yet
Madhusudhan Senior Data Engineer
4 pages
Ac 2007-2658: Helping Engineering Students Write Effective Email
No ratings yet
Ac 2007-2658: Helping Engineering Students Write Effective Email
10 pages
February
No ratings yet
February
8 pages
Account Statement Summary: Nov 2023
No ratings yet
Account Statement Summary: Nov 2023
15 pages
Steamdeck 2d 20220202
No ratings yet
Steamdeck 2d 20220202
2 pages
Hải Dương-Anh 10 -Đề Thi
No ratings yet
Hải Dương-Anh 10 -Đề Thi
8 pages
UNIT 1 COMPARE OOP and POP
No ratings yet
UNIT 1 COMPARE OOP and POP
6 pages
Isu Abele 2021 Recalled Questions
No ratings yet
Isu Abele 2021 Recalled Questions
42 pages
Ad - 1Z0-051 - Oracle Database 11g - SQL Fundamentals I - Oracle Certification Exam
No ratings yet
Ad - 1Z0-051 - Oracle Database 11g - SQL Fundamentals I - Oracle Certification Exam
2 pages
TOGAF 9 and ITIL V3 Two Frameworks Whitepaper Tom Van Sante and Jeroen Ermers, Getronics Consulting
No ratings yet
TOGAF 9 and ITIL V3 Two Frameworks Whitepaper Tom Van Sante and Jeroen Ermers, Getronics Consulting
16 pages
Pfizer's 2021-22 Innovations Report
No ratings yet
Pfizer's 2021-22 Innovations Report
184 pages
Composite Plate Optimization Model
No ratings yet
Composite Plate Optimization Model
11 pages
Professional-Client Relationships - PDF
No ratings yet
Professional-Client Relationships - PDF
18 pages
1997 - Tsuji, Liberman - Intracellular Labeling of Auditory Nerve Fibers in Guinea Pig - Central and Peripheral Projections
No ratings yet
1997 - Tsuji, Liberman - Intracellular Labeling of Auditory Nerve Fibers in Guinea Pig - Central and Peripheral Projections
15 pages
PayTM Hiring Process Insights
No ratings yet
PayTM Hiring Process Insights
22 pages
ST1 Science-5 Q3
No ratings yet
ST1 Science-5 Q3
3 pages
TALENT EXAM - English
No ratings yet
TALENT EXAM - English
5 pages
OL 240712-58 AMAN LAB-Gas Oil
No ratings yet
OL 240712-58 AMAN LAB-Gas Oil
1 page
Using COMSOL-Multiphysics in An Eddy Current
No ratings yet
Using COMSOL-Multiphysics in An Eddy Current
5 pages
Jamesonian Postmodernism Analysis
No ratings yet
Jamesonian Postmodernism Analysis
11 pages
01asset List
No ratings yet
01asset List
10 pages
Generate
No ratings yet
Generate
2 pages
Equipo PT-Pinzas Alta Tensión
No ratings yet
Equipo PT-Pinzas Alta Tensión
20 pages
Technology Transfer
No ratings yet
Technology Transfer
3 pages
Compact Vacuum Circuit Breakers
100% (2)
Compact Vacuum Circuit Breakers
56 pages
Preview ISA+MC96.1-1982
No ratings yet
Preview ISA+MC96.1-1982
14 pages
As G481 Mechanics
0% (1)
As G481 Mechanics
69 pages
Developmental Controls
No ratings yet
Developmental Controls
28 pages
OMLVU13606 POD Unlocked
No ratings yet
OMLVU13606 POD Unlocked
124 pages
Series 42 Service Manual
No ratings yet
Series 42 Service Manual
74 pages
Making Time For Making Love
No ratings yet
Making Time For Making Love
4 pages
Exercise On Chap 2 - Diode Circuit
No ratings yet
Exercise On Chap 2 - Diode Circuit
3 pages
Translation Theories
No ratings yet
Translation Theories
54 pages

Migrating Data From HDFS To Big Query

Uploaded by

Migrating Data From HDFS To Big Query

Uploaded by

Migrating data from HDFS to Big Query

Step 1: Setup IAM Permissions

 Google Cloud Storage: roles/storage.admin

Step 2: Transfer Data from HDFS to GCS using hadoop distcp

Step 3: Create Dataflow Pipeline for Transformation and Loading

Dataflow Pipeline (Python)

1. Create the Dataflow script (dataflow_pipeline.py):

2. Upload the script to GCS:

Step 4: Setup Dataform for Schema Management

1. Initialize a Dataform project:

3. Create SQLX files for table definitions:

Step 5: Orchestrate the Process Using Cloud Composer

Create modularized Airflow DAGs to automate each step of the process.

5.1 Airflow DAG to Transfer Data to GCS

Create an Airflow DAG (transfer_data_to_gcs_dag.py) to transfer data from HDFS to

with DAG('transfer_data_to_gcs', default_args=default_args,

Create an Airflow DAG (load_data_to_bigquery_dag.py) to run the Dataflow job:

with DAG('load_data_to_bigquery', default_args=default_args,

Create an Airflow DAG (run_dataform_dag.py) to run Dataform:

with DAG('run_dataform', default_args=default_args,

You might also like