Skip to content

Commit 17bda3a

Browse files
docs: Add instructions for migrating from HBase to Bigtable (offline via snapshots) (googleapis#3197)
* docs: Add README for HBase Tools and Beam import/export and validator pipelines (googleapis#2949) * docs: Add README for HBase Tools and Beam import/export and validator pipelines. * responding to some review comments * more cleanups and adding hashes export and copy to bucket * Reran through commands and fixed/cleaned up * Cleanup for Jordan * fix references to hbase-tools to hbase-1.x-tools * update version grammar additional param info * remove unnecessary commands add timestamp to output file * docs: fix readme title for Bigtable HBase tools (googleapis#3013) Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/java-bigtable-hbase/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #<issue_number_goes_here> ☕️ * docs: Fix broken links for HBase Migration tools (googleapis#3097) * docs: Fix broken links * use more refined link * update header in readme * revert schema translator class * Update link generators and typo
1 parent e156474 commit 17bda3a

File tree

2 files changed

+295
-31
lines changed
  • bigtable-dataflow-parent/bigtable-beam-import
  • bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools

2 files changed

+295
-31
lines changed
Lines changed: 203 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,217 @@
11
# HBase Sequence Files to Cloud Bigtable using Beam
22

3-
This project supports importing and exporting HBase Sequence Files to Google Cloud Bigtable using
4-
Cloud Dataflow.
3+
This folder contains tools to support importing and exporting HBase data to
4+
Google Cloud Bigtable using Cloud Dataflow.
55

6-
## Instructions
7-
[//]: # ({x-version-update-start:bigtable-client-parent:released})
8-
Download [the import/export jar](http://search.maven.org/remotecontent?filepath=com/google/cloud/bigtable/bigtable-beam-import/1.23.0/bigtable-beam-import-1.23.0-shaded.jar), which is an aggregation of all required jars.
6+
## Setup
7+
8+
To use the tools in this folder, you can download them from the maven repository, or
9+
you can build them using Maven.
910

10-
Please pay attention to the Cluster CPU usage and adjust the number of Dataflow workers accordingly.
1111

12-
## Export
12+
### Download the jars
13+
Download [the import/export jars](https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-beam-import), which is an aggregation of all required jars.
1314

14-
On the command line:
15+
### Build the jars yourself
16+
17+
Go to the top level directory and build the repo
18+
then return to this sub directory.
1519

1620
```
17-
java -jar bigtable-beam-import-1.19.3-shaded.jar export \
18-
--runner=dataflow \
19-
--project=[your_project_id] \
20-
--bigtableInstanceId=[your_instance_id] \
21-
--bigtableTableId=[your_table_id] \
22-
--destinationPath=gs://[bucket_name]/[export_directory]/ \
23-
--tempLocation=gs://[bucket_name]/[temp_work_directory]/ \
24-
--maxNumWorkers=[10x number of nodes] \
25-
--zone=[zone of your cluster]
21+
cd ../../
22+
mvn clean install -DskipTests=true
23+
cd bigtable-dataflow-parent/bigtable-beam-import
2624
```
2725

28-
## Import
26+
***
27+
# Tools
2928

30-
Create the table in your cluster.
29+
## Data export pipeline
3130

32-
On the command line:
31+
You can export data into a snapshot or into sequence files. If you're migrating
32+
your data from HBase to Bigtable, using snapshots is the preferred method.
3333

34-
```
35-
java -jar bigtable-beam-import-1.19.3-shaded.jar import \
36-
--runner=dataflow \
37-
--project=[your_project_id] \
38-
--bigtableInstanceId=[your_instance_id] \
39-
--bigtableTableId=[your_table_id] \
40-
--sourcePattern='gs://[bucket_name]/[import_directory]/part-*' \
41-
--tempLocation=gs://[bucket_name]/[temp_work_directory] \
42-
--maxNumWorkers=[3x number of nodes] \
43-
--zone=[zone of your cluster]
44-
```
34+
### Exporting snapshots from HBase
35+
36+
Perform these steps from Unix shell on an HBase edge node.
37+
38+
1. Set the environment variables
39+
```
40+
TABLE_NAME=your-table-name
41+
SNAPSHOT_NAME=your-snapshot-name
42+
SNAPSHOT_EXPORT_PATH=/hbase-migration-snap
43+
BUCKET_NAME="gs://bucket-name"
44+
45+
NUM_MAPPERS=16
46+
```
47+
1. Take the snapshot
48+
```
49+
echo "snapshot '$TABLE_NAME', '$SNAPSHOT_NAME'" | hbase shell -n
50+
```
51+
52+
1. Export the snapshot
53+
```
54+
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \
55+
-copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers NUM_MAPPERS
56+
```
57+
1. Create hashes for the table to be used during the data validation step.
58+
[Visit the HBase documentation for more information on each parameter](http://hbase.apache.org/book.html#_step_1_hashtable).
59+
```
60+
hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=10 --numhashfiles=10 \
61+
$TABLE_NAME $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/hashtable
62+
```
63+
64+
65+
### Exporting sequence files from HBase
66+
67+
1. On your HDFS set the environment variables.
68+
```
69+
TABLE_NAME="my-new-table"
70+
EXPORTDIR=/usr/[USERNAME]/hbase-${TABLE_NAME}-export
71+
hadoop fs -mkdir -p ${EXPORTDIR}
72+
MAXVERSIONS=2147483647
73+
```
74+
1. On an edge node, that has HBase classpath configured, run the export commands.
75+
```
76+
cd $HBASE_HOME
77+
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
78+
-Dmapred.output.compress=true \
79+
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
80+
-Dhbase.client.scanner.caching=100 \
81+
-Dmapred.map.tasks.speculative.execution=false \
82+
-Dmapred.reduce.tasks.speculative.execution=false \
83+
$TABLE_NAME $EXPORTDIR $MAXVERSIONS
84+
```
85+
86+
### Exporting snapshots from Bigtable
87+
88+
Exporting HBase snapshots from Bigtable is not supported.
89+
90+
### Exporting sequence files from Bigtable
91+
92+
1. Set the environment variables.
93+
```
94+
PROJECT_ID=your-project-id
95+
INSTANCE_ID=your-instance-id
96+
CLUSTER_NUM_NODES=3
97+
TABLE_NAME=your-table-name
98+
99+
BUCKET_NAME=gs://bucket-name
100+
```
101+
1. Run the export.
102+
[//]: # ({x-version-update-start:bigtable-client-parent:released})
103+
```
104+
java -jar bigtable-beam-import-1.24.0-shaded.jar export \
105+
--runner=dataflow \
106+
--project=$PROJECT_ID \
107+
--bigtableInstanceId=$INSTANCE_ID \
108+
--bigtableTableId=$TABLE_NAME \
109+
--destinationPath=$BUCKET_NAME/hbase_export/ \
110+
--tempLocation=$BUCKET_NAME/hbase_temp/ \
111+
--maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES)
112+
```
113+
[//]: # ({x-version-update-end})
114+
115+
116+
## Importing to Bigtable
117+
118+
119+
You can import data into Bigtable from a snapshot or sequence files. Before you begin your import you must create
120+
the tables and column families in Bigtable via the [schema translation tool](https://github.com/googleapis/java-bigtable-hbase/tree/master/bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools)
121+
or using the Bigtable command line tool and running the following:
122+
123+
cbt createtable your-table-name
124+
cbt createfamily your-table-name your-column-family
125+
126+
Once your import is completed follow the instructions for the validator below to ensure it was successful.
127+
128+
Please pay attention to the Cluster CPU usage and adjust the number of Dataflow workers accordingly.
129+
130+
### Snapshots (preferred method)
131+
132+
1. Set the environment variables.
133+
```
134+
PROJECT_ID=your-project-id
135+
INSTANCE_ID=your-instance-id
136+
TABLE_NAME=your-table-name
137+
REGION=us-central1
138+
139+
SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
140+
SNAPSHOT_NAME=your-snapshot-name
141+
```
142+
143+
1. Run the import.
144+
[//]: # ({x-version-update-start:bigtable-client-parent:released})
145+
```
146+
java -jar bigtable-beam-import-1.24.0-shaded.jar importsnapshot \
147+
--runner=DataflowRunner \
148+
--project=$PROJECT_ID \
149+
--bigtableInstanceId=$INSTANCE_ID \
150+
--bigtableTableId=$TABLE_NAME \
151+
--hbaseSnapshotSourceDir=$SNAPSHOT_GCS_PATH/data \
152+
--snapshotName=$SNAPSHOT_NAME \
153+
--stagingLocation=$SNAPSHOT_GCS_PATH/staging \
154+
--tempLocation=$SNAPSHOT_GCS_PATH/temp \
155+
--maxWorkerNodes=$(expr 3 \* $CLUSTER_NUM_NODES) \
156+
--region=$REGION
157+
```
158+
[//]: # ({x-version-update-end})
159+
160+
161+
### Sequence Files
162+
163+
1. Set the environment variables.
164+
```
165+
PROJECT_ID=your-project-id
166+
INSTANCE_ID=your-instance-id
167+
CLUSTER_NUM_NODES=3
168+
CLUSTER_ZONE=us-central1-a
169+
TABLE_NAME=your-table-name
170+
171+
BUCKET_NAME=gs://bucket-name
172+
```
173+
1. Run the import.
174+
[//]: # ({x-version-update-start:bigtable-client-parent:released})
175+
```
176+
java -jar bigtable-beam-import-1.24.0-shaded.jar import \
177+
--runner=dataflow \
178+
--project=$PROJECT_ID \
179+
--bigtableInstanceId=$INSTANCE_D \
180+
--bigtableTableId=$TABLE_NAME \
181+
--sourcePattern='$BUCKET_NAME/hbase-export/part-*' \
182+
--tempLocation=$BUCKET_NAME/hbase_temp \
183+
--maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \
184+
--zone=$CLUSTER_ZONE
185+
```
186+
[//]: # ({x-version-update-end})
187+
188+
189+
## Validating data
190+
191+
Once your snapshot or sequence file is imported, you should run the validator to
192+
check if there are any rows with mismatched data.
193+
194+
1. Set the environment variables.
195+
```
196+
PROJECT_ID=your-project-id
197+
INSTANCE_ID=your-instance-id
198+
TABLE_NAME=your-table-name
199+
REGION=us-central1
200+
201+
SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
202+
```
203+
1. Run the sync job. It will put the results into `$SNAPSHOT_GCS_PATH/data-verification/output-TIMESTAMP`.
204+
[//]: # ({x-version-update-start:bigtable-client-parent:released})
205+
```
206+
java -jar bigtable-beam-import-1.24.0-shaded.jar sync-table \
207+
--runner=dataflow \
208+
--project=$PROJECT_ID \
209+
--bigtableInstanceId=$INSTANCE_D \
210+
--bigtableTableId=$TABLE_NAME \
211+
--outputPrefix=$SNAPSHOT_GCS_PATH/sync-table/output-${date +"%s"} \
212+
--stagingLocation=$SNAPSHOT_GCS_PATH/sync-table/staging \
213+
--hashTableOutputDir=$SNAPSHOT_GCS_PATH/hashtable \
214+
--tempLocation=$SNAPSHOT_GCS_PATH/sync-table/dataflow-test/temp \
215+
--region=$REGION
216+
```
45217
[//]: # ({x-version-update-end})
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# HBase Tools
2+
3+
This folder contains tools to help HBase users with migrations to Cloud Bigtable.
4+
Pipelines to import and export data are under [bigtable-beam-import](bigtable-dataflow-parent/bigtable-beam-import/README.md).
5+
6+
## Setup
7+
8+
To use the tools in this folder, you can download them from the maven repository, or
9+
you can build them using Maven.
10+
11+
### Download the jars
12+
13+
Download [the Bigtable tools jars](http://search.maven.org/remotecontent?filepath=com/google/cloud/bigtable/bigtable-hbase-1.x-tools/1.24.0/bigtable-hbase-1.x-tools-1.24.0-shaded.jar), which is an aggregation of all required jars.
14+
15+
### Build the jars
16+
Go to the top level directory and build the repo then return to this sub directory.
17+
18+
```
19+
cd ../../
20+
mvn clean install -DskipTests=true
21+
cd bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools
22+
```
23+
24+
## Schema Translation tool
25+
This tool will create tables in Cloud Bigtable based on the tables in an HBase cluster.
26+
You specify a [name regex](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html?is-external=true)
27+
and it will copy column families, garbage collection rules,
28+
and table splits.
29+
30+
1. Define the environment variables to easily run the command.
31+
```
32+
PROJECT_ID=your-project-id
33+
INSTANCE_ID=your-instance-id
34+
TABLE_NAME_REGEX=your-table-name
35+
36+
ZOOKEEPER_QUORUM=localhost
37+
ZOOKEEPER_PORT=2181
38+
```
39+
1. Execute the following command to copy the schema from HBase to Cloud Bigtable.
40+
[//]: # ({x-version-update-start:bigtable-client-parent:released})
41+
```
42+
java \
43+
-Dgoogle.bigtable.project.id=$PROJECT_ID \
44+
-Dgoogle.bigtable.instance.id=$INSTANCE_ID \
45+
-Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
46+
-Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
47+
-Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
48+
-jar bigtable-hbase-1.x-tools-1.24.0-jar-with-dependencies.jar
49+
```
50+
[//]: # ({x-version-update-end})
51+
52+
### Alternative: Exporting Schema
53+
54+
If your HBase master is in a private VPC or can't connect to internet, you can
55+
export the HBase schema to a file and use that to create tables in Cloud Bigtable.
56+
57+
58+
#### Export schema
59+
60+
1. On a host that can connect to HBase, define the export location for your schema file.
61+
```
62+
HBASE_EXPORT_PATH=/path/to/hbase-schema.json
63+
```
64+
1. Run the export tool from the host.
65+
[//]: # ({x-version-update-start:bigtable-client-parent:released})
66+
```
67+
java \
68+
-Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
69+
-Dgoogle.bigtable.output.filepath=$HBASE_EXPORT_PATH \
70+
-Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
71+
-Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
72+
-jar bigtable-hbase-1.x-tools-1.24.0-jar-with-dependencies.jar
73+
```
74+
[//]: # ({x-version-update-end})
75+
76+
#### Import schema
77+
78+
1. Copy the schema file to a host which can connect to Google Cloud.
79+
```
80+
SCHEMA_FILE_PATH=path/to/hbase-schema.json
81+
```
82+
83+
1. Create tables in Cloud Bigtable using the schema file:
84+
[//]: # ({x-version-update-start:bigtable-client-parent:released})
85+
```
86+
java \
87+
-Dgoogle.bigtable.project.id=$PROJECT_ID \
88+
-Dgoogle.bigtable.instance.id=$INSTANCE_ID \
89+
-Dgoogle.bigtable.input.filepath=$SCHEMA_FILE_PATH \
90+
-jar bigtable-hbase-1.x-tools-1.24.0-jar-with-dependencies.jar \
91+
```
92+
[//]: # ({x-version-update-end})

0 commit comments

Comments
 (0)