docs: Add instructions for migrating from HBase to Bigtable (offline via snapshots) (googleapis#3197)

billyjacobson · web-flow · commit 17bda3a2d65d · 2021-09-07T11:07:41.000-04:00
* docs: Add README for HBase Tools and Beam import/export and validator pipelines (googleapis#2949) * docs: Add README for HBase Tools and Beam import/export and validator pipelines. * responding to some review comments * more cleanups and adding hashes export and copy to bucket * Reran through commands and fixed/cleaned up * Cleanup for Jordan * fix references to hbase-tools to hbase-1.x-tools * update version grammar additional param info * remove unnecessary commands add timestamp to output file * docs: fix readme title for Bigtable HBase tools (googleapis#3013) Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://github.com/googleapis/java-bigtable-hbase/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #<issue_number_goes_here> ☕️ * docs: Fix broken links for HBase Migration tools (googleapis#3097) * docs: Fix broken links * use more refined link * update header in readme * revert schema translator class * Update link generators and typo
diff --git a/bigtable-dataflow-parent/bigtable-beam-import/README.md b/bigtable-dataflow-parent/bigtable-beam-import/README.md
@@ -1,45 +1,217 @@
 # HBase Sequence Files to Cloud Bigtable using Beam
 
-This project supports importing and exporting HBase Sequence Files to Google Cloud Bigtable using
-Cloud Dataflow.
+This folder contains tools to support importing and exporting HBase data to
+Google Cloud Bigtable using Cloud Dataflow.
 
-## Instructions
-[//]: # ({x-version-update-start:bigtable-client-parent:released})
-Download [the import/export jar](http://search.maven.org/remotecontent?filepath=com/google/cloud/bigtable/bigtable-beam-import/1.23.0/bigtable-beam-import-1.23.0-shaded.jar), which is an aggregation of all required jars.
+## Setup 
+
+To use the tools in this folder, you can download them from the maven repository, or
+you can build them using Maven. 
 
-Please pay attention to the Cluster CPU usage and adjust the number of Dataflow workers accordingly.
 
-## Export
+### Download the jars
+Download [the import/export jars](https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-beam-import), which is an aggregation of all required jars.
 
-On the command line:
+### Build the jars yourself
+
+Go to the top level directory and build the repo
+then return to this sub directory.
 
 ```
-java -jar bigtable-beam-import-1.19.3-shaded.jar export \
-    --runner=dataflow \
-    --project=[your_project_id] \
-    --bigtableInstanceId=[your_instance_id] \
-    --bigtableTableId=[your_table_id] \
-    --destinationPath=gs://[bucket_name]/[export_directory]/ \
-    --tempLocation=gs://[bucket_name]/[temp_work_directory]/ \
-    --maxNumWorkers=[10x number of nodes] \
-    --zone=[zone of your cluster]
+cd ../../
+mvn clean install -DskipTests=true
+cd bigtable-dataflow-parent/bigtable-beam-import
 ```
 
-## Import
+***
+# Tools
 
-Create the table in your cluster.
+## Data export pipeline
 
-On the command line:
+You can export data into a snapshot or into sequence files. If you're migrating
+your data from HBase to Bigtable, using snapshots is the preferred method. 
 
-```
-java -jar bigtable-beam-import-1.19.3-shaded.jar import \
-    --runner=dataflow \
-    --project=[your_project_id] \
-    --bigtableInstanceId=[your_instance_id] \
-    --bigtableTableId=[your_table_id] \
-    --sourcePattern='gs://[bucket_name]/[import_directory]/part-*' \
-    --tempLocation=gs://[bucket_name]/[temp_work_directory] \
-    --maxNumWorkers=[3x number of nodes] \
-    --zone=[zone of your cluster]
-```
+### Exporting snapshots from HBase
+
+Perform these steps from Unix shell on an HBase edge node.
+
+1. Set the environment variables
+    ```
+    TABLE_NAME=your-table-name
+    SNAPSHOT_NAME=your-snapshot-name 
+    SNAPSHOT_EXPORT_PATH=/hbase-migration-snap
+    BUCKET_NAME="gs://bucket-name"
+   
+    NUM_MAPPERS=16
+    ```
+1. Take the snapshot
+    ```
+    echo "snapshot '$TABLE_NAME', '$SNAPSHOT_NAME'" | hbase shell -n
+    ```
+
+1. Export the snapshot   
+    ```
+    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \
+        -copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers NUM_MAPPERS
+    ```
+1. Create hashes for the table to be used during the data validation step.
+[Visit the HBase documentation for more information on each parameter](http://hbase.apache.org/book.html#_step_1_hashtable).
+   ```
+   hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=10 --numhashfiles=10 \
+   $TABLE_NAME $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/hashtable
+   ```    
+   
+
+### Exporting sequence files from HBase
+
+1. On your HDFS set the environment variables.
+    ```
+    TABLE_NAME="my-new-table"
+    EXPORTDIR=/usr/[USERNAME]/hbase-${TABLE_NAME}-export
+    hadoop fs -mkdir -p ${EXPORTDIR}
+    MAXVERSIONS=2147483647
+    ```
+1. On an edge node, that has HBase classpath configured, run the export commands. 
+    ```
+    cd $HBASE_HOME
+    bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
+        -Dmapred.output.compress=true \
+        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
+        -Dhbase.client.scanner.caching=100 \
+        -Dmapred.map.tasks.speculative.execution=false \
+        -Dmapred.reduce.tasks.speculative.execution=false \
+        $TABLE_NAME $EXPORTDIR $MAXVERSIONS
+    ```
+
+### Exporting snapshots from Bigtable
+
+Exporting HBase snapshots from Bigtable is not supported.
+
+### Exporting sequence files from Bigtable
+
+1. Set the environment variables.
+    ```    
+    PROJECT_ID=your-project-id
+    INSTANCE_ID=your-instance-id
+    CLUSTER_NUM_NODES=3
+    TABLE_NAME=your-table-name
+    
+    BUCKET_NAME=gs://bucket-name
+    ```
+1. Run the export.
+[//]: # ({x-version-update-start:bigtable-client-parent:released})
+   ```
+   java -jar bigtable-beam-import-1.24.0-shaded.jar export \
+        --runner=dataflow \
+        --project=$PROJECT_ID \
+        --bigtableInstanceId=$INSTANCE_ID \
+        --bigtableTableId=$TABLE_NAME \
+        --destinationPath=$BUCKET_NAME/hbase_export/ \
+        --tempLocation=$BUCKET_NAME/hbase_temp/ \
+        --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES)
+   ```
+[//]: # ({x-version-update-end})
+
+
+## Importing to Bigtable
+
+
+You can import data into Bigtable from a snapshot or sequence files. Before you begin your import you must create
+the tables and column families in Bigtable via the [schema translation tool](https://github.com/googleapis/java-bigtable-hbase/tree/master/bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools)
+or using the Bigtable command line tool and running the following: 
+
+    cbt createtable your-table-name
+    cbt createfamily your-table-name your-column-family
+
+Once your import is completed follow the instructions for the validator below to ensure it was successful.
+
+Please pay attention to the Cluster CPU usage and adjust the number of Dataflow workers accordingly.
+
+### Snapshots (preferred method)
+
+1. Set the environment variables.
+    ```
+    PROJECT_ID=your-project-id
+    INSTANCE_ID=your-instance-id
+    TABLE_NAME=your-table-name
+    REGION=us-central1
+
+    SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
+    SNAPSHOT_NAME=your-snapshot-name
+    ```
+    
+1. Run the import.
+[//]: # ({x-version-update-start:bigtable-client-parent:released})
+    ```
+    java -jar bigtable-beam-import-1.24.0-shaded.jar importsnapshot \
+        --runner=DataflowRunner \
+        --project=$PROJECT_ID \
+        --bigtableInstanceId=$INSTANCE_ID \
+        --bigtableTableId=$TABLE_NAME \
+        --hbaseSnapshotSourceDir=$SNAPSHOT_GCS_PATH/data \
+        --snapshotName=$SNAPSHOT_NAME \
+        --stagingLocation=$SNAPSHOT_GCS_PATH/staging \
+        --tempLocation=$SNAPSHOT_GCS_PATH/temp \
+        --maxWorkerNodes=$(expr 3 \* $CLUSTER_NUM_NODES) \
+        --region=$REGION
+    ```
+[//]: # ({x-version-update-end})
+
+
+### Sequence Files
+
+1. Set the environment variables.
+    ```
+    PROJECT_ID=your-project-id
+    INSTANCE_ID=your-instance-id
+    CLUSTER_NUM_NODES=3
+    CLUSTER_ZONE=us-central1-a
+    TABLE_NAME=your-table-name
+    
+    BUCKET_NAME=gs://bucket-name
+    ```
+1. Run the import.
+[//]: # ({x-version-update-start:bigtable-client-parent:released})
+    ```
+    java -jar bigtable-beam-import-1.24.0-shaded.jar import \
+        --runner=dataflow \
+        --project=$PROJECT_ID \
+        --bigtableInstanceId=$INSTANCE_D \
+        --bigtableTableId=$TABLE_NAME \
+        --sourcePattern='$BUCKET_NAME/hbase-export/part-*' \
+        --tempLocation=$BUCKET_NAME/hbase_temp \
+        --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES)  \
+        --zone=$CLUSTER_ZONE
+    ```
+[//]: # ({x-version-update-end})
+
+
+## Validating data
+
+Once your snapshot or sequence file is imported, you should run the validator to
+check if there are any rows with mismatched data. 
+
+1. Set the environment variables.
+    ```
+    PROJECT_ID=your-project-id
+    INSTANCE_ID=your-instance-id
+    TABLE_NAME=your-table-name
+    REGION=us-central1
+    
+    SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
+    ```
+1. Run the sync job. It will put the results into `$SNAPSHOT_GCS_PATH/data-verification/output-TIMESTAMP`.
+[//]: # ({x-version-update-start:bigtable-client-parent:released})
+    ```
+    java -jar bigtable-beam-import-1.24.0-shaded.jar sync-table  \
+        --runner=dataflow \
+        --project=$PROJECT_ID \
+        --bigtableInstanceId=$INSTANCE_D \
+        --bigtableTableId=$TABLE_NAME \
+        --outputPrefix=$SNAPSHOT_GCS_PATH/sync-table/output-${date +"%s"} \
+        --stagingLocation=$SNAPSHOT_GCS_PATH/sync-table/staging \
+        --hashTableOutputDir=$SNAPSHOT_GCS_PATH/hashtable \
+        --tempLocation=$SNAPSHOT_GCS_PATH/sync-table/dataflow-test/temp \
+        --region=$REGION
+    ```
 [//]: # ({x-version-update-end})
diff --git a/bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools/README.md b/bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools/README.md
@@ -0,0 +1,92 @@
+# HBase Tools
+
+This folder contains tools to help HBase users with migrations to Cloud Bigtable.
+Pipelines to import and export data are under [bigtable-beam-import](bigtable-dataflow-parent/bigtable-beam-import/README.md).
+
+## Setup 
+
+To use the tools in this folder, you can download them from the maven repository, or
+you can build them using Maven.
+
+### Download the jars
+
+Download [the Bigtable tools jars](http://search.maven.org/remotecontent?filepath=com/google/cloud/bigtable/bigtable-hbase-1.x-tools/1.24.0/bigtable-hbase-1.x-tools-1.24.0-shaded.jar), which is an aggregation of all required jars.
+
+### Build the jars
+Go to the top level directory and build the repo then return to this sub directory.
+
+```
+cd ../../
+mvn clean install -DskipTests=true
+cd bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools
+```
+
+## Schema Translation tool 
+This tool will create tables in Cloud Bigtable based on the tables in an HBase cluster.
+You specify a [name regex](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html?is-external=true)
+and it will copy column families, garbage collection rules,
+and table splits.
+
+1. Define the environment variables to easily run the command.
+    ```
+    PROJECT_ID=your-project-id
+    INSTANCE_ID=your-instance-id
+    TABLE_NAME_REGEX=your-table-name
+    
+    ZOOKEEPER_QUORUM=localhost
+    ZOOKEEPER_PORT=2181
+    ```
+1. Execute the following command to copy the schema from HBase to Cloud Bigtable.
+[//]: # ({x-version-update-start:bigtable-client-parent:released})
+    ```
+    java \
+     -Dgoogle.bigtable.project.id=$PROJECT_ID \
+     -Dgoogle.bigtable.instance.id=$INSTANCE_ID \
+     -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
+     -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
+     -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
+     -jar bigtable-hbase-1.x-tools-1.24.0-jar-with-dependencies.jar
+    ```
+[//]: # ({x-version-update-end})
+
+### Alternative: Exporting Schema
+
+If your HBase master is in a private VPC or can't connect to internet, you can
+export the HBase schema to a file and use that to create tables in Cloud Bigtable.
+
+
+#### Export schema
+
+1. On a host that can connect to HBase, define the export location for your schema file.
+    ```
+    HBASE_EXPORT_PATH=/path/to/hbase-schema.json
+    ```
+1. Run the export tool from the host.
+[//]: # ({x-version-update-start:bigtable-client-parent:released})
+    ```
+    java \
+     -Dgoogle.bigtable.table.filter=$TABLE_NAME_REGEX \
+     -Dgoogle.bigtable.output.filepath=$HBASE_EXPORT_PATH \
+     -Dhbase.zookeeper.quorum=$ZOOKEEPER_QUORUM \
+     -Dhbase.zookeeper.property.clientPort=$ZOOKEEPER_PORT \
+     -jar bigtable-hbase-1.x-tools-1.24.0-jar-with-dependencies.jar
+    ```
+[//]: # ({x-version-update-end})
+
+#### Import schema
+
+1. Copy the schema file to a host which can connect to Google Cloud.
+   ```
+   SCHEMA_FILE_PATH=path/to/hbase-schema.json
+   ```
+
+1. Create tables in Cloud Bigtable using the schema file:
+[//]: # ({x-version-update-start:bigtable-client-parent:released})
+    ```
+    java \
+     -Dgoogle.bigtable.project.id=$PROJECT_ID \
+     -Dgoogle.bigtable.instance.id=$INSTANCE_ID \
+     -Dgoogle.bigtable.input.filepath=$SCHEMA_FILE_PATH \
+     -jar bigtable-hbase-1.x-tools-1.24.0-jar-with-dependencies.jar \
+    ```
+[//]: # ({x-version-update-end})