|
1 | 1 | # HBase Sequence Files to Cloud Bigtable using Beam |
2 | 2 |
|
3 | | -This project supports importing and exporting HBase Sequence Files to Google Cloud Bigtable using |
4 | | -Cloud Dataflow. |
| 3 | +This folder contains tools to support importing and exporting HBase data to |
| 4 | +Google Cloud Bigtable using Cloud Dataflow. |
5 | 5 |
|
6 | | -## Instructions |
7 | | -[//]: # ({x-version-update-start:bigtable-client-parent:released}) |
8 | | -Download [the import/export jar](http://search.maven.org/remotecontent?filepath=com/google/cloud/bigtable/bigtable-beam-import/1.23.0/bigtable-beam-import-1.23.0-shaded.jar), which is an aggregation of all required jars. |
| 6 | +## Setup |
| 7 | + |
| 8 | +To use the tools in this folder, you can download them from the maven repository, or |
| 9 | +you can build them using Maven. |
9 | 10 |
|
10 | | -Please pay attention to the Cluster CPU usage and adjust the number of Dataflow workers accordingly. |
11 | 11 |
|
12 | | -## Export |
| 12 | +### Download the jars |
| 13 | +Download [the import/export jars](https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-beam-import), which is an aggregation of all required jars. |
13 | 14 |
|
14 | | -On the command line: |
| 15 | +### Build the jars yourself |
| 16 | + |
| 17 | +Go to the top level directory and build the repo |
| 18 | +then return to this sub directory. |
15 | 19 |
|
16 | 20 | ``` |
17 | | -java -jar bigtable-beam-import-1.19.3-shaded.jar export \ |
18 | | - --runner=dataflow \ |
19 | | - --project=[your_project_id] \ |
20 | | - --bigtableInstanceId=[your_instance_id] \ |
21 | | - --bigtableTableId=[your_table_id] \ |
22 | | - --destinationPath=gs://[bucket_name]/[export_directory]/ \ |
23 | | - --tempLocation=gs://[bucket_name]/[temp_work_directory]/ \ |
24 | | - --maxNumWorkers=[10x number of nodes] \ |
25 | | - --zone=[zone of your cluster] |
| 21 | +cd ../../ |
| 22 | +mvn clean install -DskipTests=true |
| 23 | +cd bigtable-dataflow-parent/bigtable-beam-import |
26 | 24 | ``` |
27 | 25 |
|
28 | | -## Import |
| 26 | +*** |
| 27 | +# Tools |
29 | 28 |
|
30 | | -Create the table in your cluster. |
| 29 | +## Data export pipeline |
31 | 30 |
|
32 | | -On the command line: |
| 31 | +You can export data into a snapshot or into sequence files. If you're migrating |
| 32 | +your data from HBase to Bigtable, using snapshots is the preferred method. |
33 | 33 |
|
34 | | -``` |
35 | | -java -jar bigtable-beam-import-1.19.3-shaded.jar import \ |
36 | | - --runner=dataflow \ |
37 | | - --project=[your_project_id] \ |
38 | | - --bigtableInstanceId=[your_instance_id] \ |
39 | | - --bigtableTableId=[your_table_id] \ |
40 | | - --sourcePattern='gs://[bucket_name]/[import_directory]/part-*' \ |
41 | | - --tempLocation=gs://[bucket_name]/[temp_work_directory] \ |
42 | | - --maxNumWorkers=[3x number of nodes] \ |
43 | | - --zone=[zone of your cluster] |
44 | | -``` |
| 34 | +### Exporting snapshots from HBase |
| 35 | + |
| 36 | +Perform these steps from Unix shell on an HBase edge node. |
| 37 | + |
| 38 | +1. Set the environment variables |
| 39 | + ``` |
| 40 | + TABLE_NAME=your-table-name |
| 41 | + SNAPSHOT_NAME=your-snapshot-name |
| 42 | + SNAPSHOT_EXPORT_PATH=/hbase-migration-snap |
| 43 | + BUCKET_NAME="gs://bucket-name" |
| 44 | + |
| 45 | + NUM_MAPPERS=16 |
| 46 | + ``` |
| 47 | +1. Take the snapshot |
| 48 | + ``` |
| 49 | + echo "snapshot '$TABLE_NAME', '$SNAPSHOT_NAME'" | hbase shell -n |
| 50 | + ``` |
| 51 | +
|
| 52 | +1. Export the snapshot |
| 53 | + ``` |
| 54 | + hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \ |
| 55 | + -copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers NUM_MAPPERS |
| 56 | + ``` |
| 57 | +1. Create hashes for the table to be used during the data validation step. |
| 58 | +[Visit the HBase documentation for more information on each parameter](http://hbase.apache.org/book.html#_step_1_hashtable). |
| 59 | + ``` |
| 60 | + hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=10 --numhashfiles=10 \ |
| 61 | + $TABLE_NAME $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/hashtable |
| 62 | + ``` |
| 63 | + |
| 64 | +
|
| 65 | +### Exporting sequence files from HBase |
| 66 | +
|
| 67 | +1. On your HDFS set the environment variables. |
| 68 | + ``` |
| 69 | + TABLE_NAME="my-new-table" |
| 70 | + EXPORTDIR=/usr/[USERNAME]/hbase-${TABLE_NAME}-export |
| 71 | + hadoop fs -mkdir -p ${EXPORTDIR} |
| 72 | + MAXVERSIONS=2147483647 |
| 73 | + ``` |
| 74 | +1. On an edge node, that has HBase classpath configured, run the export commands. |
| 75 | + ``` |
| 76 | + cd $HBASE_HOME |
| 77 | + bin/hbase org.apache.hadoop.hbase.mapreduce.Export \ |
| 78 | + -Dmapred.output.compress=true \ |
| 79 | + -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \ |
| 80 | + -Dhbase.client.scanner.caching=100 \ |
| 81 | + -Dmapred.map.tasks.speculative.execution=false \ |
| 82 | + -Dmapred.reduce.tasks.speculative.execution=false \ |
| 83 | + $TABLE_NAME $EXPORTDIR $MAXVERSIONS |
| 84 | + ``` |
| 85 | +
|
| 86 | +### Exporting snapshots from Bigtable |
| 87 | +
|
| 88 | +Exporting HBase snapshots from Bigtable is not supported. |
| 89 | +
|
| 90 | +### Exporting sequence files from Bigtable |
| 91 | +
|
| 92 | +1. Set the environment variables. |
| 93 | + ``` |
| 94 | + PROJECT_ID=your-project-id |
| 95 | + INSTANCE_ID=your-instance-id |
| 96 | + CLUSTER_NUM_NODES=3 |
| 97 | + TABLE_NAME=your-table-name |
| 98 | + |
| 99 | + BUCKET_NAME=gs://bucket-name |
| 100 | + ``` |
| 101 | +1. Run the export. |
| 102 | +[//]: # ({x-version-update-start:bigtable-client-parent:released}) |
| 103 | + ``` |
| 104 | + java -jar bigtable-beam-import-1.24.0-shaded.jar export \ |
| 105 | + --runner=dataflow \ |
| 106 | + --project=$PROJECT_ID \ |
| 107 | + --bigtableInstanceId=$INSTANCE_ID \ |
| 108 | + --bigtableTableId=$TABLE_NAME \ |
| 109 | + --destinationPath=$BUCKET_NAME/hbase_export/ \ |
| 110 | + --tempLocation=$BUCKET_NAME/hbase_temp/ \ |
| 111 | + --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) |
| 112 | + ``` |
| 113 | +[//]: # ({x-version-update-end}) |
| 114 | +
|
| 115 | +
|
| 116 | +## Importing to Bigtable |
| 117 | +
|
| 118 | +
|
| 119 | +You can import data into Bigtable from a snapshot or sequence files. Before you begin your import you must create |
| 120 | +the tables and column families in Bigtable via the [schema translation tool](https://github.com/googleapis/java-bigtable-hbase/tree/master/bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools) |
| 121 | +or using the Bigtable command line tool and running the following: |
| 122 | +
|
| 123 | + cbt createtable your-table-name |
| 124 | + cbt createfamily your-table-name your-column-family |
| 125 | +
|
| 126 | +Once your import is completed follow the instructions for the validator below to ensure it was successful. |
| 127 | +
|
| 128 | +Please pay attention to the Cluster CPU usage and adjust the number of Dataflow workers accordingly. |
| 129 | +
|
| 130 | +### Snapshots (preferred method) |
| 131 | +
|
| 132 | +1. Set the environment variables. |
| 133 | + ``` |
| 134 | + PROJECT_ID=your-project-id |
| 135 | + INSTANCE_ID=your-instance-id |
| 136 | + TABLE_NAME=your-table-name |
| 137 | + REGION=us-central1 |
| 138 | +
|
| 139 | + SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap" |
| 140 | + SNAPSHOT_NAME=your-snapshot-name |
| 141 | + ``` |
| 142 | + |
| 143 | +1. Run the import. |
| 144 | +[//]: # ({x-version-update-start:bigtable-client-parent:released}) |
| 145 | + ``` |
| 146 | + java -jar bigtable-beam-import-1.24.0-shaded.jar importsnapshot \ |
| 147 | + --runner=DataflowRunner \ |
| 148 | + --project=$PROJECT_ID \ |
| 149 | + --bigtableInstanceId=$INSTANCE_ID \ |
| 150 | + --bigtableTableId=$TABLE_NAME \ |
| 151 | + --hbaseSnapshotSourceDir=$SNAPSHOT_GCS_PATH/data \ |
| 152 | + --snapshotName=$SNAPSHOT_NAME \ |
| 153 | + --stagingLocation=$SNAPSHOT_GCS_PATH/staging \ |
| 154 | + --tempLocation=$SNAPSHOT_GCS_PATH/temp \ |
| 155 | + --maxWorkerNodes=$(expr 3 \* $CLUSTER_NUM_NODES) \ |
| 156 | + --region=$REGION |
| 157 | + ``` |
| 158 | +[//]: # ({x-version-update-end}) |
| 159 | +
|
| 160 | +
|
| 161 | +### Sequence Files |
| 162 | +
|
| 163 | +1. Set the environment variables. |
| 164 | + ``` |
| 165 | + PROJECT_ID=your-project-id |
| 166 | + INSTANCE_ID=your-instance-id |
| 167 | + CLUSTER_NUM_NODES=3 |
| 168 | + CLUSTER_ZONE=us-central1-a |
| 169 | + TABLE_NAME=your-table-name |
| 170 | + |
| 171 | + BUCKET_NAME=gs://bucket-name |
| 172 | + ``` |
| 173 | +1. Run the import. |
| 174 | +[//]: # ({x-version-update-start:bigtable-client-parent:released}) |
| 175 | + ``` |
| 176 | + java -jar bigtable-beam-import-1.24.0-shaded.jar import \ |
| 177 | + --runner=dataflow \ |
| 178 | + --project=$PROJECT_ID \ |
| 179 | + --bigtableInstanceId=$INSTANCE_D \ |
| 180 | + --bigtableTableId=$TABLE_NAME \ |
| 181 | + --sourcePattern='$BUCKET_NAME/hbase-export/part-*' \ |
| 182 | + --tempLocation=$BUCKET_NAME/hbase_temp \ |
| 183 | + --maxNumWorkers=$(expr 3 \* $CLUSTER_NUM_NODES) \ |
| 184 | + --zone=$CLUSTER_ZONE |
| 185 | + ``` |
| 186 | +[//]: # ({x-version-update-end}) |
| 187 | +
|
| 188 | +
|
| 189 | +## Validating data |
| 190 | +
|
| 191 | +Once your snapshot or sequence file is imported, you should run the validator to |
| 192 | +check if there are any rows with mismatched data. |
| 193 | +
|
| 194 | +1. Set the environment variables. |
| 195 | + ``` |
| 196 | + PROJECT_ID=your-project-id |
| 197 | + INSTANCE_ID=your-instance-id |
| 198 | + TABLE_NAME=your-table-name |
| 199 | + REGION=us-central1 |
| 200 | + |
| 201 | + SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap" |
| 202 | + ``` |
| 203 | +1. Run the sync job. It will put the results into `$SNAPSHOT_GCS_PATH/data-verification/output-TIMESTAMP`. |
| 204 | +[//]: # ({x-version-update-start:bigtable-client-parent:released}) |
| 205 | + ``` |
| 206 | + java -jar bigtable-beam-import-1.24.0-shaded.jar sync-table \ |
| 207 | + --runner=dataflow \ |
| 208 | + --project=$PROJECT_ID \ |
| 209 | + --bigtableInstanceId=$INSTANCE_D \ |
| 210 | + --bigtableTableId=$TABLE_NAME \ |
| 211 | + --outputPrefix=$SNAPSHOT_GCS_PATH/sync-table/output-${date +"%s"} \ |
| 212 | + --stagingLocation=$SNAPSHOT_GCS_PATH/sync-table/staging \ |
| 213 | + --hashTableOutputDir=$SNAPSHOT_GCS_PATH/hashtable \ |
| 214 | + --tempLocation=$SNAPSHOT_GCS_PATH/sync-table/dataflow-test/temp \ |
| 215 | + --region=$REGION |
| 216 | + ``` |
45 | 217 | [//]: # ({x-version-update-end}) |
0 commit comments