Hadoop compression : It is the process of compressing data stored in Hadoop Distributed File System
(HDFS) to reduce storage space and improve processing performance. Compression is especially
important in big data environments where storage and processing requirements can quickly become
prohibitively expensive.
Hadoop provides a number of built-in compression codecs, including gzip, bzip2, and Snappy, that can
be used to compress and decompress data. In addition, Hadoop allows users to create custom
compression codecs if they have specific compression requirements.
Here is an example of how to use Hadoop compression using the gzip codec:
1. First, we need to create a Hadoop input directory and copy the input data into it:
$ hadoop fs -mkdir input
$ hadoop fs -put [Link] input/
2. Next, we run a MapReduce job that uses gzip compression on the input data. Here is an example
command to do this:
$ hadoop jar /path/to/[Link] \
-D [Link]=true \
-D [Link]=[Link] \
-input input \
-output output \
-mapper [Link] \
-reducer [Link]
In this command, we use the -D option to set two configuration properties:
[Link]: This property enables compression for the intermediate
data produced by the mapper.
[Link]: This property specifies the compression codec to use,
in this case the gzip codec.
The rest of the command is standard MapReduce job configuration, including the input and
output directories, and the mapper and reducer scripts.
3. After the job completes, we can view the output data using the following command:
$ hadoop fs -cat output/[Link] | zcat
This command uses the hadoop fs -cat command to print the output file to the console, and the zcat
command to decompress the output data using gzip.
Here is a diagram that shows the steps involved in Hadoop compression:
1. Input data is stored in HDFS in uncompressed form.
2. The MapReduce job is configured to enable compression and specify the compression codec to
be used. This can be done using the following configuration properties:
[Link]: Enables compression for the intermediate data
produced by the mapper.
[Link]: Specifies the compression codec to use.
These properties can be set using the hadoop jar command, as shown in the previous answer.
3. The input data is processed by the mapper, which produces intermediate data in uncompressed
form.
4. The intermediate data is passed through a compression stage, where it is compressed using the
specified codec.
5. The compressed data is then passed to the reducer, which decompresses the data as part of its
processing. This step is optional, depending on the requirements of the job.
6. The output data is stored in HDFS in uncompressed form.