This blog post explains how to enable LZO compression on a HDInsight cluster.
ARM Template
You will need to modify the ARM template configuration and under the clusterDefinition, configuration section:
- Add core-site section and specify the codecs and compression codec class
- Add a mapred-site enable map output compression and the compression codec class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
"properties": {
"clusterVersion": "[parameters('clusterVersion')]",
"osType": "Linux",
"clusterDefinition": {
"kind": "spark",
"configurations": {
"gateway": {
"restAuthCredential.isEnabled": true,
"restAuthCredential.username": "[parameters('clusterLoginUserName')]",
"restAuthCredential.password": "[parameters('clusterLoginPassword')]"
},
"core-site": {
"io.compression.codecs": "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec",
"io.compression.codec.lzo.class": "com.hadoop.compression.lzo.LzoCodec"
},
"mapred-site": {
"mapreduce.map.output.compress": "true",
"mapreduce.map.output.compression.codec": "com.hadoop.compression.lzo.LzoCodec"
},
|
Install compression libraries on cluster nodes
You will also need to install the compression libraries on the cluster nodes.
1
|
apt install -y liblzo2-2 liblzo2-dev hadooplzo hadoop-lzo hadooplzo-native
|
On the point of compression libraries, if you are using snappy you will need to install the snappy compression libraries with:
1
|
apt install -y libsnappy1 libsnappy-dev
|