SSH to Azure HDInsight Premium cluster nodes

Posted on October 4, 2017 By Vijay Thakorlal

With HDInsight Standard cluster any user can SSH to the cluster nodes. In comparison HDInsight Premium cluster nodes by default restricts SSH access to two groups sudo and root. My initial assumption was that Microsoft may have done this for security reasons but then why allow the root user to login over SSH – this is something that most sysadmins disable.

HDInsight Premium cluster nodes have the following line in the /etc/ssh/sshd_config:

AllowGroups sudo root

This line states that members of the group sudo and root (e.g. in the later case that’s the root user) are permitted to login via SSH. If you would like to allow any user to login via SSH simply remove this line.

A better approach is to create a group in AD (and ensure this group is synchronised to the HDInsight cluster – this is something that you must configure when you deploy the cluster) and use that instead.

There seems to be a limitation that AllowGroups does not work with AD groups other than those shown via id <username>. I suspect this behaviour may be due to a limitation with winbind – when using SSD and Realmd to domain join a Linux VM, the full group membership is shown for a user. Furthermore if your AD groups contain spaces then because the space character is used to separate users then this won’t work – you can partially work around this by using the asterisk character:

AllowGroups sudo domain*users

How to enable LZO compression on HDInsight

Posted on October 3, 2017 By Vijay Thakorlal

Microsoft Azure

This blog post explains how to enable LZO compression on a HDInsight cluster.

ARM Template

You will need to modify the ARM template configuration and under the clusterDefinition, configuration section:

Add core-site section and specify the codecs and compression codec class
Add a mapred-site enable map output compression and the compression codec class

"properties": {

"clusterVersion": "[parameters('clusterVersion')]",

"osType": "Linux",

"clusterDefinition": {

"kind": "spark",

"configurations": {

"gateway": {

"restAuthCredential.isEnabled": true,

"restAuthCredential.username": "[parameters('clusterLoginUserName')]",

"restAuthCredential.password": "[parameters('clusterLoginPassword')]"

"core-site": {

"io.compression.codecs": "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec",

"io.compression.codec.lzo.class": "com.hadoop.compression.lzo.LzoCodec"

"mapred-site": {

"mapreduce.map.output.compress": "true",

"mapreduce.map.output.compression.codec": "com.hadoop.compression.lzo.LzoCodec"

Install compression libraries on cluster nodes

You will also need to install the compression libraries on the cluster nodes.

1	apt install -y liblzo2-2 liblzo2-dev hadooplzo hadoop-lzo hadooplzo-native

On the point of compression libraries, if you are using snappy you will need to install the snappy compression libraries with:

1	apt install -y libsnappy1 libsnappy-dev

Displaying HDInsight cluster information at login time

Posted on October 3, 2017 By Vijay Thakorlal

Microsoft Azure

This blog post describes how to display HDInsight cluster information when a user logs in via SSH.

Linux HDInsight clusters run Ubuntu, which allows you to customise the Message of the Day (MOTD) by placing scripts under /etc/update-motd.d and make the file executable.

The script has been published to GitHub https://github.com/vijayjt/AzureHDInsight/blob/master/script-actions/get-cluster-info.sh

Azure HDInsight Premium

Posted on October 3, 2017October 4, 2017 By Vijay Thakorlal

Microsoft Azure

This blog post discusses HDInsight premium which is currently in preview. HDInsight Premium adds the ability to domain join HDInsight clusters and Apache Ranger which can then be used to control access to databases/tables on HDInsight.

At the time of writing the documentation for HDInsight very poor and there are number of different limitations and issues with HDInsight Premium, most of which are not documented so I hope this post will help others.

Overview

HDInsight Premium allows you to join clusters to Azure AD Domain Services (AAD DS) domains. This then allows you to use accounts in your on-premise domain (provided you are synchronising users/groups via AAD Connect and have enabled password hash synchronisation) in HDInsight. Furthermore, you can then configure role based access control for Hive using Apache Ranger.

At the time of writing HDInsight is currently in Preview and has not GA’d – this means it is not backed by a full SLA. The Premium SKU is only available for “Hadoop” clusters – which do not come with Spark. However, HDInsight Premium with Spark clusters is available in private preview to a limited number of customers.

The domain-joining feature relies on Azure AD Domain Services (AADDS) – which provisions a Microsoft managed read-only domain controller. Until recently it was only possible to deploy AAD DS to a classic VNET which then required a VNET peering connection to the ARM VNET containing your HDInsight cluster (this obviously requires your VNETs are in the same region).

AD Connect and Password Synchronisation

In order to use accounts in your on-premise domain to authenticate with HDInsight you need two things:

Firstly you must use Azure AD Connect to synchronise users and groups to Azure AD
Secondly you need to enable password synchronisation.

Since HDInsight Premium implements authentication using Kerberos, this requires that Azure AD Domain Services holds the users passwords. This in turn requires that we synchronise password hashes from the on-premise domain to our Azure AD directory.

It should be noted that:

Password synchronisation will apply to all users that are being synchronised to Azure AD.
Synchronisation traffic uses HTTPS
When synchronizing passwords, the plain-text version of your password is not exposed to the password synchronization feature, to Azure AD, or any of the associated services.
The original hash is not transmitted to Azure AD. Instead, the SHA256 hash of the original MD5 hash is transmitted. As a result, if the hash stored in Azure AD is obtained, it cannot be used in an on-premises pass-the-hash attack.

Accounts are synchronised from the on-premise Active Directory to Azure AD, the AD objects are then synchronised to the Azure AD Domain Services instance. The synchronization process from Azure AD to Azure AD Domain Services is one-way/unidirectional in nature. Your managed domain is largely read-only except for any custom OUs you create. Therefore, you cannot make changes to user attributes, user passwords, or group memberships within the managed domain. As a result, there is no reverse synchronization of changes from your managed domain back to your Azure AD tenant.

On-Premise to Azure AD Syncrhonisation: this is usually on an hourly basis unless you have a newer version of Azure AD Connect and have customised the sychronisation interval.
Azure AD to AAD DS: the documentation states this takes 20 minutes, but in my experience this usually takes closer to 1 hour.

What if you don’t want to synchronise the password hash (e.g. if your security department objects)? In this case you can use cloud only users and AD groups instead.

Azure AD Domain Services

Create an Azure AD Domain Services (AAD DS) from the Azure portal. Once the AAD DS instance is created you will receive two IP addresses which are the domain controllers.

Note that it may take 10-20 minutes before the AAD DS IP addresses are available.

VNET DNS

The ARM VNET that contains the HDInsight cluster and the VNET that contains the AAD DS instance will need to be reconfigured to use the two IPs as DNS servers – this is required otherwise the cluster creation will fail.

When you create your Azure AD DS instance the actual domain used will match the domain that you have set as primary in Azure AD. If the primary domain is of the form: <MyAADTenant>.onmicrosoft.com – then this is the domain that will be used. As we will see later this has some implications in terms of LDAPS configuration.

Enabling SSL/TLS for AAD DS

HDInsight requires that you enable LDAPS for AAD DS. If you have a public domain configure as your primary in Azure AD then you can obtain a public certificate from public CA such as Symantec or DigiTrust. However, if your primary is using the default Microsoft provided domain <MyAADTenant>.onmicrosoft.com, then since you don’t own onmicrosoft.com you will need to use a self-signed certificate and request an exception by raising a support case with Microsoft.

Next an SSL certificate needs to be uploaded in PFX format with the private key (you will also need the password) via the Azure portal and enable Secure LDAP.

Ensure that “Allow secure LDAP access over the internet” is (which is the default).

Management Server

You cannot RDP to the two IP address or otherwise log on directly to the domain controllers. So how do you manage AAD DS?

The answer is a management Windows Server 2012 R2 VM should be created within the VNET that contains the AAD DS instance and then using an account that is a member of the “AAD DC Administrators” AD group (created when AAD DS instance is created) join the server to the domain.

Next install the RSAT and DNS management tools.

OUs

Although the Microsoft documentation does not mention this it is my recommendation that you create a HDInsight OU and then OUs under that for each HDInsight cluster. This will make it easy to find the computer, account and SPN objects for each cluster.

Cluster Domain Join Account

When creating a HDInsight Premium cluster, you must specify a “domain account” which is used by the cluster to join the node to the AAD DS instance. The account will require the following permissions:

Permissions to join machines to the domain
Permissions to place the machines into the OU created for HDInsight clusters
Permissions to create service principals within the OU Create reverse DNS entries

The Microsoft documentation appears to give an example of using an account that is a member of “AAD DC Administrators”.

However, given the account used to domain join the cluster also becomes the cluster admin (e.g. in Ambari), I would strongly advice against doing this as such an account would have full control over the AAD DS instance. Furthermore, if you then have multiple clusters e.g. dev, test, production or by business group then they would all have admin access to AAD DS.

Therefore a separate account should be used for each cluster since this prevents a compromise of one cluster being used to gain access to another. Using a separate account enables administration of clusters to be delegated to different teams.

The permissions can then be granted as follows:

Right-click the OU, select Delegate Control
Click Next
Click Add
Select the account to be used for domain joining and click OK
Click Next Select , and select . Delegate the following common tasks Create, delete, and manage user accounts
Click Next then click Finish
From ADUC click > View Advanced Features
Right-click the OU and click Properties
Click the tab Security
Grant the domain join account the following permissions
- Read
- Write
- Create all child objects
- Delete all child objects

The username (samaccountname) must be 15 characters or less and all lowercase – otherwise cluster provisioning using this account will fail. This is not documented by Microsoft – I had to find this out the hard way by digging through log files and looking at how Microsoft had implemented domain joined clusters. Microsoft are doing this using winbind/samba which is where this limitation comes from (that and a combination of compatibility with Win2K). It’s not clear to me why Microsoft are not using SSSD and Realmd instead.

DNS

A forward DNS zone will be automatically created upon provisioning Azure AD Domain Services however reverse zones are not. HDInsight Premium relies upon Kerberos for authentication, this requires that reverse DNS entries are created for the nodes in the cluster. As a result we must configure (via the management server) reverse DNS zones for all the subnets that will contain HDInsight Premium clusters and enable secure updates.

The reverse DNS zones need to be configured based on the /8, /16 or /24 boundaries (classless ranges are not supported directly).

You might also want to consider adding conditional forwarding for your on-premise domains if you have connectivity to them.

Issues and Limitations

I’ve summarised below the main issues and limitations that I have come across (this is based on testing with HDInsight Premium spark clusters):

HDInsight is in public preview – which means that it is not subject to any SLAs
The synchronisation lag can be quite large – in theory this should be 1 hour 20 minutes from on-premise AD to AAD DS. However, in practice this is more like 2 hours. You need to keep this in mind when troubleshooting permission / access issues.
The documentation for HDInsight is pretty bare bones and contains mistakes/errors.
- For example, this article https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-domain-joined-configure-use-powershell#run-the-powershell-script links to a repo in GitHub that is supposed to do the AAD DS configuration for you. However, apart from a README.md file it is an empty repo;
- It does not explain the permissions required to domain join a cluster in enough detail e.g. on the OU, the exact DNS permissions, how to create reverse DNS zones (unless you are a DNS admin you won’t know this);
- There are special requirements for the username of the domain join account but these are not documented anywhere.
If you delete a cluster it leaves behind the DNS entries (forward and reverse), computer accounts, as well as the user and service principal objects. This obviously clutters AAD DS but can also cause problems if you want to do CI/CD and the objects already exist.
The components that are available with HDInsight are also not well documented e.g.
- Jupyter is currently not available – presumably because the it’s not that trivial to integrate with kerberos. You can use Zeppelin though.
- The Microsoft provided Hue script action will not work because it does not support kerberos – a significant amount of effort is required to do this. In light of this you would have to use Ambari Hive views.
- Oozie is not available on the cluster either.
- Applications are not supported – which means you cannot add edge nodes via an ARM template
Other things that are not documented include
- If you are using Azure Data Factory (ADF) then Hive activities do not work.
- Spark activities with ADF does work but you have to disable CSRF protection in the livy.conf configuration file (you can do this via Ambari) but this isn’t a good idea from a Security standpoint.
Ranger policies are only provided for Hive/Spark – they do not cover HDFS. I believe this is because of the limitations with Azure Storage authorisation and authentication listed here https://hadoop.apache.org/docs/current3/hadoop-azure/index.html#Limitations

How to configure Apache Zeppelin to use LDAP Authentication on HDInsight

Posted on October 3, 2017 By Vijay Thakorlal

Microsoft Azure

Apache Zeppelin supports integration with Active Directory/LDAP via the Shiro pluggable authentication module.

Configuration files

Configuration file

Settings

Description

zeppelin-config

zeppelin.anonymous.allowed: false

This disables anonymous access to Zeppelin

zeppelin-env

The shiro_ini_content setting should be configured with the following:

[users]
# List of users with their password allowed to access Zeppelin.
# To use a different strategy (LDAP / Database / …) check the shiro doc at http://shiro.apache.org/configuration.html#Configuration-INI Sections
# LDAP configuration, for user Authentication, currently tested for single Realm
[main]
activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm
activeDirectoryRealm.systemUsername = CN=<service account tbc>,CN=Users,DC=my,DC=domain,DC=com
activeDirectoryRealm.systemPassword = <not the password>
#activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://use r/zeppelin/zeppelin.jceks
activeDirectoryRealm.searchBase = CN=Users,DC=my,,DC=domain,DC=com
activeDirectoryRealm.url = ldap://<domain controller fqdn>:389
#activeDirectoryRealm.groupRolesMap = ‘tbc’
#activeDirectoryRealm.authorizationCachingEnabled = true
shiro.loginUrl = /api/login
[urls]
# anon means the access is anonymous.
# authcBasic means Basic Auth Security
# To enfore security, comment the line below and uncomment the next one
/** = authc

The first few lines under main defines the user account and password to use to connect to the domain controller.

We then define the search base path to use when looking up users/groups.

We then define the domain controller to connect to.

The last line enables authentication for all URLs.

You have two options for applying these configuration changes:

Through the Ambari web interface or;
You can make these changes at cluster deployment time with ARM template HDInsight bootstrap configuration, although these configuration files are not officially listed in the Microsoft documentation it is possible to configure these in an ARM template (in the clusterDefinition, configurations section).

The only problem is you will not likely want to add the password to the ARM template so you could add the password via the Ambari web interface post deployment or inject it into the template at runtime.

How to create user specific databases on HDInsight Standard

Posted on October 3, 2017 By Vijay Thakorlal

Microsoft Azure

This post describes a way of creating user specific databases on HDInsight standard. This uses a similar technique as described in a previous post.

Overview

The script creates databases using beeline taking a list of the databases names from a CSV file. Since we are creating user specific databases the database names should match the username.

First create a CSV file
- The first line should contain the header name dbname
- The subsequent lines should contain the
Store the CSV file on the default Azure Storage account
Attach the Storage account to the HDInsight cluster
Deploy the cluster with an ARM template that uses a custom script
The script
- Determines the cluster name
- Based on the cluster name it looks for a file named <clustername>-user-db-list.csv on the storage account
- Copies the file to the node and iterates through the lines in the file and iterates through the file to create the databases in the file

The script is available here on GitHub https://github.com/vijayjt/AzureHDInsight/blob/master/script-actions/create-user-hive-dbs.sh

Future improvements

If we wanted to create user specific databases but use a different name for the database then the CSV and script can be modified to use two columns; the first the database name and the other the owner of the database.

The script assumes the storage account that contains the CSV file contains the string artifacts in its name; the script could and should be updated to take the storage account and container name as parameters.

Modifying the PAM Configuration on HDInsight Standard

Posted on October 3, 2017 By Vijay Thakorlal

Microsoft Azure

As mentioned in a previous blog post on HDInsight standard, Microsoft modify the PAM configuration (at least this is the case on HDInsight 3.5) such that when you create a user and try to set the password you are asked to set the password twice.

The gist below can be used to reset the PAM configuration. In the code below the various PAM configuration files have been gzipped and base64 encoded.

This technique of using gzip compressed and base64 encoded files is very useful when running script actions on HDInsight or even configuring VMs via custom script actions on Linux.

The code creates files:

common-account
common-auth
common-password
common-session
common-session-noninteractive

These should replace the ones in /etc/pam.d/

HDInsight Creating Local OS and Ambari users via the REST API

Posted on October 3, 2017October 3, 2017 By Vijay Thakorlal

Microsoft Azure

HDInsight is a semi-managed Hadoop cluster on the Microsoft Azure cloud. Although the standard version isn’t geared towards multiple users from a security perspective I recently had to figure out a way to create local users across the cluster at cluster build time. This blog post describes one way you could do this.

Overview

The way we will create users on the cluster at boot time is:

Create a CSV file containing usernames,user and group ids, shell etc.
Store the CSV file on the default Azure Storage account
Attach the Storage account to the HDInsight cluster
Deploy the cluster with an ARM template that uses a custom script which creates the user accounts
The script
- Determines the cluster name
- Based on the cluster name it looks for a file named <clustername>-user-list.csv on the storage account
- Copies the file to the node and iterates through the lines in the file and:
  - Creates local OS users;
  - Create local OS groups for the users;
  - If the users is an admin user it adds them to sudoers;
  - Creates user accounts in Ambari – note that in a HDInsight standard cluster the user accounts in Ambari are separate to the OS level accounts;
  - Creates Pig and Oozie views if they do not already exist;
  - Adds the user to either the
    - clusteruser group which will have read only access to cluster stats/configuration and access to Hive views etc in Ambari
    - clusteradministrator which will have full access to manage everything through Ambari
    - grants access to various Ambari views to the aforementioned groups

The CSV file containing the user details includes the uid, we do this to ensure the users have the same uid across all nodes in the cluster.

The creation of Ambari users, groups, checking membership of the Ambari groups and creating views makes heavy use of the Ambari REST API.

At the moment the script lists all the storage accounts associated with the cluster and finds one that contains the string artifacts and looks on this storage account for the user list CSV file. The script needs to be improved by making the storage account and container names parameters.

I have uploaded the script to github here https://github.com/vijayjt/AzureHDInsight/blob/master/script-actions/create-local-users.sh

Ambari REST API

As mentioned the script makes heavy use of the Ambari REST API.

Checking a user exists
- We do this by calling the REST API endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/users
- Then iterating to through the users returned to see if the username is in this list
- Lines 2 – 16 in the gist shown below provides some example code for checking if a user exists in Ambari (this is an extract from the full script mentioned above).
Check if a user is a member of an Ambari Group
- We call the REST API endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/groups/${GROUP_TO_CHECK}/members
- To obtain a list of users that are a member of the specified group
- Then we iterate through the list to see if the user is in the list
Adding a user to Ambari
- As shown in line 88 of the gist, we make a post request to the endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/users with a JSON body that contains the username and password
Adding a group to Ambari
- As shown in line 91 of the gist, we make a post request to the endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/groups with a JSON body that contains the group name
Adding a user to a group in Ambari
- As shown in line 94 of the gist we make a post request to the endpoint http://${ACTIVEAMBARIHOST}:8080/api/v1/groups/${ambarigroup}/members with a JSON body that contains the username and group name

Important considerations

PAM Configuration

It should be noted that Microsoft do modify the pam configuration files which can lead to an issue where the passwd <username> command will prompt you twice because they have added kerberos related configuration items to PAM. I suspect they made this change for Azure HDInsight Premium which supports domain joined clusters but accidentally included this on Standard clusters as well. I will write a separate post on how to change the PAM configuration.

Force password change

When the account is created you should force the user to change their password shortly afterwards.

Security of user list file on Azure storage

The user passwords are stored in plaintext on the Azure storage account. This is one of the reasons we delete the file after cluster provisioning. If you were to leave it on the storage account any user on the cluster will be able to view it using hdfs commands.

If you persisted the script and you scale the cluster from the Azure portal and add additional nodes you will need to reinstate the file otherwise the local OS users will not be created.

It goes without saying but you should ensure storage account encryption is enabled and the container is private.

As an alternative you could look to distribute SSH keys instead of using password authentication.

What other options are there for provisioning users on HDInsight Standard?

If you are using a configuration management tool such as Chef, Ansible or Puppet, you could create the accounts via one of these tools and also use it to distribute SSH keys so that no passwords are involved.

If you do take this approach you need to be careful that there are no scripts that rely upon the chef/ansible code to execute first and create the users otherwise the script actions may fail or you may end up with a race condition.

How to create an Azure AD Application and Service Principal that uses certificate authentication

Posted on October 3, 2017 By Vijay Thakorlal

Microsoft Azure

Creating Azure AD Applications and Service Principals that use certificate based authentication is not quite as straightforward as you might expect.

The following article provides the instructions on how to do this https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-authenticate-service-principal#create-service-principal-with-self-signed-certificate

However, what if you want to use multiple certificates using the KeyCredentials parameter to New-AzureRmAdApplication? In this case you might guess from the following article that you could create an array of objects of type

Microsoft.Azure.Commands.Resources.Models.ActiveDirectory.PSADKeyCredential

The problem is if you have a version of the Azure PowerShell module newer than 4.2.1, then the object will not have a type property as per this issue: https://github.com/Azure/azure-powershell/issues/4491

Assuming you don’t want to downgrade to version 4.2.1 how do you achieve this? Well the issue mentions the correct way of doing this is to use the New-AzureRmAdAppCredential cmdlet as shown in the example code below:

Checkpoint firewalls vs. Azure Network Security Groups (NSGs)

Posted on October 3, 2017 By Vijay Thakorlal

Microsoft Azure

This post compares Azure Network Security Groups (NSGs) and virtual firewall appliances, specifically Checkpoints.

Azure Network Security Groups

Azure Network Security Groups, are the built-in firewalling mechanism in Azure, they allow you to control traffic to your virtual network (VNET) that contains your IaaS VMs (and potentially PaaS infrastructure such as App Service Environments – ASEs).

Network Security Rules are like firewall rule, they consist of

A name for the rule
A description
The direction of the rule e.g. Inbound if it applies to traffic coming into the VNET/subnet or Outbound if it applies to traffic leaving a VNET/subnet
An action, allow or deny.
Source address range
Source port
Destination address range
Destination port

The network security rules are then grouped together into NSGs. These NSGs are in turn either applied at the subnet level or to individual Network Interfaces (NICs) associated with VMs.

Even if you have not created any NSGs there are some default built-in system rules, these essentially block all inbound traffic (except from the Azure load balancer), allow all traffic within a VNET and all outbound traffic.

Checkpoint firewall architecture on Azure

Checkpoint firewalls on Azure are virtual machines running the Checkpoint software. However, to make use of Checkpoints you need a number of other Azure services in place. If you also want high-availability then you need a few more.

Azure VMs (the Checkpoints are deployed from the Azure Marketplace) – two if you want high availability
Azure Load Balancer (if you’re deploying them in a high availability configuration), with NAT rules
Azure Route Tables to direct traffic through the Checkpoints
An Azure AD Application and Service Principal – which is used by the Check to modify the Load Balancer configuration to direct traffic to the correct Checkpoint in the event the primary fails, modify the cluster public IP association and route tables.

Comparison

	NSG	Checkpoints
Complexity	Low complexity – simply define rules in the portal, via the Azure PowerShell module or ARM templates.	High – the checkpoints depend on a number of other Azure services. When there are issues it can be difficult to reason about where the problem lies.
Cost	Low cost – there is not additional cost to using NSGs	High cost – you pay for the VMs, Checkpoint licensing (including for any blades that you need), Azure Load Balancer, Storage used by the Checkpoint VMs and Public IPs.
Management overhead	Low There is no infrastructure to manage	High You have to manage the VMs, update/patch the Checkpoint software, route tables and load balancer. If you have a lot of VNETs and/or on-premise ranges, managing the Route Tables (and static routes on the Checkpoints can be a headache).
Scalability	Relatively easy to scale You don’t need to worry about scaling – the only limits you have are around the number of NSGs and rules per NSGs. The only problem is that you can easily hit those limits if you have to implement a default deny rule on outbound traffic and then whitelist Azure IP address ranges [**more on why this is required]	Difficult to scale You are limited to two Checkpoints per cluster. In addition, the Checkpoints are in an active-passive configuration so only one can handle traffic the other is sitting idle burning cash. Furthermore, since you cannot scale out this means the best you can do is scale up. On the plus side you don’t need to worry about limits on the number of rules so you can easily whitelist the Azure IP ranges.
Features	NSGs are currently limited to traditional firewall rules. If you want other capabilities such as Web Application Firewall (WAF), you would need to use the WAF capabilities of the Azure Load Balancer.	You can use all the features (blades) that Checkpoints provide such as WAF, IPS etc.
Automation	NSGs are highly automatable you can automate the management of such rules via PowerShell and/or ARM templates.	Automating Checkpoint deployments is far more complex since it involves VMs, Load Balancer and Route tables. Furthermore, when it comes to implementing the rules – as far as I am aware there is no API or PowerShell module for Checkpoints. The closest thing available is a command line tool is dbedit.
High Availability	Comes for free with NSGs – it’s a distributed service.	HA is far more complex with Checkpoints – and complex systems are harder to make highly available. The Route Tables and static routes on the Checkpoints can make this a fragile solution.
Disaster Recovery	Microsoft are effectively responsible for this. That said you can keep NSGs rules stored in your version control system so should you make a mistake it should be easy to rollback (and identify the breaking change as well as have an audit chain of what was changed)	There is no DR solution – as while your Checkpoint management servers hold (most of the configuration so you can redeploy the config), in a DR situation if you lose a Checkpoint VM you have to rebuild it and reconfigure the cluster which requires a lot of manual effort (you can automate some of this at least on the Azure side but not everything).

It should be noted that there have been a number of announcements at Ignite 2017 that will simplify NSG rules such as Application Groups, Service Tags and Augmented Rules.

The two most common reasons I hear why an organisation wants to use Checkpoints (or another virtual firewall appliance for that matter) in place of NSGs are:

Skills: we already use Checkpoints on-premise so it’s easier to manage those from a skills perspective

The way that Checkpoints work in Azure today I think this is a fallacy. While yes it’s true your network, security (or whichever team manages your firewalls on-premise) will be familiar with how to configure rules on Checkpoints there are some fundamental differences, namely they are unlikely to be familiar with:

Azure AD Applications
Azure Load Balancer
Azure Route Tables
Managing VMs on Azure

Security/features are better with Checkpoints

This is mostly true in the sense that you can bolt on WAF or IDS/IPS capabilities to a Checkpoint and manage this through a “single pane of glass”. NSGs are (currently – I think this will change) more rudimentary in comparison.

That said you are in a sense increasing your attack surface as you must manage the Checkpoint VMs. In addition, because Checkpoints on Azure have so many moving parts in my view it’s far easier to make mistakes and create a vulnerability/security risk. Such as misconfigure your Route Table such that traffic is not filtered by your Checkpoints. In addition, you also must safe guard the Azure AD application credentials (which are stored in cleartext on the Checkpoints) as it is used to modify the route tables, load balancer, public IP associations.

NSGs are hard to troubleshoot as it’s provided as a service

Since Checkpoint rules are typically managed through a management server from a GUI it is believed this makes it easier to troubleshoot issues. This is true in that it’s easier to capture traffic going through the firewall and determine if the firewall is accepting or denying traffic. However, things are not so simple in the real world – if you have issues with return traffic e.g. route tables it becomes far more complex. You can’t easily troubleshoot as you normally would have by installing the Network Watcher on the Checkpoints because they are third party appliances.

NSGs used to be hard to troubleshoot, but now you can setup NSG logs to be stored in Azure Storage, look at the effective security rules applied to NICs to determine if traffic is allowed, use Network Watcher to capture traffic.

I can’t meet audit requirements with NSG

Before the advent of NSG logs and Network Watcher this used to be true but not anymore – with these solutions you can retain NSG logs for audit purposes and to meet other requirements (such as feeding them into a SIEM or IDS/IPS).

You can also export logs (yes this does require more work than it would have with Checkpoints) to your SIEM or log analytics tool, for example NSG logs / Network Watcher data can be fed into Splunk.

Summary

This is not to say that Checkpoints on Azure are not a good solution – what I am saying is that you need to understand the trade-offs that you are making in using either NSGs or a virtual firewall appliance such as a Checkpoint.

The idea of perimeter security is slowly giving way to other more modern approaches that involves applying network security not just at your network edge but network security policies that apply at the node (e.g. server, container etc). level. Similar to network security policies available in the containerisation world. In fact, in the cloud this model is makes far more sense than the old perimeter security model.