Friday, December 2, 2016

Azure HDInsight Apps & Domain-Joined Premium HDInsight Clusters

Up until recently, HDInsight had a separate security model from Azure AD.  Microsoft has introduced Domain-Joined HDInsight clusters, which brings authentication with Azure AD & authorization with Apache Ranger to HDInsight.  Currently this feature is part of Premium HDInsight, and supports Hive, though you should be able to install/custom configure other applications.

When spinning up an Azure HDInsight cluster, there are a few options during setup, including adding marketplace apps from 3rd-party vendors.

The blurb from the install screen.

External Apps

  • DatameerDatameer offers analysts an interactive way to discover, analyze, and visualize the results on Big Data. Pull in additional data sources easily to discover new relationships and get the answers you need quickly.
  • Streamsets Data Collector for HDnsight provides a full-featured integrated development environment (IDE) that lets you design, test, deploy, and manage any-to-any ingest pipelines that mesh stream and batch data, and include a variety of in-stream transformations—all without having to write custom code.
  • Cask CDAP 3.5 for HDInsight provides the first unified integration platform for big data that cuts down the time to production for data applications and data lakes by 80%. This application only supports Standard HBase 3.4 clusters.
More info on each of these.

Datameer is a data ingest, transformation, and visualization tool.  Connects to tons of sources, including cloud-based, on-prem, relational, nosql, and even Cobol.  
There's a built-in plugin to export to a Tableau Server.

Streamsets Data Collector, for building data pipelines and providing operational dashboards.

Cask CDAP

Each of the applications have specific requirements for the version of HDInsight / HDP that will be used and additional configuration within the applications themselves.

Custom Apps

In addition to the apps in the Azure Marketplace, you can install custom applications, one way is by using shell actions.  

Some resources for installing custom applications with Azure HDInsight.
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apps-install-custom-applications

HDInsight provides several scripts to install the following components on HDInsight clusters:



Installing the script action for Hue and some other components will restart the HDInsight cluster services, rendering the server unavailable for jobs.

Bootstrapping Config Files

If you download configs through Ambari or from command-line, you can provision new clusters with changed configuration files.

Connectivity

Connecting to the cluster using ssh

Working with HD Insight, Hue and many of the other browser-based tools sometimes requires Setting up SSH tunneling, if you don't have access directly to HDInsight network.

You can avoid this by setting up some edge nodes in the cluster.
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apps-use-edge-node

Extending the network is also useful.  When provisioning an HDInsight cluster, you can assign it to a VNet.
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-extend-hadoop-virtual-network

RStudio & RevoR

Rikard Sandström has some helpful links - installing a Premium HDInsight Spark cluster with RStudio + Tunnel and Creating HDInsight Clusters with Templates.