August 24, 2021
Introduction

PolyBase is a service that comes with Microsoft's SQL Server that allows users to set up External Tables to access remote database tables from other databases. It then allows you to query these tables using TSQL, so there is no need to know the different SQL syntax of the different data sources.

PolyBase was introduced in Microsoft's SQL Server 2016 with limited data sources it could connect to. In SQL Server 2019 we saw the number of data sources it could connect to expand significantly.

This article will show you how you can set up PolyBase on Microsoft SQL Server 2019 on a Windows server. PolyBase is available in SQL Server 2016 and above, Azure Synapse Analytics, and Parallel Data Warehouse.

Pre-Installation

Before we start installing PolyBase, there are a few things to note about the limitations and other things of note of this technology.

  • PolyBase does not use Windows authentication for accessing databases. In order to access another database, you will need to have a database user set up with proper read permissions on the database you want to access.
  • PolyBase currently is read-only. It does not allow you to write any data back to a data source. For this reason, the database user used for setting up PolyBase should have read only permissions on the data source.
  • PolyBase can be installed only one time per server. This means if you have multiple installs of SQL Server on the same server, PolyBase can only be set up for one of those instances.
  • Each node of PolyBase has 8 readers used to pull data from a remote source. The way the remote source is set up will affect the amount of data that can be pulled at one time.
  • PolyBase has a configuration option called Pushdown. Pushdown tells PolyBase whether to try to execute the query on the remote server, or pull all data back to the local server and process the query there. Having Pushdown on allows you to attempt to filter the data and do joins to table at the original data source so not as much data will be pulled back to the local server. Sometimes this is not possible to do on some data sources. I have found on some ODBC connections you have to turn this option off in order for the queries to operate correctly. You can find more information on Pushdown in the Microsoft Documentation on Pushdown computations in PolyBase .
  • An issue I have ran into with some ODBC connection is a data type conversion issue with numeric types. If you run into this issue, I recommend trying to enable the “Enforce DOUBLE” option on the ODBC DSN for the data source.
  • You will need an active directory service account for running PolyBase in scale-out mode if you are wanting to run PolyBase on more than one server in a cluster. To learn more about scale-out groups go to the Microsoft Documentation on PolyBase scale-out groups .
  • You can not index an External Table.
  • You can join multiple External Tables in a query.
  • You can not use table hints with External Tables in Views.

Installation

This guide is going to assume that on the most part you know how to install SQL Server. For that reason, there will be some sections I am going to skip and just note the important parts.

PolyBase is included as part of the SQL Server installation. You can get a copy of SQL Server Developer Edition if you need installation media.

Once you have the installation media downloaded, run setup.exe to start the installation.

This will open the SQL Server Installation Center. Once this is open, click on the “Installation” option from the navigation on the left side of the window. Then click on “New SQL Server stand-alone installation or add features to an existing installation”.

Follow the wizard until you get to the “Feature Selection” screen. This is the screen where we need to pay special attention. You will at minimum need to select “Database Engine Services” and “PolyBase Query Service for External Data”. I also highly suggest you select “Java connector for HDFS data sources” even if you aren’t planning to connect to HDFS. For some reason this can cause issues in the install. It is just safer to select this than to leave it unchecked. Once you have selected the features, click “Next >”.

You can install PolyBase on an existing instance of SQL Server or on a new instance of SQL Server. If you are installing it on an existing instance of SQL Server that has had a CU applied to it, you will need to reapply that CU after the installation of PolyBase. This is because PolyBase is installed at the same version as the installation media.

Once you select the instance of SQL Server you are installing PolyBase on, you will be presented with a screen for the PolyBase Configuration. On this screen you are given the option to “Use this SQL Server as a standalone PolyBase-enabled instance.” Or “Use this SQL Server as a part of PolyBase scale-out group.”.

If you are installing it on a single machine, you can use either option. I do recommend using the scale-out option if there is a remote chance you will eventually set up another instance to use as a child node.

If you are installing this as either a master or child node of a scale-out cluster, you need to select the second option.

Once you select and click “Next >”, you will be taken to the screen to install Java or use an existing installation. Normally, I just select the first option and use what the install recommends for this. Click on “Next >”.

The next screen is the Server Configuration screen where we can set up service accounts for the different services. Normally using a service account when running SQL Server is a good practice, but in this case, it is required if you are running PolyBase in a scale-out mode. PolyBase requires that all the nodes in a cluster to be running on the same account.

After you set the username and password for the different services, the rest of the installation is pretty standard. Finish the rest of the SQL Server installation and configure the database how you want.

Microsoft has documentation on this installation process that can be found here .

Post-Installation

Once you have completed the installation, you will need to open the Services management in Windows. The easiest way to do this is hitting the Windows key on your keyboard and then typing the word Services. Then select the services app.

Once you have opened Services, scroll down to SQL Server PolyBase Data Movement and SQL Server PolyBase Engine. If they are in the status “Running”, then you can skip to the “Configure your Database” section of this post. If these services are stuck in “Starting”, we need to do a little more configuration work.

To fix this, we need to open the SQL Server Configuration Manager. Once you have it open, click on “SQL Native Client 11.0 Configuration”. Then double-click on “Client Protocols”. Quite likely the TCP/IP option is Disabled. Right-Click on TCP/IP and select “Enable”.

In order for the PolyBase services to work, you will need to restart the SQL Server services. The easiest and most reliable way I have found to do this is rebooting the server/machine.

Once the server reboots, check the Services screen again and the SQL Server PolyBase Data Movement and SQL Server PolyBase Engine services should now be in the Running status.

Configure Your Database

Now that we have the PolyBase service up and running, we will need to configure our database to use it. The common tools to do this is either through Microsoft SQL Server Management Studio (SSMS) or Azure Data Studio (ADS). For this example, we will be using SSMS.

Once you fire up SSMS and connect to the new SQL Server you set up, you will notice there are three databases that exist: DWConfiguration, DWDiagnostics, and DWQueue. Do not make any modifications to these databases. They are used by PolyBase and you can really mess things up by messing with them.

Go ahead and create a new database. I called my database DataWarehouse.

Once your database is created, open a New Query and make sure the source is set to your new database.

First, we will need to check if PolyBase is installed. For SQL Server 2019 it likely will not be unless you installed it as part of Big Data Clusters. To check if it is enabled, run the following code.

		
			SELECT SERVERPROPERTY ('IsPolyBaseInstalled') AS IsPolyBaseInstalled;
			GO
		
	

Likely it will return a result of 0 indicating it is not enabled. To enable, run the following code. You can then rerun the above code to confirm it is enabled.

		
			exec sp_configure @configname = 'polybase enabled', @configvalue = 1;
			RECONFIGURE;
		
	

Setting Up External Data Sources and Tables

Now that we have PolyBase installed and configured, let's get it set up to use.

The first thing we need to do is create a master key using the following script.

		
			USE [DataWarehouse];
			
			CREATE MASTER KEY ENCRYPTION BY PASSWORD = N'MasterKey1234567890!';
			GO
		
	

From here, things will vary based on the data source you are connecting to. In my example, I am going to be connecting to another local database called WideWorldImporters.

In order to set up the data source we will need to set up the credentials to access the data source. In my WideWorldImporters database I already set up a user called WWI_USER. In my new database I will create the credentials for that user by using the following script. This script will use the credentials from the WideWorldImporters database and create a scoped credential called [WWI_User].

		
			USE [DataWarehouse];
			
			CREATE DATABASE SCOPED CREDENTIAL [WWI_User]
			WITH IDENTITY = N'WWI_User', SECRET = N'Password1!';
			GO
		
	

Now that we have the credentials set up, we can set up the data source. We are going to call this data source [WWI] and use our scoped credentials we created in the previous script.

		
			USE [DataWarehouse];
			
			CREATE EXTERNAL DATA SOURCE [WWI]
			WITH (LOCATION = N'sqlserver://localhost', CREDENTIAL = [WWI_User]);
			GO
		
	

To break down the previous script a little, the location is the server type and location of the server your external database is on. The remote database is on a SQL Server and that server is called localhost. You are better to use the actual server name though. Also, as mentioned previously, we are using the [WWI_user] credentials we previously created. Microsoft has more information on this here . You can also check out my github repo to find out more information on the subject. In my Intro to PolyBase Repo you will definitely want to check out my Cheat Sheet.

Now that we have our data source made, we can create our external tables.

		
			USE [DataWarehouse];
			
			CREATE EXTERNAL TABLE [dbo].[Invoices]
			(
			[InvoiceID] INT NOT NULL
			, [CustomerID] INT NOT NULL
			, [BillToCustomerID] INT NOT NULL
			, [OrderID] INT
			, [SalespersonPersonID] INT NOT NULL
			, [InvoiceDate] DATE NOT NULL
			, [CustomerPurchaseOrderNumber] NVARCHAR(20) COLLATE Latin1_General_100_CI_AS
			)
			WITH (DATA_SOURCE = [WWI],LOCATION = N'[WideWorldImporters].[Sales].[Invoices]');
			GO
		
	

Now that I created my table, I can simply query my new [dbo].[Invoices] table.

A few More Things

One of the pains of setting up External Tables is knowing what the schema of the table is you are trying to create an External Table for. Check out my blog on PolyBase External Table Schema Tips for some help on how to easily find out the schemas from your external tables.

One thing that Microsoft recommends you do for your external tables is the create statistics for the tables. This helps SQL Server figure out the best execution plan for querying the external data. This is especially helpful in large data sets since you can not index External Tables.

		
			USE [DataWarehouse];
			
			CREATE STATISTICS statistics_sales_invoice ON [dbo].[Invoices] (InvoiceID) WITH FULLSCAN;
			GO
		
	

Now that I created my table, I can simply query my new [dbo].[Invoices] table.

Conclusion

PolyBase is a great tool for querying external data sources directly from SQL Server without having to know multiple SQL syntaxes. This article should help you get started on your journey. Feel free to reach out with any experiences or issues you have had with this technology. I love hearing how people have used it and what problems they have encountered. You can contact me using my Contact Me page or using the send me feedback link below.

I don't have a comments section yet, so feel free to send me feedback on this blog.


Kevin Williams

Kevin is a data engineer and is the Business Intelligence Practice Lead at Software Design Partners specializing in data warehousing. He is a father, an occasional gamer, and lover of many different types of music.


The opinions expressed on this site are my own and may not represent my employer's view.
Share this post
About this blog...

A guide to install PolyBase in SQL Server 2019 on a Windows Server.

Archives


An error has occurred. This application may no longer respond until reloaded. Reload 🗙