PolyBase External Table Schema Tips
Introduction
One of the key pieces of working with PolyBase is knowing what the source database table looks like so you can create the External tables properly. There are a couple ways of figuring out the proper data types to use when creating your External Table.
Option 1: Azure Data Studio
Azure Data Studio has an extension that can be installed called “Data Virtualization”. This extension adds a new menu option called “Virtualize Data” when right-clicking on a database under the Connections tab. This option will take you through a wizard that will walk you through the whole process of creating a Master Key if you do not have one, an external user, an external data source, and external tables. At the end of the wizard, you will have an option to generate a script that generates a script with all the code needed to set everything up. You can also just allow the wizard to complete the process. In past experience, the wizard seems to be buggy, so using the script is a better option. This allows you to customize the script as needed.
Something about this method that I found interesting, is the generated script did not use NVARCHAR(MAX). it instead uses. NVARCHAR(4000).
Option 2: Foo Query
This method is less conventional and it works in both SQL Server Management Studio and Azure Data Studio.
If you already have the data source set up, you can use a simple query to find out the schema of a table. For the example below I am creating a table called [dbo].[Invoices] from the external data source table called [WideWorldImporters].[Sales].[Invoices]. If I don't know the schema I can just use [foo] int as a column. It does not specifically need to be [foo] int, it just needs to be a column that does not exist.
CREATE EXTERNAL TABLE [dbo].[Invoices]
(
[foo] int
)
WITH (DATA_SOURCE = [WWI],LOCATION = N'[WideWorldImporters].[Sales].[Invoices]');
GO
When I run the above query, I receive a message.
Msg 105083, Level 16, State 1, Line 1 105083;The following columns in the user defined schema are incompatible with the external table schema for table 'Invoices': user defined column: 'foo' was not found in the external table. The detected external table schema is: ([InvoiceID] INT NOT NULL, [CustomerID] INT NOT NULL, [BillToCustomerID] INT NOT NULL, [OrderID] INT, [DeliveryMethodID] INT NOT NULL, [ContactPersonID] INT NOT NULL, [AccountsPersonID] INT NOT NULL, [SalespersonPersonID] INT NOT NULL, [PackedByPersonID] INT NOT NULL, [InvoiceDate] DATE NOT NULL, [CustomerPurchaseOrderNumber] NVARCHAR(20) COLLATE Latin1_General_100_CI_AS, [IsCreditNote] BIT NOT NULL, [CreditNoteReason] NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AS, [Comments] NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AS, [DeliveryInstructions] NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AS, [InternalComments] NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AS, [TotalDryItems] INT NOT NULL, [TotalChillerItems] INT NOT NULL, [DeliveryRun] NVARCHAR(5) COLLATE Latin1_General_100_CI_AS, [RunPosition] NVARCHAR(5) COLLATE Latin1_General_100_CI_AS, [ReturnedDeliveryData] NVARCHAR(MAX) COLLATE Latin1_General_100_CI_AS, [ConfirmedDeliveryTime] DATETIME2(7), [ConfirmedReceivedBy] NVARCHAR(4000) COLLATE Latin1_General_100_CI_AS, [LastEditedBy] INT NOT NULL, [LastEditedWhen] DATETIME2(7) NOT NULL). Completion time: 2021-08-21T16:27:56.2782071-05:00
Looking through this message I can see it says “The detected external table schema is: ([InvoiceID] INT NOT NULL, [CustomerID] INT NOT NULL,…”. I can just take that section of the message and build out my script. In this particular case, I do not need all the columns, so I just create a script from the section that I do need.
Unlike option 1 above, this method does return NVARCHAR(MAX) as opposed to NVARCHAR(4000). The significance of this is beyond the scope of this article, but is worth pointing out.
CREATE EXTERNAL TABLE [dbo].[Invoices]
(
[InvoiceID] INT NOT NULL
, [CustomerID] INT NOT NULL
, [BillToCustomerID] INT NOT NULL
, [OrderID] INT
, [SalespersonPersonID] INT NOT NULL
, [InvoiceDate] DATE NOT NULL
, [CustomerPurchaseOrderNumber] NVARCHAR(20) COLLATE Latin1_General_100_CI_AS
)
WITH (DATA_SOURCE = [WWI],LOCATION = N'[WideWorldImporters].[Sales].[Invoices]');
GO
Conclusion
These are a few tricks to help while working with PolyBase External Tables. Let me know if there are any other tips or tricks that you use to make your life easier while working with PolyBase.
I don't have a comments section yet, so feel free to send me feedback on this blog.
Kevin is a data engineer and is the Business Intelligence Practice Lead at Software Design Partners specializing in data warehousing. He is a father, an occasional gamer, and lover of many different types of music.
The opinions expressed on this site are my own and may not represent my employer's view.
About this blog...
A few simple PolyBase tips to figuring out the External Table schema.