November 09, 2021

Introduction

This blog post is meant to be an introduction on architecting an on-premises data warehouse using Microsoft SQL Business Intelligence stack. This includes Microsoft SQL Server, Analysis Services, and Integration Services.

There are many other platforms out there for building a data warehouse and many of these concepts will be applicable with those technologies as well. I am focusing on this particular technology stack due to it is the one I work with the most and I am most fond of.

Before we get too far along, we should do a little bit of explanation of what a data warehouse is and why we would need one.

Data Warehouse Introduction

Data warehouses are typically large databases filled with structured data from multiple sources. These data sources can be customer relationship management (CRM) systems, enterprise and resource planning (ERP) systems, warranty systems, or any other database or other data source in your company. Many times this even includes data from spreadsheets that have been floating around companies for years. Typically, a process called an ETL (Extract, Transform, Load) or and ELT are used to pull data from the disparate data sources and then perform business logic on the data to structure it in a reportable structure.

One of the most common reasons businesses have a data warehouse is to have a place to bring data together from across a company and to be able to have a centralized reporting platform. Once the data is centralized and standardized, reporting and analytics tools can be used to query the data to build reports or dashboards using the data.

Data Warehouse Concepts

Data warehouses can be structured several different ways, but they usually use the same terminology when it comes to how it is structured. The terms you will often hear are measures, facts, and dimensions.

A measure is a value that is measurable. This can be things like sales amount, quantity sold, costs, margins, etc. A fact is a table that holds measures.

A dimension is a table that holds descriptive information about a fact. This can be things like the customer the item was sold to, the order number, invoice number, date sold, date invoiced, etc. Dimensions are also typically very denormalized.

I am not going to go very deep into table structure in this post as it is complex enough of a topic to have multiple posts dedicated to it.

Data Warehouse Architecture

One of the first resources I was ever recommended for learning about data warehousing was the book The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling by Ralph Kimball. It did a great job of explaining some of the modeling concepts that still hold up well today. I would recommend checking it out if you are getting into building data warehouses as data modeling plays an important role in a data warehouse's design.

Today we are going to keep things a bit higher level though as we are going to look at a high level architecture of a data warehouse. Let's take a look at how I typically structure data warehouses.

Data Warehouse Architecture

One of the concepts that Kimball shares is an idea of having multiple layers in a data warhouse. The first layer is the data layer where all the data is stored and manipulated. This layer of data is protected from user access to prevent users from messing with data as well as pulling data in the middle of an ETL process.

This layer consists of the data warehouse tables including any staging tables used to stage or hold data for the ETL process. These tables are typically organized by utilizing database schemas to group objects together such as tables, views, and stored procedures. Over time I have created a collection of scripts for creating the overall structure of a data warehouse. You can find them on GitHub in my DWH-Scripts Project. Feel free to check it out and leave feedback.

Additionally, if you are utilizing External Tables using PolyBase, these tables would be included in the data layer as well. If you are not familiar with PolyBase, I give a introduction to it in my blog post Installing PolyBase on SQL Sever 2019.

The second layer in a data warehouse is the presentation layer. This is the layer where the data is curated for user consumption. Consumption of the data is often done using reporting or analytics tools.

Most of the time both of these layers are just separated by schemas in the same database. While that is perfectly fine, I typically encourage companies to utilize SQL Server Analysis Services. Using this method, views are exposed to Analysis Services with the data they need.

Analysis Services is a technology that allows you to build data marts in the form of OLAP cubes for users to consume. This takes a lot of complications out of managing access to the data warehouse database. All user access is then controlled through Analysis Services. Analysis Services also allows you to build measures and calculated columns directly into the model as well as format the fact and dimensions to give the end user a much better experience when it comes to consuming the data.

Analysis Services is an on-premises service, but also has multiple web platforms as well. If you are looking for a cloud or web based platform, you can utilize Azure Analysis Services or Power BI Premium using XMLA endpoints. The on-premises version allows the creation of multidimensional or tabular cubes. Azure Analysis Services and Power BI Premium only support Tabular cubes as they are in-memory type of cubes.

ETL Process

In order to move data around, an ETL process has to be utilized. There are a couple of pieces of this process.

The first part of the process is getting the data to the same server as the data warehouse. This makes it much easier to manipulate the data as it is all local. This can also be referred to as an ELT as the transforming of data is done after the extraction from the source system and loading the data into the data warehouse server. Usually the data is loaded into staging tables when it is needed for data to be copied to the server before being loaded into the data warehouse.

For this process you can utilize a lot of different ETL tools such as SQL Server Integration Services. Personally, I am not overly fond of ETL tools. I prefer to work with TSQL for extracting, transforming, and loading the data. This is one of the reasons I prefer to use PolyBase when possible. It allows me to directly query other data sources. Even when using PolyBase for pulling the data, I still use Integration Services as an orchestration tool to manage when my stored procedures are triggered to pull data from which data sources. This also allows me to run multiple stored procedures at the same time to load data faster.

The drawback to PolyBase is it does not work with data sources such as APIs. For APIs, you will have to use an ETL tool or build a custom integration using the platform of your choice. Again, since I am not fond of ETL tools, I'd prefer to build an integration in .NET. Occasionally a customer will have an ETL tool they use as a standard though, so that will be utilized to stay consistent.

The second part is loading the data into the data warehouse tables. This can be part of the same ETL process as above, but it should be done after the staging tables are loaded to reduce the possible issues of data not coming into the data warehouse in the correct order. This is something that I have seen bite people a lot over the years. Instead of using steps in moving and loading data, they try to do it all at once and it becomes a mess.

Take the time to plan how your data is going to flow into your data warehouse. A little bit of thought can save you many hours of rewriting the warehouse later on.

Conclusion

Building a data warehouse is not a small feat. It takes a lot of effort and many hours of planning. Hopefully the above article provided some useful information and resources to get you started and provided a high-level overview of what is involved in building a data warehouse on-premises using Microsoft SQL Server.

I don't have a comments section yet, so feel free to send me feedback on this blog.


Kevin Williams

Kevin is a data engineer and is the Business Intelligence Practice Lead at Software Design Partners specializing in data warehousing. He is a father, an occasional gamer, and lover of many different types of music.


The opinions expressed on this site are my own and may not represent my employer's view.
Share this post
About this blog...

Let's take a look at how to build a data warehouse on-premises using Microsoft SQL Server.

Archives


An error has occurred. This application may no longer respond until reloaded. Reload 🗙