Automate Lakehouse Table Maintenance In Fabric
Introduction
Automation makes things much easier for us. Especially when running multiple repetitive tasks at the same time. In this post we are going to look at how to run multiple tasks in python in a Microsoft Fabric Notebook. In particular, we are going to be looking at performing some maintenance tasks across multiple tables in a lakehouse at the same time.
This post assumes you have access to Microsoft Fabric.
Before We Get Started
In this notebook we want to be able to run multiple processes at the same time, because running them synchronously (one after another) could take a long time. In order to run multiple processes at the same time, we are going to use a library called concurrent.futures. This was added in Python 3.2. You will need to make sure your Notebook environment supports this version or later.
You can read more information about concurrent.futures in the Python documentation.
Creating The Notebook
In Microsoft Fabric, create a Notebook and attach to it a default lakehouse you want to work with.
Next, we need to bring in the libraries and functions we are going to use.
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime
Now we define a couple of functions that we are going to use. The first function we will create is to get all the tables from the default lakehouse. This is so we have a list of tables to iterate through.
def get_table_list():
# Get default lakehouse name
lakehouse_name = notebookutils.runtime.context.get("defaultLakehouseName")
# Get list of tables from lakehouse
tables = notebookutils.lakehouse.listTables(lakehouse_name)
# Return data as a list
return [table.name for table in tables]
If you are interested in learning more about notebookutils.runtime, check out my blog post Retrieve Envinroment Info In MS Fabric Notebook.
The second function we will create will run various commands against a table. Since this is a maintenance script we run weekly, it performs things such as OPTIMIZE and VACUUM. You can see how I use them in my previous blog post called MS Fabric Lakehouse Data Missing?
def refresh_table(table_name):
try:
print(f"[{datetime.now()}] Starting refresh for table: {table_name}")
spark.sql(f"OPTIMIZE `{table_name}`")
spark.sql(f"VACUUM `{table_name}` RETAIN 168 HOURS")
spark.sql(f"REFRESH TABLE `{table_name}`")
spark.sql(f"ANALYZE TABLE `{table_name}` COMPUTE STATISTICS")
print(f"[{datetime.now()}] Completed refresh for table: {table_name}")
return f"Success: {table_name}"
except Exception as e:
error_msg = f"Failed to refresh {table_name}: {str(e)}"
print(f"[{datetime.now()}] {error_msg}")
return f"Error: {error_msg}"
Before we create the main process in our script, we need to set a few variables.
max_workers = 5
timeout_limit = 300
table_list = get_table_list()
# Alternative approach is to get certain tables in the lakehouse instead of all of them
# table_list = {
# "fake_erp_dbo_customer",
# "fake_erp_dbo_purchase_order"
# }
Now we can bring it all together.
job_start_time = datetime.now()
print(f"[{job_start_time}] Job Starting")
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_table = {
executor.submit(refresh_table, table): table for table in table_list
}
for future in concurrent.futures.as_completed(future_to_table):
table_name = future_to_table[future]
try:
result = future.result(timeout=timeout_limit)
except concurrent.futures.TimeoutError:
error_msg = f"Timeout: {table_name} took longer than {timeout_limit} seconds"
print(f"[{datetime.now()}] {error_msg}")
except Exception as exc:
error_msg = f"Exception for {table_name}: {exc}"
print(f"[{datetime.now()}] {error_msg}")
job_end_time = datetime.now()
job_duration = (job_end_time - job_start_time).total_seconds()
print(f"[{datetime.now()}] Job Completed in {job_duration:.1f} seconds")
The important part of the code above is the ThreadPoolExecutor which allows us to run multiple functions at the same time. This is dictated by the variable max_workers which is set to 5. This means that the ThreadPoolExecutor will try to continuously keep five functions going at the same time.
# Initialize ThreadPoolExecutor and set max_workers to 5 based on previously set variable
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_table = {
# Loop through table list and call "refresh_table" function for each table
executor.submit(refresh_table, table): table for table in table_list
}
# Grab results from each function run
for future in concurrent.futures.as_completed(future_to_table):
table_name = future_to_table[future]
result = future.result(timeout=timeout_limit)
There is also special error handling for ThreadPoolExecutor as you can see in the following code.
except concurrent.futures.TimeoutError:
error_msg = f"Timeout: {table_name} took longer than {timeout_limit} seconds"
print(f"[{datetime.now()}] {error_msg}")
This error is in case the function takes longer than the time we allow it to. In this case it is 300 seconds, which is five minutes.
You can check out the full jupyter notebook in my github.
Conclusion
The code in this article can help you create a notebook in Microsoft Fabric that allows you to run maintenance on tables in a lakehouse. Add the code to your own notebook or take a look at my github repo and download a copy of the notebook from there. Once you have the notebook created in your Microsoft Fabric workspace, add your default lakehouse and schedule your notebook to run however often you want it to run.
I usually run this once a week for the OPTIMIZE and VACUUM commands, but I run it after the bronze pipeline and the silver pipeline in my ETL pipelines whenever the bronze and silver layers are both in a lakehouse.
I don't have a comments section yet, so feel free to send me feedback on this blog.
Kevin is a data engineer and is the Business Intelligence Practice Lead at Software Design Partners specializing in data warehousing. He is a father, an occasional gamer, and lover of many different types of music.
The opinions expressed on this site are my own and may not represent my employer's view.
About this blog...
This post takes a look at how to automate lakehouse table maintenance in Microsoft Fabric.
Archives
- 2025
- 2023
- 2022
- 2021
