MS Fabric Lakehouse Data Missing?
Introduction
One of the issues I have ran into multiple times in Microsft Fabric involves loading data into a Lakehouse table and immediately querying only to see the data has not fully loaded yet. In truth, the data is there in underlying parquet files, but the delta store has not been updated yet to see the files. This is much more of a problem on lower Microsoft Fabric capacities such as F16 and lower. There are a few things you can try to help with this.
Recommendations
Recomendation 1
The first recommendation is to force a refresh of the metadata for a particular table. If you have a default Lakehouse attached to your notebook you can simply run the code below. Don`t forget to change the table_name before you run it.
table_name = "lakehouse_table"
spark.sql(f"REFRESH TABLE '{table_name}'")
spark.sql(f"ANALYZE TABLE '{table_name}' COMPUTE STATISTICS")
Recomendation 2
After a table has existed for a while, the underlying parquet files start to grow a lot. This is because the delta store does not usually get rid of underlying files when the files are no longer needed. Files can become no longer needed when the data in them is not the latest data for a particular row of data in the table. The delta store keeps these files around so that you can do time traveling in queries to be able to pull older versions of these tables.
This can be a great feature, but if your table does not need to have that time travel ability, it is a good idea to clean out unused files after a while.
In addition to there being files that are not used, there may also be files that are very small which can also cause slowness. To address this we can utilize a few more commands below. Again, make sure you change the table name before using the code.
table_name = "lakehouse_table"
spark.sql(f"OPTIMIZE '{table_name}'")
spark.sql(f"VACUUM '{table_name}' RETAIN 168 HOURS")
The Optimize command consolidates smaller parquet files into larger parquet files. Vacuum then removes files that are no longer being used by the delta store. You can read more about this in the Microsft Documentation.
It is a good idea to run this as a weekly maintenance script. I will be posting a script to do this with many tables at the same time in my next blog post.
Recomendation 3
I have also found that just doing a quick query against the table after refreshing the metadata can force the metadata store to register the updates as well. This recommendation is a bit more fickle than the first two recommendations.
Conclusion
In Microsoft Fabric, lower tier capacities then to be very slow refreshing the metadata for tables when new data is loaded into them. The above recommendations have helped me in the past with forcing refreshes so that data will be available in the tables when I go to query them after loading new data into them.
I don't have a comments section yet, so feel free to send me feedback on this blog.
Kevin is a data engineer and is the Business Intelligence Practice Lead at Software Design Partners specializing in data warehousing. He is a father, an occasional gamer, and lover of many different types of music.
The opinions expressed on this site are my own and may not represent my employer's view.
About this blog...
This post takes a look at MS Fabric data not showing up in lakehouse after data load
Archives
- 2025
- 2023
- 2022
- 2021