Transforming a Wide Table into an Effective Data Model in Power Query
Written on
Introduction
One day at work, a client reached out to resolve an issue with their Power BI report. Upon examining the data, I discovered a single wide table that contained 30 to 40 columns or more.
Naturally, I inquired about the data source. The response was, “Excel, of course.”
That led me to ask: “Is it possible to create a good data model from this?”
My client's question was, “Why?”
And this is where our journey begins.
Ideally, I would transfer the source file into a relational database and create a well-structured data model. However, it’s rare for my clients to invest in something that doesn't seem immediately beneficial.
But what’s the importance of a good data model? Doesn’t a flat table suffice?
At first glance, yes, especially with a few thousand rows. However, complications arise as the data volume increases.
Here’s a brief summary of why a solid data model is essential:
Why is a Good Data Model Important?
There are numerous reasons to prioritize a good data model. The two main ones are efficiency and usability.
By organizing data into separate tables and establishing relationships between them, Power BI can operate more efficiently. Additionally, this approach reduces the size of the Power BI file by eliminating redundancies.
For a more comprehensive understanding, I highly recommend reading the SQLBI article linked in the References section below.
Furthermore, those who interact with the data model will find attributes grouped in distinct tables rather than sifting through a lengthy, alphabetized list.
Now that we understand the importance, the next question arises: What constitutes a good data model?
The answer: A Star Schema.
What is a Star Schema?
If you’re already familiar with a Star Schema, feel free to skip ahead.
My initial data model appears as follows:
The main table, “Azure_UsageDetails,” sits at the center of the diagram, while two additional tables, including a Date table, assist with time-based reporting.
Upon closer inspection of the original table, I noticed several intriguing columns:
- BillingProfile
- MeterName
- SubscriptionName
These columns share a common trait: despite having around 55,000 rows, they contain only a limited number of distinct values, indicating low cardinality.
Moreover, these columns provide descriptive context rather than containing quantitative data like “Quantity” or “UnitPrice.”
The goal is to transfer these columns into separate tables, termed Dimensions, which will contain only unique values or combinations derived from these columns.
Refer to the diagram below:
This is why the structure is known as a Star Schema. We could whimsically call it a Squid Schema, but that wouldn’t resonate with most.
In this structure, each Dimension table connects to the central table via an ID column. It's crucial that these ID columns are of integer type, never text.
The central table of the star schema is termed a “Fact” table, housing values for each event, transaction, or data point.
Let’s Begin
Now, let’s outline the necessary preparatory steps:
- Identify columns with low cardinality.
- Determine columns for grouping.
- Assign names for each Dimension table.
- Replace all empty cells.
- Duplicate the data table.
First, access Power Query.
Next, enable data profiling:
Power Query defaults to profiling only the first 1,000 rows, but it’s often beneficial to analyze the entire dataset. To do this, click on “Column profiling based on top 1000 rows” and switch to “Column profiling based on entire data set”:
This adjustment may take some time depending on data volume.
Upon examining my dataset, I identified three potential candidates for Dimensions:
From these, I can create three Dimensions:
- BillingProfiles
- Subscriptions
- Meters
Notice that I’ve used plural names for the Dimension tables, each representing one or more instances of that entity.
Next, let’s analyze the columns beginning with “Meter”:
Two key observations arise:
- The cardinality of the four columns varies.
- There are many empty entries in the MeterRegion column (76%).
Given the cardinality, I can establish a hierarchy for the Meters:
- Top-Level: Meter Region
- Second-Level: Meter Category
- Third-Level: Meter Subcategory
- Leaf: Meter Name
The Fact table will relate to the Meter Name column, specifically through an ID created from the Leaf Level based on the distinct combination of all four columns. This combination is essential because Meter Names may appear multiple times across different Meter Subcategories.
Next, I will replace the empty rows in the Meter Region column with meaningful text to prevent a hierarchy node labeled (Null).
To do this, I right-click on the column name and select “Replace Value.” In the dialog box, I input null as the value to find and “Empty” as the replacement:
Consequently, the MeterName column will have no empty entries:
Next, I will review all columns designated as Dimension columns and replace any blank values.
I can identify these by checking the “Empty” line in the Profiling Area (indicated by the red lines).
You might consider using an alternative term like “None” for the replacement text.
To prevent duplicate column names when forming the Dimension tables, I will rename all columns ending with “Id” to “GUID” (excluding InvoiceID, which will remain unchanged).
Having cleaned the data, I can now duplicate my table to start constructing the Dimension tables. I right-click on the Azure_UsageDetail table and select “Duplicate”:
I’ll rename this duplicate to Azure_UsageDetail_Copy.
However, this duplicate isn’t needed in our Data model, so I will disable the table loading:
This allows me to use the table as a source for subsequent operations without including it in Power BI.
Now, I can utilize the copy of the original table to build my Dimension tables.
Creating a Simple Dimension Table
The first Dimension will focus on Subscriptions.
The necessary steps are as follows:
- Create the table referencing the copied table.
- Remove all other columns.
- Eliminate duplicates.
- Add an ID column.
After these tasks, I will merge this column back into the original table:
- Merge both tables.
- Expand the ID column.
- Remove the original Subscription column from the main table.
Right-click on the copied table and select “Reference” (for clarification on the difference between “Duplicate” and “Reference,” please refer to the References section below).
Next, I’ll rename this table to Subscriptions (Double-click on the table).
I can either use the “Choose columns” feature or the “Remove other Columns” option to retain only the two Subscription columns. I select both columns (using Shift-click) and right-click:
Next, I’ll remove duplicates:
Now, I have a compact table with two columns and two rows.
To introduce an ID column, I’ll apply the Index feature:
I’ll rename the newly created “Index” column to “SubscriptionId.”
Now, the Subscriptions Dimension table is complete.
Next, I’ll replace the existing “Subscription” columns in the original table with the new ID column. I will utilize the “Merge Queries“ feature to link the two tables:
It’s crucial to select both columns to ensure proper row assignments.
You can verify the correct assignment by checking the information line at the bottom: when both numbers align, everything is fine.
To obtain the ID column, I’ll expand the merged table:
This yields an additional column with the corresponding SubscriptionId.
I’ll repeat these steps for all “simple” Dimension tables, but some may require additional steps during construction.
Something More Complex
While refining the Data model, I noticed several columns that are interrelated:
- All “Billing” columns
- ChargeType
- Frequency
- PublisherType
- PricingModel
- InvoiceNo
Since all these columns pertain to invoicing, I decided to consolidate them into a single Dimension table.
The initial steps mirror those described earlier:
- Create a referenced table.
- Remove unnecessary columns.
- Add an Index column for the ID.
- Rename columns for clarity.
However, I recognized that I could derive the Billing Month from the Billing Period column and then eliminate that column for reporting efficiency.
Thus, I added a Custom column with the following formula:
Date.MonthName([Billing Period]) & "-" & Text.From(Date.Year([Billing Period]))
This yields:
To ensure this column can be sorted by month number, I added another Custom column:
(Date.Year([Billing Period])*100) + Date.Month([Billing Period])
Using the Sort by Column feature in Power BI, I can sort the Billing Month column based on this new BillingMonthNum column.
Next, I’ll set the appropriate Data type (Text for the Billing Month and “Whole Number” for the BillingMonthNum column).
An alternative method for adding a sorting column is through Ranking.
For example, consider the Invoice No column:
If we want to incorporate a sorting mechanism for this column, an Index column won’t suffice since it would generate a continuous number irrespective of the content.
However, several rows have null as the Invoice No.
Thus, utilizing the Index for sorting in Power BI becomes impractical due to repeated content with different numbers.
We can resolve this by employing the following expression for a new Custom column:
Table.AddRankColumn(#"Changed Type","InvoiceNoRank", "Invoice No",[RankKind=RankKind.Dense])
- #”Changed Type” refers to the preceding step (Input table).
- “InvoiceNoRank” is the name of the new column.
- “Invoice No” is the column for ranking.
- The final parameter [RankKind=RankKind.Dense] is paramount.
With [RankKind=RankKind.Dense], I instruct the function to assign the same ranking to rows with identical invoice numbers while maintaining a continuous number range.
For further insights on this function, refer to the following article:
Adding Pre-Calculated Rank in Power BI Using Power Query
The outcome looks like this:
You can see that all entries with a blank Invoice No receive a ranking of 1, with subsequent invoice numbers following as 2, 3, 4, etc.
When merging this table with the fact table, I will select all columns that uniquely identify each row, specifically the BillingPeriods and Invoice No.
Final Steps — Completing the Data Model
At last, I can finalize the Data model.
Before transitioning to Power BI, I’ll remove any obsolete columns from the Original table in Power Query and rename it to “FactAzureUsageDetails.”
Upon loading the data into Power BI, I’ll need to verify the automatically created relationships to ensure all connections are made through the ID columns. Sometimes, Power BI defaults to creating relationships via Name columns, which is less than ideal.
Since I previously established a report based on the original data model, I will need to update all visuals and measures that utilized columns now relocated to the dimension tables.
Final Result
This is the completed Data model:
This is precisely what I aimed to achieve.
Interestingly, the new pbix file is now larger than the original—800 kB versus 750 kB previously. The raw data itself is approximately 20 MB.
In the past, I've made such modifications that resulted in a smaller pbix file. However, I suspect that given the limited data volume, the added complexity has led to the increase in size.
Conclusion
The effort involved in refining my Data model was significant but not overwhelming. It’s definitely worthwhile to pursue this methodology for a more efficient solution in Power BI, rather than relying on a table with numerous columns that can be cumbersome to navigate.
Imagine consolidating all columns related to a single topic into one Dimension table. Even when sorted alphabetically, finding them becomes much easier compared to having them dispersed across the original table.
In my view, this topic warrants greater attention. I often see colleagues creating solutions in Power BI with overly wide tables. My immediate question is, “Why utilize such a table? Why not implement a Star Schema in your Data model?”
The typical response is, “Because the data arrives in that format, and I lack the time to adjust the structure.”
However, once performance issues arise, my first step is always to analyze and optimize the Data model.
I recommend using DAX Studio with VertiPaq Analyzer (SQLBI Tools Page) to gather statistics about the Data model and identify potential issues.
For a brief introduction on using this combination, check out the following resource:
How to use Vertipaq Analyzer with DAX Studio for Power BI Model analysis
Feel free to search for “How to use VertiPaq Analyzer” in your preferred search engine.
I’m contemplating writing an article on this topic soon, so please let me know in the comments if you’re interested in such content.
References
The data derives from my personal Azure Subscription, where I download monthly consumption from the Azure Portal and import it into an Azure SQL database.
For further reading on the significance of a Star Schema as a Data model, refer to the SQLBI article:
Power BI - Star schema or single table - SQLBI
Additionally, for clarity on the difference between the Duplicate and Reference features in Power Query, check this article:
Reference vs Duplicate in Power BI; Power Query Back to Basics