Schema Design Principles in Modern Data Warehousing - Part 2 [Updated]

Published on Aug 18, 2025 | Author: Fiona | ~ 14 min read
SQL Mastery SQL Data Style Design Data Engineering Style Guide Best Practices
‘Before analytics can scale, before dashboards can be trusted, and before transformations can be efficient, the foundation has to be solid. That foundation is schema design. A good schema makes querying intuitive, keeps pipelines maintainable, and ensures your warehouse grows sustainably.’

Introduction

A database is comprised of more than one table, and the relationships among all the tables in a database are collectively referred to as a schema. A good schema design:

Fact vs Dimension Tables

To understand schema design, we need to dissect some terminology first. A fact table is simply a table that contains facts, or measurable events, like sales details, transaction data, or website clicks. A dimension table contains data that describes events in the fact table, like customer details, clients, products, or dates. In summary, fact tables hold the key numeric data, while dimensions provide context, making it easy to query and analyze, making it the most popular schema.

Schema Architecture Styles

The most popular architectural styles include:

Star schema

A schema where dimension tables surround a single central fact table. A star schema can have any number of dimension tables, the relationship signaling a many-to-one relationship between the fact table and each dimension table. It’s known for its fast performance and easy-to-read queries. Since the fact table is central, you rarely need to complicate joins. However, its simplistic design is a limitation when data relationships get more complex.

Snowflake Schema

Consists of one fact table that is connected to many dimension tables, which can be connected to other dimension tables through a many-to-one relationship. Typically, the tables are normalized to the third normal form, with each dimension representing one level of hierarchy, resulting in reduced duplicate data. On the other hand, the more tables you join, the more complex the complexity. This results in slow query performance and confused analysts who have to track multiple dimension tables. If the data isn’t large and complicated, the snowflake schema may not be the best option.

Galaxy (Fact Constellation) schema

Invovles multiple fact tables, each potentially having its own set of dimensions, but some dimensions are shared between fact tables. This is more complex than both the snowflake and star scheme, due to the presence of multiple fact tables and shared dimensions. Such a schema is typically implemented for a large organization with numerous business-related processes, enabling cross-domain analysis and insights. If you opt for this choice, it’s best to have active maintenance schedules to reduce performance and data quality challenges.

Once a schema is selected, it’s important to adhere to it throughout your architecture. And as your architecture evolves and your needs change, a transition can be made to a more suitable schema.

Data Layered Design (Raw → Staging → Curated)

Even if you are not using a full lakehouse, it's important to layer your warehouse for clarity, modularity, and to have testable pipelines, depending on your selected data modelling standards.

Raw layer

Naming Conventions

It’s also important to consider naming by layer, like we've seen in the data layer and design section. It helps with ease of traceability and maintainability, allowing engineers to track the entire flow of a process for root cause analysis and make necessary updates.

Clear singularity and column names. Use of clear and singular names for your columns and tables helps with.

Avoid ambiguity; define your tables, columns, and pipelines with self-explanatory names. A user should be able to deduce what data can be found in a column or travel by its name, and which pipeline supports a particular dataset.

Common pitfalls and anti-patterns

While creating your schema design, it's easy to get carried away and lose the vision you had. In this process, here are some things you need to watch out for:

  1. Schema sprawl: too many tables, no clear purpose. In the curated/data mart layer, it's easy to get carried away with the moment and create many tables that answer some, at the time of thought, useful questions. But the main question is, are they useful? Is there an intended use case? A good schema design should be useful and cost-conscious. A table created that doesn't answer any business questions at that moment may not be practical

  2. Mixing layers: raw and transformed day in the same schema is a bad design choice. Example: say you need to perform root cause analysis, and the data'll be difficult to pinpoint if the problem is from the transformation, the pipeline, or the data source.

  3. Inconsistent data types: one needs to be aware of inconsistent data types in conversations. Having a column as a string in one table and an int in another table is like having half of your tools in the metric system and the rest in imperial. It's doable, but one can easily lose the plot. It can lead to messy queries, data quality risks, performance hits with type conversion, etc.

  4. Lack of version control: Version control provides a mitigation plan in the event of introducing breaking changes, which will inevitably occur at some point in an engineer’s career.

  5. No documentation: As an engineer, it's important to remember one rule: a schema is only understandable to the architect. Document the complex logic and the seemingly obvious tools.

Case study

Consider the source-driven flat schema below, which uses the operational data store approach. It mirrors how data comes out of the source systems, prioritizing quick ingestion over clean schema, and is usually useful for short-term reporting.

Single Wide Table Dumped from Source

analytics.sales_data

transaction_id customer_id name joinDate email prod_id product cat price storeCode store region amount ts
1001 C123 Alice Johnson 2022/03/01 alice@email.com P001 WidgetA Gadgets 49.99 NYC S-NYC East 49.99 2024-07-15 14:03
1002 C124 Rob Hunter 2023-01-15 rob@email.com P002 WidgetB Gadgets 19.99 NYC Store1 East 19.99 2024-07-15 14:05
1003 C123 Alice Johnson Mar 1 2022 alice@email.com P003 WidgetC Premium 99.99 LA S-LA West 99.99 2024-07-16 10:11

What’s Wrong Here?

1. Inconsistent Data Types; joinDate sometimes YYYY/MM/DD, sometimes YYYY-MM-DD, sometimes free text (Mar 1, 2022). amount is a float, price is sometimes a string in other loads.

2. No Surrogate Keys; All joins rely on natural keys like customer_id, prod_id, and storeCode. If a customer changes ID in source (system migration, merges), history is broken.

3. Mixed Grain; Customer attributes (name, email) repeated on every transaction row, update anomalies, larger storage, wasted compute.

4. Inconsistent Naming; Some columns snake_case (transaction_id), some camelCase (joinDate), some shortcodes (prod_id). Store has three versions: storeCode, store, and region.

5. No Clear Layering, Raw source data + derived values live in the same table. No separation of raw → core → mart.

To fix these issues, we’ll implement the recommendations discussed earlier:

The raw layer

raw.sales.transaction

src_transaction_id src_customer_id src_product_id
1001 C123 P001
1002 C124 P002
1003 C123 P003

raw.sales.customers

src_customer_id customer_name join_date email country
C123 Alice Mcdoodle 2022-03-01 alice@email.com UK
C214 Rob Hunter 2023-01-15 rob@email.com US

raw.sales.products

src_product_id product_name category price
P001 Widget A Gadgets 49.99
P002 Widget B Gadgets 19.99
P003 Widget C Premium 99.99

raw.sales.stores_raw

src_store_id store_name region
S01 Store NYC East
S02 Store LA West

Core Layer (Business Logic + Surrogate Keys)

In this section, we standardize data types, introduce surrogates, and enforce naming conventions. Also, facts and dimensions are identified in this section.

Dimension Tables

core.dim_customer

customer_sk src_customer_id customer_name join_date email country
1 C123 Alice Mcdoodle 2022-03-01 alice@email.com UK
2 C214 Rob Hunter 2023-01-15 rob@email.com US

core.dim_product

product_sk src_product_id product_name category price
10 P001 Widget A Gadgets 49.99
11 P002 Widget B Gadgets 19.99
12 P003 Widget C Premium 99.99

core.dim_store

store_sk src_store_id store_name region
100 S01 Store NYC East
101 S02 Store LA West

Fact Table

core.fact_sales

transaction_sk transaction_date customer_sk src_product_id store_sk transaction_ amount src_transaction_id
5001 2024-07-15 1 10 100 49.99 1001
5002 2024-07-15 2 11 100 19.99 1002
5003 2024-07-16 1 12 101 99.99 1003

Mart Layer (Presentation / Reporting)

Aggregates for BI consumption.

mart.sales.sales_summary_monthly

month region category total_transactions
2024-07 East Gadgets 2
2024-07 West Premium 1

Monthly sales summary by product category.

Schema Style

The core layer is a defined star schema, one fact table, multiple dimension tables, joined via surrogate keys. Natural keys retained for traceability (src_* columns), data types standardized (e.g., dates, decimals), and clear, consistent naming is defined (dim_, fct_, mart_), as discussed in previous sections.

 

Watch out for more parts!

← Back to Blog
Share: Twitter LinkedIn
← Previous
Designing a Data Style Guide - Part 1
Next →