Schema Design Principles in Modern Data Warehousing - Part 2

Published on Feb 6, 2026 | Author: Alulu | ~ 19 min read
Data Style Guide SQL Data Warehouse Data Cleaning ETL ELT Style Guide
Before analytics can scale, before dashboards can be trusted, and before transformations can be efficient, the foundation has to be solid. That foundation is schema design. A good schema makes querying intuitive, keeps pipelines maintainable, and ensures your warehouse grows sustainably.

Introduction

A database is comprised of more than one table, and the relationships among all the tables in a database are collectively referred to as a schema. A good schema design:

Fact vs Dimension Tables

To understand schema design, we need to dissect some terminology first. A fact table is simply a table that contains facts, or measurable events,  like sales details, transaction data, or website clicks. A dimension table contains data that describes events in the fact table, like customer details, clients, products, or dates. In summary, fact tables hold the key numeric data, while dimensions provide context, making it easy to query and analyze, making it the most popular schema.

Schema Architecture Styles

The most popular architectural styles include:

Star schema

A schema where dimension tables surround a single central fact table. A star schema can have any number of dimension tables, the relationship signaling a many-to-one relationship between the fact table and each dimension table. It’s known for its fast performance and easy-to-read queries. Since the fact table is central, you rarely need to complicate joins. However, its simplistic design is a limitation when data relationships get more complex. 

Snowflake Schema

Consists of one fact table that is connected to many dimension tables, which can be connected to other dimension tables through a many-to-one relationship. Typically, the tables are normalized to the third normal form, with each dimension representing one level of hierarchy, resulting in reduced duplicate data. On the other hand, the more tables you join, the more complex the complexity. This results in slow query performance and confused analysts who have to track multiple dimension tables. If the data isn’t large and complicated, the snowflake schema may not be the best option.

Galaxy (Fact Constellation) schema

Invovles multiple fact tables, each potentially having its own set of dimensions, but some dimensions are shared between fact tables. This is more complex than both the snowflake and star scheme, due to the presence of multiple fact tables and shared dimensions. Such a schema is typically implemented for a large organization with numerous business-related processes, enabling cross-domain analysis and insights. If you opt for this choice, it’s best to have active maintenance schedules to reduce performance and data quality challenges.

Once a schema is selected, it’s important to adhere to it throughout your architecture. And as your architecture evolves and your needs change, a transition can be made to a more suitable schema.

Data Layered Design (Raw → Staging → Curated)

Even if you are not using a full lakehouse, it's important to layer your warehouse for clarity, modularity, and to have testable pipelines, depending on your selected data modelling standards. 

Raw layer

Staging Layer

Intermediate/Transform/Core Layer

Curated/Refined Layer

https://miro.medium.com/v2/resize:fit:1100/format:webp/1*DLLogWqmw-gkrTpD2lTKqg.png

Naming Conventions

It’s also important to consider naming by layer, like we've seen in the data layer and design section. It helps with ease of traceability and maintainability, allowing engineers to track the entire flow of a process for root cause analysis and make necessary updates.

Clear singularity and column names. Use of clear and singular names for your columns and tables helps with.

Avoid ambiguity; define your tables, columns, and pipelines with self-explanatory names. A user should be able to deduce what data can be found in a column or travel by its name, and which pipeline supports a particular dataset. 

Common pitfalls and anti-patterns

While creating your schema design, it's easy to get carried away and lose the vision you had. In this process, here are some things you need to watch out for:

  1. Schema sprawl: too many tables, no clear purpose. In the curated/data mart layer, it's easy to get carried away with the moment and create many tables that answer some, at the time of thought, useful questions. But the main question is, are they useful? Is there an intended use case? A good schema design should be useful and cost-conscious. A table created that doesn't answer any business questions at that moment may not be practical 

  2. Mixing layers: raw and transformed day in the same schema is a bad design choice. Example: say you need to perform root cause analysis, and the data'll be difficult to pinpoint if the problem is from the transformation, the pipeline, or the data source. 

  3. Inconsistent data types: one needs to be aware of inconsistent data types in conversations. Having a column as a string in one table and an int in another table is like having half of your tools in the metric system and the rest in imperial. It's doable, but one can easily lose the plot. It can lead to messy queries, data quality risks, performance hits with type conversion, etc.

  4. Lack of version control: Version control provides a mitigation plan in the event of introducing breaking changes, which will inevitably occur at some point in an engineer’s career. 

  5. No documentation: As an engineer, it's important to remember one rule: a schema is only understandable to the architect. Document the complex logic and the seemingly obvious tools. 

Case study 

Consider the source-driven flat schema below, which uses the operational data store approach. It mirrors how data comes out of the source systems, prioritizing quick ingestion over clean schema, and is usually useful for short-term reporting.

Single Wide Table Dumped from Source

analytics.sales_data

 

transaction_id

customer_id

name

joinDate

email

prod_id

product

cat

price

storeCode

store

region

amount

ts

1001

C123

Alice Johnson

2022/03/01

alice@email.com

P001

WidgetA

Gadgets

49.99

NYC

S-NYC

East

49.99

2024-07-15 14:03

1002

C124

Rob Hunter

2023-01-15

rob@email.com

P002

WidgetB

Gadgets

19.99

NYC

Store1

East

19.99

2024-07-15 14:05

1003

C123

Alice Johnson

Mar 1 2022

alice@email.com

P003

WidgetC

Premium

99.99

LA

S-LA

West

99.99

2024-07-16 10:11

 

What’s Wrong Here?

1. Inconsistent Data Types; joinDate sometimes YYYY/MM/DD, sometimes YYYY-MM-DD, sometimes free text (Mar 1, 2022). amount is a float, price is sometimes a string in other loads.

2. No Surrogate Keys; All joins rely on natural keys like customer_id, prod_id, and storeCode. If a customer changes ID in source (system migration, merges), history is broken.

3. Mixed Grain; Customer attributes (name, email) repeated on every transaction row, update anomalies, larger storage, wasted compute.

4. Inconsistent Naming; Some columns snake_case (transaction_id), some camelCase (joinDate), some shortcodes (prod_id). Store has three versions: storeCode, store, and region.

5. No Clear Layering, Raw source data + derived values live in the same table. No separation of raw → core → mart.

To fix these issues, we’ll implement the recommendations discussed earlier: 

The raw layer

raw.sales.transaction

src_transaction_id

src_customer_id

src_product_id

1001

C123

P001

1002

C124

P002

1003

C123

P003

 

raw.sales.customers

src_customer_id

customer_name

join_date

email

country

C123

Alice Mcdoodle

2022-03-01

alice@email.com

UK

C214

Rob Hunter

2023-01-15

rob@email.com

US



raw.sales.products

src_product_id

product_name

category

price

P001

Widget A

Gadgets

49.99

P002

Widget B

Gadgets

19.99

P003

Widget C

Premium

99.99

 

raw.sales.stores_raw

src_store_id

store_name

region

S01

Store NYC

East

S02

Store LA

West

 

Core Layer (Business Logic + Surrogate Keys)

In this section, we standardize data types, introduce surrogates, and enforce naming conventions. Also, facts and dimensions are identified in this section.

Dimension Tables

core.dim_customer

customer_sk

src_customer_id

customer_name

join_date

email

country

1

C123

Alice Mcdoodle

2022-03-01

alice@email.com

UK

2

C214

Rob Hunter

2023-01-15

rob@email.com

US

 

core.dim_product

product_sk

src_product_id

product_name

category

price

10

P001

Widget A

Gadgets

49.99

11

P002

Widget B

Gadgets

19.99

12

P003

Widget C

Premium

99.99

 

core.dim_store

store_sk

src_store_id

store_name

region

100

S01

Store NYC

East

101

S02

Store LA

West

 

Fact Table

core.fct_sales

transaction_sk

transaction_date

customer_sk

src_product_id

store_sk

transaction_

amount

src_transaction_id

5001

2024-07-15

1

10

100

49.99

1001

5002

2024-07-15

2

11

100

19.99

1002

5003

2024-07-16

1

12

101

99.99

1003

 

Mart Layer (Presentation / Reporting)

Aggregates for BI consumption.

mart.sales.sales_summary_monthly

month

region

category

total_transactions

2024-07

East

Gadgets

2

2024-07

West

Premium

1

 

Monthly sales summary by product category.

Schema Style

The core layer is a defined star schema, one fact table, multiple dimension tables, joined via surrogate keys. Natural keys retained for traceability (src_* columns), data types standardized (e.g., dates, decimals), and clear, consistent naming is defined (dim_, fct_, mart_), as discussed in previous sections.

Watch out for more parts!

← Back to Blog
Share: Twitter LinkedIn
← Previous
Designing a Data Style Guide - Part 1
Next →