Neural Digest | Know the magic behind it

Introduction

A database is comprised of more than one table, and the relationships among all the tables in a database are collectively referred to as a schema. A good schema design:

Reduces data redundancy
Prevents data inconsistencies and inaccuracies
Preserves data integrity
Facilitates radio data retrieval and analysis
Keeps sensitive and confidential data secure and accessible to those who need it.

Fact vs Dimension Tables

To understand schema design, we need to dissect some terminology first. A fact table is simply a table that contains facts, or measurable events, like sales details, transaction data, or website clicks. A dimension table contains data that describes events in the fact table, like customer details, clients, products, or dates. In summary, fact tables hold the key numeric data, while dimensions provide context, making it easy to query and analyze, making it the most popular schema.

Schema Architecture Styles

The most popular architectural styles include:

Star schema

A schema where dimension tables surround a single central fact table. A star schema can have any number of dimension tables, the relationship signaling a many-to-one relationship between the fact table and each dimension table. It’s known for its fast performance and easy-to-read queries. Since the fact table is central, you rarely need to complicate joins. However, its simplistic design is a limitation when data relationships get more complex.

Snowflake Schema

Consists of one fact table that is connected to many dimension tables, which can be connected to other dimension tables through a many-to-one relationship. Typically, the tables are normalized to the third normal form, with each dimension representing one level of hierarchy, resulting in reduced duplicate data. On the other hand, the more tables you join, the more complex the complexity. This results in slow query performance and confused analysts who have to track multiple dimension tables. If the data isn’t large and complicated, the snowflake schema may not be the best option.

Galaxy (Fact Constellation) schema

Invovles multiple fact tables, each potentially having its own set of dimensions, but some dimensions are shared between fact tables. This is more complex than both the snowflake and star scheme, due to the presence of multiple fact tables and shared dimensions. Such a schema is typically implemented for a large organization with numerous business-related processes, enabling cross-domain analysis and insights. If you opt for this choice, it’s best to have active maintenance schedules to reduce performance and data quality challenges.

Once a schema is selected, it’s important to adhere to it throughout your architecture. And as your architecture evolves and your needs change, a transition can be made to a more suitable schema.

Data Layered Design (Raw → Staging → Curated)

Even if you are not using a full lakehouse, it's important to layer your warehouse for clarity, modularity, and to have testable pipelines, depending on your selected data modelling standards.

Raw layer

Data is an exact copy of the source.
No transformations are done.
It is used for audit, exploration, and reproducibility purposes.
Consumers are usually data engineers.

Staging Layer
Light cleaning.
Standardization formats.
Ready for joins, deduplication.
Highly governed and documented
Maintains history if the raw source doesn't support.
Consumers are usually data engineers, scientists, and analysts.

Intermediate/Transform/Core Layer
Business logic begins.
Joins, aggregations, and filters.
Considered the single source of truth.
SCD-2 is maintained.
Consumed by data analysts, data scientists, and operational analysts.

Curated/Refined Layer
Data products
Facilitates self-service analytics.
Feeds the BI tools and data consumers.
Consumers include BI analysts, heads of departments, and data scientists.

Naming Conventions

It’s also important to consider naming by layer, like we've seen in the data layer and design section. It helps with ease of traceability and maintainability, allowing engineers to track the entire flow of a process for root cause analysis and make necessary updates.

Clear singularity and column names. Use of clear and singular names for your columns and tables helps with.

Avoid ambiguity; define your tables, columns, and pipelines with self-explanatory names. A user should be able to deduce what data can be found in a column or travel by its name, and which pipeline supports a particular dataset.

Common pitfalls and anti-patterns

While creating your schema design, it's easy to get carried away and lose the vision you had. In this process, here are some things you need to watch out for:

Schema sprawl: too many tables, no clear purpose. In the curated/data mart layer, it's easy to get carried away with the moment and create many tables that answer some, at the time of thought, useful questions. But the main question is, are they useful? Is there an intended use case? A good schema design should be useful and cost-conscious. A table created that doesn't answer any business questions at that moment may not be practical
Mixing layers: raw and transformed day in the same schema is a bad design choice. Example: say you need to perform root cause analysis, and the data'll be difficult to pinpoint if the problem is from the transformation, the pipeline, or the data source.
Inconsistent data types: one needs to be aware of inconsistent data types in conversations. Having a column as a string in one table and an int in another table is like having half of your tools in the metric system and the rest in imperial. It's doable, but one can easily lose the plot. It can lead to messy queries, data quality risks, performance hits with type conversion, etc.
Lack of version control: Version control provides a mitigation plan in the event of introducing breaking changes, which will inevitably occur at some point in an engineer’s career.
No documentation: As an engineer, it's important to remember one rule: a schema is only understandable to the architect. Document the complex logic and the seemingly obvious tools.

Case study

Consider the source-driven flat schema below, which uses the operational data store approach. It mirrors how data comes out of the source systems, prioritizing quick ingestion over clean schema, and is usually useful for short-term reporting.

Single Wide Table Dumped from Source

analytics.sales_data

transaction_id	customer_id	name	joinDate	email	prod_id	product	cat	price	storeCode	store	region	amount	ts
1001	C123	Alice Johnson	2022/03/01	alice@email.com	P001	WidgetA	Gadgets	49.99	NYC	S-NYC	East	49.99	2024-07-15 14:03
1002	C124	Rob Hunter	2023-01-15	rob@email.com	P002	WidgetB	Gadgets	19.99	NYC	Store1	East	19.99	2024-07-15 14:05
1003	C123	Alice Johnson	Mar 1 2022	alice@email.com	P003	WidgetC	Premium	99.99	LA	S-LA	West	99.99	2024-07-16 10:11

What’s Wrong Here?

1. Inconsistent Data Types; joinDate sometimes YYYY/MM/DD, sometimes YYYY-MM-DD, sometimes free text (Mar 1, 2022). amount is a float, price is sometimes a string in other loads.

2. No Surrogate Keys; All joins rely on natural keys like customer_id, prod_id, and storeCode. If a customer changes ID in source (system migration, merges), history is broken.

3. Mixed Grain; Customer attributes (name, email) repeated on every transaction row, update anomalies, larger storage, wasted compute.

4. Inconsistent Naming; Some columns snake_case (transaction_id), some camelCase (joinDate), some shortcodes (prod_id). Store has three versions: storeCode, store, and region.

5. No Clear Layering, Raw source data + derived values live in the same table. No separation of raw → core → mart.

To fix these issues, we’ll implement the recommendations discussed earlier:

The raw layer

raw.sales.transaction

src_transaction_id	src_customer_id	src_product_id
1001	C123	P001
1002	C124	P002
1003	C123	P003

raw.sales.customers

src_customer_id	customer_name	join_date	email	country
C123	Alice Mcdoodle	2022-03-01	alice@email.com	UK
C214	Rob Hunter	2023-01-15	rob@email.com	US

raw.sales.products

src_product_id	product_name	category	price
P001	Widget A	Gadgets	49.99
P002	Widget B	Gadgets	19.99
P003	Widget C	Premium	99.99

raw.sales.stores_raw

src_store_id	store_name	region
S01	Store NYC	East
S02	Store LA	West

Core Layer (Business Logic + Surrogate Keys)

In this section, we standardize data types, introduce surrogates, and enforce naming conventions. Also, facts and dimensions are identified in this section.

Dimension Tables

core.dim_customer

customer_sk	src_customer_id	customer_name	join_date	email	country
1	C123	Alice Mcdoodle	2022-03-01	alice@email.com	UK
2	C214	Rob Hunter	2023-01-15	rob@email.com	US

core.dim_product

product_sk	src_product_id	product_name	category	price
10	P001	Widget A	Gadgets	49.99
11	P002	Widget B	Gadgets	19.99
12	P003	Widget C	Premium	99.99

core.dim_store

store_sk	src_store_id	store_name	region
100	S01	Store NYC	East
101	S02	Store LA	West

Fact Table

core.fact_sales

transaction_sk	transaction_date	customer_sk	src_product_id	store_sk	transaction_ amount	src_transaction_id
5001	2024-07-15	1	10	100	49.99	1001
5002	2024-07-15	2	11	100	19.99	1002
5003	2024-07-16	1	12	101	99.99	1003

Mart Layer (Presentation / Reporting)

Aggregates for BI consumption.

mart.sales.sales_summary_monthly

month	region	category	total_transactions
2024-07	East	Gadgets	2
2024-07	West	Premium	1

Monthly sales summary by product category.

Schema Style

The core layer is a defined star schema, one fact table, multiple dimension tables, joined via surrogate keys. Natural keys retained for traceability (src_* columns), data types standardized (e.g., dates, decimals), and clear, consistent naming is defined (dim_, fct_, mart_), as discussed in previous sections.

Watch out for more parts!

Schema Design Principles in Modern Data Warehousing - Part 2 [Updated]