A database is comprised of more than one table, and the relationships among all the tables in a database are collectively referred to as a schema. A good schema design:
Reduces data redundancy
Prevents data inconsistencies and inaccuracies
Preserves data integrity
Facilitates radio data retrieval and analysis
Keeps sensitive and confidential data secure and accessible to those who need it.
To understand schema design, we need to dissect some terminology first. A fact table is simply a table that contains facts, or measurable events, like sales details, transaction data, or website clicks. A dimension table contains data that describes events in the fact table, like customer details, clients, products, or dates. In summary, fact tables hold the key numeric data, while dimensions provide context, making it easy to query and analyze, making it the most popular schema.
The most popular architectural styles include:
A schema where dimension tables surround a single central fact table. A star schema can have any number of dimension tables, the relationship signaling a many-to-one relationship between the fact table and each dimension table. It’s known for its fast performance and easy-to-read queries. Since the fact table is central, you rarely need to complicate joins. However, its simplistic design is a limitation when data relationships get more complex.
Consists of one fact table that is connected to many dimension tables, which can be connected to other dimension tables through a many-to-one relationship. Typically, the tables are normalized to the third normal form, with each dimension representing one level of hierarchy, resulting in reduced duplicate data. On the other hand, the more tables you join, the more complex the complexity. This results in slow query performance and confused analysts who have to track multiple dimension tables. If the data isn’t large and complicated, the snowflake schema may not be the best option.
Invovles multiple fact tables, each potentially having its own set of dimensions, but some dimensions are shared between fact tables. This is more complex than both the snowflake and star scheme, due to the presence of multiple fact tables and shared dimensions. Such a schema is typically implemented for a large organization with numerous business-related processes, enabling cross-domain analysis and insights. If you opt for this choice, it’s best to have active maintenance schedules to reduce performance and data quality challenges.
Once a schema is selected, it’s important to adhere to it throughout your architecture. And as your architecture evolves and your needs change, a transition can be made to a more suitable schema.
Even if you are not using a full lakehouse, it's important to layer your warehouse for clarity, modularity, and to have testable pipelines, depending on your selected data modelling standards.
Data is an exact copy of the source.
No transformations are done.
It is used for audit, exploration, and reproducibility purposes.
Consumers are usually data engineers.
Light cleaning.
Standardization formats.
Ready for joins, deduplication.
Highly governed and documented
Maintains history if the raw source doesn't support.
Consumers are usually data engineers, scientists, and analysts.
Business logic begins.
Joins, aggregations, and filters.
Considered the single source of truth.
SCD-2 is maintained.
Consumed by data analysts, data scientists, and operational analysts.
Data products
Facilitates self-service analytics.
Feeds the BI tools and data consumers.
Consumers include BI analysts, heads of departments, and data scientists.
It’s also important to consider naming by layer, like we've seen in the data layer and design section. It helps with ease of traceability and maintainability, allowing engineers to track the entire flow of a process for root cause analysis and make necessary updates.
Clear singularity and column names. Use of clear and singular names for your columns and tables helps with.
Avoid ambiguity; define your tables, columns, and pipelines with self-explanatory names. A user should be able to deduce what data can be found in a column or travel by its name, and which pipeline supports a particular dataset.
While creating your schema design, it's easy to get carried away and lose the vision you had. In this process, here are some things you need to watch out for:
Schema sprawl: too many tables, no clear purpose. In the curated/data mart layer, it's easy to get carried away with the moment and create many tables that answer some, at the time of thought, useful questions. But the main question is, are they useful? Is there an intended use case? A good schema design should be useful and cost-conscious. A table created that doesn't answer any business questions at that moment may not be practical
Mixing layers: raw and transformed day in the same schema is a bad design choice. Example: say you need to perform root cause analysis, and the data'll be difficult to pinpoint if the problem is from the transformation, the pipeline, or the data source.
Inconsistent data types: one needs to be aware of inconsistent data types in conversations. Having a column as a string in one table and an int in another table is like having half of your tools in the metric system and the rest in imperial. It's doable, but one can easily lose the plot. It can lead to messy queries, data quality risks, performance hits with type conversion, etc.
Lack of version control: Version control provides a mitigation plan in the event of introducing breaking changes, which will inevitably occur at some point in an engineer’s career.
No documentation: As an engineer, it's important to remember one rule: a schema is only understandable to the architect. Document the complex logic and the seemingly obvious tools.
Consider the source-driven flat schema below, which uses the operational data store approach. It mirrors how data comes out of the source systems, prioritizing quick ingestion over clean schema, and is usually useful for short-term reporting.
analytics.sales_data
transaction_id | customer_id | name | joinDate | prod_id | product | cat | price | storeCode | store | region | amount | ts | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1001 | C123 | Alice Johnson | 2022/03/01 | alice@email.com | P001 | WidgetA | Gadgets | 49.99 | NYC | S-NYC | East | 49.99 | 2024-07-15 14:03 |
1002 | C124 | Rob Hunter | 2023-01-15 | rob@email.com | P002 | WidgetB | Gadgets | 19.99 | NYC | Store1 | East | 19.99 | 2024-07-15 14:05 |
1003 | C123 | Alice Johnson | Mar 1 2022 | alice@email.com | P003 | WidgetC | Premium | 99.99 | LA | S-LA | West | 99.99 | 2024-07-16 10:11 |
What’s Wrong Here?
1. Inconsistent Data Types; joinDate sometimes YYYY/MM/DD, sometimes YYYY-MM-DD, sometimes free text (Mar 1, 2022). amount is a float, price is sometimes a string in other loads.
2. No Surrogate Keys; All joins rely on natural keys like customer_id, prod_id, and storeCode. If a customer changes ID in source (system migration, merges), history is broken.
3. Mixed Grain; Customer attributes (name, email) repeated on every transaction row, update anomalies, larger storage, wasted compute.
4. Inconsistent Naming; Some columns snake_case (transaction_id), some camelCase (joinDate), some shortcodes (prod_id). Store has three versions: storeCode, store, and region.
5. No Clear Layering, Raw source data + derived values live in the same table. No separation of raw → core → mart.
To fix these issues, we’ll implement the recommendations discussed earlier:
raw.sales.transaction
src_transaction_id | src_customer_id | src_product_id |
---|---|---|
1001 | C123 | P001 |
1002 | C124 | P002 |
1003 | C123 | P003 |
raw.sales.customers
src_customer_id | customer_name | join_date | country | |
---|---|---|---|---|
C123 | Alice Mcdoodle | 2022-03-01 | alice@email.com | UK |
C214 | Rob Hunter | 2023-01-15 | rob@email.com | US |
raw.sales.products
src_product_id | product_name | category | price |
---|---|---|---|
P001 | Widget A | Gadgets | 49.99 |
P002 | Widget B | Gadgets | 19.99 |
P003 | Widget C | Premium | 99.99 |
raw.sales.stores_raw
src_store_id | store_name | region |
---|---|---|
S01 | Store NYC | East |
S02 | Store LA | West |
In this section, we standardize data types, introduce surrogates, and enforce naming conventions. Also, facts and dimensions are identified in this section.
core.dim_customer
customer_sk | src_customer_id | customer_name | join_date | country | |
---|---|---|---|---|---|
1 | C123 | Alice Mcdoodle | 2022-03-01 | alice@email.com | UK |
2 | C214 | Rob Hunter | 2023-01-15 | rob@email.com | US |
core.dim_product
product_sk | src_product_id | product_name | category | price |
---|---|---|---|---|
10 | P001 | Widget A | Gadgets | 49.99 |
11 | P002 | Widget B | Gadgets | 19.99 |
12 | P003 | Widget C | Premium | 99.99 |
core.dim_store
store_sk | src_store_id | store_name | region |
---|---|---|---|
100 | S01 | Store NYC | East |
101 | S02 | Store LA | West |
core.fact_sales
transaction_sk | transaction_date | customer_sk | src_product_id | store_sk | transaction_ amount | src_transaction_id |
---|---|---|---|---|---|---|
5001 | 2024-07-15 | 1 | 10 | 100 | 49.99 | 1001 |
5002 | 2024-07-15 | 2 | 11 | 100 | 19.99 | 1002 |
5003 | 2024-07-16 | 1 | 12 | 101 | 99.99 | 1003 |
Aggregates for BI consumption.
mart.sales.sales_summary_monthly
month | region | category | total_transactions |
---|---|---|---|
2024-07 | East | Gadgets | 2 |
2024-07 | West | Premium | 1 |
Monthly sales summary by product category.
The core layer is a defined star schema, one fact table, multiple dimension tables, joined via surrogate keys. Natural keys retained for traceability (src_* columns), data types standardized (e.g., dates, decimals), and clear, consistent naming is defined (dim_, fct_, mart_), as discussed in previous sections.
Watch out for more parts!