Neural Digest | Know the magic behind it

Introduction

What is a data style guide? In the context of data engineering, it is a set of rules and guidelines that dictate how data should be formatted, presented and labelled within a specific context. It is used to create standards for variable naming, table naming, column naming, file naming and even coding practices.

A general coding style guide focuses on syntax and visual formatting of the code, whereas, a data style guide focuses on how data should be formatted, presented and structured within a warehouse/lake. In most cases, a data style guide would also include a coding style guide. A data style guide provides a structured approach to data engineering, laying out the how, why and where.

Let’s dive into why it’s important and the parts that make it whole. This will be a series, for other parts search “data style guide”.

Why a data style guide

The key cornerstones of a data style guide are usually to provide:

Clarity; defining clear standards for data labelling and formatting helps users and computers understand the meaning at a glance. This prevents errors such as accidentally using the wrong table or column in a query.
Consistency; allows uniformity across different datasets and projects within a team, allowing easier comparison and interpretability of information.
Facilitate collaboration; No matter how small a team is, someone else will look at your code or processes at some point. Whether someone in your team or the next person who replaces you at a job you’ve left. It’s important to use the same conventions for better collaboration and readability.
Scalability; As the business grows the data grows, and a good guide supports growth in data volume, complexity and new use cases, remaining relevant as technology and business needs evolve. It must scale with increasing data size and complexity, not too rigid that it strains innovation and plans for schema evolution and versioning strategies.

How are other guys doing it?

If you are looking for inspiration, check out the DBT style guide here (https://docs.getdbt.com/best-practices/how-we-style/0-how-we-style-our-dbt-projects). An important thing to note is that these are just suggestions, and every use case and organization’s needs will be different. Tune your guide to your needs; nothing is written in stone. Consult them and agree on a set style if you are part of a team. The important part is to decide on a style and stick to it.

A style guide is only as useful if engineers use it, the best style guide is one that integrates seamlessly into existing workflows. Make the guide easily accessible, by hosting it on a wiki, GitHub or internal portal. Integrate checks into the CI/CD pipelines to automate enforcement.

Common pitfalls with data style guides

Strict rules and guidelines limit flexibility and innovation. Distinguish between the rules and guidelines of the style, this will allow flexibility for scenarios that the style guide may not be suitable. Have a balance between standardization and flexibility.

Lack of enforcement and ownership. An unenforced style guide might get ignored and lead to inconsistencies. Engineers might view it as extra red tape rather than a helpful resource. To overcome this challenge one can offer training and onboarding documentation, integrate the guide into workflows using automated linting tools, and foster a culture of accountability among the engineers to adopt the guide.

Lack of automation where possible. Engineers are human and make mistakes sometimes. Adopt formatters and linters where possible, they ease a code reviewers load during a PR review. An engineer can focus on the functionality and logic of the code, leaving the linters and formatters to handle the cosmetic end of it. Additionally, utilizing tools like Github actions, Codepipeline and Jenkins could be viable automation tools.

Balancing standardization and flexibility

When designing a style guide, striking a balance between standardisation and flexibility is important. Without it overly rigid rules can stifle innovation and create unnecessary friction, while too much flexibility can lead to inconsistency and inefficiencies. A successful data style guide provides clear enforceable rules while allowing situational flexibility where necessary. The following are strategies one should consider.

Define the core rules vs guidelines

Categorize the recommendations into negotiable and non-negotiable. Examples of non-negotiable recommendations could include; variable naming conventions using snake_case and all personally identifiable information (PII) must be encrypted. The recommended ones include Using partitioning for large datasets unless there is a specific reason not to and preference for CSV storage over Parquet, or vice versa, but allow exceptions for operability. This allows for critical aspects of data governance to remain intact while allowing teams to adjust less critical aspects to fit their needs.

Allow for exceptions with justifications.

Introduce a formal process for requesting exceptions, in place of a rigid, inflexible style guide. If accepted, the documentation is updated and shared for transparency. However, the team should take ownership and not let the process become a loophole for ignoring standards.

Automation of core rules

Linters and CI/CD tools are good solutions for enforcing consistency, without burdening engineers. Automated checks can enforce naming conventions for tables and columns, schema validation rules, and compliance with security policies. Allowing for manual review for flexible areas like query optimization techniques.

Create team-specific extensions

Organizations with multiple teams working in different domains (data engineering, Machine learning etc) can allow for differentiability between teams, with domain-specific add-ons. For example, coding style applies to everyone, while a team-specific extension for the data science team includes best practices for feature engineering and dataset versioning.

Continuous review and evolution.

The style guide is a living document and should be treated as such, evolving based on feedback, new technologies and business needs. It is best to conduct periodic reviews, gather feedback on pain points and maintain a changelog to track updates and ensure transparency.

The next article will discuss naming, schema and data modelling best practices.

Designing a Data Style Guide - Part 1

Introduction

Why a data style guide

How are other guys doing it?

Common pitfalls with data style guides

Balancing standardization and flexibility

Define the core rules vs guidelines

Allow for exceptions with justifications.

Automation of core rules

Create team-specific extensions

Continuous review and evolution.