Kimani Mbugua - Data and Technology blog

Automating repo scaffolding with Azure DevOps

Sun, 05 Mar 2023 00:00:00 +0000

Integrating Cookiecutter with yaml pipelines in Azure DevOps to automatically scaffold repos, provides a simple and repeatable workflow and further minimises manual effort. This will show how we can set up this automation in DevOPs via yaml pipelines.

Using Azure DevOps to run Cookiecutter templates

Sun, 19 Feb 2023 00:00:00 +0000

Ever wondered how you can scaffold repos in a yaml pipeline? This post will show how we could do this in Azure DevOps.

Using cookiecutter hooks to enhance code scaffolding

Sun, 29 Jan 2023 00:00:00 +0000

Using cookiecutter to scaffold code repositories offers useful way kick start projects. To enhance the user experience even more, this post will look at using hooks to perform actions such as input validation and clean up activities.

Scaffolding repos with cookiecutter

Sun, 15 Jan 2023 00:00:00 +0000

For code development projects, we often end up creating code repositories (repos) that have similar structures or components to what we have previously used.

Cookiecutter is a tool that we can use to scaffold the creation of our repos and this post will guide you through how to do this and more.

Why you want Databricks Auto Loader

Sat, 27 Feb 2021 00:00:00 +0000

Auto Loader is one of the standout features in Databricks and this post will introduce you to why you’d want to use it to address common data ingestion challenges.

Part 3 - Pre-commit hooks - SQL Linting

Sat, 13 Feb 2021 00:00:00 +0000

It can be a challenge to keep code formatted consistently and with a lack of consistency, errors soon follow.

In part 3 of this pre-commit hooks series, we’ll focus on how we can use pre-commit hooks in Azure git repos, to automatically check for stylistic and programmatic errors in SQL scripts.

Part 2 - Detect secrets in Azure repos

Sat, 30 Jan 2021 00:00:00 +0000

Even with the advent of cloud computing and all manner of technology enhancements, exposing secrets seems to be a problem that won’t go away.

Without the right controls in place, developers can leak secrets that can cause financial and reputational damage to an organisation.

In part 2, we’ll look at how we can use a pre-commit hook to try and detect secrets in our code.

Part 1 - Pre-commit hooks in Azure repos

Sat, 23 Jan 2021 00:00:00 +0000

Having standards for code development is a necessity but making sure those standards are followed can be a challenge.

As human beings, we make mistakes and can overlook standards at the very moment we need to apply them.

Central to that challenge is making sure standards are applied before changes are committed.

In this series, we’ll look at taking on that challenge with pre-commit hooks. We’ll explore what pre-commit hooks are, why we might want to use them and how they work.

Moving to a Hugo static site

Fri, 31 Dec 2021 00:00:00 +0000

In this post, I’ll share my motivations and experiences of moving my blog, towards the end of 2021, from Wix to a Hugo static site.

Pin Databricks Clusters

Mon, 23 Aug 2021 00:00:00 +0000

This post is for anyone who is unaware that interactive Databricks clusters can be deleted 30 days after termination, unless the cluster is “pinned”.

Part 4 - Bad records path

Mon, 14 Jun 2021 00:00:00 +0000

In part 4, the final part of this beginner’s mini-series of how to handle bad data, we will look at how we can retain flexibility to capture bad data and proceed uninterrupted.

We’ll look to use specifically, the “badRecordsPath” option in Azure Databricks, which has been available since Azure Databricks runtime 3.0.

Part 3 - Permissive

Sun, 30 May 2021 00:00:00 +0000

In the 3rd instalment of this 4-part mini-series, we will look at how we can handle bad data using PERMISSIVE mode. It is the default mode when reading data using the DataFrameReader but there’s a bit more to it than simply replacing bad data with NULLs.

Part 2 - Dropmalformed

Mon, 17 May 2021 00:00:00 +0000

In the second part, we’ll continue to focus on the DataFrameReader class and look at the option, DROPMALFORMED to remove bad data.

Part 1 - Failfast

Mon, 10 May 2021 00:00:00 +0000

Receiving bad data is often a case of “when” rather than “if”, so the ability to handle bad data is critical in maintaining the robustness of data pipelines.

In this beginners 4-part mini-series, we’ll look at how we can use the Spark DataFrameReader to handle bad data and minimise disruption in Spark pipelines. There are many other creative methods outside of what will be discussed and I invite you to share those if you’d like.

Delta Lake table restore

Mon, 19 Apr 2021 00:00:00 +0000

One of the most common reasons to perform a restore is to do so for a table. In this post, we’ll be looking into how one of delta lake’s neat features allows us to accomplish fast and simple table restores to previous versions.

Key vault secrets in ADF pipelines

Thu, 18 Mar 2021 00:00:00 +0000

This short post looks at some considerations when using key vault secrets in Data Factory to securely pass information in pipeline activities. This is not an exhaustive list however but do take note.

Secret redaction caution

Thu, 18 Mar 2021 00:00:00 +0000

Secret redaction within Databricks is a great feature that helps to prevent exposure of your secrets unintentionally. This post will look at a short demo of why we need to remain cautious of secret exposure, even with secret redaction in place.

Databricks managed identity setup in ADF

Wed, 17 Feb 2021 00:00:00 +0000

This post shows how to quickly set up a managed identity for Databricks activities in Data Factory (ADF), to eliminate the need to manage credentials.

Cutting code and architects

Mon, 15 Feb 2021 00:00:00 +0000

Over the years working on data platforms, I have seen architects writing less and less code, to the point where in more recent times, a seemingly growing number of architects (in data projects) are writing no code at all.

Concurrency defaults in ADF

Mon, 09 Nov 2020 00:00:00 +0000

In this short post, we’ll look at concurrency default values in ADF and implications of changing them or not.

ADF activity policy

Mon, 26 Oct 2020 00:00:00 +0000

In this post, we’ll explore the Azure Data Factory (ADF) activity policy, it’s configuration and default behaviour implications.

Specify dynamic JSON content in ADF

Sun, 11 Oct 2020 00:00:00 +0000

This article shows how to utilise the json editor and key vault secret references in Azure Data factory (ADF) to provide an alternative experience for linked service connectors that do not have built-in parameterisation support.

Mon, 01 Jan 0001 00:00:00 +0000

Hi. I’m Kimani.

I help companies with the strategy, design and implementation of their data projects.

As well as other more general subjects, I particularly enjoy sharing my experience with data and cloud technologies.

I’m based in Sussex, England and outside of work you can find me playing cricket, hanging out with family or on the occasional scuba dive.

Mon, 01 Jan 0001 00:00:00 +0000

Loading…