<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Data Quality on Kimani Mbugua - Data and Technology blog</title><link>http://kimanimbugua.com/tags/data-quality/</link><description>Recent content in Data Quality on Kimani Mbugua - Data and Technology blog</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 14 Jun 2021 00:00:00 +0000</lastBuildDate><atom:link href="http://kimanimbugua.com/tags/data-quality/rss.xml" rel="self" type="application/rss+xml"/><item><title>Part 4 - Bad records path</title><link>http://kimanimbugua.com/post/handling-bad-data-part-4-bad-records-path/</link><pubDate>Mon, 14 Jun 2021 00:00:00 +0000</pubDate><guid>http://kimanimbugua.com/post/handling-bad-data-part-4-bad-records-path/</guid><description>&lt;p&gt;In part 4, the final part of this beginner’s mini-series of how to handle bad data, we will look at how we can retain flexibility to capture bad data and proceed uninterrupted.&lt;/p&gt;
&lt;p&gt;We&amp;rsquo;ll look to use specifically, the “&lt;a href="https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/handling-bad-records"&gt;badRecordsPath&lt;/a&gt;” option in Azure Databricks, which has been available since Azure Databricks runtime 3.0.&lt;/p&gt;</description></item><item><title>Part 3 - Permissive</title><link>http://kimanimbugua.com/post/handling-bad-data-part-3-permissive/</link><pubDate>Sun, 30 May 2021 00:00:00 +0000</pubDate><guid>http://kimanimbugua.com/post/handling-bad-data-part-3-permissive/</guid><description>&lt;p&gt;In the 3rd instalment of this 4-part mini-series, we will look at how we can handle bad data using PERMISSIVE mode. It is the default mode when reading data using the DataFrameReader but there’s a bit more to it than simply replacing bad data with NULLs.&lt;/p&gt;</description></item><item><title>Part 2 - Dropmalformed</title><link>http://kimanimbugua.com/post/handling-bad-data-part-2-dropmalformed/</link><pubDate>Mon, 17 May 2021 00:00:00 +0000</pubDate><guid>http://kimanimbugua.com/post/handling-bad-data-part-2-dropmalformed/</guid><description>&lt;p&gt;In the second part, we’ll continue to focus on the DataFrameReader class and look at the option, &lt;strong&gt;DROPMALFORMED&lt;/strong&gt; to &lt;strong&gt;remove&lt;/strong&gt; bad data.&lt;/p&gt;</description></item><item><title>Part 1 - Failfast</title><link>http://kimanimbugua.com/post/handling-bad-data-part-1-failfast/</link><pubDate>Mon, 10 May 2021 00:00:00 +0000</pubDate><guid>http://kimanimbugua.com/post/handling-bad-data-part-1-failfast/</guid><description>&lt;p&gt;Receiving bad data is often a case of “when” rather than “if”, so the ability to handle bad data is critical in maintaining the robustness of data pipelines.&lt;/p&gt;
&lt;p&gt;In this beginners 4-part mini-series, we’ll look at how we can use the Spark DataFrameReader to handle bad data and minimise disruption in Spark pipelines. There are many other creative methods outside of what will be discussed and I invite you to share those if you’d like.&lt;/p&gt;</description></item></channel></rss>