Wednesday, July 13, 2016

To Purge or not to Purge

There comes a time in every data warehouse team tenure where you have to ascertain if you truly need to keep data from 1993.   Data Governance would dictate that you have a policy surrounding your data retention.  I worked for a client that had data "back to Noah" per the General Manager and the way the data warehouse code was written to calculate subscribers.  A database needs to be maintained or you experience a lot of problems.   But back to the data purging issue, let’s look at some reasons why you want and don't want to purge data.   


What is Purging
Purging is the process of freeing up space in the database or of deleting obsoletedata that is not required by the system. The purge process can be based on the age of the data or the type of data. Archiving is the process of backing up the obsolete data that will be deleted during the purge process

Reasons not to purge:
1. Disk space is cheap, why not keep the data
2. You don't know if you will ever "Need" the data
3. What if we are audited

Reason to purge the data
1. Improve performance - Why would you want to search through  1 Million records when you could search through 500k records - Query times, 
2. Legal reason -PII requires you to setup a purging policy.
3. Data beyond X number of years is irrelevant.
4. Costs of maintaining data - physical disks, space, bloated databases

Sample Purging policy

Every day, purge data older than 45 days from the current data set, moving it to the history data set.
The first day of every month, purge data older than 13 months from the history data set, moving it to the dormant data set.
Every day, purge data older than eight years from the dormant data set


Even if your organization utilizes an ODS where you have current information for X years and then an EDW where you have historical data, the EDW would still need to have some sort of purging policy or you will experience problems down the road in database performance or lack thereof.   Working with your data governance team to review issues with data, setting data standards and retention policy can only be good for your organization.   Sometimes your organization has to champion the data governance for your organization in order to show the business that data is important to your organization but maintaining that data is important to both IT and the business.

No comments:

Post a Comment