Rules of Scrubbing Data

After setting up a scrub server, we need to review each database for data that needs to be scrubbed.  The data you need to scrub depends upon the government rules you need to follow.  Discovering the data to be scrubbed will require us to query the database and also talk to our developers.  Here is a list of possible field names that we need to search for;

Email

Address

City

State

Zip

Postal Code

Name

Credit Card

Social Security Number (SSN).

 

Here is a query to search for the fields,

select t.name as TableName

,c.name as ColumnName

,x.name as ColumnType

,c.max_length

from sys.columns c

inner join sys.tables t

on c.object_id = t.object_id

inner join sys.types x

on c.system_type_id = x.system_type_id

where c.name like ‘%email%’

and (x.name = ‘varchar’

or x.name = ‘nvarchar’)

 

Remember, this will only find the fields that match a certain pattern.  Your developers may help you find other fields.

The database might also have fields that have been encrypted with cell-level encryption or a third party product.  If data is encrypted, you will need to change the encryption certification to a non-Production certificate.  I recommend sharing the QA/Dev certificate for a non-Production environment.

After you have discovered all the fields that need to be scrubbed, next we determine how we want to scrub the data.  One thing to note, you may need to do something special when scrubbing the data.  For example, when I worked for a credit card company, we had to maintain the first digit of the credit card number.  Yes, there was some code which had the value information hard coded.

Now, let’s discuss the different methods for scrubbing data:

  1. Change the data. For example, change the social security number by maybe replacing the data with primary key padded with zeros.  With this said, this is assuming your primary key is an integer value.
  2. Scramble the data. With the primary key, pull the data to scrub out of the table and randomize the data.  After the data is randomized we will put the data back into the original table using the primary key.
  3. Remove the data. You can null, zero out or space out the data.  Remember, you are changing the selectivity of the data.
  4. Change all email fields. Remember you don’t want to spam your customers during development, QA testing or UAT testing.

When I am scrubbing data, I want to keep as much of the data the same as I can.  If I can scramble the data, it is my first choice.  For other options of scrubbing or generating data, please check out this article from Red-Gate, https://www.simple-talk.com/sql/database-administration/obfuscating-your-sql-server-data/.

In our next article, we will wrap automating the process of scrubbing data.

 

Setting Up a Scrub Server

A scrub server is a server where we will restore protected Production data. Production data must be changed to not expose data that is protected by government regulations. Some of the government regulations will include privacy and personal identifiable information (PII data) laws. Other laws might be HIPPA and PCI laws. This is not an exhaustive list of all laws that covers data in your production environment. Check with your government and management for laws you need to adhere to.

With that said, when I set up a Scrub server, it is in a very secure area where the data access is very, very limited. For example, in my current company, the server is in a separate domain from Production and QA/Dev. Only DBAs are allowed to access this server. If you have multiple DBAs at your location, you may want to even limit which DBAs have access to this server. Our goal is to automate the entire scrubbing process so no one has to access the data including copying backup files from Production and to a shared scrub location for QA/Dev to retrieve.

First, we have to determine which version of SQL Server will be on the Scrub server or you may decide to have multiple versions of SQL Server. I chose to only have one version of SQL Server on my Scrub Server. Determining the SQL Server version will all depend upon the version of SQL Server in Production. Hopefully, your QA/Dev environments will be at least the same version of SQL Server as Production or newer. Note, I did not say the same service pack or cumulative update version but SQL Server 2008 R2, 2012, 2014 or 2016. You may be in the process of deploying service pack 1 to Production but if you are like me, you start out changing your Development servers first and if the deployment goes well moving the service pack to QA, then UAT and finally to Production. So we must be on the same major version number as Production, not the same service pack.

Second, you must have a lot of disk space. By far, this is my largest SQL Server in terms of hard drive space. People may say you are wasting disk space but remember what you are really doing is protecting the Company from a data breach. When protecting the data was explained to management by scrubbing data, disk space request was no longer an issue. On my scrub server, I have enough disk space to restore all of my Production databases from all of my Production servers. Yes, that is a lot of disk space. My scrub server has 1.6 TB of disk space. To put this in relation, my largest database is 160 gigs. In my previous company where I had PCI data, the largest production database was 700 gigs and my scrub server was 4 TB of disk space. Remember you will need space for the backup from Production, the restored database and the backup you will take after you scrub it. So 160 gigs database would need 480 gigs of space if you can’t compress the backup files.

Finally, I recommend the Scrub server to be a Virtual Server to keep some cost down. My Scrub VM server has 2 vCpu and 8 gigs of Ram. The scrub server does not need much horsepower. With automation, I restore, scrub and backup 17 databases in 25 minutes run by a SQL Server Agent job with Powershell and T-SQL.

As we wrap up this blog, the most important thing to remember when creating a scrub server is security. You will be restoring Production data, keep it secure. Let’s build, troubleshoot and test with the best possible dataset we can provide to our team by scrubbing Production data. Next blog post will discuss how we discover what data needs to be scrubbed and how I scatter the data across the database.

Why Do I Need Production Data for Testing?

Every day as we test and develop, we need data to see how a screen will work. If the screen is filling in an address of an employee and the employee is adding their home address state, how does the screen behave? Is the UI allowing just US states or is this a worldwide application? If this is a legacy database system, what values have been allowed in the past? What happens if I run a state report?

Have you ever heard, “but it works on my machine”? Is this because of data perfection in Development and QA or having specific failure conditions? Can you think of all the data scenarios that accompany Production data? What about performance? Why did the application fail? What happens if I add this index?

Here are the reasons I believe you should get a scrubbed version of your production database into your Development, QA and UAT environments.

  1. Data Selectivity – if you have data in your address table, do you have addresses, which are 90 percent one state, California, or are the states spread out. If your database is only a US database, do you only have US states or do you have other data?
  2. Application Errors – You need to figure out why the application is failing in Production, you need to validate the error condition. Is there more than one row of data causing this error? Did a data value overflow the data type?
  3. Data Usage Stats – When you look at the data statistics, what happens if you add an index? Does this explain why sometimes your query is fast and other times it is very slow?
  4. Data Anomalies –What data are you expecting in the field? Is the data what you are expecting? Do we need to fix the screen to only allow certain values to be entered? What if you have a legacy database which has unexpected data values, how does the application handle the situation if the data can’t be changed?
  5. Index Maintenance – What is the effect of the index you want to add or remove?
  6. Data Security – Never move real Production data outside of Production without scrubbing the data first.

All of this should not be done or debugged in Production. Let’s get a copy from Production, scrub it and work with this data. Next, let’s learn how to build a data scrubbing environment.

It Has Been A While!

Hello and welcome back!  It has been a while since my last blog post.  Since the last post, I moved from Denver to Raleigh.  Last week while at the PASS Summit, I was visiting with Jason Horner and he suggested I blog about the subject I was sharing with him. The topic is Restoring a Production Database to a Non-Production Server.  Over the next few weeks, we will be reviewing several sub-topics about restoring a production database. I call this process, Scrubbing the Data. Here is an overview of the topics we will be discussing:

  • Why do I need Production Data outside Production
  • Developing an Environmental Aware Servers
  • Setting up a Scrub Server
  • Rules of Scrubbing Data
  • Changing TDE Database and Cell Level Encryption by Certificate
  • Complete Scrubbing Process

I will be leaving on vacation on Friday, so keep your eyes open for the first post on scrubbing your data around the end of November.