Before GDPR came into effect in May 2018, Gartner found that over 80% of companies admitted to using sensitive production data for testing. That is a remarkable figure and it would be naïve to think that this practice has been eliminated only a year later. Indeed, as I have described in my earlier blog pieces, Delphix actually makes the use of production data much easier and so increases the risk of exposing personal data.
The solution for this problem is to mask any data being copied for non-production purposes. To do this Delphix has a comprehensive masking capability that is fully integrated with its data virtualisation engine.
Delphix masking supports profiling, masking, and tokenizing a variety of different data sources including distributed databases, mainframe, PaaS databases, and files. It maintains absolute consistency when masking data, regardless of the DBMS or data file it is stored in.
Imagine you have a system comprising three different applications; one on a SQLServer database, another on Oracle and the third on a mainframe using DB2. If you use Delphix to mask all these data sources, when you mask the name “David Webster”, for example, its masked value might always be “Thomas Jones”. That is true for all databases, and even in the DB2 database on the mainframe. “David Webster” will always be masked as “Thomas Jones”. This very important feature maintains the referential integrity of data across all your data sources and means you only have to code a masking rule once for each data element and not separately for each platform.
Delphix masking is irreversible, even if you have access to the encryption keys.
Take the name “David”. This might be masked as “Thomas” and will always be masked as “Thomas”. However, the names “Brian”, “Frederick”, “Rasheed”, “Wasim” and others may also be masked as “Thomas”. So, even if you see “Thomas”, you cannot determine the source value. This many to one data conversion makes the masking extremely secure.
There are eight Delphix masking algorithms out of the box:
|Secure Lookup:||Replaces sensitive data with fictional data. Repeating data patterns, or collisions may occur because “David”, “Rasheed” and “Neil” could all be masked as “Thomas”. Because names and addresses will recur in real data, this is realistic behaviour.|
|Segmented Mapping:||Supports unique masked values by dividing a target value into separate segments and masking each segment separately, For example, with a National Insurance number with both alpha and numeric sections you can control how the data is masked in each section to maintain a valid format.|
|Mapping:||This simply maps source data values sequentially to a list of values pre-supplied in a lookup table.|
|Binary Lookup:||This replaces objects in large object columns such as scanned images. It does not change any data within the images, instead the entire image is replaced with a fictional one. Typically, you might upload one small image that is used to mask all images.|
|Min Max:||Enables very high or low data to be normalised. Figures such as an executive salary can be masked into a mid-range value, making it impossible to attribute the data to any individual.|
|Data Cleansing:||This allows you to standardise varied spellings, misspellings, and abbreviations to the same data string. For example, “North Yorks”, “N. Yorks” and “N Yorkshire” can all be standardised as “North Yorkshire”. This is particularly useful for analytics.|
|Free Text:||A free text redaction algorithm removes sensitive data that appears in free‑text strings such as “Comments” or “Notes” columns in a database. It requires some expertise to configure it to recognise sensitive data within a block of text.|
|Tokenization:||This is the only algorithm that allows the masking to be reversed. A tokenization algorithm masks data before sending it to a third party for analysis. They can identify specific records that need attention without access to the sensitive data and feed back corrected, tokenized, data and results. The masking is then reversed using the token key.|
You choose the best algorithm for each data element that you need to mask and then define specific rules that will be applied consistently across the data estate. Combining the masking capability with the virtualisation at the heart of the Delphix engine means you can provision fully masked, virtual copies of production data in minutes for analytics and testing.
It is worth noting that Delphix masking is not limited to data that has been virtualised with Delphix – you can also use it against physical data sources.
In the next edition of this blog I will look at how Delphix can help expedite the migration of data workloads into the cloud.