Realistic Synthetic Data at Scale: Influenced by, but not Production Data | SNIA

Abstract

To have a high confidence in a product, testing it against a data set which resembles production data is must. The challenge is in generating data for testing that represents production. The data in production is not predictable, it doesn’t follow simple formula, there are many variables that characterize it. Broadly, test data can be divided into two categories: Arbitrary, which is random and unstructured and Realistic, which follows patterns, is predictable and controlled. To generate a Realistic test data, right patterns needs to be captured by analyzing the existing production data. Access to production data can be regulated and not easy to obtain. However, implementing code to read relevant data from production, without exposing the actual data, but updating models which are used to generate test data, when required such that the generated test data represents production data in selected dimensions, as directed by the business of the product under test. In this session Mehul Sheth will talk about Druva's journey in generating test data at scale, which is highly influenced by production data, has "genes" of production data but not a single byte is taken "as-is" from production. Although Druva's journey and decisions taken may be unique and not directly applicable in all scenarios, session will highlight the thought process, algorithms and decisions in a generic fashion. How to focus on the ability to assess the model and tweak it to include edge conditions, remain realistic, applicable at all time, versatile, repeatable and easily controllable. Specifically, the session describes a process for modeling a directory tree with files and folders with various variables (like size of file, number of files and folders in each folder at each depth, patterns in names of files and folders, ratio of different file types and other variables) which may be important for the application under test. And then how to apply this model to generate file-sets of different sizes but completely random data, maintaining the relations between modeled variables. Datasets thus generated are random in raw format, however, maintain the characteristics of the model and can be used for performance / stress testing anti-virus software, legal discovery software or backup software. Extending the concept further, it can be used to model any data and meta-data like mailboxes or transnational databases.

Learning Objectives

Production like data generation for testing,Synthetic data generation at scale,Modeling production data without exposing customer data