0% found this document useful (0 votes)
308 views

Data Virtuality Best Practices

This document outlines best practices for organizing data assets and leveraging the power of the Data Virtuality logical data warehouse platform. It recommends having separate development and production servers to facilitate testing upgrades before pushing changes to production. It also stresses the importance of implementing a version control system and access permissions strategy to maintain data security and privacy. Finally, it provides guidance on onboarding new data sources, establishing naming and schema standards, and architecting the logical data warehouse schema.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
308 views

Data Virtuality Best Practices

This document outlines best practices for organizing data assets and leveraging the power of the Data Virtuality logical data warehouse platform. It recommends having separate development and production servers to facilitate testing upgrades before pushing changes to production. It also stresses the importance of implementing a version control system and access permissions strategy to maintain data security and privacy. Finally, it provides guidance on onboarding new data sources, establishing naming and schema standards, and architecting the logical data warehouse schema.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

 

Data Virtuality Best Practices 


 
 
 
Contents 
Introduction 2 

Server Environments 2 
Single Server Environment 2 
Multiple Server Environments 3 

Version Control 4 

Permissions Strategy 5 

Data Source Onboarding 6 

Naming & Schema Standards 7 

Access Roles 10 

Schema and Architecture Best Practices 11 

Logical Data Warehouse Schema 11 

Sandbox Schema 13 

Virtual Data Building Blocks 14 

Corporate Access Layer 16 

Further Information 17 


 

 
 
 
Introduction 
The following best practices outline several tactical and strategic suggestions to help 
organize your virtual data assets and fully leverage the power of the Data Virtuality Logical 
Data Warehouse platform. 

Server Environments 
Single Server Environment 
The simplest most basic Data Virtuality installation is a single server installation acting as 
either a production server, development server, or in a combined capacity. In a single 
server setup there is no synchronization of metadata required and this is a common 
approach in an evaluation phase of working with data virtualization. Although it is the 
quickest server environment to set up, it does have drawbacks. A single server cannot 
provide a redundant backup in the event of hardware failure meaning that when hardware 
fails the server will not be available until it is repaired. Also, a single server environment 
does not facilitate the testing of upgrades, patches, performance tuning, etc. prior to 
pushing these changes to a production server meaning that there is greater risk when 
making changes and it becomes much more important to have a backup strategy including 
periodic full restore testing. 
  

 
 
  

Multiple Server Environments 


A more common approach is to setup at least two servers; one for production loads and 
the other to handle development, testing, and server maintenance testing. These server 
names would commonly contain a suffix such as '_Prod', or '_Dev' to differentiate between 
the two.  
 
A multiple server environment does not necessarily makes things more complicated since 
these two servers could run in Virtual Machines (VMs) on the same VM Server or they could 
also run on separate commodity server hardware installations if so desired. In either case, 
it is best to keep both the production and development servers running on precisely the 
same hardware and software configurations to ensure that once tested in development, a 
change will perform as expected on the production server. 
 

 
  

Version Control 
Data Virtuality supports version control through easily exporting the server configuration in 
the form of a SQL script that can be stored and versioned in any version control system. 
This SQL script can be retrieved to the original server or any other Data Virtuality server. 
This simplifies several server management functions such as rebuilding and restoring 
servers, and cloning or migrating servers. Having tight integration between version control 
and Data Virtuality server is on the near roadmap as of 2019-04-12.  
 

Fault Tolerance, Load Balancing, and Redundancy 


These are advanced server topology configurations and can be best discussed in a 
conversation with Data Virtuality’s solution engineers w.r.t. the overall goals of data 
integration acceleration and uptime guarantees. 

  
 
Permissions Strategy 
Data accessibility has multiple considerations. On one hand, we want to democratize the 
data or, in other words, provide expedient access to anyone with a business need to 
analyze data. On the other hand we need to consider the privacy and security concerns 
relevant to PCI, PII, SOX, HIPAA, etc. Because of these concerns it is important to configure 
data access permissions for any and all data accessible through a data virtualization 
platform.  

 
There are two key security access methods to be familiar with in this regard. The first is 
system account access and is the most practical security approach for use with Data 
Virtuality. In some cases, the security system for a data source may have its own user 
credential system that is separate and apart from the LAN or Data Virtuality ID. In other 
cases due to system or policy restrictions it is necessary to use a 'system' account to access 
data sources. Significant care must be taken when allowing users to access source data 
using a system account since each user has the exact same rights as the system account 
and as each other. Further, system accounts tend to have much broader access than do 
individual user accounts. In order to protect sensitive data when using system accounts, 
the logical data warehouse should be configured to hide or mask sensitive data elements 
whether they are data within a field, the entire field itself, or an entire table or group of 
tables. This can be done by setting up a view into a raw table and applying any necessary 
filters or masking within the view.  

 
The second security method is called pass-through authentication and it uses the same 
credentials for each user that they use to log into the LAN, Data Virtuality, the data sources, 
etc. These same credentials can be passed through the logical data warehouse to each data 
source so that any access rights previously assigned to that user ID will be used, or 'passed 
through' the data virtualization layer. Pass-through authentication can be used in limited 
circumstances since the approach must be supported by both the consuming application 
and the data source.​  
Data Source Onboarding 
There are several considerations when onboarding a data source in Data Virtuality and it is 
suggested that business and technology groups partner to discuss these factors and jointly 
decide how to best meet the needs of analytical data access while ensuring production 
system stability and performance.  
 
Typically, the actual connection to the data source should be installed and configured by a 
technology consortium including expertise from the source data system technology 
caretakers and the DV platform admin. There are several nuances to these types of 
connections and a collaborative approach ensures the highest level of performance and 
stability for both the source system and for your logical data warehouse.  
 
When each new data source is onboarded it will be necessary to present some or all of the 
tables within that data source as raw views meaning they can be seen within your logical 
data warehouse from a query or visualization tool. Which tables are presented as raw 
views should be decided by your overall analytical goals and the sensitivity of the source 
data. Considerations such as PCI, PII, SOX, and HIPPA should all be considered at this point. 
It is easier to block access to inappropriate data by eliminating the possibility of access at 
the raw view level than it is to try to control every possible access point within the logical 
data warehouse if the sensitive data is onboarded.  
 
Other considerations to discuss during onboarding are factors such as geographic distance, 
source system performance, type and schema of data, and expected frequency of use. In 
reviewing these factors and weighing options it is common that data sourcing strategies 
will shift somewhat. For example, onboarding a data warehouse or data lake will gain you 
access to data from several source systems and may serve your purposes adequately. 
However, when real-time data is needed a strategy that combines warehoused data with 
live system connections, even when the data warehouse gets a daily load from that source 
system, is more appropriate to meet the need for real-time data while leveraging the 
warehoused data.  

In general, when source system performance is an issue, a good data sourcing approach is 
to connect the logical data warehouse to a secondary copy of the live data such as a data 
warehouse, standby or shadow server, or you could use CDC or replication to materialize 
the data within the logical data warehouse storage area.  
 
These conversations tend to be very specific to each organization and serve to help define 
overall data strategy and also strategy specific to your logical data warehouse and are a 
good starting point toward architecting you logical data warehouse. 

  

 
Naming & Schema Standards 
There will be several different types of objects within you Data Virtuality Logical Data 
Warehouse. The basic components are data sources, raw views, core views, and business 
views. It is extremely helpful when navigating your Logical Data Warehouse to see objects 
named in such a way as to be able to clearly distinguish among the different types and 
their purposes. A chart of objects, naming suggestions, and purposes follows: 

Object 
Purpose  Example Naming Convention 
Type 
Prepend the data source name with 
'DataSource_'.  
The original data sources existing outside of your 
Data  Examples:  
Logical Data Warehouse.  
Source  DataSource_Oracle  
Examples: Oracle, Financial systems, CRM, etc. 
DataSource_Fin  
DataSource_CRM 
Raw views represent tables from source data 
systems that appear almost precisely as they do in 
the source system. Raw views are stored in schemas 
named to represent the original data source. So for 
example if we onboard our enterprise data 
Prepend the raw view schemas with 
warehouse named 'EDW' we will create a schema 
'Raw_'  
Raw  named RV_EDW and we will store all raw views from 
Examples:  
Views  that source in this schema. The raw views in this 
Raw_Oracle  
schema will each represent a table from the EDW in 
Raw_EDW 
this example.  
 
Each raw view represents a table in a source system 
looking just as it does in the source system with 
minimal renaming, transformations, or cleansing. 
This approach allows quick and easy schema updates 
in the event that the source system schema changes. 
All core views are stored in a single core view schema 
that could be named Core_Views for easy 
identification. A core view is a view into a source 
system in which filtering, naming changes, and 
modeling would be applied prior to any virtual joins 
occurring.  
 
An example change made in the core layer might be 
Core  to present the customer name in a core view in a  Place all core views in a schema 
Views  field identified as "Customer_Name" whereas in the  named "Core_Views" 
source system it may be in multiple fields and the 
field names may be more cryptic such as 
EDW_CUS_FIR, EDW_CUS_LAS, etc. The business 
friendly semantic name makes browsing and 
discovering data much easier since it removes the 
guesswork from field identification. In addition, the 
entire table can be renamed at this layer to further 
clarify stored data. 
Business views are virtualized views representing 
sets of data joined across one or more data sources. 
Business views are stored in virtual schemas that are 
named to represent the business unit or business  Prepend business view schemas 
function.   with 'BusinessViews_'  
Business 
  Examples:  
Views 
Business views are views joining one or more tables  BusinessViews_Customer_360View  
from disparate data sources. Semantic naming,  BusinessViews_Finance 
aggregate functions, and filtering can all be applied in 
business views to make them business-friendly and 
to ensure they are as useful as possible. 
 
Beyond data sources, schemas and views, Data Virtuality also has a configurable physical 
table structure residing in a schema identified as 'dwh'. This is a useful location to store 
various metadata, slowly changing dimensions, temp tables, field to field mappings, and 
helper functions. A helpful guideline to support future growth is to clearly name these 
tables in a manner so as to make it clear which functions they support or what their 
purpose is within the Logical Data Warehouse.  

 
Materialization 
Within a Data Virtuality Logical Data Warehouse you have the option to access data live 
from the original source or to access a materialized view of the data. Querying the live data 
sources ensures that you are retrieving the most up to date information in real time but 
can respond slower than accessing materialized data. Querying the live data source can 
also have a negative impact to the source system performance for its primary intended 
purpose which is most likely some sort of production process.  

 
Materialization is a performance strategy wherein data is selectively stored in a repository 
to maximize analytical query performance and also to protect source system stability. The 
possibilities for materialization include materializing views, joins, aggregations, and also 
creating indexes on the tables in physical storage. Data Virtuality also uses machine 
learning to provide recommendations on which data to materialize based on user query 
patterns to proactively keep your logical data warehouse operating at peak performance. 
Materialization can also be enabled manually on any view or source table in the logical data 
warehouse.  

 
In deciding whether or not to materialize a data source, or a subset of a data source, there 
are several factors to consider. In general, if the data source is historically slow, plagued 
with performance problems, is geographically far away, provides mission critical services to 
your organization, or is only available through batch feeds, then materializing is the 
suggested strategy for making that data accessible. For any other data source it is 
suggested that it be connected directly to the live data source without the use of 
materialization.  

 
Replication is a process closely related to materialization and the difference between the 
two is that materialization creates exact "shadow" copies of views or tables in the local 
storage, whereas replication creates new tables in the storage that can be queried explicitly 
using their own name and which can also be changed or manipulated for example for 
cleansing purposes. 

  

 
   
Access Roles 
Identifying user roles and permissions is a helpful strategy in organizing and allocating 
appropriate groups of rights to appropriate groups of users. This minimizes the amount of 
administrative work that is required when adding or removing a user or if there is a need to 
change their rights based on their role within the organization having changed. An example 
of a basic role chart follows: 

Development Environment  Production Environment 


Role  Typical Member 
Rights  Rights 
● Server configuration 
● Install upgrades and 
● Server configuration 
patches 
● Install upgrades and patches 
● Onboard data sources 
● Onboard data sources 
● Create shared virtual views 
● Create shared virtual views 
● Create personal virtual 
● Create personal virtual views 
● Operational  views 
● Modify existing views of 
DV  Support  ● Modify existing views of 
others 
Admin  ● Technology  others 
● Modify own existing views 
Platform 'Owner'  ● Modify own existing views 
● Setup and configure 
● Setup and configure 
caching/materialization 
caching/materialization 
● Connect to & query live data 
● Connect to & query live 
● Connect to & query 
data 
materialized data 
● Connect to & query 
materialized data 
● Onboard data sources 
● Create shared virtual views 
● Create shared virtual views 
● Create personal virtual views 
● Create personal virtual 
● Modify existing views of 
views 
DV  others 
● Business Group  ● Modify own existing views 
Power  ● Modify own existing views 
Analyst Lead  ● Connect to & query live 
User  ● Setup and configure 
data 
caching/materialization 
● Connect to & query 
● Connect to & query live data 
materialized data 
● Connect to & query 
materialized data 
● Create personal virtual views 
● Modify own existing views  ● Create personal virtual 
● Setup and configure  views 
DV User  ● Business Analyst  caching/materialization  ● Modify own existing views 
● Connect to & query live data  ● Connect to & query 
● Connect to & query  materialized data 
materialized data 
    
Schema and Architecture Best 
Practices 
When you first start to build out your logical data warehouse you might ask: How do I get 
started? How do I layout my database design? Although specific schema strategy will 
depend on your unique data sources, technology environment, analyst preferences, and 
business strategy, there are common elements to all good logical data warehouse schemas 
that we will review here. 

 
Logical Data Warehouse Schema 
The Logical Data Warehouse itself is the overall overarching data warehouse schema that 
may include physical data stores, virtual data stores, virtual views, core views, materialized 
views, and sandboxes. Although each organization's logical data warehouse is likely to be 
very different from the next, there are core components to each, and core principles that 
when followed will help maximize the value you extract from your Data Virtuality Logical 
Data Warehouse.  
 
The key areas of your Data Virtuality Logical Data Warehouse are the Raw, Core, and 
Business layers. Each of these layers serve specific purposes but is also flexible, so some 
crossover between layers is likely and in some cases it is necessary. The Raw layer is a 
foundational layer that, for the most part, presents the data tables of the source systems 
as they are in the source system. In certain cases it may be appropriate to apply some 
transformations, masking, data cleansing, etc. in order to make the tables relevant in the 
logical data warehouse. There is no data integration in this layer.  
 
Next is the Core layer in which business-friendly naming can be applied, appropriate 
transformations and aggregations can be applied, and collapsing any appropriate third 
normal source structures into star schemas may occur. Views from this Core layer should 
be shared and considered raw building blocks upon which to build virtual views since they 
have been cleaned up and transformed for analytics.  
The top layer in the schema is the Business layer. It is here that logical data marts, virtual 
building blocks, and stacked views are created. Although the other two layers are critical in 
providing the foundation for the Virtual layer, it is the Virtual layer that provides the 
acceleration and empowerment for analysts.  
 
Providing department or functional group-specific virtual data marts is a powerful way to 
provide 360 degree views of each area of your business. For example a retail organization 
may have a logical data mart for customer, product, traffic, and sales, with each logical data 
mart being managed by the most appropriate group. These virtual views can easily evolve 
along with the business. Data Virtuality's logical data warehouse provides agile, 
business-driven, evolution of your data warehouse that ensures that your ability to analyze 
data at the speed of your business will happen consistently and reliably. 

  
Sandbox Schema 
Each department, or functional group of analysts, can be allotted a sandbox schema that is 
especially valuable for data exploration and discovery. The sandbox schema includes 
access to the entire logical data warehouse schema along with a sandbox area where the 
analysts have elevated access rights. Additional data sources may be onboarded or 
uploaded to the sandbox to allow early exploration and discovery of data that has not yet 
been formally integrated into the logical data warehouse. The sandbox schema can also be 
utilized as a playground of sorts wherein rapid prototyping efforts can be executed without 
interfering with the production schema. In the sandbox schema, new views can be built 
using either or both the logical data warehouse schema or the sandbox schema, whereas 
in the logical data warehouse schema views can only be built on production views that exist 
within the logical data warehouse itself. 

 
 
  
  
 
   
 

 
Virtual Data Building Blocks 
Creating and sharing virtual building blocks is a terrific way to save your analysts' time by 
reducing duplicative and redundant effort. These building blocks also serve to standardize 
how KPIs and metrics are calculated within your organization and provide the means for 
any analyst to use prepackaged datasets to streamline their efforts and ensure data quality 
and accuracy.  
 
The idea of the virtual data building blocks is to create different types of business views 
that can be stacked as needed to quickly serve new initiatives with minimal additional 
development time. There are three suggested types of business views in this approach: 
macro, filtering, and micro level business views.  
 

 
Virtual Data Building Blocks 1 

 
At the macro level you would build broad reusable views without filtering that can be used 
as a base layer for more specific needs. So, for example, for website traffic you might build 
a business view that contains all web traffic from all devices and for all purposes on all 
dates. It is unlikely you would ever need to query that view directly because of its broad 
scope but building it as a foundational layer provides a starting point for anyone wanting to 
analyze any aspect of traffic.  
 
On top of the macro view layer you would then create filtering views as needed. In our 
example of web traffic a filtering view might be all product-browsing traffic. Another 
example might be purchase conversions by landing page, by device. Filtering views can be 
easily reused when similar specific data subsets are needed.  
 
An initiative-specific micro view layer might sit on top of the filtering layer which sits on top 
of the macro layer. The micro layer joins across one or more filtering layers and can apply 
additional filters and aggregations so that only the desired data for a specific initiative is 
presented. This micro layer serves data visualization, data discovery, BI, and many other 

1
​Virtual Data Building Blocks used with permission from Dirk Garner
use cases precisely. These views can be shared and re-used but are less likely to have 
broad audiences like the Macro and Filtering layers will have. 

Corporate Access Layer 


One challenge larger organizations face is creating a corporate data access layer which can 
serve business users with data while respecting the differences of the different business 
units of the corporation with respect to their business practices or technical landscape. 

Data Virtuality provides a great way to build a corporate data access layer using a number 
of interconnected Data Virtuality instances.  

The architecture involves one instance (or respectively one development instance and one 
production instance) at each business unit and one instance or instance pair at the parent 
organization. This instance or pair of instances responsible for: 

1) Providing data access and data modelling possibilities to the own data of the 
business unit. 
2) Exposing a defined subset of the local data to the Data Virtuality instance of the 
parent organization.  
3) Consuming a subset of data which is being published by the Data Virtuality instance 
of the parent organization.  

The Data Virtuality instance of the parent organization, in turn, is responsible for: 

1) Gathering data from business units for centralized data access 


2) Publishing subsets of data from different business units to other business units 

Such architecture allows centralized access and modelling of data in a corporation while 
maintaining a high level of flexibility and autonomy for each individual business unit. One 
special benefit is the possibility to address various legal requirements of different countries 
with regard to not storing the data abroad, anonymizing personal data directly at the 
source etc.  
 
  
 

Further Information 
Further information and assistance is available by contacting Data Virtuality at:  
+1 650 898 0227 or through our website: h
​ ttp://support.datavirtuality.com/hc 

You might also like