Data Virtuality Best Practices
Data Virtuality Best Practices
Server Environments 2
Single Server Environment 2
Multiple Server Environments 3
Version Control 4
Permissions Strategy 5
Introduction
The following best practices outline several tactical and strategic suggestions to help
organize your virtual data assets and fully leverage the power of the Data Virtuality Logical
Data Warehouse platform.
Server Environments
Single Server Environment
The simplest most basic Data Virtuality installation is a single server installation acting as
either a production server, development server, or in a combined capacity. In a single
server setup there is no synchronization of metadata required and this is a common
approach in an evaluation phase of working with data virtualization. Although it is the
quickest server environment to set up, it does have drawbacks. A single server cannot
provide a redundant backup in the event of hardware failure meaning that when hardware
fails the server will not be available until it is repaired. Also, a single server environment
does not facilitate the testing of upgrades, patches, performance tuning, etc. prior to
pushing these changes to a production server meaning that there is greater risk when
making changes and it becomes much more important to have a backup strategy including
periodic full restore testing.
Version Control
Data Virtuality supports version control through easily exporting the server configuration in
the form of a SQL script that can be stored and versioned in any version control system.
This SQL script can be retrieved to the original server or any other Data Virtuality server.
This simplifies several server management functions such as rebuilding and restoring
servers, and cloning or migrating servers. Having tight integration between version control
and Data Virtuality server is on the near roadmap as of 2019-04-12.
Permissions Strategy
Data accessibility has multiple considerations. On one hand, we want to democratize the
data or, in other words, provide expedient access to anyone with a business need to
analyze data. On the other hand we need to consider the privacy and security concerns
relevant to PCI, PII, SOX, HIPAA, etc. Because of these concerns it is important to configure
data access permissions for any and all data accessible through a data virtualization
platform.
There are two key security access methods to be familiar with in this regard. The first is
system account access and is the most practical security approach for use with Data
Virtuality. In some cases, the security system for a data source may have its own user
credential system that is separate and apart from the LAN or Data Virtuality ID. In other
cases due to system or policy restrictions it is necessary to use a 'system' account to access
data sources. Significant care must be taken when allowing users to access source data
using a system account since each user has the exact same rights as the system account
and as each other. Further, system accounts tend to have much broader access than do
individual user accounts. In order to protect sensitive data when using system accounts,
the logical data warehouse should be configured to hide or mask sensitive data elements
whether they are data within a field, the entire field itself, or an entire table or group of
tables. This can be done by setting up a view into a raw table and applying any necessary
filters or masking within the view.
The second security method is called pass-through authentication and it uses the same
credentials for each user that they use to log into the LAN, Data Virtuality, the data sources,
etc. These same credentials can be passed through the logical data warehouse to each data
source so that any access rights previously assigned to that user ID will be used, or 'passed
through' the data virtualization layer. Pass-through authentication can be used in limited
circumstances since the approach must be supported by both the consuming application
and the data source.
Data Source Onboarding
There are several considerations when onboarding a data source in Data Virtuality and it is
suggested that business and technology groups partner to discuss these factors and jointly
decide how to best meet the needs of analytical data access while ensuring production
system stability and performance.
Typically, the actual connection to the data source should be installed and configured by a
technology consortium including expertise from the source data system technology
caretakers and the DV platform admin. There are several nuances to these types of
connections and a collaborative approach ensures the highest level of performance and
stability for both the source system and for your logical data warehouse.
When each new data source is onboarded it will be necessary to present some or all of the
tables within that data source as raw views meaning they can be seen within your logical
data warehouse from a query or visualization tool. Which tables are presented as raw
views should be decided by your overall analytical goals and the sensitivity of the source
data. Considerations such as PCI, PII, SOX, and HIPPA should all be considered at this point.
It is easier to block access to inappropriate data by eliminating the possibility of access at
the raw view level than it is to try to control every possible access point within the logical
data warehouse if the sensitive data is onboarded.
Other considerations to discuss during onboarding are factors such as geographic distance,
source system performance, type and schema of data, and expected frequency of use. In
reviewing these factors and weighing options it is common that data sourcing strategies
will shift somewhat. For example, onboarding a data warehouse or data lake will gain you
access to data from several source systems and may serve your purposes adequately.
However, when real-time data is needed a strategy that combines warehoused data with
live system connections, even when the data warehouse gets a daily load from that source
system, is more appropriate to meet the need for real-time data while leveraging the
warehoused data.
In general, when source system performance is an issue, a good data sourcing approach is
to connect the logical data warehouse to a secondary copy of the live data such as a data
warehouse, standby or shadow server, or you could use CDC or replication to materialize
the data within the logical data warehouse storage area.
These conversations tend to be very specific to each organization and serve to help define
overall data strategy and also strategy specific to your logical data warehouse and are a
good starting point toward architecting you logical data warehouse.
Naming & Schema Standards
There will be several different types of objects within you Data Virtuality Logical Data
Warehouse. The basic components are data sources, raw views, core views, and business
views. It is extremely helpful when navigating your Logical Data Warehouse to see objects
named in such a way as to be able to clearly distinguish among the different types and
their purposes. A chart of objects, naming suggestions, and purposes follows:
Object
Purpose Example Naming Convention
Type
Prepend the data source name with
'DataSource_'.
The original data sources existing outside of your
Data Examples:
Logical Data Warehouse.
Source DataSource_Oracle
Examples: Oracle, Financial systems, CRM, etc.
DataSource_Fin
DataSource_CRM
Raw views represent tables from source data
systems that appear almost precisely as they do in
the source system. Raw views are stored in schemas
named to represent the original data source. So for
example if we onboard our enterprise data
Prepend the raw view schemas with
warehouse named 'EDW' we will create a schema
'Raw_'
Raw named RV_EDW and we will store all raw views from
Examples:
Views that source in this schema. The raw views in this
Raw_Oracle
schema will each represent a table from the EDW in
Raw_EDW
this example.
Each raw view represents a table in a source system
looking just as it does in the source system with
minimal renaming, transformations, or cleansing.
This approach allows quick and easy schema updates
in the event that the source system schema changes.
All core views are stored in a single core view schema
that could be named Core_Views for easy
identification. A core view is a view into a source
system in which filtering, naming changes, and
modeling would be applied prior to any virtual joins
occurring.
An example change made in the core layer might be
Core to present the customer name in a core view in a Place all core views in a schema
Views field identified as "Customer_Name" whereas in the named "Core_Views"
source system it may be in multiple fields and the
field names may be more cryptic such as
EDW_CUS_FIR, EDW_CUS_LAS, etc. The business
friendly semantic name makes browsing and
discovering data much easier since it removes the
guesswork from field identification. In addition, the
entire table can be renamed at this layer to further
clarify stored data.
Business views are virtualized views representing
sets of data joined across one or more data sources.
Business views are stored in virtual schemas that are
named to represent the business unit or business Prepend business view schemas
function. with 'BusinessViews_'
Business
Examples:
Views
Business views are views joining one or more tables BusinessViews_Customer_360View
from disparate data sources. Semantic naming, BusinessViews_Finance
aggregate functions, and filtering can all be applied in
business views to make them business-friendly and
to ensure they are as useful as possible.
Beyond data sources, schemas and views, Data Virtuality also has a configurable physical
table structure residing in a schema identified as 'dwh'. This is a useful location to store
various metadata, slowly changing dimensions, temp tables, field to field mappings, and
helper functions. A helpful guideline to support future growth is to clearly name these
tables in a manner so as to make it clear which functions they support or what their
purpose is within the Logical Data Warehouse.
Materialization
Within a Data Virtuality Logical Data Warehouse you have the option to access data live
from the original source or to access a materialized view of the data. Querying the live data
sources ensures that you are retrieving the most up to date information in real time but
can respond slower than accessing materialized data. Querying the live data source can
also have a negative impact to the source system performance for its primary intended
purpose which is most likely some sort of production process.
Materialization is a performance strategy wherein data is selectively stored in a repository
to maximize analytical query performance and also to protect source system stability. The
possibilities for materialization include materializing views, joins, aggregations, and also
creating indexes on the tables in physical storage. Data Virtuality also uses machine
learning to provide recommendations on which data to materialize based on user query
patterns to proactively keep your logical data warehouse operating at peak performance.
Materialization can also be enabled manually on any view or source table in the logical data
warehouse.
In deciding whether or not to materialize a data source, or a subset of a data source, there
are several factors to consider. In general, if the data source is historically slow, plagued
with performance problems, is geographically far away, provides mission critical services to
your organization, or is only available through batch feeds, then materializing is the
suggested strategy for making that data accessible. For any other data source it is
suggested that it be connected directly to the live data source without the use of
materialization.
Replication is a process closely related to materialization and the difference between the
two is that materialization creates exact "shadow" copies of views or tables in the local
storage, whereas replication creates new tables in the storage that can be queried explicitly
using their own name and which can also be changed or manipulated for example for
cleansing purposes.
Access Roles
Identifying user roles and permissions is a helpful strategy in organizing and allocating
appropriate groups of rights to appropriate groups of users. This minimizes the amount of
administrative work that is required when adding or removing a user or if there is a need to
change their rights based on their role within the organization having changed. An example
of a basic role chart follows:
Logical Data Warehouse Schema
The Logical Data Warehouse itself is the overall overarching data warehouse schema that
may include physical data stores, virtual data stores, virtual views, core views, materialized
views, and sandboxes. Although each organization's logical data warehouse is likely to be
very different from the next, there are core components to each, and core principles that
when followed will help maximize the value you extract from your Data Virtuality Logical
Data Warehouse.
The key areas of your Data Virtuality Logical Data Warehouse are the Raw, Core, and
Business layers. Each of these layers serve specific purposes but is also flexible, so some
crossover between layers is likely and in some cases it is necessary. The Raw layer is a
foundational layer that, for the most part, presents the data tables of the source systems
as they are in the source system. In certain cases it may be appropriate to apply some
transformations, masking, data cleansing, etc. in order to make the tables relevant in the
logical data warehouse. There is no data integration in this layer.
Next is the Core layer in which business-friendly naming can be applied, appropriate
transformations and aggregations can be applied, and collapsing any appropriate third
normal source structures into star schemas may occur. Views from this Core layer should
be shared and considered raw building blocks upon which to build virtual views since they
have been cleaned up and transformed for analytics.
The top layer in the schema is the Business layer. It is here that logical data marts, virtual
building blocks, and stacked views are created. Although the other two layers are critical in
providing the foundation for the Virtual layer, it is the Virtual layer that provides the
acceleration and empowerment for analysts.
Providing department or functional group-specific virtual data marts is a powerful way to
provide 360 degree views of each area of your business. For example a retail organization
may have a logical data mart for customer, product, traffic, and sales, with each logical data
mart being managed by the most appropriate group. These virtual views can easily evolve
along with the business. Data Virtuality's logical data warehouse provides agile,
business-driven, evolution of your data warehouse that ensures that your ability to analyze
data at the speed of your business will happen consistently and reliably.
Sandbox Schema
Each department, or functional group of analysts, can be allotted a sandbox schema that is
especially valuable for data exploration and discovery. The sandbox schema includes
access to the entire logical data warehouse schema along with a sandbox area where the
analysts have elevated access rights. Additional data sources may be onboarded or
uploaded to the sandbox to allow early exploration and discovery of data that has not yet
been formally integrated into the logical data warehouse. The sandbox schema can also be
utilized as a playground of sorts wherein rapid prototyping efforts can be executed without
interfering with the production schema. In the sandbox schema, new views can be built
using either or both the logical data warehouse schema or the sandbox schema, whereas
in the logical data warehouse schema views can only be built on production views that exist
within the logical data warehouse itself.
Virtual Data Building Blocks
Creating and sharing virtual building blocks is a terrific way to save your analysts' time by
reducing duplicative and redundant effort. These building blocks also serve to standardize
how KPIs and metrics are calculated within your organization and provide the means for
any analyst to use prepackaged datasets to streamline their efforts and ensure data quality
and accuracy.
The idea of the virtual data building blocks is to create different types of business views
that can be stacked as needed to quickly serve new initiatives with minimal additional
development time. There are three suggested types of business views in this approach:
macro, filtering, and micro level business views.
Virtual Data Building Blocks 1
At the macro level you would build broad reusable views without filtering that can be used
as a base layer for more specific needs. So, for example, for website traffic you might build
a business view that contains all web traffic from all devices and for all purposes on all
dates. It is unlikely you would ever need to query that view directly because of its broad
scope but building it as a foundational layer provides a starting point for anyone wanting to
analyze any aspect of traffic.
On top of the macro view layer you would then create filtering views as needed. In our
example of web traffic a filtering view might be all product-browsing traffic. Another
example might be purchase conversions by landing page, by device. Filtering views can be
easily reused when similar specific data subsets are needed.
An initiative-specific micro view layer might sit on top of the filtering layer which sits on top
of the macro layer. The micro layer joins across one or more filtering layers and can apply
additional filters and aggregations so that only the desired data for a specific initiative is
presented. This micro layer serves data visualization, data discovery, BI, and many other
1
Virtual Data Building Blocks used with permission from Dirk Garner
use cases precisely. These views can be shared and re-used but are less likely to have
broad audiences like the Macro and Filtering layers will have.
Data Virtuality provides a great way to build a corporate data access layer using a number
of interconnected Data Virtuality instances.
The architecture involves one instance (or respectively one development instance and one
production instance) at each business unit and one instance or instance pair at the parent
organization. This instance or pair of instances responsible for:
1) Providing data access and data modelling possibilities to the own data of the
business unit.
2) Exposing a defined subset of the local data to the Data Virtuality instance of the
parent organization.
3) Consuming a subset of data which is being published by the Data Virtuality instance
of the parent organization.
The Data Virtuality instance of the parent organization, in turn, is responsible for:
Such architecture allows centralized access and modelling of data in a corporation while
maintaining a high level of flexibility and autonomy for each individual business unit. One
special benefit is the possibility to address various legal requirements of different countries
with regard to not storing the data abroad, anonymizing personal data directly at the
source etc.
Further Information
Further information and assistance is available by contacting Data Virtuality at:
+1 650 898 0227 or through our website: h
ttp://support.datavirtuality.com/hc