More and more often the terms data virtualization, data federation, and data integration are used. Unfortunately, these terms have never been defined properly.Let’s see if, together, we can come up with generally accepted definitions.
Data Virtualization
Data virtualization is the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.
Data virtualization provides an abstraction layer that data consumers can use to access data in a consistent manner. A data consumer can be any application retrieving or manipulating data, such as a reporting or data entry application. This abstraction layer hides all the technical aspects of data storage. The applications don’t have to know where all the data has been stored physically, where the database servers run, what the source API and database language is, and so on.
Technically, data virtualization can be implemented in many different ways. Here are a few examples:
- With a federation server, multiple data stores can be made to look as one.
- An enterprise service bus (ESB) can be used to develop a layer of services that allow access to data.
- Placing data stores in the cloud is also a form of data virtualization.
- In a way, building up a virtual database in memory with data loaded from data stored in physical databases can also be regarded as data virtualization.
- Organizations could also develop their own software-based abstraction layer that hides where and how the data is stored.
Data Federation
Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.
This definition is based on the following concepts:
- Data virtualization: Data federation is a form of data virtualization. Note that not all forms of data virtualization imply data federation.
- Heterogeneous set of data stores: Data federation should make it possible to bring data together from data stores using different storage structures, different access languages, and different APIs.
- Autonomous data stores: Data stores accessed by data federation are able to operate independently; in other words, they can be used outside the scope of data federation.
- One integrated data store: Regardless of how and where data is stored, it should be presented as one integrated data set. This implies that data federation involves transformation, cleansing, and possibly even enrichment of data.
- On-demand integration: This refers to when the data from a heterogeneous set of data stores is integrated.
Data Integration
Data integration is the process of combining data from a heterogeneous set of data stores to create one unified view of all that data.
Data integration involves joining data, transforming data values, enriching data, and cleansing data values. This is the approach taken when using ETL tools.
Data virtualization might not need data integration. It depends on the number of data sources being accessed. Data federation always requires data integration. For data integration, data federation is just one style of integrating data.
First off I would like to say superb blog! I had a quick question in which I’d like to ask if you don’t mind. I was curious to find out how you center yourself and clear your head prior to writing. I’ve had a tough time clearing my thoughts in getting my thoughts out. I truly do take pleasure in writing but it just seems like the first 10 to 15 minutes are wasted simply just trying to figure out how to begin. Any ideas or hints? Thanks!