Data virtualisation: the key to better machine learning?

Alberto Pan, chief Technical Officer at Denodo, looks at how data virtualisation is enabling organisations to make the most of their data.

There’s no getting away from it: in today’s digital economy, data has become the lifeblood of every organisation, regardless of size or sector. Nowadays, almost every action, reaction and interaction will produce a multitude of data. Data which, if harnessed correctly, can improve business processes and increase employee productivity.

What to do with all that data?

In short, the reams of data being produced each day within an organisation can be extremely valuable. As a result, data lakes have become a principal data management architecture for data scientists.

By storing all raw data – whether structured or unstructured – in one physical repository, data lakes make discovery somewhat easier for organisations. They save businesses time and money whilst providing massive computing power so that useful data can be efficiently transformed and combined to meet the needs of any process.

Above all, data lakes enable businesses to generate a range of insights which can help them to make informed decisions. By using machine learning to analyse the historical data stored within them, businesses are even able to forecast likely outcomes and work out how to achieve the best results in terms of employee productivity, business process and so on.

So, what’s the problem?

Despite all the benefits that data lakes have to offer, when it comes to making the most of data with machine learning most businesses are still struggling with certain aspects of data discovery and integration. In fact, research suggests that data scientists spend as much as 80% of their time on these tasks.

It’s simple: storing data in its original form does not remove the need to adapt it later for the machine learning process. And having all your data in the same physical place doesn’t necessarily make discovery easy. Instead, it’s the modern-day, digital equivalent of trying to find a needle in a haystack.

In fact, many organisations have hundreds of data repositories distributed across on-premise platforms, data centres and cloud providers so it’s more like trying to find a needle in several haystacks. It’s easy to see why organisations wishing to access the benefits highlighted above have a challenge on their hands.

Over the past few years, some data preparation tools have emerged to make simple integration tasks more accessible to data scientists, but more complex tasks still need an advanced skillset. That’s where data virtualisation comes in.

How can data virtualisation help?

Data virtualisation provides a single access point to any data – regardless of where it is located and no matter what format it is in. It can seamlessly stitch together data abstracted from various underlying sources and deliver it in real time. This liberates users and organisations alike by providing a fast, cost effective way to access different data.

Most importantly, data virtualisation can help organisations to:

  • Discover data: Having a data virtualisation layer makes an increased amount of data more accessible. Unlike older systems, these new technologies remove the need for data to be replicated so adding new content is not only faster, it’s cheaper. But that’s not all: data virtualisation platforms are also user-friendly. They offer a searchable, browsable catalogue of all data sets available. The data catalogue includes extensive metadata about each data set so that organisations can access anything at any time.
  • Integrate data: Data virtualisation tools organise data according to a consistent data representation and query model. This means that no matter where the data was originally stored, organisations can view all their data as if it were stored on a relational database or as if it were all in one place. By using data virtualisation, organisations can also create reusable logical data sets which can be adapted to meet the needs of each individual machine learning process. They make life easier for data scientists and IT departments by taking care of some the more complex issues such as transformations and performance optimisation.

As machine learning and data lakes continue to proliferate and support modern analytics, data virtualisation is enabling data scientists to seamlessly expose the results of machine learning analysis so that informed business decisions can be made. By simplifying the discovery and integration processes, data virtualisation is opening the door to a whole new world of possibilities for all organisations wishing to drive real business value from their data.