Author Archives: admin

Developing the Right Data Strategy for Your Organization

- By Randy Bean

When it comes to making actionable use of data, there is no single playbook or set of common practices that apply universally to all businesses, CIO Journal Columnist Randy Bean says. “Organizations would be well served to break from accepted dogma and apply fresh thinking as they consider how best to align their resources, capabilities, and people to make wise use of their data,” he writes.

FastWorks Friday: Getting smarter about testable hypotheses and experiments

Do you struggle with coming up with good Leap-of-Faith Assumptions (LOFAs) when applying FastWorks? That is, do you have a hard time articulating the testable hypothesis (a statement proposing some relationship between two or more variables that can be tested) ?

The best article I’ve seen on the topic really helped me some time back. It’s from the go-to thinker on applying analytics in business contexts, Prof. Tom Davenport. In this classic Harvard Business Review article from 2009 titled “How to Design Smart Business Experiments”, Prof Davenport outlines one of the most difficult things to wrap one’s head around: what makes a smart business experiment. Here’s a key quotation from the article, “Formalized testing can provide a level of understanding about what really works that puts more intuitive approaches to shame.” For me, this article helped clarify how to clarify what could be testable, how to test it quickly, and how to learn from the results.

The image below (attached) adds an additional and important point on the same: how to make sure that the experiments run are shared and thus encourage a culture of testing hypotheses and learning from them.

Inline image 1

The article is incredibly practical, offering some good rules of thumb, especially for those of us looking to apply FastWorks as a manager. Of course, FastWorks is a lot more than hypotheses and experiments. But the first step is understanding how to move quickly, testing assumptions that have a high impact to success if wrong and are quick to test.

Do you have anything you’ve done to get to testable hypotheses more quickly?

Clearly Defining Data Virtualization, Data Federation, and Data Integration

More and more often the terms data virtualization, data federation, and data integration are used. Unfortunately, these terms have never been defined properly.Let’s see if, together, we can come up with generally accepted definitions.

Data Virtualization

Data virtualization is the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.

Data virtualization provides an abstraction layer that data consumers can use to access data in a consistent manner. A data consumer can be any application retrieving or manipulating data, such as a reporting or data entry application. This abstraction layer hides all the technical aspects of data storage. The applications don’t have to know where all the data has been stored physically, where the database servers run, what the source API and database language is, and so on.

Technically, data virtualization can be implemented in many different ways. Here are a few examples:

  • With a federation server, multiple data stores can be made to look as one.
  • An enterprise service bus (ESB) can be used to develop a layer of services that allow access to data.
  • Placing data stores in the cloud is also a form of data virtualization.
  • In a way, building up a virtual database in memory with data loaded from data stored in physical databases can also be regarded as data virtualization.
  • Organizations could also develop their own software-based abstraction layer that hides where and how the data is stored.

Data Federation

Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.

This definition is based on the following concepts:

  • Data virtualization: Data federation is a form of data virtualization. Note that not all forms of data virtualization imply data federation.
  • Heterogeneous set of data stores: Data federation should make it possible to bring data together from data stores using different storage structures, different access languages, and different APIs.
  • Autonomous data stores: Data stores accessed by data federation are able to operate independently; in other words, they can be used outside the scope of data federation.
  • One integrated data store: Regardless of how and where data is stored, it should be presented as one integrated data set. This implies that data federation involves transformation, cleansing, and possibly even enrichment of data.
  • On-demand integration: This refers to when the data from a heterogeneous set of data stores is integrated.

Data Integration

Data integration is the process of combining data from a heterogeneous set of data stores to create one unified view of all that data.

Data integration involves joining data, transforming data values, enriching data, and cleansing data values. This is the approach taken when using ETL tools.

Data virtualization might not need data integration. It depends on the number of data sources being accessed. Data federation always requires data integration. For data integration, data federation is just one style of integrating data.

Pivotal Open Sources key parts of its Big Data Suite

Pivotal announced today that it was moving three core pieces of its Big Data Suite to open source, while continuing to offer advanced features and support in a commercial version.

The three components moving to open source are GemFire, the platform’s in-Memory NoSQL Database, the enterprise SQL on Hadoop component, HAWQ and the suite’s massively parallel processing (MPP) analytics database, Greenplum DB.

Read more from link below….
http://www.cio.com/article/2884323/big-data/big-data-suite-goes-open-source.html

Decoding Hadoop – What is so Big about Big Data ?

Everyday day around 2.5 Quintillion (10 followed by 18 Zeros) bytes of data is created, and most of the data has been created in the past 2 years!!!!. Most of the data available today is unstructured providing real time useful information.

Thus Big data can be defined as any data having 3V (Volume, Variety & Velocity)

Volume – Any data that has considerable size like data size increasing to multi-terabyte and multi-petabyte range would need different processing from the conventional algorithm for speed and accuracy. Thus Volume is the first dimension for any data to quality as BIG data.

Velocity – Velocity is the second dimension for the Big data. Conventional algorithm can process large volume of data with a trade off on the time which can range from few hours to overnight processing. However in case of national security or real time proactive action, overnight is not a good option. So when there is a need to process huge amount of data in quick span of time, the data can be called Big Data.

Variety – The third dimension of Big Data is that the data has mix of various data types & formats. The data can be logs, video, audio, pictures, financial transaction, text message, etc. Traditional database mandates that data be represented in rows and columns. But in real life information is generated in all forms of artifacts.

When Volume, Velocity & Variety of data is combined, there is a need to process the data differently from conventional methods. – This were Big Data processing comes into picture.

There are 2 other V recommended by SAS for Big Data

Variability – The volume velocity and Variety can vary depending on the period of the day or based on a particular event. Like during festivals the sales data , retail transaction data, and browsing data can peak.

Complexity – There are various sources of data in today’s world. In some situation it is needed to match up data from different sources in different format to reach a logical conclusion.

Five Hot Trends Impacting Your Decision-Making Environment and What You Need to Worry about

All, in today’s TDWI Webinar, Claudia Imhoff gave a talk on the five hot BI trends impacting Data Management. They are Big Data Analytics, Advanced Analytics, Self-service BI, BI Mobile Device and Cloud-based BI Solutions.

A few take-away about Big Data Analytics are:

Definition of this trend: Data sets with sizes beyond ability of commonly-used software tools to capture, integrate, manage, and process within a reasonable amount of time.

Impact: Hence the rise of Hadoop, data warehouse appliance with solid state drives or in-memory technologies and data virtualization.

One-size fits all data management is no longer viable due to the workload changes placing on the DWH architecture. Therefore we will need to

Extend DW environment to support new workloads, for example add Hadoop into the architecture that can be integrated with existing data warehouse

Need to modify data modeling and integration approaches such as include data virtualization, data blending and *data refineries

Modify data governance approaches – use different levels of governance on security, compliance, quality and retention needs

What are *data refineries? Data lakes such as Hadoop technology may need to be used as sand box and experimental areas for data refineries.

Combining consumer data to solve big problems

This Forbes article includes some great examples of business models evolving to make money and improve our lives through mining and sharing data. The businesses are not making money immediately but are investing in data capture, analytics and partnerships to start capitalizing on their data for long term revenue.

Key points in the article:
- Your data combined with those of thousands of other people can tackle bigger problems such as cutting your company’s health care budget or sparing the nearby utility from building another power plant.
- Smart-thermostat maker Nest Labs has quietly built a side business managing the energy consumption of a slice of its customers on behalf of electric companies.
- In wearables, health tracker Fitbit is selling companies the tracking bracelets and analytics services to better manage their health care budgets, and its rival Jawbone may be preparing to do the same.

- These companies are capitalizing on the terabytes of data they collect from consumers and, to an extent, on the largesse of taxpayers. State governments have increased the money–from $1.3 billion in 2003 to $6 billion in 2012–allocated to helping utilities manage energy demand, according to the U.S. Energy Information Administration.

See full article - http://www.forbes.com/sites/parmyolson/2014/04/17/the-quantified-other-nest-and-fitbit-chase-a-lucrative-side-business/

How can we leverage our data either directly or with business partners/customers to generate revenue and solve big problems?

Start-ups are now offering BI solutions to our middle market customers

We know we have data which we can leverage as an asset. We also can derive great insights or predictions for our customers. In this article it describes what start-ups are doing to build a winning BI business model in this middle market. I like the intercept/middleware model mentioned near the end of the article. CDF is looking at potentially playing in that space or partnering with other companies that offer services to dealers and capture additional data from the dealers.

http://www.nytimes.com/2014/07/10/business/smallbusiness/finding-affordable-ways-to-use-big-data.html?_r=1

Great analysis of the 2014 Gartner BI Quadrant – winners and losers

The Gartner report was released a few months ago. I was looking for an objective review of their quadrant and who was gaining ground and who was losing ground. I found this excellent summary with insightful commentary. Link

Gainers: Tableau, Qlik and Spotfire. Losers: Microsoft, MicroStrategy, SAP and Oracle.

We don’t need more tools right now but we do need to keep an eye on the future for when we might want to retire a tool or move more solutions to a tool we already have.

Hope you find the analysis informative

The Industries Plagued by the Most Uncertainty

It’s a cliché to say that the world is more uncertain than ever before, but few realize just how much uncertainty has increased over the past 50 years. To illustrate this, consider that patent applications in the U.S. have increased by 6x (from 100k to 600k annually) and, worldwide, start-ups have increased from 10 million to almost 100 million per year.  That means new technologies and new competitors are hitting the market at an unprecedented rate.  Although uncertainty is accelerating, it isn’t affecting all industries the same way. That’s because there are two primary types of uncertainty — demand uncertainty (will customers buy your product?) and technological uncertainty (can we make a desirable solution?) — and how much uncertainty your industry faces depends on the interaction of the two.

Demand uncertainty arises from the unknowns associated with solving any problem, such as hidden customer preferences. The more unknowns there are about customer preferences, the greater the demand uncertainty. For example, when Rent the Runway founder Jenn Hyman came up with the idea to rent designer dresses over the internet, demand uncertainty was high because no one else was offering this type of service.  In contrast, when Samsung and Sony were deciding whether to launch LED TVs, which offered better picture quality than plasma TVs at a slightly higher price, there was lower uncertainty about demand because customers were already buying TVs.

Technological uncertainty results from unknowns regarding the technologies that might emerge or be combined to create a new solution. For example, a wide variety of clean technologies (including wind, solar, and hydrogen) are vying to power vehicles and cities at the same time that a wide variety of medical technologies (chemical, biotechnological, genomic, and robotic) are being developed to treat diseases. As the overall rate of invention across industries increases, so does technological uncertainty.

Consider the 2×2 matrix below. The horizontal axis plots each industry based on technological uncertainty, measured as the average R&D expenditures as a percentage of sales in the industry over the past ten years. The vertical axis plots each industry’s demand uncertainty, measured as an equal weighting of industry revenue volatility, or change, over the past 10 years and percentage of firms in the industry that entered or exited during that same time period. Although these are imperfect measures, they identify the industries facing the highest, and lowest baseline levels of uncertainty.