Posted by: Ravi Padaki | March 27, 2013

Big Data Open Source Technology Landscape

Here is a handy guide for anyone interested in purely open source technologies available to process big data. This is a draft and may not be complete with all the offerings from Apache. The intent here is to highlight the major technologies, their intended use and clear some confusion around the use of common buzzwords we hear around big data like Hadoop, Hive, HBase etc. Would welcome any suggestions to enhance this landscape.

Big data open source technology stack

Big data open source technology stack

This visual guide is based on a somewhat hierarchical stack. The stack from bottom to top starts with hardware and storage options. Before data gets added to the stack, the file systems must be aware of where the data resides. Data warehouses are built on file systems. Databases also directly reside on file systems and are not dependent on data warehouses. Databases are a logical subset of data stored in data warehouses and hence shown above data warehouses. All the data stored would be of no use if it cannot be retrieved. Hence query languages and enterprise search engines form the next layer. There are two parallel stacks that apply to all stacks. The administration and management stack identify the need to administer and manage all the services on a continuous basis for optimal performance. The other parallel stack shows data integration services with 3rd party tools and data sources.

More on when to use what with real life commercial applications in future blogs! If you are building an application from any combination of these stacks, drop me an email! Would be interested to cover your use case.

Here is a list of references from the diagram for more information on each of the technologies

  1. Open Compute Project
  2. Apache Hadoop
  3. Apache Hive
  4. Apache HBase
  5. CouchDB
  6. MySQL
  7. Apache Cassandra
  8. MongoDB
  9. PIG
  10. Apache SOLR
  11. Apache flume
  12. Apache sqoop
  13. Apache Ambari
  14. Apache Oozie
Posted by: Ravi Padaki | March 21, 2013

Workshop on Big Data for Business Managers

Big data technology is revolutionizing the way businesses look at their data! How many businesses are poised to take advantage of this is another question. My company, Pravi Solutions has designed a 1 day workshop to empower business users launch big data initiatives in their organizations! The workshop is on April 5, 2013 in Bangalore. Here are 5 reasons why you should attend this workshop especially if you are in retail, telecommunications, pharmaceuticals, life sciences, digital marketing, ecommerce, utilities, energy or any sector that has a lot of data but not able to use it!

Why you should attend?

  • If you are curious about how big data can help drive profitability and efficiency but didn’t know whom to ask
  • If you wanted to know more about the possibilities of Big Data for your business
  • You know what is big data but not sure how to make technology choices
  • You know about big data and have an idea about technology but don’t know how to get started
  • You have launched a big data project but need to be aware of other commercial implementations/successes

Your competition is analyzing big data. Don’t get left behind! Have a Big Data strategy today!

Register for the workshop now by visiting http://www.pravisolutions.com/workshops

Posted by: Ravi Padaki | March 5, 2013

What Big Data means for business?

What Big Data means for business?

This blog is for businesses across various sectors including retail, telco, pharma, digital marketing, airlines, utilities etc, who are interested to know more about Big Data and the multitude of opportunities it presents. This blog starts with how to identify if businesses have a big data problem. It then proceeds to explain the key drivers of Big Data and finally describing 3 ways Big data can benefit businesses.

First, do you have a Big Data problem/opportunity?

Aberdeen Group recommends measure of anything more than 5 TB of data as Big Data. That’s a good benchmark to quickly qualify if businesses have a big data problem. There are two other dimensions  as comprehensively defined by Gartner and they are a) if businesses generate data in real time leading to a torrent of data every second and b) if the data has variety i.e. not just machine generated or transactional data in databases but also videos, emails, chats, audio files etc that are not easily query-able. Some experts argue that the volume and velocity has been a familiar problem but ability to process variety is what is leading the big data opportunity. We will look at the drivers of big data adoption in the next section.

Consider data across all customer touch points

A common problem in organizations is that data about the user is spread across multiple data sources. For e.g. the online behavior of the user is captured in web logs whereas the purchase history is captured in financial data marts and the inventory levels are captured in ERPs. So, when counting the size of data, add up data across all sources that are collected from various customer touch points. These include customer support calls, emails, chats; user’s responses to online or mobile marketing campaigns, social media mentions etc.

Take the short test – do you have a big data problem?

  • Volume – Is your data more than 5TB/month? Yes/No
  • Velocity – Is data being generated at a rapid pace in real time? Yes/No
  • Variety – Is data manifested in unstructured formats such as emails, chats etsc? Yes/No

If you have answered yes to any one or more of them, then you have a Big Data problem! Congratulations! You are sitting on a gold mine!

Key Drivers of Big Data

According to the joint study by IDC and EMC, the size of the digital universe in 2012 was 2837 exabytes and is expected to grow to 40,000+ exabytes by the year 2020. That’s not far off – its only 7 more years now! (An exabyte is a billion gigabytes).

The key business driver is the fact that 90% of the data in the digital universe was made in the last 2 years alone and that we are still growing at 40% per year according to Aberdeen. We are all walking data generators, thanks to mobile devices and the internet of things! The promise of Big Data is not about the growing torrent of data but the ability to answer questions businesses have long been asking! These questions have always remained with the business but they had to be contented with optimal accuracy. For various technological and cost reasons, only sample sizes of large data sets were considered in statistical processing. Big data now makes it not only feasible but also cheaper to process large data sets. So, business starts to benefit from increased accuracy of predictions and recommendations.

The key drivers of Big data, in my opinion, are mainly technological and they are:
1. Hadoop and its ecosystem – Hadoop is a data processing platform. It is not a database and comes with a suite of scheduler, workflow and query tools. It is fundamentally a file system that can be configured over commodity hardware and that enables compute where the data is stored instead of having to move the data around to where the compute is. The advent of Hadoop has catalyzed processing large volumes of data. Not all applications are meant for Hadoop – only where the data processing can be contained to the location. There are also several other open source frameworks that extend Hadoop. More on the technological landscape of Hadoop in my next blog.
2. Cheaper Storage – Cost of a gigabyte of storage has fallen from $1M in 1980 to $0.10 in 2012. And its still headed downwards!
3. Cloud computing – Elastic compute power and Infrastructure as a Service has taken away the complexity for businesses to set up their own IT infrastructure and data centers. Large businesses have their own private clouds.

Big data: what is now possible for businesses?

Imagine a customer walking into a Saks 5th Ave store in New York and receiving personalized recommendations on their mobile device for him or her on what might best suit them based on several factors including season of the year, size, likes and dislikes, word of mouth recommendations, past purchase history and more importantly – what is available at that moment in the store down to the color, size, brand and price (including promotions)! Its almost like the shop owner knows you since 20 years!

Here are 3 things that Big data can do right now for business:
1. Improve topline through unprecedented personalization at scale.
2. Improve bottomline through unprecedented efficiency at scale
3. Improve governance through unprecedented monitoring at scale

Lets look at each of them with examples. It is a matter of time big data processing will become mainstream with enormous implications for businesses and governments!

Unprecedented personalization at scale!

Businesses for a long time have asked the question – what will make my customer buy? In the olden age of relationship based, single chain selling, the shop owner used to know their customers personally to recommend products. It is now possible again with the use of Big Data. The case of Target recommending pregnancy related products to a family even before they knew their family member was pregnant is a well known case in personalized targeted promotions. Call centres are investing in technologies that derive insights out of text mining emails and chats between support centres and customers offering personalized advice there by cutting time to resolve issues and delighting customers. Personalization will transform the following areas:
a. Acquiring new customers
b. Retaining and upselling to existing customers
c. Customer support services

Unprecedented efficiency at scale

Harvard Business Review reported an airport reducing wait times for in flight airplanes about to land thereby saving millions of dollars for the airlines to help run their flights efficiently by looking at several parameters in the past such as weather, air traffic etc and asking the question what was common whenever there was a delay. Here is another example in the retail sector that could use Big Data. Suppliers right now don’t have visibility SKU level inventory turn around in real time. As a result, they have to wait for reports at the end of the day or week to know how the products are moving on the shelf. For e.g. they can’t tell right now how long a SKU has been out of stock. If suppliers can find out SKU level they can quickly take decisions to restock. Right now, there is no visibility at the movement by SKU level because the data is too big and not accessible. Likewise, efficiency can be looked across the supply chain operations including logistics, manufacturing, operations, merchandizing, shipping, stocking etc.

Unprecedented monitoring at scale

CCTV is recording streams and streams of data in public places. There is not enough manpower to watch all the recordings. It is sometimes used as a deterrent. Being able to process unstructured data of video streams and run pattern matching algorithms to identify interesting cases could have huge implications for security of citizens all over the world. Similarly, enforcing traffic rules where monitoring by limited staff means limited success in controlling accidents in an overcrowded planet. Unprecedented monitoring is applicable in the cases of anti-virus and fraud detection as well which need to look at historical data for non obvious patterns.

These are the 3 main questions businesses can start to ask. Of course, the era of Big data is just beginning and the possibilities are just emerging. We will see a lot more applications in the coming years.

Summary

Businesses should be really excited to be in this technological phase of big data! The business goals have not changed with the advent of Big data. Only the ability to answer those questions with depth has changed. Change is probably an understatement. Soon, businesses will be transformed with personalization, efficiency and governance at scale! What do you think? Please share your thoughts.

Posted by: Ravi Padaki | August 27, 2012

3 point framework to make data actionable

In my earlier blog 3 easy ways to get data right, I talk about the importance of planning for data early in the process. In this blog, I offer details of data planning for data practitioners, product managers, business analysts and frontline users of data.

Someone aptly said data is as useful as the decisions it enables. More often than not, data can be intimidating, perplexing, scarce, abundant, complex; everything at the same time. If you can relate to one or more of these points below, then read along:

  1. Too much data and yet no insights
  2. Ok, great insight. So, what?
  3. I don’t know what I am looking for so  give me all the data you have

These are symptoms of not having clarity into what data should inform the consumer. Hence, creators and consumers of data need a framework that simplifies the process and aligns with the business. Let me at the outset mention that there are (at least) two types of data consumption. Reserving the main theme of this blog to the second point, firstly, there is data mining. Data Mining is highly exploratory in nature and is usually about validating or building a hypothesis. Also the turnaround times are unpredictable as the nature of the problem itself is unbounded. This is of course a fascinating domain of analytics but does not enable making quick business decisions.

The second type of consumption is one where data is contextual to the business process. The opportunity (and the challenge) is to provide data ready for consumption by business users who need to make quick decisions. So, the question is how do we understand, create and design data that enables quick decisions?

Here is the 3 point “business data framework”:

  1. First, identify the business goals
  2. Second, for each business goal, list the business tasks performed
  3. Third, for each business task, ask the question – what data do I need to accomplish this business task in order to achieve the business goal?

Sounds simple isn’t it? It is, as a framework. Let’s put it to practice with an example. Suppose you are running an ad network. What are the business goals that you consider are business drivers? Consider one of the business goals is to Increase Spend from Advertisers. Why is this a business goal? Because ad spend is a direct contributor to the profit margins of your network and thus to the success of your company. Also, there is no additional intermediate player between ad spend and the profit margins. More the ad spend, more the network profits.  How many such direct contributors of the business can you list as goals?

Let’s take ad spend as a business goal and look at step 2 in the framework – listing the business tasks needed to achieve the business goal. Business tasks are actions taken in a typical day in the life of a business user on the frontline of trading. In this example of achieving the business goal of increasing ad spend; a business user might take several actions depending on the context and history of the client, such as recommend the right type of supply. Each task can have sub tasks as well if they need to be further deconstructed. Let’s take one of the tasks intended to increase ad spend – recommend a good supply partner for the buyer to link to!

Here is the business to data mapping: Let’s begin by asking the question – what data do we need to enable the business user to recommend supply to increase ad spend? Let’s presume the client has provided certain input criteria such as size of reach (impressions), target audience or channel type, country/region, offer type, initial bid price and creative specifications. The minimum data functionality required is to allow the user to discover i.e. search and browse a set of potential partners based on their relevancy to the criteria and ultimately recommend a partner for the advertising client. This can increase in sophistication if we want to automatically suggest a partner at various phases of the campaign depending on the performance of the campaign. In addition to this data set, the user may also be interested in past performances of the candidate partners with respect to other similar types of demand!

This approach has multiple benefits. It defines the data problem in a way that a) makes it easy and meaningful for the consumer to process data b) allows creators to deliver high impact innovation in analytics and package data contextually with the business process.

Summary

The framework enables alignment of data with business goals which is critical to high impact and high velocity data-enabled-decision making. In summary, list the business goals and tasks and ask the question what data is needed to accomplish the business task in order to achieve the business goal? How frequently should this mapping be conducted – After the first time, once every quarter should be good enough to check in with the business and to ensure coverage of new capabilities in core products.

Posted by: Ravi Padaki | June 30, 2012

3 easy ways to get data right

Business users have often lamented the lack of right data. There is either no data available or there is too much data. The underlying question is what is the right data? And when data is made available, can it be meaningful and actionable? Here are 3 suggestions that will help business users and product organizations get the right data –

Plan data early in the process
Understanding business goals and strategy of an organization is fundamental to defining and designing actionable data. Data and analytics involve capturing, collecting, collating and presenting data. The trick is to not get into the trap of discussing any of these elements early on. When we track too many metrics, it is usually an indication of lack of clarity on what to track and what matters the most. Understanding the need for data by aligning with the business goals and the current business strategy of the customers is more important than getting into the minutiae of metrics early on. This alignment can lead to realization in gaps in analytical services.

Hence, data discussions should ideally begin at the time of finalizing business and product strategies. We also need to simultaneously identify and segregate reporting solutions for billing purposes from needs of analytical solutions.

For e.g. when launching a new feature, add data strategy section in the MRD or business case that talks about what data do you need that helps in each of the decisions taken by the user. If you are thinking about a new strategy for your regional markets that think about what data you need to enable that strategy. To get a fresh perspective, consult with your nearest data teams.

Recognize that not all data is the same
Data is a term that is used in general to describe the gamut of business intelligence. However, there is gradation in data and is described below. The gradation from bottom to top is data, reports and analytics.

1. Raw data is in structured or unstructured form that may or may not be processed with some basic filtering to weed out extraneous data. This initially collected data usually runs into terabytes and needs large amount of storage. Typically it is not stored for more than 3 to 7 days.
2. Reports are in aggregated format that are obtained after several layers of cleaning, aggregation and filtering based on business rules. Reports provide information in tabular or trend charts. Aggregated reports in different time period formats can be stored for longer periods of time, based on business needs.
3. Analytics are algorithmic solutions that are based on mathematical/statistical models processed on aggregated data.
Insights are unexpected revelations from studying reports or analytics and usually correlate the movement of two or more metrics or their causation. They are typically produced by business analysts.

For e.g. ad serving impression log records are stored on grid (after some basic filtering for e.g. fraudulent clicks). This has the highest level of fidelity in terms of capturing the event but is also very much Big Data considering the volume of records generated per second. Data at this level is raw and cannot make much sense to the users. The next level of processing, sorting and filtering can produce metrics, for e.g. unique users reached in an advertiser campaign. This is a report that distills information out of raw data. Analytics is the realm of solutions that analyze unique users by targeting attributes, exposure to viewers, conversions and recommends targeting attributes for higher performance based on mathematical models.

Make data work for you
Know the power of data by familiarizing yourself with the maturity stack. The business intelligence maturity stack can be described as follows in terms of increasing complexity and value.

1. Composition – This provides information on what are the parts that make up the whole. For e.g. number of campaigns.
2. Performance – This provides information on the state of the parts and the whole. For e.g. reach and delivery by campaign
3. Recommendations – these answers the question on what is happening and why are they happening. Recommendations can be in real time if the business domain demands. For e.g. campaign optimization recommendations
4. Prediction – This answers the question on what might happen given a set of assumptions and controllable levers. This also allows the user to try out what-if scenarios and understand the impact before they are implemented. For e.g. how many users will I reach if I change my targeting attributes or pricing type to CPM?
5. Optimization – Controlling future outcomes in choreographed situations, including auto-optimization of business process workflows. For e.g. change budget delivery type to ASAP if CTR are trending upwards.

Composition and performance provide information that are important and insightful to run the business but recommendations, predictions and optimizations are where the hidden value of data lies and which will provide the force multipliers in value to users.

Summary
In summary, plan data early on, recognize not all data is same, and make data work for you. The key success criteria for data remain the impact of decisions made from using data. Hopefully, this note will empower business users the next time they ask for data and drive more value to analytic products.

Categories