With my involvement in some data science work recently, I have had the privilege to explore a lot tools of the trade – Rapid Miner, Python, Tensorflow and Azure Machine Learning to name a few. My experience has been highly enriching but I felt there was no Swiss knife that can handle the initial – and the most critical stage of a Data science project: i.e. Hypothesis stage.
During this stage, scientists typically need to quickly prep the data, find the correlation patterns and establish hypotheses. It requires them to fail fast by identifying null hypotheses and spurious correlations and stay focussed on the right path. I recently explored Power BI and would like to share my findings through this blog.
Business Problem
Let us take a business case of a juice vendor say Julie. Julie sells various kinds of juices and she collects some data about her business operations on daily basis. Say we have the following data for the month of July which looks like below. It is pretty much – when, where, what and for how much?
Now say I am a data scientist who is trying to help Julie to increase her sales and give her some insights that what should she focus on to get the best bang of her buck. I have been tasked to build an estimation model for Julie based on simple linear regression.
Feature Engineering
I will start by analysing various correlations between the features and our target variable i.e. Revenue. It can be commenced by importing the data into Power BI and looking after the following basics
1) Eliminate the null values with mean value of the feature
2) Dedupe any rows
3) Engineer some new features as below
Feature |
DAX formula |
Day Type Purpose of this feature is to distinguish between a week day and a weekend day. I wanted to test a hypothesis that weekend day might generate more sales than a week day. |
Day Type = IF(WEEKDAY(Lemonade2016[Date],3) >= 5,”Weekend”,”Weekday”) |
Total Items Sold |
Lemon + Orange |
Revenue |
Total Items Sold * Price |
Data preparation and feature engineering was a breeze in Power BI, thanks its extensive support of DAX, calculated columns and measures. The dataset looks like below now.
Hypotheses Development
Once we had our dataset ready in Power BI, the next task was to analyse the patterns between Revenue and other features
Hypothesis 1 – There is a positive correlation between Temperature and Revenue
Result: Passed
Hypothesis 2 – There are more sales on a weekend day
Result: Failed
I derived these results using the below visualizations built briskly using Power BI platform
Next off to some advanced hypothesis development. Shall we?
I needed to understand the relationship between the leaflets given on a particular day and their relationship with Revenue. Time to pull some heavy plumbing in, so I decided to tow R into in the mix. Power BI comes with inbuilt (almost!) support with R and I was able to quickly spawn a coplot using just 6-8 lines of R in the R Script Editor of Power BI
Interesting insight was how correlation differs based on the day. This was made possible using the Power BI slicer as shown below
Wednesday – Less correlation between leaflets and sales |
Sunday – High correlation between leaflets and sales |
Power BI + R = Advanced Insights
If you need to analyse the dynamics between various features and how this dynamics impacts your target variable i.e. Revenue. You can easily model that in Power BI. Below is a dynamic co plot that shows the incremental causal relationship between Leaflets, Revenue and Temperature.
The 6 quadrants at the bottom should be read in conjunction with 6 steps in the top box. The bottom left is the first step and the top right the last step of leaflets. Basically it shows how the correlation between Temperature and Revenue is affected by leaflets bin size
I ended my experiment by building a simple regression model that can give you prediction of your Revenue if you enter Temperature, Price and Leaflets. Below is the code for model in case you are keen
Power BI is a very simple and powerful tool for the exploratory data scientist in you. Give it a go.
Hi Manny,
Great post. Quick question, how did you got the results for the above two hypothesis. I mean yes, by looking at the charts I can identify the pattern/results based on regression line and means. But data science should have more statistical rigour i.e. hypothesis testing should involve confidence interval/margin of error etc. I would assume that you were just trying to present a simple example, but would be really interested in some advanced level posts on your blogs regarding this.
Also by any chance have you explored Bayesian statistical testing. Would be great to have your thoughts on how Bayesian testing results can be presented using powerbi using data within dynamics crm – where we can see the updated the results in real time once new data arrives. Below is a sample calculator for AB testing:
https://www.peakconversion.com/2012/02/ab-split-test-graphical-calculator/
Btw enjoyed your posts so far.
Cheers,
R
LikeLike
Thanks for your encouraging comment. Yes you are correct about this hypothesis being just a sample. Advanced techiques are required for production grade scenarios. My focus was more on the capabilites of PowerBI and have kept the statistical reasoning simple. Will explore the techniques you mentioned in more detail.
LikeLike