Posted in Artificial Intelligence, Machine Learning

Dedupe Duplicates using Fuzzy / Proximity search

Last year I wrote a post about finding similar accounts for Dynamics CRM which generated lot of interest in the community. Understandably so, as this is a very common requirement that is asked for in nearly every CRM project – Duplicate Accounts. CRM duplicate detection capabilities are only basic – they just do partial match, they can’t do any fuzzy or proximity match.

Even with the latest and advanced weaponry in CRM’s armour i.e. Relevance Search it is not there yet where it could tell that the following accounts are infact the same companies.

Account

Potential Duplicate

Reason

Waste Management

Waset Manaegment

Typo

Public Storage Co.

Storage Public Co.

Wrong order

Scotts Miracle-Gro

Scott Miracles Gro

Plural

Melbourne University

Melbourne Univ.

Short form

I decided to improve and generalise my code a bit, so that it can be used not only for CRM for any general requirement where you need to find duplicates based on proximity. I am going to share the code and approach in this blog.

Approach

This proximity search is based on the machine learning algorithms which base the search on Edit Distance. The program starts with finding the exact matches first, if it couldn’t find an exact match, then it widens the search filter to find partial and proximity matches (i.e. words in the same neighbourhood, ordered in a different way, etc.)

Results

I have also attached the original files that I used during my testing i.e. the file containing duplicates and the results (where duplicates were found). Below is the brief snapshot of the results from my test run

Company

Duplicate Found

Kimberly-Clark

Kimberly Clark

San disk

SanDisk

Macy’s

Macy

Starwood Hotels & Resorts

Starwood Hotels And Resorts

Expeditors Washington

Expeditors International of Washington

There were some false positives in the results as well, so you can adjust the thresholds of the algorithm as per your data.

How to use

You got a list of companies and you want to know which of them are duplicates. So, this is what you need to do.

1. Export the list into a CSV file.

2. Point the code to your file.

3. Run the code and it generates a new file results.csv with a new column called Duplicate

Complete source code

Python is a beautiful language and does big things in just few lines of code. Just install Python on your desktop and run the following file. No frills, no servers, no deployment. Too easy.

# ProximitySearch.py
# AUTHOR - MANNY GREWAL 2017 (https://mannygrewal.wordpress.com)
# THIS CODE WILL DO FUZZY SEARCH FOR SEARCH TERM INSIDE A DATABASE. THE PRECENDENCE OF SEARCH STARTS WITH THE # NARROWEST FILTER WHICH SLOWLY WIDENES UP. THE IDEA IS TO GET TO PERFECT MATCHES BEFORE NEAR MATCHES. EACH  # FILTER HAS ITS OWN THRESHOLD CUTOFF.

#IMPORT THE PACKAGES NEEDED
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import csv
import os


#DEFINE AND CONFIGURE
FULL_MATCHING_THRESHOLD = 80
PARTIAL_MATCHING_THRESHOLD = 100
SORT_MATCHING_THRESHOLD = 100
TOKEN_MATCHING_THRESHOLD = 100
MAX_MATCHES=1

#READ THE CURRENT DATABASE
companies_db = "<local path of your CSV file>/CompaniesShort.csv"
pwd = os.getcwd()
os.chdir(os.path.dirname(companies_db))
current_db_dataframe = pd.read_csv(os.path.basename(companies_db),skiprows=1,index_col=False, names=['Company'])
os.chdir(pwd)

def find_matches(matchThis):
    rows = current_db_dataframe['Company'].values.tolist();
    rows.remove(matchThis)
    matches= process.extractBests(matchThis,rows,scorer=fuzz.ratio,score_cutoff=FULL_MATCHING_THRESHOLD,limit=MAX_MATCHES)
    if len(matches)==0:
        matches= process.extractBests(matchThis,rows,scorer=fuzz.partial_ratio,score_cutoff=PARTIAL_MATCHING_THRESHOLD,limit=MAX_MATCHES);
        if len(matches)==0:
            matches= process.extractBests(matchThis,rows,scorer=fuzz.token_set_ratio,score_cutoff=TOKEN_MATCHING_THRESHOLD,limit=MAX_MATCHES);
            if len(matches)==0:
                matches= process.extractBests(matchThis,rows,scorer=fuzz.token_sort_ratio,score_cutoff=SORT_MATCHING_THRESHOLD,limit=MAX_MATCHES);
    
    return matches[0][0] if len(matches)>0 else None


fn_find_matches = lambda x: find_matches(x)
current_db_dataframe['Duplicate']=current_db_dataframe.applymap(fn_find_matches)

current_db_dataframe.to_csv("results.csv")
Posted in IoT, Machine Learning

Azure IoT Hub Streaming Analytics Simulator

Azure IoT Hub Streaming Analytics Simulator is an application written by Manny Grewal. The purpose of this blog is to explain What, Why and How of this application.

 

What?

Streaming analytics is a growing trend that deals with analysing data in real-time. Real-time data streams have a short life span, their relevance decreases with time, so they demand quick analysis and rapid action.

Some areas where such applications are highly useful include data streams emitted by

  • Data centres to detect intrusions, outages
  • Factory production line to detect wear and tear of the machinery
  • Transactions and phone calls to detect fraud
  • Time-series analysis
  • Analogy Detection

 

Data used by streaming analytics applications is temporal in nature i.e. it is based on short intervals of time. What is happening at the interval TX can be influenced by what happened 2 minutes ago i.e. at the interval TX-2

So the relationships between various events are time-based rather than entity based (e.g. as in general Entity Relational Database based systems)

Take the scenario of a Data Centre which has two sensors that emit a couple of data streams – Fan Speed of the server hardware and its temperature.

If temperature reading of server hardware is going high, it could be related to the dwindling Fan Speed reading. We need to look at both the readings over an interval of time to establish a hypotheses on their correlation.

 

 

Why?

In order to model and work with streaming analytics it is important to have an event generator that can generate the data streams in a time-series fashion.

Some example of such generators can be vehicle sensors, IoT devices, medical devices, transactions, etc. that generate data quickly.

The purpose of this application is to simulate the data generated by those devices, it just helps you setup quickly and start modelling some data for your IoT experiments.

 

 

Main benefits of this app

1. Integrated with Azure IoT Hub i.e. the messages emitted by this application are sent to the Azure IoT Hub and can be leveraged by the Intelligence and Big Data ecosystem of Azure.

2. This app comes with 4 preset sensors

a. Temperature/Humidity

b. Air Quality

c. Water Pollution

d. Phone call simulator

3. Configure > Ready. App can be easily pointed to your Azure instance and can start sending messages to your Azure IoT Hub

4. Can be extended, if you are handy with .NET development. I have designed the app on S.O.L.I.D framework so it can be extended and customised the link to source code is below

 

How?

 

App and source code can be downloaded from my Github

 

A quick tour of the app is below

IoT Hub

 

 

Configure

The app needs to be configured with details of your Azure IoT Hub account.

The following files need to be configured

1. App.Config

2. If you are registering Devices in the Hub, then keys for the devices need to be stored in the SensorBuilder.cs

3. You may need to restore the Nuget Packages to build the application

 

Once the above three steps have been completed, you can build the application and the EXE of the application will be generated.

 

Sensor Tuning

Sensors can be tuned from the classes inheriting IDataPoint e.g. in the FloatDataPoint.cs

The following properties can be used to tune the sensors

Property Name Tuning
MinValue The minimum value of the sensor reading e.g. for climatic temperature it can be -40C
MaxValue The maximum value of the sensor reading e.g. for climatic  temperature it can be 55C
CommonValue This is the average value of the sensor e.g. for warmer months it can be 30C
FluctuationPercentage How much variance you want in the generated data
AlertThresholdPercentage When should an alert be generated if the reading passes a certain threshold e.g. 80% of the maximum value

 

Azure IoT Hub

The messages sent by the sensor simulator can be accessed in the Azure IoT Hub. Once you have configured your hub and related streaming jobs. The messages can be seen in the dashboard as below

image

 

The messages are sent in the JSON format and below is a structure of one of the messages emitted by a sensor located at Berwick, VIC


{
"IncludeSensorHeader": 1,

"MessageId": "949a3618-c4a4-42bc-9c2a-39da86aa9191",

"EmittedOn": "2017-06-30T11:13:45.3543200",

"SensorDataHeader": {
"Sensor":
"Berwick",
"DeviceId":
"G543",
"Lat":
-38.0309,
"Long":
145.3461
},
"SpecialMessage":
null,
"Readings": [

{
"ReadingValue":
27.9943523,
"MetaData":
{
"Name":
"Temperature",
"Unit":
"C"

},
"Level":
"Normal"
},

{
"ReadingValue":
49.6043358,
"MetaData":
{
"Name":
"Humidity",
"Unit":
"RH"

},
"Level":
"Normal"
}

],
"EventProcessedUtcTime":
"2017-07-01T11:26:53.1434112Z",
"PartitionId":
0,
"EventEnqueuedUtcTime":
"2017-06-30T11:13:48.4340000Z",
"IoTHub":
{
"MessageId":
null,
"CorrelationId":
null,
"ConnectionDeviceId":
"G543",
"ConnectionDeviceGenerationId":
"636297589019910475",
"EnqueuedTime":
"2017-06-30T11:13:47.5760000Z",
"StreamId":
null
}
}

Posted in Computer Vision, Dynamics 365

Can Dynamics CRM understand images? Yes! Using deep learning.

Machine Learning is quite a buzzword these days and we have witnessed how quickly Microsoft and other vendors have made progress in this area. Just couple of years back Microsoft had no product or tool in this space and today they have closer to a dozen. Recently Microsoft has integrated Machine Learning into SQL Server and Dynamics CRM, it is slowly becoming core to its product line.

I would not be surprised if machine learning becomes a mandatory skill for most of the development jobs in the next decade.

How Image Recognition can help CRM?

Attaching documents is a common feature asked for in many CRM projects where customers can complete an application form and then upload scanned copies to support their application. Think of invoices, receipts, certificates, licenses, etc. As of now there is no way that Dynamics CRM can detect if the scan that a customer is uploading is a picture of a license, or beach or a car.

What if Dynamics CRM can detect and recognise the scanned image and tell the user that it is expecting a license not a Dilbert on the beach.

clip_image001

Source: Ol.v!er [H2vPk] – Flickr

Wouldn’t it be great?

Although there are some Image engines that can tell you what an uploaded picture contains but there isn’t any engine or tool (as per my knowledge) that can tell whether an upload document is a license or not. This is because there are only subtle differences between scanned copies of various documents.

In this blog series I will build and demonstrate an approach to have this kind of image recognition capability with our favourite Dynamics CRM and we will use a branch of machine learning called Deep Learning that is very good at tasks related to Computer Vision. I would not be delving into the concepts of Deep Learning (there are numerous posts and videos on the internet) but will try to cover the major building block in this whirlwind tour.

Australian Identity Documents

I will take a real business case which is ubiquitous in many online applications in Australia where a customer is asked to provide a scan of their Australian ID as a proof. For our blog we will use the following Australian IDs

1) Victoria Driver’s License clip_image003

Courtesy: VicRoads

 

2) Australian Visa clip_image005

Courtesy: http://www.thejumpingkoala.com/

3) Medicare card clip_image007

Courtesy: Medicare

Note: Because of their sensitive nature I would only be exposing sample documents in this blog

The expectation is that the system can tell if the user is attaching a scanned copy of their Australian Visa when the record type is Australian Visa. So we will validate the image based on its content.

Good thing about deep learning based systems is that the detection algorithms do not rely on exact colour, resolution and placement but rather on pattern and feature matching. I got pretty good results when I built this system which I will share in later posts.

Technical Setup

Deep Learning based systems use a concept of neural networks to train themselves and to perform their tasks. There are many kinds of neural networks and the one that does the job for us is the Convolutional Neural Network. CNNs are good at image related tasks.

In order to train a CNN from scratch you need lot of hardware and computing power and I do not have that. So I will be using a partially trained network and customise it for our specific task i.e. to identify the images of those 3 types of Australian IDs.

Let us cover the building blocks of our solution

TensorFlow TM

TensorFlow is an open source framework for Deep Learning and we will be using it to train our engine.

Python

TensorFlow comes in many platforms but we will use its Python version.

Dynamics 365

Once our model is trained we will deploy it online as web service and CRM can query that. I would not be posting the integration code here as I have already posted code to integrate Dynamics CRM with Machine Learning web services in my other blog

 

Let us start by training an image recognition model that can classify an image e.g. a scanned copy and tell if it is an Australian ID e.g. driving license or visa scan, etc.

Approach

We will use an approach called Transfer Learning. In this approach you take an existing Convolutional Neural Network and retrain its last few layers. Think of it this way that you have already got a network that can detect differences of aeroplane from a dog but you need to retrain it to pick more subtle differences i.e. the difference between a scanned invoice and a scanned passport.

TensorFlow is based on the concept of a tensor which is a mathematical vector that contains the features of an image. We will grab the penultimate layer of tensors and retrain it with some sample images of a Medicare card, an Australian Visa and Victoria’s Driver license.

Once the model is trained we will use a simple Support Vector Machine classify and predict the likelihood of the uploaded image to be an Australian ID. The output of the SVC classifier will a predicted class along with a likelihood probability e.g.

(Visa, 0.83) Model thinks 83% the image is that of an Australian Visa
(Medicare, 0.89) 89%, it is a Medicare
(License, 0.45) 45% it is a license

If the confidence percentage is low it means that image is not in the class of our interest e.g. in the last example the uploaded image is most likely not a license. As a rule of thumb, a probability of 0.80 is good mark for the prediction to be reliable.

Training Pool

Below are the screenshots of the samples that I used as a training for my image classification model. As you can see images differ in terms of angles, positioning, colours, etc. system can still learn based on important properties and disregard irrelevant properties.

Australian Visa

Training Set

clip_image008
Medicare

Training Set

clip_image009
Victoria Driver’s License

Training Set

clip_image010

Training Phase

The training procedure involves categorising all the training images into a folder which is a named after their class. As you can see in the screenshots above, the windows folders are named after the class i.e. DriversLicense, Medicare and Visa

We then iterate over all these images and pass them to the penultimate layer of TensorFlow which gives us a feature tensor (a 2048 dimensional array of that image), we then label the image with its respective class.

Support Vector Machine

Once we have the feature tensor and label of every image, our training dataset is complete and we feed it to a Support Vector Machine and train the model. To save time, I pickled the model so that it can be reused for all predictions.

I know some of this terminology may be new to you but in the next post I will explain the architecture and some sample code that generates the predictions. Then it will start falling in place. See you then.

Part 3

In the previous two instalments I have been explaining the image recognition system that I built to recognise Australian IDs and discussed how our traditional CRM can benefit from such intelligent capabilities.

In this post I will cover the Architecture and share some sample code

Architecture

clip_image012

As you can see above there are basically two major pillars of the system

A) Python

B) CRM ecosystem

Python is used to build the model using TensorFlow, then the compiled version of the trained model is deployed to an online webservice that should be able to accept binary contents like image data.

On the CRM ecosystem side, user can upload the image in a web portal or directly from CRM based on the scenario, then we need to pass it to the model and get the score.

Source Code

Below is an excerpt of the source code from one of the unit tests that will give you glimpse of what happens under the hood on Python side of the fence. This is just one class for introductory purposes, not the entire source code.

import os

import pickle

import sklearn

import numpy as np

from sklearn.svm import SVC

import tensorflow as tf

import tensorflow.python.platform

from tensorflow.python.platform import gfile

model_dir = 'inception'

def CreateImageGraph():

#Get the tensorflow graph

with gfile.FastGFile(os.path.join(

model_dir, 'classify_image_graph_def.pb'), 'rb') as f:

graph_def = tf.GraphDef()

graph_def.ParseFromString(f.read())

_ = tf.import_graph_def(graph_def, name='')

def ClassifyAustralianID(image):

nb_features = 2048

#Initialise the feature tensor

features = np.empty((1,nb_features))

CreateImageGraph()

with tf.Session() as sess:

next_to_last_tensor = sess.graph.get_tensor_by_name('pool_3:0')

print('Processing %s...' % (image))

if not gfile.Exists(image):

tf.logging.fatal('File does not exist %s', image)

image_data = gfile.FastGFile(image, 'rb').read()

#Get the feature tensor 

predictions = sess.run(next_to_last_tensor,{'DecodeJpeg/contents:0': image_data})

features[0,:] = np.squeeze(predictions)

clear = '\n' * 20

print(clear)

return features

if __name__ == '__main__':

#Unpickle the trained model

trainedSVC = pickle.load(open('Trained SVC','rb'))

#Path to the image to be classified

unitTestImagePath = 'Test\\L5.jpg'

#Get feature tensor of the image

X_test = ClassifyAustralianID(unitTestImagePath)

print("Trying to match the image at path %s.....",unitTestImagePath)

#Get predicted probabilities of various classes

y_predict_prob=trainedSVC.predict_proba(X_test)

#Get predicted class

y_predict_class=trainedSVC.predict(X_test)

#Choose the item with the best probability

bestProb = y_predict_prob.argsort()[0][-1]

#Print the predicted class along with its probability

print("(%s, %s)" % (y_predict_class, y_predict_prob[0][bestProb]))

 

The purpose of the above stub is to test the prediction class ClassifyAustralianID with a sample image L5.jpg which is below. As we can see it is a driving license.

clip_image013

Running this image against the model gives us this output

clip_image014

It means the model says, it is 93% sure that the input image matches the Driving License class. In my testing I found anything above 80% was the correct prediction

i.e. the confidence percentage for the below images was low because they do not belong to one of our classes (Drivers License, Visa or Medicare), which is the expected output

clip_image015

Closing Notes

Image recognition is a field of budding research and getting a lot of attention these days because of driverless cars, robots, etc. This little proof of concept gave me a lot of insight into how things work behind the scenes and it was a great experience to create such a smart system. The world of machine learning is very interesting!!

Hope you enjoyed the blog.

Posted in Machine Learning

Power BI for Data Scientists

With my involvement in some data science work recently, I have had the privilege to explore a lot tools of the trade – Rapid Miner, Python, Tensorflow and Azure Machine Learning to name a few. My experience has been highly enriching but I felt there was no Swiss knife that can handle the initial – and the most critical stage of a Data science project: i.e. Hypothesis stage.

During this stage, scientists typically need to quickly prep the data, find the correlation patterns and establish hypotheses. It requires them to fail fast by identifying null hypotheses and spurious correlations and stay focussed on the right path. I recently explored Power BI and would like to share my findings through this blog.

Business Problem

Let us take a business case of a juice vendor say Julie. Julie sells various kinds of juices and she collects some data about her business operations on daily basis. Say we have the following data for the month of July which looks like below. It is pretty much – when, where, what and for how much?

clip_image001

Now say I am a data scientist who is trying to help Julie to increase her sales and give her some insights that what should she focus on to get the best bang of her buck. I have been tasked to build an estimation model for Julie based on simple linear regression.

Feature Engineering

I will start by analysing various correlations between the features and our target variable i.e. Revenue. It can be commenced by importing the data into Power BI and looking after the following basics

1) Eliminate the null values with mean value of the feature

2) Dedupe any rows

3) Engineer some new features as below

Feature DAX formula
Day Type

Purpose of this feature is to distinguish between a week day and a weekend day. I wanted to test a hypothesis that weekend day might generate more sales than a week day.

Day Type = IF(WEEKDAY(Lemonade2016[Date],3) >= 5,”Weekend”,”Weekday”)
Total Items Sold Lemon + Orange
Revenue Total Items Sold * Price

Data preparation and feature engineering was a breeze in Power BI, thanks its extensive support of DAX, calculated columns and measures. The dataset looks like below now.

clip_image001[4]

Hypotheses Development

Once we had our dataset ready in Power BI, the next task was to analyse the patterns between Revenue and other features

Hypothesis 1 – There is a positive correlation between Temperature and Revenue

Result: Passed

Hypothesis 2 – There are more sales on a weekend day

Result: Failed

I derived these results using the below visualizations built briskly using Power BI platform

clip_image003

Next off to some advanced hypothesis development. Shall we?

I needed to understand the relationship between the leaflets given on a particular day and their relationship with Revenue. Time to pull some heavy plumbing in, so I decided to tow R into in the mix. Power BI comes with inbuilt (almost!) support with R and I was able to quickly spawn a coplot using just 6-8 lines of R in the R Script Editor of Power BI

clip_image004

Interesting insight was how correlation differs based on the day. This was made possible using the Power BI slicer as shown below

clip_image006 clip_image008
Wednesday – Less correlation between leaflets and sales Sunday – High correlation between leaflets and sales

Power BI + R = Advanced Insights

If you need to analyse the dynamics between various features and how this dynamics impacts your target variable i.e. Revenue. You can easily model that in Power BI. Below is a dynamic co plot that shows the incremental causal relationship between Leaflets, Revenue and Temperature.

The 6 quadrants at the bottom should be read in conjunction with 6 steps in the top box. The bottom left is the first step and the top right the last step of leaflets. Basically it shows how the correlation between Temperature and Revenue is affected by leaflets bin size

clip_image009

I ended my experiment by building a simple regression model that can give you prediction of your Revenue if you enter Temperature, Price and Leaflets. Below is the code for model in case you are keen

clip_image010

Power BI is a very simple and powerful tool for the exploratory data scientist in you. Give it a go.

Posted in Computer Vision

How developers can move to the next level

Bored of writing  plugins, workflows, integrations and web pages and want to try something interesting? Try artificial intelligence.

It is so interesting and powerful that once you are into it you will never look back. Drones are in the air and driverless cars are being trialled. All such smart machines have one key requirement i.e. Visual Recognition.

Ability to understand what a frame contains – what is in that image, what is in the video?

It is quite fascinating to think about how can a program interpret an image?

If that is something you like then read on.

 

How a program understands an image

Images are matrices of pixel values, think of it as a 3D array where first dimension is the with of the image, second dimension is along the height and third dimension is the color channel i.e. RGB.

For the below image – An array value of [10][5][0]=157 means the value of Red Channel of the pixel at 10th row and 5th column is 157

and its Green Channel value may be 34 i.e. [10][5][1]=34

 

image

Source: openframeworks.cc

So at very basic level image interpretation is all about applying machine learning to these matrices

 

How to write a basic Image classifier

In this blog, I will highlight how can you write a very basic image classifier – that would not be state of the art but it can give you an understanding about the basics. There is a great source available that can help you train your image classifier. The CIFAR dataset gives you around 50K classified images in their matrix form that your program can train upon and additional 10K image that you can use to test the accuracy of your program. At the end of this blog I will leave you with the link to full source code a working classifier.

 

Training Phase

In the training phase you load all these images in an array and also store their category in an equivalent array e.g. let me show you some code

unpickledFile=self.LoadFile(fileName)
# Get the raw images.
rawImages = unpickledFile[b'data']
# Get the class-numbers for each image. Convert to numpy-array.
classNames = np.array(unpickledFile[b'labels'])
# Reshape 32 *32 * 3 (3D) vector into 3072 (1D) vector
flattenedMatrix = np.reshape(matrixImages, (self.NUM_EXAMPLES, self.NUMBER_OF_PIXELS * self.NUMBER_OF_PIXELS * self.TOTAL_CHANNELS))

 

In the above code we are loading the CIFAR dataset and converting into two arrays. Array flattenedMatrix contains the image pixels and Array classNames contains what the image actually contains e.g. a boat, horse, car, etc.

So flattenedMatrix [400] will give us pixel values of the 400th example and classNames[400] will give us its category e.g. a car

That way program can relate, what pixel values correspond to what objects and create patterns that it can match against during prediction.

Prediction

This being a very simple classifier uses a simple prediction algorithm called kNN i.e. k Nearest Neighbour. Prediction occurs by finding the closest neighbour from the images the program already knows.

For example if k=5, then for an input image X the program finds 5 closest images whose pixel values are similar to X. Then the class of X is computed based on the majority vote e.g. if 3 of those images are of category horse, then X is also most likely to be a horse.

Below is some code that shows how this computation occurs

def Predict(self, testData, predictedImages=False):
# testData is the N X 3072 array where each row is 3072 D vector of pixel values between 0 and 1
totalTestRows = testData.shape[0]
# A vector where each element is zero with N rows where each row will be predicted class i.e. 0 to 9
Ypred = np.zeros(totalTestRows, dtype = self.trainingLabels.dtype)
Ipred = np.zeros_like(testData)

# Iterate for each row in the test set
for i in range(totalTestRows):
# It uses Numpy broadcasting. Below is what is happening
# testData[i,:] is test row of 3072 values
# self.trainingExamples - testData[i,:] gives you a difference matrix of size 50000 X 3072 where each element is the difference value
# np.sum() computes sums across the columns e.g. [ 2 4 9] sum is 15,
# distances is 50000 rows where each element is the distance (cummulative sum of all 3072 columns) from test record (i)
distances = np.sum(np.abs(self.trainingExamples - testData[i,:]), axis = 1)
#Partition by nearest K distances (smallest K)
nearest_K_distances= np.argpartition(distances, self.K)[:self.K]
#K matches
labels_K_matches= self.trainingLabels.take(nearest_K_distances)
# top matched label
best_label=np.bincount(labels_K_matches).argmax()
Ypred[i] = best_label
# do we need to return predicted Image as well
if(predictedImages==True):
best_label_arg= np.argwhere(labels_K_matches==best_label)
# store the match
Ipred[i] = self.trainingExamples[nearest_K_distances[best_label_arg[0][0]]]
return Ypred, Ipred

 

As outlined above if you need to try this yourselves, full source code is available on my Github page

Posted in Chatbots, Dynamics 365

Part 2 – Bot Framework

The recently released Bot Framework equips us with the basic plumbing that is required for chat sessions and making connections with services like LUIS. Some of the key features of Bot Builder SDK include

· Support for both C# and Node.js

· Open source on Github

· Conversation support – Prompts, Dialog and Rulesets for form flows

· Chat emulator – a client for testing

· Connector to Cognitive services like LUIS

Once you have the prerequisites discussed in the previous part, you can create a new bot project from Visual Studio by going

File > New > Project > Bot Application

The project setup is based on WebAPI / MVC style routing and you need to implement a message controller. Below is a screenshot of the source code for the bot

clip_image001

Handling messages

The main entry point of the bot framework is the MessagesController as shown below


[BotAuthentication]
public class MessagesController : ApiController
{
[ResponseType(typeof(void))]
public virtual async Task<HttpResponseMessage> Post([FromBody] Activity
activity)
{
// check if activity is of type message
if (activity != null && activity.GetActivityType() ==
ActivityTypes.Message)
{
await Conversation.SendAsync(activity, () => new InsuranceDialog());
}
else
{
HandleSystemMessage(activity);
}
return new HttpResponseMessage(System.Net.HttpStatusCode.Accepted);
}

The controller is secured by the BotAuthentication decoration that secures the bot’s endpoint, then we are checking the incoming message to ensure it is of type message and initiate a dialog called InsuranceDialog. The dialog then passes the message to LUIS to determine the customer’s intent and generates a reply accordingly. We will dig in more details of LUIS in the next blog.

Replies

Replies from the bots are posted back on the chat screen using some of the common methods described below

context.PostAsync("Hi there. Welcome to BestPrice.");

Above line shows how to post a basic message back to the user

PromptDialog.Choice<TypeOfInsuranceOptions>(context, 
ResumeTypeOptionsAsync, options, "Let us know what are you interested in?");

Here we are using a dialog class which not only posts a message with options but also listens to the user’s input i.e. the option they chose.

PromptDialog.Confirm(context, HandleInsuranceOptions,"Do you want to know about 
our insurance?"
,"Didn't get that!");

This is an example of a confirm message where we expect a Yes or No from the user

Using the Channel Emulator

One of most useful application for such projects is the Channel Framework Emulator which is a client you use to unit test your bots. It can connect to both online and locally deployed bot apps. You need to ensure that AppId and Secret you use in this app are the ones your bot app uses i.e. the ones in its web.config. Below is a screenshot of our bot being tested locally. Let us meet in the next blog post where we explore LUIS.

  clip_image003

Posted in Chatbots, Dynamics 365

Build a Chatbot for Dynamics CRM– Part 1

“Chatbots are about taking the power of human language and applying it more pervasively to our computing.”

Satya Nadella

We have seen an age of mobile phone apps, and guess what is coming next? Chatbots. To acknowledge their soaring growth and to leverage on this business opportunity, at this year’s Build conference, Microsoft has released a full framework to build bots. It is called the Bot Framework.

Microsoft is not alone in the game, Facebook and Amazon have released their bot platforms as well, and the developer base is growing at an astonishing pace. Technology is making huge leaps in Natural Language Processing, with Google just having open-sourced their NLP parser and Microsoft having enriched their Language platform LUIS (Language Understanding Intelligent Service). These advancements coupled with the capability to build chatbots presents an incredible opportunity for developers and the businesses alike. A proof of their popularity is this statistic that says since last year bots have outnumbered humans on the internet. So not only they are a raging trend but also a hot market.

But what does all this mean for businesses? Put simply, organisations will be able to leverage Conversation as a platform where they can deploy intelligent chatbots to serve their customers. The equation of return on investment is quite attractive too based on a survey finding that the average cost of a customer transaction via phone is around $2.50 and, the average cost of a digital transaction (online or on mobile) is only around $0.17.  It is not all doom and gloom though, there will still be lot of human element required to fill up what bots lack, at least for the foreseeable future.

I decided to give the Microsoft’s bot platforms a whirl to check how easy it is to build a basic chat-bot. Through this blog series, I will walk you through the process of building a chat-bot that may interact with Dynamics CRM and can optionally be deployed using Microsoft portals. We will use two spanking new platforms released recently as a part of Microsoft’s Cognitive Suite: Bot Framework and LUIS. Before we start building, let us first understand how bots fit into the ecosystem.

clip_image002

We will use the setup outlined in the above diagram. The bot will be primarily built on the Bot Framework using .NET (Node.js is also supported) and it will interact with LUIS to parse the natural language and try to understand what the customer means.

There will be three more parts to this series and I will also link the source code of the working bot in the last part

Part 1 – Introduction

Part 2 – Bot Framework

Part 3- LUIS

Part 4 – Chatbot Integration and Deployment

Let us layout the scenario to understand what we are building.

Scenario

Say we are an insurance company called BestPrice and we are deploying a chatbot that customers can converse with to know about our products and to register their interest. The bot will pass some of the conversations to LUIS to determine customer’s intent. Three intents will be used for this demo

Greeting – Conversation is just a greeting like hi, hello, etc.

Enquire – The customer wants to enquire about our insurance products

Engage – Customers wants us to engage with them

Prerequisites

In order to setup the project you need to have the following prerequisites

1. Bot Framework VS template

2. Bot Framework Channel Emulator

3. Bot Framework dlls (via Nuget)

4. A developer account with Bot Framework

5. A developer account with LUIS with subscription key

6. Once the bot is deployed online it needs to be registered with bot framework

You can read more about the above prerequisites here or search them online

In the next instalment we will start building the bot and go through some of key building blocks.

Posted in Dynamics 365, Machine Learning

Use Machine Learning to predict customers you might lose – Part 4

So far we have seen how a Dynamics CRM integration can be connected to Azure ML to receive the predictions. Once we got the integration going there is no dearth of possibilities. You may like to build an alert / flagging functionality that can alert a Customer Service rep to contact a customer if their predictors are indicating that they might churn. You may incorporate predictions into exec reporting so that the execs are aware of the churn trends and make decisions to minimise churn.

Insights

One of the things I discussed at the start of this series was to be able to get some insights into the key drivers of customer churn e.g. how do you know which features are most likely to cause churn. Answering such questions begins with analysing your data, few starting points can be

1. From your data find out what fields change with respect to the Churn variable e.g. does the churn rate increase as the income of the customer goes up or is it dependent on their usage?

There are measures like correlation, covariance, entropy, etc that can help you answer such questions.

2. Find the distribution of your data and identify any outliers e.g. check if there is a skew in the data or if the classes are unbalanced. You may need to apply some statistical techniques like variance, standard deviation to have a better platform to delve into some of these insights.

Azure Machine Learning does provide some modules straight off the bat that can make the job easier e.g. it has the following modules

Compute Elementary Statistics

Compute Linear Correlation

Getting advanced insights can be tricky based on your algorithm or setup of the experiment (project). But there are ways e.g. with bit of a Python code you can produce a decision rule tree below. The last label in the box class= {LEAVE, STAY} tells us if the customer will churn based on what path they fall under

clip_image002

Above is the automatically generated insight that tells us that overage is most important variable in deciding customer churn. If overage exceeds 97.5 then a customer is more likely to churn, this does not mean that every customer whose overage is more than 97.5 will churn nor does it mean that whose overage is less than that will stay. It is just that Overage is the strongest indicator of churn based on our data.

We can even derive decision rules from insights like these e.g. customers with overage less than 97.5 and Leftover minutes less than 24.5 minutes are most likely to stay. On the contrary customers with overage more than 97.5 and average income more than $100059.5 are most likely to leave.

Here is another one that shows the impact of House Value, Handset Value and other features on the churn

clip_image004

Once decision rules have been identified based on the above insights, policies can be made to retain such customers who are at risk of churn e.g. give them discounts, offer them a change of plan, prize them with loyalty offerings, etc.

Where to from here?

Hopefully by now you appreciate the potential of machine learning and recognise the opportunity it provides when it is complemented with traditional information systems like CRMs, ERPs and Document Management systems. The field of machine learning is enormous and sometimes quite complex too as it based on scientific techniques and mathematics. You need to understand and lot of theory if you need to get into the black box i.e. how machine learning does what it does?

But great thing about using Azure Machine Learning suite is that it makes entry into machine learning easier by taking care of the complexities and giving you an easy-to-understand and easy-to-use environment. You have full control over the data structure and algorithms used in your project. It can be tuned as per the needs of your organisation to receive the best possible results.

For example you can tune the example I provided in the following different ways

1. Rather than going with Random Forest you can choose Support Vector Machines or Neural Networks and compare the results.

2. You are not restricted to Javascript, you can call the web services from a plugin. That way in a data migration scenario, while you are importing data you can set the prediction scores as the data is being imported

3. You can also change the threshold of confidence percentage to ignore the predictions score where confidence is less than a certain amount.

So there are lot of possibilities. Hope you enjoyed the series.

Happy CRM + ML!!

Posted in Dynamics 365, Machine Learning

Use Machine Learning to predict customers you might lose – Part 2

Continuing our journey from the previous post where we defined the issue of churn prediction, in this instalment, let us create the model in Azure Machine Learning. We are trying to predict the likelihood of customer’s churn based on certain features in the profile which are stored in the Telecom Customer entity. We will use a technique called Supervised Learning, where we train the model on our data first and let us understand the trends before it can start giving us some insights.

Obviously you need access to Azure Machine Learning, once you log into it, you can create a new Experiment. That gives you a workspace designer and a toolbox (somewhat like SSIS/Biztalk) where you can drag control and the feed into each other. So it is a flexible model and for most tasks you do not need to write code.

Below is a screenshot of my experiment with toolbox on the left

image

Now machine learning is something which is slightly atypical for a usual CRM audience, I would not be able to fit full details of each of these tools in this blog but I will touch on each of these steps so that you can understand at high-level that what is going on inside these boxes. Let us address them one by one

Dynamics CRM 2016 Telecom

This module is the input data module where we are reading the CRM customer information in the form of a dataset. At the moment of writing the blog, there is no direct connection available from Azure Machine Learning to CRM online. But where there is a will, there is a way i.e. I discovered that you can connect to CRM using the following

1. You schedule a daily export of CRM data into a location that Azure Machine Learning can read e.g. Azure blob storage, Web Url over Http

2. You can write a small Python based module that connects to Dynamics using Azure Directory Services, the module can the pass the data to the Azure using a DataFrame control

From my experience having an automatic sync is not important from Dynamics to Azure ML but it is important the other way round i.e. Azure ML to Dynamics.

Split Data

This module basically splits your data into a two sets

1. Training dataset – The data based on which the machine learning model will learn

2. Testing dataset – The data based on which the accuracy of the model will be determined

I have chosen stratified split which ensures that the Testing dataset is balanced when it comes to classes being predicted. The split ratio is 80/20 i.e. 80% of the records will be used for training and 20% for testing.

Two-class Decision Forest

This is main classifier i.e. the module that does the grunt of the work. The classifier of choice here is a random forest with bootstrap aggregation. Two-class makes sense for us because our prediction has two outcomes i.e. whether the customer will churn or not.

Random forests are fast classifiers and very difficult to overfit, rather than taking one path they learn your data from different angles (called ensembles). Then in the end the scores of various ensembles are combined to come up with an overall prediction score. You can read more about this classifier here.

Train Model

This module basically connects the classifier to the data. As you can see in the screenshot of the experiment I posted above there are two arrows coming out of Spilt Data, the one of the left is the 80% one i.e. the training dataset. The output of this module is trained model that is ready to make predictions.

Score Model

This step uses the trained model from the previous step and tests the accuracy of the model against our test data. Put simply, here we start feeding the data to the model that it has not seen before and count how many number of times the model gave the correct prediction Vs wrong prediction.

Evaluate Model

The scores (hit vs miss) generated from the previous modules are evaluated in this step. In Data Science there are standard metrics to measure this kind of performance e.g. Confusion Matrices, ROC curves and many more. Below is the screenshot of the Confusion matrix

clip_image002clip_image004

I know there is a lot of confusing details here (hence the name Confusion Matrix) but as a rule of thumb we need to focus on AUC i.e. area under the curve. As shown in the results above we have a decent 72.9% of the area under the curve (which in layman terms means percentage of correct predictions). Higher percentage does not necessarily equate to a better model, more often than not a higher percentage (e.g. 90%) means overfitting i.e. a state where your model does very well on the sample data but not so well on the real-world data. So our model is good to go.

You can read more about the metrics and terms above here

In the next blog we will deploy and integrate the model with Dynamics CRM.

Posted in Dynamics 365, Machine Learning

Use Machine Learning to predict customers you might lose – Part 1

“Customer satisfaction is worthless. Customer loyalty is priceless.”

Jeffrey Gitomer

Business is becoming increasingly competitive these days and getting new customers increasingly difficult. The wisest thing to do in this cut-throat scenario is to hold on to your existing customer base while trying to develop new business. Realistically, no matter how hard it tries, every organisation still loses a percentage of its customers every year to the competition. This process of losing customers is called Churn.

Progressive organisations take churn seriously, they want to know in advance that approximately how many customers they are going to lose this year and what is causing the churn. Having an insight into customer churn at least gives an organisation an opportunity to proactively take measures to control the churn before it is too late and the customer is gone.

Two pieces of information help the most when it comes to minimising the churn

1. Which customers are we going to lose this year

2. What are the biggest drivers of customer churn

The answers to the above questions often are hidden in the customer data itself but revealing these answers out of swathes of data is an art – rather a science called Data Science. With recent advances in some practical Data Science techniques like Machine Learning getting these answers is becoming increasingly feasible even for small scale organisations who do not have the luxury of a Data Science team. Thanks to services like Azure Machine Learning which are trying to democratise these advanced techniques to a level such that even a small scale customer can leverage them to solve their business puzzles.

Let me show you how your Dynamics CRM can leverage the powerful Machine Learning cortex to get some insights into the key drivers of customer churn. In this blog series, we will build a machine learning model that will answer the questions regarding churn. I have divided the series into four parts as below

Part 1 – Introduction

Part 2 – Creating a Machine Learning model

Part 3- Integrate the model with Dynamics CRM

Part 4 – Gaining insights within Dynamics CRM

I will take the example of a Telecom organisation but the model can be extended any kind of organisation in any capacity and from any industry.

Scenario

Let us say there is a Telecom company called TelcoOrg which uses Microsoft CRM 2016 and they have an entity called Telecom Customer that stores their telco profile. Such profile may include some data regarding a customer mobile plan, phone usage, demography and reported satisfaction.

Understanding the features

In data science projects, it is crucial to understand the data points (called features). You need to carefully select those features that are relevant to the problem at hand, some the features also need to be engineered and normalised before they start generating some information gain. Below are the features that we will be using in this scenario of our Churn problem

Let me quickly explain the features so that we can understand the information contained in them

Feature

Information

Has a College degree?

If the customer has a college degree

Cost price of phone

Price of the customer’s phone as per the plan/contract with TelcoOrg

Value of customer’s house

Approximate value of customer’s house based on Property Information websites like RPData, etc.

Average Income

Yearly income as reported by the customer

Leftover minutes per month

Average number of minutes a customer normally does not use from monthly quota

Average call duration

Average duration of calls made based on call history

Usage category

The category customer’s phone usage falls under as compared to other customers e.g. Very High, High, Average, Low or Very Low

Average overcharges

Average number of times a customer is usually overcharged per month

Average long duration calls

Average number of calls a customer usually makes per month that are more than 15 minutes long

Considering change?

How customer responded to TelcoOrg’s survey when asked if they are considering changing to another provider e.g. Yes, considering, Maybe, Not looking, etc.

Reported level of satisfaction

How customer responded to TelcoOrg’s survey when asked if they are satisfied with TelcoOrg’s service e.g. Unsatisfied, Neutral, Satisfied

Account Status

Current Status of the customer (i.e. if they have left or are currently Active)

Predicted Churn Status

This is the predicted status returned by the Azure Machine Learning Web Service

Prediction Confidence Percentage

This field means how confident Azure Machine Learning Web Service is regarding its prediction. A threshold can be set to only consider the predictions above e.g. we can say, take only those predictions where WS is 70% confident.

The screenshot below shows the information from the Telecom Customer entity. The section highlighted in blue are predictions based on Azure Machine Learning web services. Whenever any of the fields on this CRM form changes, the WS updates its prediction scores based on the record’s data. I will provide details later during the series as to how I built this integration.

clip_image001

Below is a screenshot of some of these records

clip_image003

We will achieve the following business benefits using Azure Machine Learning

1. Customers who are predicted to be at a higher risk of leaving (churn) can be flagged, so the customer retention teams can get it touch with them to proactively address their concerns in a bid to retain them

2. Find what factors affect churn the most i.e. out of all these fields we will determine what fields are more likely to make a leave than others

3. We will also get insights into some business rules that dictate churn i.e. the drivers

I hope you understand the problem now and find it interesting so far. Let us meet in next part of the blog where I will show how a machine learning model is created.