Selection:

The goal of this case is try to generate a decision tree to classify customer calls duration. In order to achieve this goal, we would use customer information, marketing data as well as phone call details information.

Data selected

Field Name / Table Name / Description
CustomerID / CDemographics / The ID of the customer
Sex / CDemographics / The sex of a customer
Date of birth / CDemographics / The date of birth of a customer
Education Level / CDemographics / The education level of a customer
PhoneNo / CDemographics / Phone number of the customer
Joining Date / CDemographics / Date of joining ABC telecom
Join From / CDemographics / Which telecom company does the customer previously joined
CustomerID / MarketingData / The ID of the customer in the MarketingData
Special Offer / MarketingData / Special Offer offered to the customer.
PhoneNo / PhoneCallDetails / The origin phone number which makes the call
StartTime / PhoneCallDetails / Time of making the call
EndTime / PhoneCallDetails / Time of ending the call

Preprocessing

Missing value

Field Name / Table Name / No. of missing record / Total no. of record
in table / Percentage
Date of birth / CDemographics / 4 / 256 / 1.56%
Joining Date / CDemographics / 7 / 256 / 2.73%

There are total 10 records which missing “DateOfBirth” or “JoiningDate”. Since there is only a small portion (3.91%), we would ignore those records with missing field “DateOfBirth” or “JoiningDate”.

JoinForm in CDemographics table

There are totally 10 possible value of field “JoinForm” in CDemographics table. Some of the value is very close to each other. It is believed that these should be caused by data entry error. The field “JoinForm” would be corrected as shown in the following table.

Correction Table

Original Value / New Value
Anemal / Animal
Animal / Animal
Apple / Apple
FolishTnoe / FolishTone
FoolishTone / FoolishTone
Monday / Monday
OldWorld / OldWorld
OldWrold / OldWorld
Thre2One / Three2One
Three2One / Three2One

EducationLevel in CDemographics table

This field consists an unknown value “Sec” and there are total 23 records have this value. It is believed the value of this 23 records should be S (Secondary). The mapping between the original values and corrected values are shown in the following table.

Original Value / New Value
A / A
D / D
P / P
S / S
Sec / S
T / T

Ambiguous meaning

EducationLevel in CDemographics table

The group “D” (Degree) and “A” (Degree or above) are overlapped. The group “D” (Degree) would be merged into group “A” (Degree or above).


Transformation – Initial Stage

Summering Data

Age:

The age is derived from the “Dateofbirth” field in the CDemographics table.

The age would be calculated by the following formula:

Age = (timestampdiff(MONTH, dateofbirth, now()) / 12)

The unit of this field is year. Please note that the above formula is a portion of the resulting SQL statement.

No. of joined month:

The field no. of joined month is derived from “joiningDate” field in the CDemographics table.

The age would be calculated by the following formula:

No. of joined month = timestampdiff(MONTH, JoiningDate, now())

The unit of this field is month. Please note that the above formula is a portion of the resulting SQL statement.

Average Call Duration

The field CallDuration is derived from difference of starttime and endtime divided by the total number of call in the phonecalldetails. The CallDuration is calculated by a view created in the database. The name of the view is called “PhoneCallDetails3”. The unit of this field is minute.


Special Offer

There are three fields about Special Offer would be created. Their name is A_SPECIAL_OFFER, B_SPECIAL_OFFER, C_SPECIAL_OFFER. A view “MarketingData2” is created in order to retrieved these three fields. Basically, if a customer has a record in marketingData table and the value of “SpecialOffer” is “A”, then the derived field A_SPECIAL_OFFER would be ‘Y’. Similar logic is applied to fields B_SPECIAL_OFFER and C_SPECIAL_OFFER.

Binning

Age

Initially we would divide this field into 3 bins by using fixed width binning method.

Bin / From / To
Bin 1 / 20 / 31.2
Bin 2 / 31.2 / 42.4
Bin 3 / 42.4 / 53.6
Bin 4 / 53.6 / 64.8
Bin 5 / 64.8 / 76

No. of joined month:

Initial we would divide this field into 6 bins by using fixed width binning method.

Bin / From / To
Bin 1 / 28 / 37.8333
Bin 2 / 37.8333 / 47.6666
Bin 3 / 47.6666 / 57.5
Bin 4 / 57.5 / 67.3333
Bin 5 / 67.3333 / 77.1666
Bin 6 / 77.1666 / 87

Average Call Duration

Initial we would divide this field into 3 bins by using fixed width binning method.

Bin / From / To
Bin 1 / 0.0167 / 55.5667
Bin 2 / 55.5667 / 111.1167
Bin 3 / 111.1167 / 166.667


Data mining – Initial Stage

We would like to use C5 algorithm to generated decision tree for classify customers average call duration.

Parameters

Parameter Name / Description
Sex / Sex of the customer
A_SpecialOffer / Indicator for Customer having Special Offer A.
If customer has special offer a, the value of this field is Y, else N
B_SpecialOffer / Indicator for Customer having Special Offer B.
If customer has special offer b, the value of this field is Y, else N
C_SpecialOffer / Indicator for Customer having Special Offer C.
If customer has special offer c, the value of this field is Y, else N
JoinFrom / Which telecom company does the customer using before joining ABC telecom.
CorrectedEducation / The education level of customer.
It consists of only four possible values.
P – Primary School
S – Secondary School
T – Tertiary Education
A – Degree or above
Age_Bin / The binned value of the customer age
Joined_Bin / The binned value of the no. of month does the customer joined ABC telecom.
CallDuration_BIN / The binned value of the customer average monthly air time.
CallStartTime / The start time of the call, Either “Peak” or “Not-Peak”


Data Mining 1st Stage

Setting of Clementine

SQL statement provided to Clementine:

SELECT

Sex,

(timestampdiff(MONTH, dateofbirth, now()) / 12) as age,

EducationLevel,

timestampdiff(MONTH, JoiningDate, now()) as joined,

JoinFrom,

A_SPECIAL_OFFER,

B_SPECIAL_OFFER,

C_SPECIAL_OFFER,

CallDuration,

CallStartTime

FROM cdemographics, MarketingData2, Phonecalldetails3

WHERE cdemographics.customerid = marketingdata2.customerid

AND cdemographics. phoneNo = Phonecalldetails3.phoneNo

and dateofbirth is not null

and JoiningDate is not null

and cell2 is not null


Generated Decision Tree


Test result of 1st run

Evaluation – 1st Stage

The generated decision tree is too large, we try to do a tree pruning in order to decrease the size of the tree.

Transformation – 2nd Stage

In this stage, we are going to change the no. of bin of average call air time from 3 to 2 and also the generated decision tree is too large. Tree Pruning would be applied.

Average Call Duration

Bin / From / To
Bin 1 / 0.0167 / 83.3417
Bin 2 / 83.3417 / 166.667


Data Mining 2nd Stage

Tree pruning setting

Generated Decision Tree in 2nd stage

Test result of 2nd stage


Evaluation – 2nd Stage

After going through the 1st stage and 2nd stage, the generated decision tree is not very accurate; it is only about 53% which means that there should be some other factors which affect the average duration of the call but they are not included in the dataset provided. That should be the best decision tree could be genreated from what we have known.

Interpretation

Mapping of number of month joined ABC telecom

Bin / From / To / Logical Name
Bin 1 / 28 / 37.8333 / Joined 2 to 3 years
Bin 2 / 37.8333 / 47.6666 / Joined 3 to 4 years
Bin 3 / 47.6666 / 57.5 / Joined 4 to 5 years
Bin 4 / 57.5 / 67.3333 / Joined 5 to 6 years
Bin 5 / 67.3333 / 77.1666 / Joined 6 to 7 years
Bin 6 / 77.1666 / 87 / Joined 7 to 8 years

Mapping of age of customer

Bin / From / To / Logic Name
Bin 1 / 20 / 31.2 / Very Young Customer
Bin 2 / 31.2 / 42.4 / Young customer
Bin 3 / 42.4 / 53.6 / Middle Aged Customer
Bin 4 / 53.6 / 64.8 / Old Customer
Bin 5 / 64.8 / 76 / Very Old Customer

Mapping of age of average call duration of customer

Bin / From / To / Logical Name
Bin 1 / 0.0167 / 83.3417 / Short
Bin 2 / 83.3417 / 166.667 / Long


Totally there are 13 rules generated and they are listed below

Rule 1 / If a customer joined from 2 to 3 years and the customer is young and her sex = female then the average call duration is short.
Rule 2 / If a customer joined from 2 to 3 years and the customer is old and the customer does not have special offer A then the average call duration is short
Rule 3 / If a customer joined from 2 to 3 years and the customer is very young and his sex = male then the average call duration is short.
Rule 4 / If a customer joined for 3 to 4 years then the customer’s average call duration is short.
Rule 5 / If a customer joined for 4 to 5 years then the customer’s average call duration is short.
Rule 6 / If a customer joined for 5 to 6 years and the customer is young then average call duration is short.
Rule 7 / If a customer joined for 5 to 6 years and the customer is middle age then average call duration is short.
Rule 8 / If a customer joined for 5 to 6 years and the customer is very old and the customer does not have special offer A then average call duration is short.
Rule 9 / If a customer joined for 6 to 7 years and the customer is old then average call duration is short.
Rule 10 / If a customer joined for 7 to 8 years and his sex is male then average call duration is short.
Rule 11 / If a customer joined for 2 to3 years and the customer is very young then average call duration is long.
Rule 12 / If a customer joined for 2 to3 years and the customer is young and his sex is male then average call duration is long.
Rule 13 / If a customer joined for 2 to3 years and the customer is middle age then average call duration is long.
Rule 14 / If a customer joined for 2 to3 years and the customer is old and he has special offer A then average call duration is long.
Rule 15 / If a customer joined for 2 to3 years and the customer is very old and her sex is female then average call duration is long.
Rule 16 / If a customer joined for 5 to6 years and the customer is very young and her sex is female then average call duration is long.
Rule 17 / If a customer joined for 5 to 6 years and the customer is old then average call duration is long.
Rule 18 / If a customer joined for 5 to 6 years and the customer is very old and the customer does not have special offer A then average call duration is long.
Rule 19 / If a customer joined for 6 to 7 years and the customer is very young then average call duration is long.
Rule 20 / If a customer joined for 6 to 7 years and the customer is young then average call duration is long.
Rule 21 / If a customer joined for 6 to 7 years and the customer is middle age then average call duration is long.
Rule 22 / If a customer joined for 6 to 7 years and the customer is very old then average call duration is long.
Rule 23 / If a customer joined for 6 to 7 years and her sex is female then average call duration is long.

From the above rules, we found that the no. of month that the customer joined ABC telecom and the age of the customer are very important factors to classify the length of the average call duration. For those customers who joined ABC telecom very long time or very short time, they tend to have long average call duration. On the other hand, for those customers who in the middle range, they usually have short average call duration. By comparing rule 2 and rule 14, and rule 8 and rule 18, we found that special Offer A is not always an effective way to increase the average duration of call or we should say special offer A is only working on a specific group of customer.

The following suggestions are made based on above finding in order to increase average call duration of customer.

1.  Offer special offer A to customer who joined from 2 to 3 years and the customer is old.

2.  Special promotion of target user, for example customer joined ABC telecom from 4 to 6 years.