library(ggplot2)
library(survival)
library(corrplot)
library(car)
# read data
df <- read.csv("acquisition_and_defection_data.csv")
# show the structure of the data
str(df)
'data.frame': 500 obs. of 14 variables:
$ Customer : int 1 2 3 4 5 6 7 8 9 10 ...
$ Acquisition : int 1 0 0 1 1 0 0 0 1 1 ...
$ Duration : int 384 0 0 730 579 0 0 0 730 730 ...
$ Retention : int 0 0 0 1 0 0 0 0 1 1 ...
$ First_Purchase: int 434 0 0 226 363 0 0 0 599 271 ...
$ Acq_Expense : int 760 148 253 610 672 436 363 884 452 787 ...
$ Ret_Expense : int 2310 0 0 2193 801 0 0 0 1341 2266 ...
$ Future_CLV : int 0 0 0 5732 0 0 0 0 6916 6084 ...
$ Industry : int 1 1 1 1 1 0 0 0 1 1 ...
$ Revenue : num 30.2 39.8 54.9 45.8 69 ...
$ Employees : int 1240 166 1016 122 313 359 902 264 1782 539 ...
$ Breadth : int 5 0 0 2 4 0 0 0 1 1 ...
$ Frequency : int 2 0 0 12 7 0 0 0 11 14 ...
$ X : logi NA NA NA NA NA NA ...
# convert variables to factors
cols <- c("Acquisition", "Retention", "Industry")
df[cols] <- lapply(df[cols], factor)
# create variable names
levels(df$Acquisition) <- make.names(c("No", "Yes"))
levels(df$Retention) <- make.names(c("No", "Yes"))
levels(df$Industry) <- make.names(c("No", "Yes"))
# delete X column from df
df$X <- NULL
# show sample of the data
head(df)
Understanding Acquisition: In this first analysis, develop an acquisition model (acquired or not) based upon the following variables: a) acquisition expense, b) Industry, c) Firm revenue and d) Firm employees. Give a basic assessment of the predictive ability of your model and then provide a brief overview of the factors influencing successful acquisition. Since the number of predictor variables is limited, be sure to explore non-linear and/or interaction/moderating effects where possible.
# fit inital logstic regression model to predict acquisition
acq_glm_fit <- glm(Acquisition ~ Acq_Expense + Industry + Revenue + Employees,
data = df,
family = binomial(link = "logit"))
# fit just the intercept model
int_fit <- glm(Acquisition ~ 1,
data = df,
family = binomial(link = "logit"))
# show the results of the model
summary(acq_glm_fit)
Call:
glm(formula = Acquisition ~ Acq_Expense + Industry + Revenue +
Employees, family = binomial(link = "logit"), data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7653 -0.1760 0.0224 0.2127 1.9177
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.352e+01 1.497e+00 -9.033 < 2e-16 ***
Acq_Expense 2.148e-02 2.219e-03 9.677 < 2e-16 ***
IndustryYes 1.126e-01 3.640e-01 0.309 0.7570
Revenue 2.741e-02 1.110e-02 2.469 0.0135 *
Employees 3.813e-03 5.477e-04 6.961 3.38e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 678.97 on 499 degrees of freedom
Residual deviance: 217.66 on 495 degrees of freedom
AIC: 227.66
Number of Fisher Scoring iterations: 7
# subset dataframe to only continuous variables
cont_vars <- df[c("Acq_Expense", "Revenue", "Employees")]
# create function to provide correlations
panel.cor <- function(x, y, digits = 2, cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
# correlation coefficient
r <- cor(x, y)
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste("r= ", txt, sep = "")
text(0.5, 0.6, txt)
# p-value calculation
p <- cor.test(x, y)$p.value
txt2 <- format(c(p, 0.123456789), digits = digits)[1]
txt2 <- paste("p= ", txt2, sep = "")
if(p<0.01) txt2 <- paste("p= ", "<0.01", sep = "")
text(0.5, 0.4, txt2)
}
# create scatterplots and correlations
pairs(cont_vars, upper.panel = panel.cor)
# show VIF for model
vif(acq_glm_fit)
Acq_Expense Industry Revenue Employees
1.924360 1.032087 1.098079 1.852277
# check LR test
anova(int_fit,acq_glm_fit,test="LRT")
Analysis of Deviance Table
Model 1: Acquisition ~ 1
Model 2: Acquisition ~ Acq_Expense + Industry + Revenue + Employees
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 499 678.97
2 495 217.66 4 461.31 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Modeling Defection: The next step is to model the defection/retention process for those 292 customers which were acquired. In doing so, utilize two different techniques: logistic regression and survival analysis. You may employ any of the variables you feel appropriate in either or both models. Again, alternative expressions of the variables may be useful. Also, can information from the acquisition model be incorporated
# number of potential customers vs acquired
table(df$Acquisition)
No Yes
208 292
# subset dataset into acquired customers
cust_acq <- df[df$Acquisition == "Yes", ]
# sort df by Future_CLV
cust_acq <- cust_acq[order(cust_acq$Future_CLV),]
# plot FutureCLV by Retention
ggplot(data = cust_acq) +
geom_point(mapping = aes(x = seq(1, length(cust_acq$Future_CLV)), y = Future_CLV, color = Retention)) +
ggtitle("Future CLV vs. Rention for Acquired Customers") +
labs(x = "Rank",y = "Future CLV") +
theme(plot.title = element_text(lineheight = 0.8, face = "bold", hjust = 0.5)) +
theme(legend.position = c(0.1, 0.85))
# sort df by Duration
cust_acq <- cust_acq[order(cust_acq$Duration),]
# plot Duration by Retention
ggplot(data = cust_acq) +
geom_point(mapping = aes(x = seq(1, length(cust_acq$Duration)), y = Duration, color = Retention)) +
ggtitle("Duration vs. Rention for Acquired Customers") +
labs(x = "Rank",y = "Duration") +
theme(plot.title = element_text(lineheight = 0.8, face = "bold", hjust = 0.5)) +
theme(legend.position = c(0.1, 0.85))
# fit inital logstic regression model to predict churn without Future_CLV or Duration
churn_glm_fit <- glm(Retention ~ First_Purchase + Acq_Expense + Ret_Expense + Industry + Revenue + Employees + Breadth + Frequency,
data = cust_acq,
family = binomial(link = "logit"))
glm.fit: fitted probabilities numerically 0 or 1 occurred
# show the results of the model
summary(churn_glm_fit)
Call:
glm(formula = Retention ~ First_Purchase + Acq_Expense + Ret_Expense +
Industry + Revenue + Employees + Breadth + Frequency, family = binomial(link = "logit"),
data = cust_acq)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.31948 -0.02187 0.00000 0.01035 2.52118
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.720040 3.719032 0.731 0.464545
First_Purchase 0.014186 0.010309 1.376 0.168780
Acq_Expense -0.062551 0.012759 -4.903 9.45e-07 ***
Ret_Expense 0.009469 0.001786 5.301 1.15e-07 ***
IndustryYes 6.456693 1.439686 4.485 7.30e-06 ***
Revenue 0.060758 0.037490 1.621 0.105096
Employees -0.002217 0.002530 -0.876 0.380835
Breadth 0.973277 0.290250 3.353 0.000799 ***
Frequency 1.084277 0.212643 5.099 3.41e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 403.14 on 291 degrees of freedom
Residual deviance: 59.85 on 283 degrees of freedom
AIC: 77.85
Number of Fisher Scoring iterations: 9
# show VIF for model
vif(churn_glm_fit)
First_Purchase Acq_Expense Ret_Expense Industry Revenue Employees Breadth Frequency
16.085131 18.066321 8.440694 4.475781 4.039566 10.831953 2.303818 11.646520
# subset dataframe to only continuous variables
cont_vars <- cust_acq[c("First_Purchase", "Acq_Expense", "Ret_Expense", "Revenue", "Employees", "Breadth", "Frequency")]
# create scatterplots and correlations
pairs(cont_vars, upper.panel = panel.cor)
# fit inital logstic regression model to predict churn without Acq_Expense
churn_glm_fit1 <- glm(Retention ~ First_Purchase + Ret_Expense + Industry + Revenue + Employees + Breadth + Frequency,
data = cust_acq,
family = binomial(link = "logit"))
# show the results of the model
summary(churn_glm_fit1)
Call:
glm(formula = Retention ~ First_Purchase + Ret_Expense + Industry +
Revenue + Employees + Breadth + Frequency, family = binomial(link = "logit"),
data = cust_acq)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.46126 -0.34800 -0.02215 0.31653 2.60652
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.724e+01 2.236e+00 -7.711 1.25e-14 ***
First_Purchase 5.203e-02 6.836e-03 7.612 2.70e-14 ***
Ret_Expense 3.628e-03 5.562e-04 6.522 6.92e-11 ***
IndustryYes 2.258e+00 4.868e-01 4.638 3.52e-06 ***
Revenue -1.231e-01 2.220e-02 -5.545 2.95e-08 ***
Employees -1.164e-02 1.632e-03 -7.135 9.68e-13 ***
Breadth 2.130e-01 1.231e-01 1.731 0.0835 .
Frequency 4.115e-01 6.245e-02 6.588 4.45e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 403.14 on 291 degrees of freedom
Residual deviance: 157.63 on 284 degrees of freedom
AIC: 173.63
Number of Fisher Scoring iterations: 7
# show VIF for model
vif(churn_glm_fit1)
First_Purchase Ret_Expense Industry Revenue Employees Breadth Frequency
19.305937 1.924286 1.336773 3.495110 13.721545 1.104398 2.089035
# subset dataframe to only continuous variables
cont_vars <- cust_acq[c("First_Purchase", "Ret_Expense", "Revenue", "Employees", "Breadth", "Frequency")]
# create scatterplots and correlations
pairs(cont_vars, upper.panel = panel.cor)
# fit inital logstic regression model to predict churn without First_Purchase
churn_glm_fit2 <- glm(Retention ~ Ret_Expense + Industry + Revenue + Employees + Breadth + Frequency,
data = cust_acq,
family = binomial(link = "logit"))
# show the results of the model
summary(churn_glm_fit2)
Call:
glm(formula = Retention ~ Ret_Expense + Industry + Revenue +
Employees + Breadth + Frequency, family = binomial(link = "logit"),
data = cust_acq)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6710 -0.8412 -0.2798 0.9119 1.9508
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.3965059 1.0031115 -7.374 1.66e-13 ***
Ret_Expense 0.0017371 0.0002934 5.921 3.19e-09 ***
IndustryYes 1.2532274 0.3084229 4.063 4.84e-05 ***
Revenue 0.0281038 0.0087820 3.200 0.00137 **
Employees 0.0009735 0.0003055 3.187 0.00144 **
Breadth 0.0572668 0.0843417 0.679 0.49715
Frequency 0.1939535 0.0341566 5.678 1.36e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 403.14 on 291 degrees of freedom
Residual deviance: 305.40 on 285 degrees of freedom
AIC: 319.4
Number of Fisher Scoring iterations: 5
# show VIF for model
vif(churn_glm_fit2)
Ret_Expense Industry Revenue Employees Breadth Frequency
1.174202 1.139574 1.082365 1.127100 1.021835 1.258904
# subset dataframe to only continuous variables
cont_vars <- cust_acq[c("Ret_Expense", "Revenue", "Employees", "Breadth", "Frequency")]
# create scatterplots and correlations
pairs(cont_vars, upper.panel = panel.cor)
# create a "survival object" for each observation, using time and churn data.
cust_acq$survival <- Surv(cust_acq$Duration, cust_acq$Retention == "No")
# fit a basic survival curve using the data
fit <- survfit(survival ~ 1, data = cust_acq)
# plot the survival curve and add a title!
plot(fit, lty = 1, mark.time = FALSE, ylim=c(.75,1), xlab = 'Days since Subscribing', ylab = 'Percent Surviving')
# fit Cox Proportional Hazard Model
churn_coxph <- coxph(survival ~ Ret_Expense + Industry + Revenue + Employees + Breadth + Frequency,
data = cust_acq)
# show the results of the model
summary(churn_coxph)
Call:
coxph(formula = survival ~ Ret_Expense + Industry + Revenue +
Employees + Breadth + Frequency, data = cust_acq)
n= 292, number of events= 157
coef exp(coef) se(coef) z Pr(>|z|)
Ret_Expense -0.0011142 0.9988864 0.0001553 -7.173 7.36e-13 ***
IndustryYes -0.7286930 0.4825393 0.1644832 -4.430 9.41e-06 ***
Revenue -0.0187239 0.9814503 0.0049961 -3.748 0.000178 ***
Employees -0.0004553 0.9995448 0.0001767 -2.576 0.009994 **
Breadth -0.0772937 0.9256180 0.0474889 -1.628 0.103606
Frequency -0.1115135 0.8944794 0.0182321 -6.116 9.58e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
Ret_Expense 0.9989 1.001 0.9986 0.9992
IndustryYes 0.4825 2.072 0.3496 0.6661
Revenue 0.9815 1.019 0.9719 0.9911
Employees 0.9995 1.000 0.9992 0.9999
Breadth 0.9256 1.080 0.8434 1.0159
Frequency 0.8945 1.118 0.8631 0.9270
Concordance= 0.722 (se = 0.024 )
Rsquare= 0.306 (max possible= 0.997 )
Likelihood ratio test= 106.5 on 6 df, p=0
Wald test = 99.41 on 6 df, p=0
Score (logrank) test = 101.1 on 6 df, p=0