In [3]:
source('normality.r')
In [4]:
d=read.table("titanic.data")
In [5]:
head(d)
summary(d)
A data.frame: 6 × 5
V1V2V3V4V5
<fct><fct><fct><fct><fct>
Name PClassAge GenderSurvived
Allen, Miss Elisabeth Walton 1st 29 female1
Allison, Miss Helen Loraine 1st 2 female0
Allison, Mr Hudson Joshua Creighton 1st 30 male 0
Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st 25 female0
Allison, Master Hudson Trevor 1st 0.92male 1
                            V1            V2            V3           V4     
 Carlsson, Mr Frans Olof     :   2   1st   :322   22     : 35   female:462  
 Connolly, Miss Kate         :   2   2nd   :280   21     : 31   Gender:  1  
 Kelly, Mr James             :   2   3rd   :711   30     : 31   male  :851  
 Abbing, Mr Anthony          :   1   PClass:  1   18     : 30               
 Abbott, Master Eugene Joseph:   1                36     : 29               
 Abbott, Mr Rossmore Edward  :   1                (Other):601               
 (Other)                     :1305                NA's   :557               
        V5     
 0       :863  
 1       :450  
 Survived:  1  
               
               
               
               

Q1) Is there a significant difference in Age distribution b/w those who survived and those who did not?

In [6]:
S=subset(d,d$V5==1) # Survivors
NS = subset(d,d$V5==0) # Non-Survivors
In [7]:
t1=as.numeric(as.character(S$V3))
t2=as.numeric(as.character(NS$V3))
t1 = na.omit(t1)
t2 = na.omit(t2)
In [8]:
par(mfrow=c(1,2))
hist(t1)
boxplot(t1)
In [9]:
par(mfrow=c(1,2))
hist(t2)
boxplot(t2)
In [10]:
par(mfrow=c(1,2))
qqnorm(t1)
qqnorm(t2)
In [11]:
par(mfrow=c(1,2))
plot(ecdf(t2))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t2),sd(t2)))
plot(ecdf(t1))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t1),sd(t1)))
In [12]:
normtest(t1)
normtest(t2)
A data.frame: 5 × 2
MethodP.value
<fct><dbl>
Shapiro-Wilk normality test 0.0004996137
Anderson-Darling normality test 0.0017831268
Cramer-von Mises normality test 0.0068092810
Lilliefors (Kolmogorov-Smirnov) normality test0.0322461921
Shapiro-Francia normality test 0.0020591830
Warning message in cvm.test(x):
“p-value is smaller than 7.37e-10, cannot be computed more accurately”
A data.frame: 5 × 2
MethodP.value
<fct><dbl>
Shapiro-Wilk normality test 1.461444e-09
Anderson-Darling normality test 5.014279e-16
Cramer-von Mises normality test 7.370000e-10
Lilliefors (Kolmogorov-Smirnov) normality test2.646866e-16
Shapiro-Francia normality test 1.719147e-08

Strong Evidence Against Normality So using non-parametric test

In [13]:
wilcox.test(t1,t2)
	Wilcoxon rank sum test with continuity correction

data:  t1 and t2
W = 65469, p-value = 0.1917
alternative hypothesis: true location shift is not equal to 0

Ans : There is no significant difference in the age distribution of the ones who survived and who didn't

Q2) Is there a significant difference in age distribution between those who survived and those who did not after controlling for gender ?

Controlling for the Gender Male

In [14]:
S=subset(d,(d$V5==1) & (d$V4=='male')) # Survivors
NS = subset(d,(d$V5==0)&(d$V4=='male')) # Non-Survivors
t1=as.numeric(as.character(S$V3))
t2=as.numeric(as.character(NS$V3))
t1 = na.omit(t1)
t2 = na.omit(t2)
In [15]:
par(mfrow=c(1,2))
hist(t1)
boxplot(t1)
In [16]:
par(mfrow=c(1,2))
hist(t2)
boxplot(t2)
In [17]:
par(mfrow=c(1,2))
qqnorm(t1)
qqnorm(t2)
In [18]:
par(mfrow=c(1,2))
plot(ecdf(t2))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t2),sd(t2)))
plot(ecdf(t1))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t1),sd(t1)))
In [19]:
normtest(t1)
normtest(t2)
A data.frame: 5 × 2
MethodP.value
<fct><dbl>
Shapiro-Wilk normality test 0.004201276
Anderson-Darling normality test 0.008390771
Cramer-von Mises normality test 0.045051620
Lilliefors (Kolmogorov-Smirnov) normality test0.059854415
Shapiro-Francia normality test 0.013508441
Warning message in cvm.test(x):
“p-value is smaller than 7.37e-10, cannot be computed more accurately”
A data.frame: 5 × 2
MethodP.value
<fct><dbl>
Shapiro-Wilk normality test 6.368376e-10
Anderson-Darling normality test 2.227363e-16
Cramer-von Mises normality test 7.370000e-10
Lilliefors (Kolmogorov-Smirnov) normality test6.088379e-15
Shapiro-Francia normality test 8.325134e-09

Strong evidence against Normality for Distribution of Age groups for survivors and non survivors after controlling gender as male. Proceeding on to use non-parametric test

In [20]:
wilcox.test(t1,t2)
	Wilcoxon rank sum test with continuity correction

data:  t1 and t2
W = 14453, p-value = 0.003962
alternative hypothesis: true location shift is not equal to 0

There is strong evidence to suggest difference in the age distribution of the ones who survived and who didn't after controlling the gender to male

Controlling for gender female

In [21]:
S=subset(d,(d$V5==1) & (d$V4=='female')) # Survivors
NS = subset(d,(d$V5==0)&(d$V4=='female')) # Non-Survivors
t1=as.numeric(as.character(S$V3))
t2=as.numeric(as.character(NS$V3))
t1 = na.omit(t1)
t2 = na.omit(t2)
In [22]:
par(mfrow=c(1,2))
hist(t1)
boxplot(t1)
In [23]:
par(mfrow=c(1,2))
hist(t2)
boxplot(t2)
In [24]:
par(mfrow=c(1,2))
qqnorm(t1)
qqnorm(t2)
In [25]:
par(mfrow=c(1,2))
plot(ecdf(t2))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t2),sd(t2)))
plot(ecdf(t1))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t1),sd(t1)))
In [26]:
normtest(t1)
normtest(t2)
A data.frame: 5 × 2
MethodP.value
<fct><dbl>
Shapiro-Wilk normality test 0.0026901879
Anderson-Darling normality test 0.0006357403
Cramer-von Mises normality test 0.0008234442
Lilliefors (Kolmogorov-Smirnov) normality test0.0001661670
Shapiro-Francia normality test 0.0077707718
A data.frame: 5 × 2
MethodP.value
<fct><dbl>
Shapiro-Wilk normality test 0.11296551
Anderson-Darling normality test 0.11530637
Cramer-von Mises normality test 0.08483381
Lilliefors (Kolmogorov-Smirnov) normality test0.12109238
Shapiro-Francia normality test 0.11744076

There is strong evidence against normality in the age distribution for survivors with gender controlled as female and No significant evidence against normality for the age distribution in the non survivors

In [27]:
wilcox.test(t1,t2)
	Wilcoxon rank sum test with continuity correction

data:  t1 and t2
W = 9408.5, p-value = 0.005119
alternative hypothesis: true location shift is not equal to 0

There is strong evidence to suggest difference in the age distribution of the ones who survived and who didn't after controlling the gender to male

Ans Q2) There is strong evidenve to suggest difference in the age distributions of the ones who survived and the ones who didn't even after controlling for the gender of the individual

Q3) Is there significant difference in the survival probabilities of the 2 gender ?

In [38]:
a = table(as.character(d$V4[-c(1)]),as.character(d$V5[-c(1)]))
a
        
           0   1
  female 154 308
  male   709 142
In [29]:
chisq.test(a)
	Pearson's Chi-squared test with Yates' continuity correction

data:  a
X-squared = 329.84, df = 1, p-value < 2.2e-16
In [30]:
chisq.test(a,simulate.p.value=T)
	Pearson's Chi-squared test with simulated p-value (based on 2000
	replicates)

data:  a
X-squared = 332.06, df = NA, p-value = 0.0004998
In [31]:
fisher.test(a)
	Fisher's Exact Test for Count Data

data:  a
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.07620521 0.13155709
sample estimates:
odds ratio 
 0.1003494 
In [40]:
fisher.test(a,alt='l')
	Fisher's Exact Test for Count Data

data:  a
p-value < 2.2e-16
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.0000000 0.1262404
sample estimates:
odds ratio 
 0.1003494 

Ans Q3)There is significant evidence to show there is difference in the survival probabilities of the 2 gender

Q4) Is there significant difference in survival probabilities for the two genders even after taking the effects of passenger class into account?

In [33]:
#Passenger Class 1 
p1=subset(d,d$V2=="1st")
#Passenger Class 2 
p2=subset(d,d$V2=="2nd")
#Passenger Class 3
p3=subset(d,d$V2=="3rd")

head(p1)
A data.frame: 6 × 5
V1V2V3V4V5
<fct><fct><fct><fct><fct>
2Allen, Miss Elisabeth Walton 1st29 female1
3Allison, Miss Helen Loraine 1st2 female0
4Allison, Mr Hudson Joshua Creighton 1st30 male 0
5Allison, Mrs Hudson JC (Bessie Waldo Daniels)1st25 female0
6Allison, Master Hudson Trevor 1st0.92male 1
7Anderson, Mr Harry 1st47 male 1

Checking for passenger Class 1st

In [34]:
t1 = table(as.character(p1$V4),as.character(p1$V5))
t1
chisq.test(t1)
fisher.test(t1)
fisher.test(t1,alt='l')
        
           0   1
  female   9 134
  male   120  59
	Pearson's Chi-squared test with Yates' continuity correction

data:  t1
X-squared = 119.64, df = 1, p-value < 2.2e-16
	Fisher's Exact Test for Count Data

data:  t1
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.01397898 0.07139961
sample estimates:
odds ratio 
0.03344369 
	Fisher's Exact Test for Count Data

data:  t1
p-value < 2.2e-16
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.00000000 0.06440093
sample estimates:
odds ratio 
0.03344369 

Checking for passenger Class 2nd

In [35]:
t2 = table(as.character(p2$V4),as.character(p2$V5))
t2
chisq.test(t2)
fisher.test(t2)
fisher.test(t2,alt='l')
        
           0   1
  female  13  94
  male   148  25
	Pearson's Chi-squared test with Yates' continuity correction

data:  t2
X-squared = 142.76, df = 1, p-value < 2.2e-16
	Fisher's Exact Test for Count Data

data:  t2
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.01055891 0.05030944
sample estimates:
odds ratio 
0.02391888 
	Fisher's Exact Test for Count Data

data:  t2
p-value < 2.2e-16
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.00000000 0.04531073
sample estimates:
odds ratio 
0.02391888 

Checking for passenger Class 3rd

In [36]:
t3 = table(as.character(p3$V4),as.character(p3$V5))
t3
chisq.test(t3)
fisher.test(t3)
fisher.test(t3,alt='l')
        
           0   1
  female 132  80
  male   441  58
	Pearson's Chi-squared test with Yates' continuity correction

data:  t3
X-squared = 63.201, df = 1, p-value = 1.867e-15
	Fisher's Exact Test for Count Data

data:  t3
p-value = 1.184e-14
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.1440935 0.3266507
sample estimates:
odds ratio 
 0.2175681 
	Fisher's Exact Test for Count Data

data:  t3
p-value = 9.239e-15
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.0000000 0.3070662
sample estimates:
odds ratio 
 0.2175681 

Strong evidence to say there is significant differnce in the Survival probabilities of the 2 genders even after taking the effect of passenger class int account