source('normality.r')

d=read.table("titanic.data")

head(d)
summary(d)

                            V1            V2            V3           V4     
 Carlsson, Mr Frans Olof     :   2   1st   :322   22     : 35   female:462  
 Connolly, Miss Kate         :   2   2nd   :280   21     : 31   Gender:  1  
 Kelly, Mr James             :   2   3rd   :711   30     : 31   male  :851  
 Abbing, Mr Anthony          :   1   PClass:  1   18     : 30               
 Abbott, Master Eugene Joseph:   1                36     : 29               
 Abbott, Mr Rossmore Edward  :   1                (Other):601               
 (Other)                     :1305                NA's   :557               
        V5     
 0       :863  
 1       :450  
 Survived:  1

Q1) Is there a significant difference in Age distribution b/w those who survived and those who did not?

S=subset(d,d$V5==1) # Survivors
NS = subset(d,d$V5==0) # Non-Survivors

t1=as.numeric(as.character(S$V3))
t2=as.numeric(as.character(NS$V3))
t1 = na.omit(t1)
t2 = na.omit(t2)

par(mfrow=c(1,2))
hist(t1)
boxplot(t1)

par(mfrow=c(1,2))
hist(t2)
boxplot(t2)

par(mfrow=c(1,2))
qqnorm(t1)
qqnorm(t2)

par(mfrow=c(1,2))
plot(ecdf(t2))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t2),sd(t2)))
plot(ecdf(t1))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t1),sd(t1)))

normtest(t1)
normtest(t2)

Warning message in cvm.test(x):
“p-value is smaller than 7.37e-10, cannot be computed more accurately”

Strong Evidence Against Normality So using non-parametric test

wilcox.test(t1,t2)

	Wilcoxon rank sum test with continuity correction

data:  t1 and t2
W = 65469, p-value = 0.1917
alternative hypothesis: true location shift is not equal to 0

Ans : There is no significant difference in the age distribution of the ones who survived and who didn't

Q2) Is there a significant difference in age distribution between those who survived and those who did not after controlling for gender ?

Controlling for the Gender Male

S=subset(d,(d$V5==1) & (d$V4=='male')) # Survivors
NS = subset(d,(d$V5==0)&(d$V4=='male')) # Non-Survivors
t1=as.numeric(as.character(S$V3))
t2=as.numeric(as.character(NS$V3))
t1 = na.omit(t1)
t2 = na.omit(t2)

par(mfrow=c(1,2))
hist(t1)
boxplot(t1)

par(mfrow=c(1,2))
hist(t2)
boxplot(t2)

par(mfrow=c(1,2))
qqnorm(t1)
qqnorm(t2)

par(mfrow=c(1,2))
plot(ecdf(t2))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t2),sd(t2)))
plot(ecdf(t1))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t1),sd(t1)))

normtest(t1)
normtest(t2)

Warning message in cvm.test(x):
“p-value is smaller than 7.37e-10, cannot be computed more accurately”

Strong evidence against Normality for Distribution of Age groups for survivors and non survivors after controlling gender as male. Proceeding on to use non-parametric test

wilcox.test(t1,t2)

	Wilcoxon rank sum test with continuity correction

data:  t1 and t2
W = 14453, p-value = 0.003962
alternative hypothesis: true location shift is not equal to 0

There is strong evidence to suggest difference in the age distribution of the ones who survived and who didn't after controlling the gender to male

Controlling for gender female

S=subset(d,(d$V5==1) & (d$V4=='female')) # Survivors
NS = subset(d,(d$V5==0)&(d$V4=='female')) # Non-Survivors
t1=as.numeric(as.character(S$V3))
t2=as.numeric(as.character(NS$V3))
t1 = na.omit(t1)
t2 = na.omit(t2)

par(mfrow=c(1,2))
hist(t1)
boxplot(t1)

par(mfrow=c(1,2))
hist(t2)
boxplot(t2)

par(mfrow=c(1,2))
qqnorm(t1)
qqnorm(t2)

par(mfrow=c(1,2))
plot(ecdf(t2))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t2),sd(t2)))
plot(ecdf(t1))
lines(seq(0,60,1),pnorm(seq(0,60,1),mean(t1),sd(t1)))

normtest(t1)
normtest(t2)

There is strong evidence against normality in the age distribution for survivors with gender controlled as female and No significant evidence against normality for the age distribution in the non survivors

wilcox.test(t1,t2)

	Wilcoxon rank sum test with continuity correction

data:  t1 and t2
W = 9408.5, p-value = 0.005119
alternative hypothesis: true location shift is not equal to 0

There is strong evidence to suggest difference in the age distribution of the ones who survived and who didn't after controlling the gender to male

Ans Q2) There is strong evidenve to suggest difference in the age distributions of the ones who survived and the ones who didn't even after controlling for the gender of the individual

Q3) Is there significant difference in the survival probabilities of the 2 gender ?

a = table(as.character(d$V4[-c(1)]),as.character(d$V5[-c(1)]))
a

        
           0   1
  female 154 308
  male   709 142

chisq.test(a)

	Pearson's Chi-squared test with Yates' continuity correction

data:  a
X-squared = 329.84, df = 1, p-value < 2.2e-16

chisq.test(a,simulate.p.value=T)

	Pearson's Chi-squared test with simulated p-value (based on 2000
	replicates)

data:  a
X-squared = 332.06, df = NA, p-value = 0.0004998

fisher.test(a)

	Fisher's Exact Test for Count Data

data:  a
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.07620521 0.13155709
sample estimates:
odds ratio 
 0.1003494

fisher.test(a,alt='l')

	Fisher's Exact Test for Count Data

data:  a
p-value < 2.2e-16
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.0000000 0.1262404
sample estimates:
odds ratio 
 0.1003494

Ans Q3)There is significant evidence to show there is difference in the survival probabilities of the 2 gender

Q4) Is there significant difference in survival probabilities for the two genders even after taking the effects of passenger class into account?

#Passenger Class 1 
p1=subset(d,d$V2=="1st")
#Passenger Class 2 
p2=subset(d,d$V2=="2nd")
#Passenger Class 3
p3=subset(d,d$V2=="3rd")

head(p1)

Checking for passenger Class 1st

t1 = table(as.character(p1$V4),as.character(p1$V5))
t1
chisq.test(t1)
fisher.test(t1)
fisher.test(t1,alt='l')

        
           0   1
  female   9 134
  male   120  59

	Pearson's Chi-squared test with Yates' continuity correction

data:  t1
X-squared = 119.64, df = 1, p-value < 2.2e-16

	Fisher's Exact Test for Count Data

data:  t1
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.01397898 0.07139961
sample estimates:
odds ratio 
0.03344369

	Fisher's Exact Test for Count Data

data:  t1
p-value < 2.2e-16
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.00000000 0.06440093
sample estimates:
odds ratio 
0.03344369

Checking for passenger Class 2nd

t2 = table(as.character(p2$V4),as.character(p2$V5))
t2
chisq.test(t2)
fisher.test(t2)
fisher.test(t2,alt='l')

        
           0   1
  female  13  94
  male   148  25

	Pearson's Chi-squared test with Yates' continuity correction

data:  t2
X-squared = 142.76, df = 1, p-value < 2.2e-16

	Fisher's Exact Test for Count Data

data:  t2
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.01055891 0.05030944
sample estimates:
odds ratio 
0.02391888

	Fisher's Exact Test for Count Data

data:  t2
p-value < 2.2e-16
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.00000000 0.04531073
sample estimates:
odds ratio 
0.02391888

Checking for passenger Class 3rd

t3 = table(as.character(p3$V4),as.character(p3$V5))
t3
chisq.test(t3)
fisher.test(t3)
fisher.test(t3,alt='l')

        
           0   1
  female 132  80
  male   441  58

	Pearson's Chi-squared test with Yates' continuity correction

data:  t3
X-squared = 63.201, df = 1, p-value = 1.867e-15

	Fisher's Exact Test for Count Data

data:  t3
p-value = 1.184e-14
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.1440935 0.3266507
sample estimates:
odds ratio 
 0.2175681

	Fisher's Exact Test for Count Data

data:  t3
p-value = 9.239e-15
alternative hypothesis: true odds ratio is less than 1
95 percent confidence interval:
 0.0000000 0.3070662
sample estimates:
odds ratio 
 0.2175681

Strong evidence to say there is significant differnce in the Survival probabilities of the 2 genders even after taking the effect of passenger class int account

Method	P.value
<fct>	<dbl>
Shapiro-Wilk normality test	0.0004996137
Anderson-Darling normality test	0.0017831268
Cramer-von Mises normality test	0.0068092810
Lilliefors (Kolmogorov-Smirnov) normality test	0.0322461921
Shapiro-Francia normality test	0.0020591830

Method	P.value
<fct>	<dbl>
Shapiro-Wilk normality test	1.461444e-09
Anderson-Darling normality test	5.014279e-16
Cramer-von Mises normality test	7.370000e-10
Lilliefors (Kolmogorov-Smirnov) normality test	2.646866e-16
Shapiro-Francia normality test	1.719147e-08

Method	P.value
<fct>	<dbl>
Shapiro-Wilk normality test	0.004201276
Anderson-Darling normality test	0.008390771
Cramer-von Mises normality test	0.045051620
Lilliefors (Kolmogorov-Smirnov) normality test	0.059854415
Shapiro-Francia normality test	0.013508441

Method	P.value
<fct>	<dbl>
Shapiro-Wilk normality test	6.368376e-10
Anderson-Darling normality test	2.227363e-16
Cramer-von Mises normality test	7.370000e-10
Lilliefors (Kolmogorov-Smirnov) normality test	6.088379e-15
Shapiro-Francia normality test	8.325134e-09

Method	P.value
<fct>	<dbl>
Shapiro-Wilk normality test	0.0026901879
Anderson-Darling normality test	0.0006357403
Cramer-von Mises normality test	0.0008234442
Lilliefors (Kolmogorov-Smirnov) normality test	0.0001661670
Shapiro-Francia normality test	0.0077707718

A data.frame: 6 × 5
V1	V2	V3	V4	V5
<fct>	<fct>	<fct>	<fct>	<fct>
Name	PClass	Age	Gender	Survived
Allen, Miss Elisabeth Walton	1st	29	female	1
Allison, Miss Helen Loraine	1st	2	female	0
Allison, Mr Hudson Joshua Creighton	1st	30	male	0
Allison, Mrs Hudson JC (Bessie Waldo Daniels)	1st	25	female	0
Allison, Master Hudson Trevor	1st	0.92	male	1

A data.frame: 6 × 5
	V1	V2	V3	V4	V5
	<fct>	<fct>	<fct>	<fct>	<fct>
2	Allen, Miss Elisabeth Walton	1st	29	female	1
3	Allison, Miss Helen Loraine	1st	2	female	0
4	Allison, Mr Hudson Joshua Creighton	1st	30	male	0
5	Allison, Mrs Hudson JC (Bessie Waldo Daniels)	1st	25	female	0
6	Allison, Master Hudson Trevor	1st	0.92	male	1
7	Anderson, Mr Harry	1st	47	male	1