I finished the book 'The Manga Guide to Statistics' by Shin Takahashi this weekend.
This book is a good read for intermediate statisticians wanting to brush up on the basics taught in high school / early undergraduate degree. This is not a beginner-friendly book as it requires prior understanding of statistical concepts. Overall, I felt that the author did not do justice for the topic hypothesis testing, and the last chapter was very rushed. The author could have gone into more detail on p-value calculation as well.
If you are an experienced statistician looking for a book which gives an overview of what you learnt in high school, this book is perfect for you. But if you are someone who wants to go deep into important topics such as hypothesis testing, I would recommend consulting a different book.
Chapter by chapter summary
Chapter 1 - What are numerical and categorical variables? Explanation with examples are presented. A brief discussion on how the findings of a sample can be applied to a population is also included.
Chapter 2 - Numerical data is discussed in detail. The author plots the data as a histogram/bar graph. Reader is introduced to terms such as mean, median, midpoint (of each class) and standard deviation.
Chapter 3 - Categorical data is discussed in detail. The author converts categorical data in a table to a cross-tabulation format.
Chapter 4 - Concepts such as standard score (z-score) and deviation score are introduced. The deviation score is a transformed version of the standard score.
Probability density function is introduced. It is generated by approximating the histogram of a data as a curve. The concepts normal distribution and standard normal distribution (transformation of normally distributed data to z-score) are also introduced.
As an example, let's assume I scores 63/100 on a test. The score is first converted to a z-score. Using the standard normal distribution curve, I can compute the following -
- ratio of students who scored more than me
- probability of a random student getting 63
Therefore, the area under the curve = ratio = probability.
Here, the author briefly introduces the concept of equation of a line: y = ax + b; where a is the slope (coefficient/correlation) and b is the y-intercept.
He also introduces degrees of freedom (d.o.f) of a distribution, which is similar to the slope of a line. The d.o.f also depends on the sample size; higher the sample size, higher is the degrees of freedom. Further, higher is the degrees of freedom, higher is the area under the curve.
The author introduces the following metrics -
- Correlation coefficient: comparing 2 numeric variables [-1, 1]
- Correlation ratio: comparing a numeric and a categorical variable [0, 1]
- Cramer's coefficient: comparing 2 categorical variables [0, 1]
There are informal standards set for each of these metrics to determine the significance of the coefficient. For instance -
- 1 - 0.9 : very strong correlation
- 0.9 - 0.7 : fairly strong correlation
- 0.7 - 0.5 : fairly weak correlation
- below 0.5 : not related
The drawback of such coefficients is that they can only be used to detect linear relationships. Further, if a different set of sample was collected, then the coefficient would have been different. The coefficient may not be consistent across samples.
Here is when the material gets very dense. The author introduces complex topics such as hypothesis testing which addresses the drawbacks of the above coefficients. There is a way to know for certain using hypothesis tests if the correlation coefficient is not 0 for a population. The procedure is as follows:
Note - Chi-sq test is used to check whether there is a relationship between 2 variables. Eg. Gender and medium of asking a person out (phone or face-to-face)
- Set the null hypothesis that there is no relationship between gender and medium of asking out (determined using Cramer's coefficient on the population)
- The data for a sample is converted to cross tabulation format
- From the cross-tabulation, Chi-sq test statistic is calculated.
- Steps 2 and 3 are repeated 10,000 times for 10,000 different samples of the population to get values of Chi-sq test statistic for each sample.
- A distribution of the test statistic is plotted (Chi-sq distribution)
If the distribution follows a Chi-sq distribution, the coefficient is 0 and we can conclude that there is no relation between the 2 variables.