Install packages to be used in the book. Package gregmisc is no more available. It is replaced by gplots for this document. Package Rassoc is no more available. All funcitons were copied from the archived version.

Rassoc package is not active. You can get the archived version from http://cran.r-project.org/web/packages/Rassoc/index.html .

The following functions are copy-pasted from the archived version.

This book was written for the purpose of use as a textbook of statistical genetics or genetic statistics. Statistical genetics or genetic statistics is a field to interpret the biological phenomenama mathematically with heredity as the central axis.

A parent and child are similar. It is because genes are transmitted from a parent to child. They are very similar each other, they can be well distinguished. Individuals in a population also mutually similar. But there is a variation in the population. Genetics is to understand this variations in similarlity with transmission of genes and their functions.

Set of genes of the child is consisted of a halve of set of mother and a half of set of father. Conversely, only half of each parent’ gene set is transmitted to the child. Since the choice of this half occurs at random, genetic phenomena are stochastic. Another field on stochastic phenomena is “statistics”. “Genetics” and “statistics” shares their basis and actually the root of “statistics” was “genetics”.

Currently, genetics is aiming to understand biological phenomena with genomic approach and from molecular genetics aspects using large scale data sets. On the other hand, statistics has broadened its usage to almost all academic fields (Figure 0.1).

Mathematical statistical approach in genetics and genomics is very wide-ranging. This textbook covers, gene mapping-specific statistical issues and basics of statistical thinking that are shared by heterogeneous statistical methods in genetic statistics.

This document is handling a wide range of the research on the function of the genetic phenomenon and the gene, it was constructed by focusing on the things that are common at the time to understand them in the statistical side.

Part I deals with the biological matters related to genetic-gene.

Part II is for matters relating to the basis for the handling of data. It focuses on categorical data type because of the discreteness of the gene. It also explains methods focusing on populational aspects of samples or focusing on individual relation among samples.

Part III is on descriptive statistics, dimensions/ degree of freedom that is the basics of estimation/test, probability, likelihood, distribution, describes the index. Likelihood and linkage analysis are also handled in this part.

Part IV describes how to derive meaning from data.

Part V is on the issues when data sets are large or complicated.

~ This section is auto-translation results ~

As described above, this book was constructed by focusing on the things that are common at the time to understand the statistical aspects of the research on the function of the genetic phenomenon and the gene. Conversely, commentary and of the various techniques that are used to, such as gene mapping and gene function analysis, such as the definition of the various types of index we have decided to not dare treated. You can take advantage of their individual approaches, but is of course the thing that must be properly using the index, than the aim of understanding about them, then you no longer know did not matter that dealt with the future, developed and proposed will that, because not stick the force corresponding to such new methods and index. It than, by placing a weight on the basis of what is the concept that is common to a variety of techniques and index, because I hope and I want you to put the application force. In a desk study of the language, by holding down the foundation, is an image, such as aim to become Yomikonaseru also for the first time of the sentence. So, in this document, may read this chapter if you are interested in the transformation mapping, this chapter if you want to know the clustering, not configured and so on. Coming out one of the research themes matters relating to the to and fro, to read while they cross-reference has been making the best, that.

In addition, more than once, the terms and concepts that appeared, not necessarily be carefully explained at the first appearance, was described in the place where polite description is considered to be the most useful. This is, by writing the definition and description at the time of first appearance, is to avoid that the first half will be heavy. Therefore, somewhat, also remains vague, the process proceeds to read as long as the meaning is through, I think think of source Ikebayoshi, is convincing when read back again. Upon advance reading, when it is difficult to understand is the highlighted term in bold type 1, please examine the terms in the index. Page in which the term has appeared you can see somewhere. When you have appeared at a plurality of locations, since the page that explains the most carefully and listed in bold, please read the description of the page. In the pages that are most carefully explained, the term is Yes to express in a different type of font as shown in bold 2. Speaking in the web of the article, it is a condition, such as cross-links are stretched here and there. Please use with the intention, such as jumping through the index.

In addition, we have extensive use of the figure. In addition, we also posted free source of statistical software R of to draw the figure. Holding the contents in a sentence of natural language, on which was divided the image in the figure, while confirming an accurate representation in a formula, I think that the source of the R I want you to help understanding. In addition, since the source of R is what is often a simulation, by all means, use, it is recommended that you try to output the results in different conditions. These figures might be easy to understand it is better to look at the color. You can browse by http://www.genome.med.kyoto-u.ac.jp/func-gen-photo/index.php?album=StatGenetTextbook. R source of which was used in this document, it can be downloaded from http://www.genome.med.kyoto-u.ac.jp/StatGenet/lectures/2010/StatGenetTextbook/Rsrc.zip.

~ This section is auto-translation results ~

Upon writing a book, it became indebted to many people. In particular, everyone of the private meeting that I am riding on the Yorozu consultation of the day-to-day research (Waseda University Faculty of Science and Technology, Advanced Electrical and Information Engineering and Life Science Inoue Masato teacher, University of Tokyo Institute of Medical Science, Human Genome Center Imoto Kiyoshi??? teacher, Daiichi Sankyo Co., Ltd. Seiko Endo, University of Tsukuba graduate School of comprehensive human Sciences, Ohashi order teacher, the RIKEN Center for genomic Medicine statistical analysis research team Natsuhiko Kumasaka (in alphabetical order)) is very I learned a lot of things to. I would like to thank to take this opportunity. Also, in this member, especially in the Ohashi order teacher, we received a lot of advice over the range when writing this book. I’m really thankyou. In addition, Terao Chikashi’s Kyoto University Hospital and Immunology collagen disease internal medicine, the Kawaguchi TakashiHisa’s same school Graduate School of Medicine, University Medical Center genome, such as you to a confirmation of the document ・ R source, peacetime joint we become indebted in addition to research activities, in Kawaguchi, I was writing the R source of the demonstration program that was used in this book. In addition, from far away position is with genetic statistics, I am grateful to Junko wife gave me an opinion on the structure of the present through the eyes to the entire manuscript. And more than anything, statistical genetics and statistical genetics books published in the field of Ohmsha like choose to provide us with the opportunity of, also, be patient despite the great deal changed the content and structure while advancing the brush thank you to everyone of ohm’s development section who acquaintance.

In this way, it is this book, which was finished thanks to the many people, interested in even a little better of a lot of readers statistical genetics and statistical genetics, people of diverse academic background is a chance to work in this field we hope that.

~ This section is auto-translation results ~

Genetics of In reading this book, statistics, mentioned the related Statement of R. This book, here neither the introduction for an understanding of the present listed, read the book mentioned here, not a technical book should be read. Both are a may be read in first, but when you read both, it seems that understanding deepens.

What is heredity? When there is a connection of blood, it is that certain features are shared. When from human parents was born a human child, the feature that is a human being is shared by parents and a child. When this is too obvious, human beings should not have found it interesting and they might not have created the word heredity to pinpoint the phenomenon. However a child shares some features with its parents but does not others, which attracted our ancestors’ interests and the idea of “heredity” was born and it seemed the starting point of “genetics”.

Feature with organisms is called the trait. Feature is anything trait. Or interested in what traits, to determine the contents of the study, to have a point of view of the new analysis, you are paraphrasing also possible to define a new trait. In that sense, it is useful to organize the point of view to consider the trait. One way when classifying the trait, will be mentioned whether the assessment by one of the five senses (sight, hearing, taste, smell, touch) is. As a way of another classification, anatomy and structural, physiological and functional, there is a way of dividing that molecular biology, pharmacological. In addition, whether a feature that was provided to the individual itself, the individual is whether a way of holding of the relationship between predators and the environment (such as behavior, immune response), you can also classify that. Others include whether the mathematical concept, seemingly to physically measure, any change to (garbled) appearance of the, point of view also helps to transform the category of. Data by observing these traits can be obtained. This trait of observation data is the phenotype (phenotype).

Genes and their descendents from parents to offsprings have two aspects.

Make almost identical offsprings

Make different offsprings from parents

Heredity is a mixture of identity and diversity.

Genetics study heredititary phenomena (a mixture of identity and diversity) as realization of stochastic events in genes

Fitness is an important idea.

(Figure 1.1 )

~ This section is auto-translation results ~

Organism is the one characterized by leaving the same as their own to the next generation. Asexual reproduction ※ Dewako will directly take over the parent of the gene, but the person does not mean at all to leave the same individual in the sexual reproduction ※ of organisms, including, will take over the two halves of the parents of the gene . Speaking from the terms of adaptation, because why parents are living well, the child might also say that strategy ever thought it would go well if the same as the parent. On the other hand, rather than the individual, and then try to focus on the species that individual belongs. That’s the same all any kind of individuals, when the environment changes, all individuals would sometimes be difficult to live all together. It can be thought of as not a good idea. Even if there is a change in the environment, if there are a variety of individuals, from even happening is a change of environment will reduce the possibility of annihilation of the species, also considered a good idea that you configured in a variety of individuals. Genetic phenomenon, this way, it has a dual nature to ensure the identity of the security and diversity.

What is gene?

It is quite difficult.

What is genome?

It is a bit easier than “what is gene?”.

Basic knowledge on chromosomes are necessary.

Try to draw a bar graph that represents the magnitude of the human chromosome (R1-1.R).

```
# Length of chromosomes (bp)
# A vector of chromosome length
chromLen<-c(247249719,242951149,199501827,191273063,180857866,
170899992,158821424,146274826,140273252,135374737,134452384,
132349534,114142980,106368585,100338915,88827254,78774742,
76117153,63811651,62435964,46944323,49691432,154913754,57772954)
# barplot. Labels of choromosomes are on "X" axis and their length are "Y" axis. col argument specifies the color of bars.
barplot(chromLen,names=c(1:22,"X","Y"),col="black")
```

Define the terms below.

locus, loci base SNP allele haplotype exon, intron genotype diplotype phenotype

Define the terms below.

diploid homozygotes, heterozygotes 2x3 table of binary phenotype and three genotypes of biallelic locus.

Autosomal chromosomes exist in pairs of one by one from the parents. It means that you have two sets of duplicate genetic information. In this way it is called a diploid two sets with organisms. When you have two identical alleles for a locus, the individual is said to be homozygous for the gene locus, when you have a different allele, said to be heterozygous. From here and later, we handle diploid organisms only. Humans are also diploid. Diploid organisms have two alleles in the gene locus on autosomes. The way that individual has two alleles is genotype. On the other hand, the appearance of an individual’s traits is phenotype. The most clear-cut relationships between genotype and phenotype appear as dominant mode of inheritance and recessive mode of inheritance. Assume two alleles M and m. Assume a phenotype “no” and “yes” for a trait. Individuals have one of three genotypes, MM, Mm and mm.

Assume an allele m brings “yes” phenotype. \[ \begin{table} \renewcommand{\footnoterule}{} \begin{center} \begin{minipage}{10cm} \caption{Relation between genotypes and phenotypes $2\times 3$ table} \begin{tabular}[htb]{|c|c|c|c|c|} \hline &MM&Mm&mm&sum\\ \hline yes&$n_{10}$&$n_{11}$&$n_{12}$&$n_{1.} $ \\ \hline no&$n_{20}$&$n_{21}$&$n_{22}$&$n_{2.}$ \\ \hline sum&$n_{.0}$&$n_{.1}$&$n_{.2}$&$n_{..} $\\ \hline \end{tabular} \] \[ \begin{equation*} \sum_{i=1}^2 n_{ij}=n_{.j}; \sum_{j=0}^2 n_{ij}=n_{j.}; \sum_{i=1}^2 \sum_{j=0}^2 n_{ij}=n_{..}; \end{equation*} \] Rates of phenotype being “yes”, \(f_i=\frac{n_{1j}}{n_{.j}},j=0,1,2\), are different for each genotype. Assume MM as the reference genotype. Then each genotype’s “yes” fraction can be expressed as fold change from MM’s fraction, that is the genotype relative risk (genotype relative risk: GRR). \(\lambda_0=1\) from definition. When GRR of heterozygous (Mm) is equal to that of the homozygote (mm)(\(\lambda_1=\lambda_2\)), it is called the way of impact of m on this phenotype is dominant mode of inheritance. On the contrary, when \(\lambda_1=\lambda_0\), We call this form as recessive inheritance.

There are intermediate forms that does not fit to either the dominant or recessive mode. There are two definitions. One definition is based on the following formula, \[ \lambda_1=x \lambda_2 + (1-x) \lambda_0 \] with whichx = 0.5 is additive (additive) model where , x = 0 is recessive and x = 1 is dominant.

The other expression is, \[ \lambda_1=\lambda_2^y \times \lambda_0^{1-y} \] and y = 0 is recessive, y = 0.5 is synergistic (multiplicative), y = 1 is dominant.

When x = 0.5, \[ \lambda_1=\frac{\lambda_0+\lambda_2}{2} \] means arithmetic mean.

When y=0.5, \[ \lambda_1=\sqrt{\lambda_0 \times \lambda_2} \] means geometric mean.

In addition, when the effect of having one copy of m is less than the effect of two copies and less than the effect of zero copy, it is called super-dominant (overdominance).

In the formula x or y, the intermediate effect is expressed by $ 0 < x, y < 1 $ and all the other effects including overdominace is expressed by $ x, y < 0$ or $ x, y > 1$.

$ -< x,y < $ covers all inheritance models.

(Figure 2.1).

Mutations

Somatic mutations Germline mutations

Crossovers and recombination and recombinant

Figure 2.2

Identity by descent (IBD) vs Identity by state (IBS)

IBD = {0,1,2}

Parent-offspring \[ Pr(IBD={0,1,2}) = (0,1,0) \] Sibrings \[ Pr(IBD={0,1,2}) = (0.25,0.5,0.25) \] \[ 0.25 = 0.5 \times 0.5 \]

Table 2.1

\[ \begin{table} \begin{center} \caption{血???関係とIBD数の確???の関???} \begin{tabular}[htb]{|c|c|c|c|c|c|c|} \hline &\multicolumn{3}{|c|}{IBD数別確???}&IBD数の??????値&\multicolumn{2}{|c|}{一致???}\\ 血???関???&$2$&$1$&$0$& &??????値&???散\\ \hline 自身・一卵性双生???&$1$&$0$&$0$&$2$&$1$&$0$ \\ 親???&$0$&$1$&$0$&$1$&$\frac{1}{2}$&$0$ \\ ??????&$\frac{1}{4}$&$\frac{1}{2}$&$\frac{1}{4}$&$1$&$\frac{1}{2}$&$\frac{1}{8}$ \\ 祖父???-孫&$0$&$\frac{1}{4}$&$\frac{3}{4}$&$\frac{1}{4}$&$\frac{1}{8}$&$\frac{3}{64}$ \\ \hline \end{tabular} \end{center} \end{table} \]

The probability that the number of IBD is 0, 1 or 2, is useful as the information of the genetic closeness. It is inconvenient to handle three probability values, a probability vector with three elements. When the closeness of kinship is your interest, it may be convenient to have a single value representing the closeness. For example the expected value of IBD can be a candidate for such purpose.

When the observable values are \(X=\{x_1,x_2,...\}\) and their probability is \(P=\{p_1,p_2,...\} (\sum_{i} p_i =1)\), the expected value is

\[ E(X)=\sum_{i} x_i p_i . \]

In the case of identical twins \(E (X) = 2\). The relation between one and its self is also \(E (X) = 2\).

It might be reasonable to make the strength of relationship of identity 1, as the strongest blood relationship, then \(\frac{E(X)}{2}\) might be a good index as well, that could be called “concordance rate of allele”. The concordance rate of various relations is shown in Table 2.1.

The concordance rate between parent-child and the rate between sibs are the same, but the IBD probability vectors are different ({0,1,0} vs. {0.25,0.5,0.25}). This means, the concordance rate is convenient because of its simplicity but the simplicity is realized by loosing detailed information in the probability vectors.

- Mean, variance, the moment

the The expected value E (X) of \(X=\{x_1,x_2\}\) with \(P=\{p_1,p_2,...\} (\sum_{i} p_i =1)\) is, \[ E(X)=\sum_{i} x_i p_i \]

This expected value is can be called “mean” of X.

Variance V (X) is an index representing a variation in values, and it \[ V(X)=\sum_{i} p_i (x_i-E(X))^2 . \]

Both mean and variance are indices to describe X and its distribution.

You may call them more generally. Mean is the 1st moment and variance is the 2nd moment arount mean. Now you can summarize X and its distribution with the i-th moment around zero or mean.

It is expressed as, \[ \mu_k(c) = \sum_{i} p_i (x_i-c)^k \] with c as a center.

Therefore moments around zero is, \[ \mu_k(0) = \sum_{i} p_i x_i^k \]

and moments around mean is,

\[ \mu_k(E(X))=\sum_{i} p_i(x_i-E(X))^k \]

Let’s make a function to calculate moments with R as an exercize of R.

```
momentProb<-function (x, p,order = 1, center = FALSE)
{
if(center)
x <- x - momentProb(x,p,order=1,center=FALSE)
sum(x^order*p)
}
datax<-c(1,2,3)
datap<-c(0.2,0.3,0.5)
momentProb(x=datax,p=datap,order=2,center=TRUE)
```

```
momentX<-function (x, order = 1, center = FALSE)
{ rep(1,length(x))/length(x) #を与えたmomentProb()に同じ
momentProb(x,p=rep(1,length(x))/length(x),order=order,center=center)
}
```

Let’s apply the functions to the IBD vectors.

```
x<-c(1,0.5,0) # IBD数別の一致の値
pDoho<-c(1/4,1/2,1/4) # ?????????IBD数別確???
momentProb(x,pDoho,order=1,center=FALSE) # ??????値
momentProb(x,pDoho,order=2,center=TRUE) # ???散
```

The results are shown in Table 2.1. Using expected value(mean) and variance, you can distinguish parent-child vs. siblings.

- 23 pairs of chromosomes

There are 23 pairs of chromosomes in human. Assume no recombination. One chromosome of a pair is from mother and the other is from father. Now we want to calculate how many pairs are IBD=2 and how many pairs are IBD=1 and how many paris are IBD=0.

When you compare identical twins, all 23 paris are IBD=2. How about the case of siblings?

In total 46 = 2 x 23 chromosomes descend from parents to a offspring.

Each one of 46 can be shared by a sibling pair or not. Let 1 and 0 denote shared and not-shared, respectively.

For a chromosome pair, the probability of “11” is 0.25 and “01” or “10” is 0.5 and “00” is 0.25.

There are 4 ways to make repeated permutation of length 2; 00, 01, 10, 11.

Each probability is \[ \frac{1}{2^{2}} \]

For k chromosome pairs, there are \(2^{2k}\) ways of number sequence of {0,1}, each sequence probability is \[ \frac{1}{2^{2k}} \]

The probability that all 46 chromosomes are IBD between a sib pair is the probability to have “11….1” and it is \(\frac{1}{2^{2k}}\).

The probability i out of \(2k\) chromosomes are IBD is product of , \[ \binom{2k}{i}=\frac{2k!}{i! (2k-i)!} \] and

\[ \frac{i}{2^{2k}} . \]

Therefore \[ \frac{2k!}{i! (2k-i)!}\frac{i}{2^{2k}}. \] Figure 2.3 is showing the probability of number of IBD chromosomes.

When number of pairs is bigger, the number of shared chromosomes is less variable. In other words, if number of chromosomes is small, some sib-pairs share a lot of chromosomes and some share only few; on the contrary, when number of chromosomes is big, the variation is in a narrow range.

```
sibIdValue<-function(k=1){
(0:(2*k))/(2*k) # i/(2k)
}
sibIdProb<-function(k=1){
dbinom(0:(2*k),2*k,0.5)
}
numch<-1:23
means<-vars<-rep(0,length(numch))
for(i in 1:length(numch)){
identity<-sibIdValue(numch[i])
prob<-sibIdProb(numch[i])
if(i>1){par(new=TRUE)}
plot(identity,prob,type="b",ylim=c(0,1))
means[i]<-momentProb(identity,prob,order=1,center=FALSE) # 平???
vars[i]<-momentProb(identity,prob,order=2,center=TRUE) # ???散
}
plot(means)
plot(vars)
```

- Length of the chromosome is various

In the previous section, the length of the chromosome was plotted as shown in Figure 1.2. The shared fraction when i chromosomes are shared and all of the chromosomes are long is different from when i chromosomes are shared and all of them are short.

The following simulation takes the length of chromosomes into calculation for two scenarios; one scenario is that all chromosomes have the same length and the other scenario is that their lengths vary as human chromosomes.

The average concordance rate is 0.5 for both and the variance of them is also similar (R2-5.R).

```
SibSim<-function(f=f,Niter=10000){
rs<-matrix(rbinom(Niter*length(f),1,0.5),nrow=Niter)
rs%*%f
}
chs<-rep(chromLen[1:23],2)
f<-chs/sum(chs)
Niter<-10000
simOut<-SibSim(f,Niter=Niter)
cheven<-rep(1,length(f))
feven<-cheven/sum(cheven)
simOutEven<-SibSim(feven,Niter=Niter)
plot(ppoints(Niter,a=0),sort(simOut),type="l",ylim=c(0,1))
plot(ppoints(Niter,a=0),sort(simOutEven),type="l",ylim=c(0,1))
```

Mean and variance tell some information on the distribution and they are similar between two scenarios. Figure 2.4 is a plot of sorted concordance rates. The lines look different, smooth vs. step-like.

In the simulation, which went to the faithful to the length of the chromosome (a), whereas the curve is fairly smooth, it is that the case of equal length (b) is made in a staircase pattern. In the case of “equal-length”, it takes a discrete pattern.

Visualization can reveal extra information out of summarizing index values.

Three points from the example above:

Without plotting data, some information minght be missed

Be careful when handling data with special conditions (“equal-length” was the condition in the example above).

Discreteness affects on distribution

- 23 pairs should be divided into pieces - crossing, recombinant: Poisson process and the exponential distribution

Crossovers make recombinats, that makes chromosomes into segments. Segments are descended from a parent to an offspring.

The number of crossovers per chromosome per generation is not so many.

The number of crossovers per unit chromosomal length can be assumed in Poisson distribution, that is a distribution for rare stochastic events.

See Figure 2.5. Crossover sites are generated with two different ways; one is to determine the sites without using Poisson random numbers and the other is using Poisson random number generators.

```
RecombSim<-function(L=10000,r=0.001,Niter=1000){
m<-matrix(rbinom((L-1)*Niter,1,r),nrow=Niter)
apply(m,1,sum)
}
RecombPois<-function(L=10000,r=0.001,Niter=1000){
rpois(Niter,(L-1)*r)
}
L<-10000;r<-0.0001;Niter<-1000
NumSim<-RecombSim(L=L,r=r,Niter=Niter)
NumPois<-RecombPois(L=L,r=r,Niter=Niter)
ylim<-c(0,max(NumSim,NumPois))
plot(ppoints(Niter,a=0),sort(NumSim),ylim=ylim,col=gray(6/8))
par(new=T)
plot(ppoints(Niter,a=0),sort(NumPois),type="l",ylim=ylim)
```

Crossovers make mosaics of chromosomes as shown in Figure 2.6.

Shared fraction (IBD fraction) should be the sum of the segmetns with the same color.

Because crossovers make length of segments descendent to offspring shorter, which will not change the mean of fraction of IBD but will make the variance of fraction of IBD smaller (R2-7.R , Figure 2.7).

```
Niter<-1000
L<-1000000
r<-0.0001
crosses<-sort(sample(1:(L-1),rpois(1,(L-1)*r),replace=FALSE))
A<-c(0,crosses)
B<-c(crosses,L)
C<-B-A
rexps<-rexp(length(C),1/mean(C))
ylim<-c(0,max(C,rexps))
plot(sort(C),ylim=ylim,cex=0.5,pch=15)
par(new=T)
plot(sort(rexps),col="red",ylim=ylim,type="l")
```

New alleles are born with mutation events. Their initial allele frequency is \(\frac{1}{2n}\) where \(n\) is population size. Their frequency changes with generations because some alleles are transmitted to next generation more than other alleles. This stochastic change in allele frequency is called genetic drift.

Newly born alleles might be extincted. Newly born alleles might take over the original allele eventually.

- Contingency table

Figure 2.8.

It models that chromosomes make a pool that is expanded because each chromosome has a chance to transmit its copy to next generation more than 1 times. The next generation chromosome pool should be samples from the expanded pool.

This sampling process can be expressed as a contingency table.

This is expressed in 2 x 2 table and its exact occurence probabiity is \(\frac{4! 44! 12! 36!}{48! 2! 10! 2! 34!}\).

- State transition

Assume a population with only two individuals. In total there are 4 chromosomes. The number of copy of mutant allele can be 0, 1, 2, 3, or 4.

See Figure 2.9. The number of copy of mutant allele will change in the next generation.

Figure 2.9 describes the state transition.

State probability vector is a vector with length being number of states the element of the vector is non-negative and some of the elements is 1.

From the state with 0 mutant allele, the number of mutant copy of next generation is 0 with probability 1.

From the state with 1 mutant allele, the next generation value can be 0,1,2,3 and 4. The arrows in Figure 2.9 indicates the possible transition.

(R2-8.R Figure 2.10).

```
# ???団サイズ・変異アレル本数などを与え、何世代後に変異アレル本数が何本である??????確???を計算す???
#N 染色体集団サイズ,p 変異アレル本数,k ??????の染色体が次世代に残す染色体最大本数,infty 制限なく残せるならinfty=TRUE
probDrift02NInf<-function(N,p,k,infty=FALSE){
if(infty){ # 染色体を抜き取る??????ルのサイズを無限大にする
ret<-rep(0,(N+1)) # 変異染色体本数0,1,2,...,Nの状???
pr1<-p/N # 変異染色??????度
pr2<-1-pr1 # 非変異染色??????度
if(pr1==0){ # 変異染色体がなければ、次世代も変異染色??????な???
ret[N+1]<-1
}else if(pr2==0){ # 非変異染色体がなければ、次世代は変異染色??????かり
ret[1]<-1
}else{
for(i in 0:N){ # N本中i本が変異染色体である確???
ret[i+1]<-exp(lgamma(N+1)-lgamma(i+1)-lgamma(N-i+1)+i*log(pr1)+(N-i)*log(pr2))
}
}
ret
}else{ # 次世代染色体を抜き出?????????ルは有限本数とする
ret<-rep(0,(N+1))
kN<-k*N
kp<-k*p
kNp<-kN-kp
k1N<-(k-1)*N
commonLN<-lgamma(kp+1)+lgamma(kNp+1)+lgamma(k1N+1)+lgamma(N+1)-lgamma(kN+1)
for(i in 0:N){
if(kp>=i){ # ??????ルにある変異染色体本数より多い本数を抜??????すことはできな???
n11<-i
n12<-kp-n11
n21<-N-n11
n22<-k1N-n12
if(n11>=0 && n12>=0 && n21>=0 && n22>=00){
ret[i+1]<-exp(commonLN-lgamma(n11+1)-lgamma(n12+1)-lgamma(n21+1)-lgamma(n22+1))
}
}
}
ret
}
}
# ある状態において、次世代の変異染色体本数別の確???を計算す???
nextGenExProb2<-function(x,k,infty){
N<-length(x)-1 # 染色体集団サイズ
ret<-rep(0,length(x))
for(i in 1:length(x)){ # 状???0,1,...,Nごとに、次世代の状態への移り変わる確???を計算す???
p<-i-1
tmpret<-rep(0,length(x))
tmpret<-probDrift02NInf(N,p,k,infty=infty)
ret<-ret+tmpret*x[i] # ??????世代における変異本数がiの確???がx[i]なので、その比率に応じて、次世代の状???0,1,2,...,Nになる確???を加???
}
return(ret)
}
DriftSim3<-function(k=2,initial=1,np=20,ng=100,infty=FALSE){ # k: 最大子孫染色体本数, initial: 初期変異染色体本数, np: ???団サイズ, ng: シミュレーション世代数,infty:無限大??????ルにするかど??????
m<-matrix(rep(0,(np*2+1)*ng),nrow=ng) #全世代の全状??????確???を納め???
m[1,initial+1]<-1 # 第1世代は??????期本数の状??????確?????????
for(i in 2:ng){ # 第???世代以降を???次シミュレー???
m[i,]<-nextGenExProb2(m[i-1,],k=k,infty=infty)
}
return(m)
}
# 人数20人、染色体数40本、最大子染色体数2,初期変異染色体数10本、世代数25
out<-DriftSim3(k=2,np=20,initial=10,ng=25,infty=FALSE) # ??????
# 結果を鳥瞰図表示
# theta,phiは鳥瞰の視点を決める変数、shadeは陰影をつけて見やすくするための変数
persp(out,theta=120,phi=30,shade=0.2,xlab="generation",ylab="No. mutants",zlab="probability")
```

- Transition matrix

When transition from one state to another is stochastic, transition of n states to n states is represented by a n x n matrix, transition matrix.

The i-th column of the matrix represents avector of next generation probablity of all states when the i-th state proability before transition is 1.

Using this matrix, the state probability vector can be calculated by multiplication of the matrix on the state vector and the calculation can be repeated.

The next time-point state is only based on the one-step previous state only and no further past states will not affect. This is called a Markov chain.

Markov chain will appear in Chapter 8 linkage analysis.

\[ pr_{i=i,j=j}=\frac{(2N)!(2(k-1)N)!(ki)!(k(2N-i)!}{(2kN)!j!(2N-j)!(ki-j)!(2N(k-1)-(ki-j))!} \]

The corresponding 2x2 table is shown in the Japanese version textbook.

DNA RNA protein codon

Genetic information that has been recorded in the DNA will exert the function is Utsushitora to RNA. As shown in Figure 2.11, the upper portion of information “CAGGTT” of DNA is taken as the RNA copy of “CAGGUU”. It called a transfer this. DNA is “A”, “C”, “G”, but use the “T”, RNA will use the “U” instead of “T”. RNA Many also be further translated into a protein, RNA of the case is called messenger RNA (mRNA). There is also possible to exert the function as RNA without being translated, in that time, is called a functional RNA gene.

It reads the information of mRNA translation and called to make a succession (protein) of the amino acid. Amino acids used in the translation There are 20 kinds. The length of the shortest base sequence for identifying the 20 amino acids at the base 4 kinds of permutations is \(4^2 \le 20 \le 4^3\) but, in fact, each of the bases 3 permutations are translated to correspond to the amino acids. 3 a sequence of bases called a codon, showed the correspondence between the codon and amino acid in Figure 2.12. When translated into protein, and special codons which means the start of the translation process, by determining a specific codon which means the end of the translation, initiation of translation processing and end are controlled. Since an initiation codon “AUG” corresponds to “Met” is a specific amino acid, the first amino acid at which the protein is made it is always “Met”. The end of the codon There are several, but by it to do not to correspond whatsoever of amino acids, the translation work is completed.

Figures 3.1 and 3.2

substitution insertion/deletion repeat inversion translocation

microscopic vs. macroscopic variants

structural variants

Figure 3.3

{A,T,G,C} : \(4^n\)

{0,1} : \(2^n\)

splicing

splicing variants

Using the R, we try to make an array of splicing variant (R3-1.R).

```
seq1<-sample(c("A","T","G","C"),100,replace=TRUE) # 長???100のDNA??????をランダ???に作る
exonpattern1<-c(11:20,41:60,81:90) # mRNA1は???エクソン。そのパターン
exonpattern2<-c(11:20,41:60,66:77,81:90) #mRNA2はmRAN1の第???エクソンと第???エクソンの間にエクソンが挿入されて???ま???
seq1[seq1=="T"]<-"U" # mRNAではDNAの"T"???"U"になりま??????
mRNA1<-seq1[exonpattern1] # mRNA1の??????を抜??????しま???
mRNA2<-seq1[exonpattern2]
```

DNA sequence variations ~ Genome variations

Other omics layer variations

(Figure 3.4).

Diversity means “values” vary. Therefore variance, an statistical index, in a good indicator for it.

In some conditions, multiple variances can be summed up and one variance can be decomposed into multiple pieces that are also variances of somethings.

Assume a population with subpopulations.

You can measure variance of a whole population. Also you can measure variance of each subpopulation and also variance of means of subpopulations.

These “variances” are mutually related, and their relation is informative.

Assume a variable \(X_s\) and its \(N\) samples \(\{x_{s,1},...,x_{s,N}\}\).

Variance of \(X_s\) is, \[ V(Xs)=\frac{1}{N}\sum_{i=1}^N (x_{s,i}-\mu_s)^2 \] where \(\mu_s=\frac{1}{N}\sum_{i=1}^N x_{s,i}\).

You can write this as, \[ \sigma_{s,s}=V(Xs,Xs)=\frac{1}{N}\sum_{i=1}^N (x_{s,i}-\mu_s) \times (x_{s,i}-\mu_s) \]

Square is multiplication of self.

If you replace one \(s\) out of two \(s\) s to \(t\),

\[ \sigma_{s,t}=\frac{1}{N}\sum_{i=1}^N (x_{s,i}-\mu_s) \times (x_{t,i}-\mu_t) \] where \(X_t\) is another variable.

This should be something. Something related with variance but considering two variables. Actually this is called covariance.

Assume \(X1,X2\), such as test scores of physics and biology.

You are interested in the sum of two subjects, \(P=\{p_i=x_{1,i}+x_{2,i}\}\)

\[ \sigma_{P,P}=\sigma_{X1,X1}+\sigma_{X2,X2}+\sigma_{X1,X2}+\sigma_{X2,X1}=\sum_{s=1}^2 \sum_{t=1}^2 \sigma_{X_s,X_t} \] Covariance is 0 when two variables are mutually independent. You can easily find detailed description or proof of this.

Rather than repeating the regular textbook, here we simulate these with R.

```
nsample<-10000;X1<-rnorm(nsample);X2<-rnorm(nsample) # rnorm(), x1 and x2 are independent
p12<-X1+X2
cov12<-cov(cbind(X1,X2)) #X1, X2 variance covariance
cov12
vpp_2<-cov(p12,p12) #p12 variance
vpp_2-sum(cov12) # p12 vs. sum of variance and covariance
plot(X1,X2)
# x1 and x2 are dependent
X3<-X1+0.1*rnorm(nsample)
p13<-X1+X3
cov13<-cov(cbind(X1,X3))
cov13
vpp_3<-cov(p13,p13)
vpp_3-sum(cov13)
plot(X1,X3)
```

Number of variables is arbitrary, k.

\[ \sigma_{p,p}=\sum_{s=1}^k \sum_{t=1}^k \sigma_{s,t} \]

Phenotypes vary and have variance.

Genotypes vary and have variacen.

Environmental factors vary and have variance.

We assume phenotypic variance is sum of genotypic variance and environmental variance.

\[ \sigma_{P,P}=\sigma_{G,G}+\sigma_{E,E} \]

Heritability, how heavily phenotype is affected by genes, is defined as \[ \frac{\sigma_{G,G}}{\sigma_{P,P}} \]

Assume a biallelic polymorphism with 2 alleles, A and a, whose allele frequencies are p and 1-p.

G0,G1 and G2 (“AA”, “Aa”, and “aa”) are diplotypes.

Under the assumption of independence in combination of two alleles, \[ \begin{align*} G2=p^2\\ G1=2p(1-p)\\ G0=(1-p)^2 \end{align*} \] The state of this assumption is Hardy-Weinberg equilibrium.

When the assumption is not true, a term representing the deviation from HWE should be added.

\[ \begin{align*} G0=p^2+\Delta\\ G1=2p(1-p)-2\times \Delta\\ G2=(1-p)^2+\Delta \end{align*} \]

This can be re-written as, \[ \begin{align*} G0=p^2+p(1-p)F\\ G1=2p(1-p)(1-F)\\ G2=(1-p)^2+p(1-p)F. \end{align*} \]

Then F is an index called inbreeding coefficient.

Giving values 1 and 0 to “A” and “a”, we can population numerically.

Mean and variance of the population of 2N chromosomes are, \[ \begin{equation*} \mu(h)=(p \times 1 +(1-p) \times 0) =p \end{equation*} \] \[ \begin{equation*} v(h)=p \times (1-p)^2+ (1-p) \times (0-p)^2=p(1-p) \end{equation*} \]

Considering diploid population,

\[ \begin{equation*} \mu(d)=G0\times 0 +G1\times 1 + G2 \times 2=2p=2\mu(h) \end{equation*} \] \[ \begin{align*} v(d)&=G0 \times (0-\mu(d))^2+G1\times (1-\mu(d))^2+G2\times (2-\mu(d))^2\\ &=2p(1-p)(1+F)\\ &=2v(h)+2\times F \times v(h) \end{align*} \]

Variance of diploids is twice of haploids.

Covariance is expressed as \[ Cov(d)=F \times v(h) \].

\[ F=\frac{Cov(d)}{v(h)}=\frac{Cov(d)}{p(1-p)} \] F is a measure of deviation from HWE and it is covariance of two alleles.

See Figure 3.6.

Two columns for two variables with 0 or 1.

F for HWD and r for LD are proportinal to covariance of two variables.

Difference between HWD and LD is the way to treat (1,0) and (0,1).

See Figure 3.7.

Handling dominant model, we need the third column. The third column gives information on the heterozygotes.

The value of genotypes is the sum of three columns.

\[ \begin{equation*} Vd=G1(1-G1)d^2 \end{equation*} \begin{equation*} Cov1d=Cov2d=\frac{1}{2}G1(G2-G0)d \end{equation*} \]

Polygenic traits or complex genetic traits.

When many factors get together, the overall effects tend to take normal distribution.

R3-3.R models many loci that are in HWE and mutually in LE.

The phenotypic variance is sum of each loci’s variance and their covariances.

```
covXY<-function(x,y){ #?????????値ベクトルから共???散?????????
mx<-momentX(x,order=1,center=FALSE)
my<-momentX(y,order=1,center=FALSE)
sum((x-mx)*(y-my))/length(x)
}
covMat<-function(m){ # すべての??????クトルのペアにつ???て共???散?????????
ret<-matrix(rep(0,length(m[1,])^2),nrow=length(m[1,]))
for(i in 1:length(m[1,])){
for(j in 1:length(m[1,])){
ret[i,j]=covXY(m[,i],m[,j])
}
}
ret
}
N<-1000;M<-100 # サンプル数 2アレル型多型数
af<-runif(M) # アレル頻度 runif()関数につ???ては、付録A.5 確?????????関数、疑似乱数??????発生を??????
r<-rnorm(M) # アレル???本???の得点
d<-rnorm(M) # ??????カーごとの相???モ???ルからのずれ
Xmat<-matrix(rep(0,N*M),nrow=N)
for(i in 1:M){ # ??????カーごとに人数???の得点?????????
x<-rep(0,N)
x<-x+rbinom(N,1,af[i])
x<-x+rbinom(N,1,af[i])
x[x==1]<-1+d[i] # ヘテロ接合体特??????値(相???モ???ルからのずれ)?????????
x<-x*r[i] # 座??????強さを掛け???
Xmat[,i]<-x
}
X<-apply(Xmat,1,sum) # 個人の全多型???の得点
momentX(X,order=2,center=TRUE) # 全多型???の得点の???散と
sum(covMat(Xmat)) #???散共???散?????????値の??????一致する
hist(X) # 個人の全多型???得点の??????
```

It is evaluated by Chapter 4 observation Capture the Chapter 5 sample individually Capture the Chapter 6 sample as a group In order to understand the relationship between the genotype-phenotype of the organism, to observe the genotype-phenotype, it is necessary to deal with it as data. In the first part Ⅱ, after thinking about the type of data is the root for handling the data, we will look at two minutes increases the approach for handling the sample data is attributable. One approach is a way to distinguish between the individual samples, and one is the way to capture the sample as a group.

Since the life phenomena to organize and interpreted in terms of gene in the genetic statistics, even when considering the type and structure of the data, and the type classification from the point of view. Divided into two a

genotype (genotype)

phenotype (phenotype) covering all but genotype

Figure 4.1.

Interactions between layers.

One-way vs. two-ways.

Intermediate phenotypes and terminal phenotypes.

- Data types

Discrete vs. continuous

Ordered vs. not-ordered (vs. partially-ordered)

2 categories and 3-or-more categories are different from order-standpoint

Table 4.1

- 2 category type is also true that there is also that there is a sequence

A or non-A.

A is a subset of universe. non-A is complement of A.

Figure 4.2

Two categories can be handled as ordered. Figure 4.3.

- 3 or more category type

Three categories can be drawn as subsets as shown in Figure 4.2.

Three genotypes of SNPs may be treated as ordered (Figure 4.3) when focusing number of copies of one allele.

Three categories’ symmetricity can berepresented as a triangle (Figure 4.3).

Ordered categroies can be generated by dividing continuous variable with thresholds (Figure 4.3).

Order can be placed in one dimensional space.

- Continuous type

Variables of this type is realized on a line.

When a line structure exists in variables, they are totally ordered. Some of values of variable are ordered but no order is defined between some values, they are called partially ordered.

Combinatorial genotypes of 2 or more SNPs can be said partially ordered.

\[ \begin{align*} (0,0) < (0,1) < (0,2)\\ (0,0) < (1,0) < (2,0)\\ (0,0) < (1,1) < (2,2) \end{align*} \]

\[ \prod_{i=1}^N k_i =k_1\times k_2 \times ... \times k_N \]

Zero or more items might be selected from a item list. That may be found in questionnaires or in clinical criteria.

Some clinical criterion have a scoreing system that has a weight vector for items.

There are multiple ways to record them as shown in tables in the Japanese textbook.

Combination \(\binom{N}{k}=\frac{N!}{k!(N-k)!}\) or permutation \(\frac{N!}{(N-k)!}\) is important.

In case of questionnaire, multiple items might be selected (combination) or they should be selected with the order of preference (permutation).

Assume a biallelic variant, with A and T alleles. Genotypes are AA, AT and TT.

We genotyped N individuals’ genotype and \(n_0,n_1,n_2\) individuals had AA, AT and TT, respectively.

This observation makes N x 2 table, the table in Japanese text book.

The sum of first column is the number of A allele observed and sum of the second column is the number of T allele.

All row sums are 2, because everybody is diploid.

Total sum is 2N.

This can be seen as the result of repeated procedure that select two from two categories; two values can be same or different.

Let’s test the null hypothesis where alleles are selected at random.

We perform the “exact probablity test” for HWE. You may find some notes on exact probability tests in general in Chapter 13.

There are \(n_0+n_2\) cells with value 2, \(n_0 + n_2\) cells with value 0, and \(2\times n_1\) cells with value1, and \(N\) marginal counts with value2 and \(n_A\) and \(N_T\) are the marginal counts of colum-sums, therefore,

\[ \begin{equation*} \frac{\prod_{i=1}^N (2!) n_A! n_T!}{(2N)! (2!)^{n_0+n_2} (0!)^{n_0+n_2}(1!)^{2\times n_1}}= \frac{2^{n_1} n_A! n_T!}{(2N)!} \end{equation*} \] Note \(0!=1,1!=1,2!=2\),\(N=n_0+n_1+n_2\)

The table treats all N rows separately, that means individuals are mutually distiguishable.

However the observation is on the numbers of people with three genotypes and no distinction of individuals.

The number of ways to have the \(N x 2\) table is \(\frac{N!}{n_0! n_1! n_2!}\), therefore, the exact probability to observe \(n_0,n_1,n_2\) under the assumption of HWE is \[ \begin{equation*} \frac{N!}{n_0!n_1!n_2!} \times \frac{2^{n_1} n_A! n_T!}{(2N)!} = \frac{2^{n_1} n_A! n_T! N!}{(2N)!n_0!n_1!n_2!} \end{equation*} \]

```
hweExact<-function(g=c(813,182,5)){ # number of people with three genotypes
n<-sum(g) # total
nA<-2*g[1]+g[2] # No. allele A
na<-2*g[3]+g[2] # No. allele a
evod<-g[2]%%2 # Heterozygoutes are even or odd
maxAa<-min(nA,na)-evod
Aa<-seq(from=evod,to=maxAa,by=2) # list of possible heterozygous people number
AA<-(nA-Aa)/2 # corresponding No. AA
aa<-(na-Aa)/2 # corresponding No. aa
obs<-(g[2]-evod)/2+1 # observed heterozygoutes
prob<-rep(0,length(Aa)) # exact probs of all cases
prob<-exp(n*lgamma(2+1)+lgamma(nA+1)+lgamma(na+1)-lgamma(2*n+1)-(AA*lgamma(2+1)+Aa*lgamma(1+1)+aa*lgamma(2+1))+lgamma(n+1)-(lgamma(AA+1)+lgamma(Aa+1)+lgamma(aa+1)))
p.value<-sum(prob[prob<=prob[obs]]) # p-value
# Aa No. heteros
# prob exact probs
# obsprob exact prob of observed table
# p.value
list(Aa=Aa,prob=prob,obsprob=prob[obs],p.value=p.value)
}
xx<-hweExact(c(813,182,5))
xx$p.value # test p-value
sum(xx$prob) # sum of all exact probs is 1
```

Only when a value of one variable is a particular value, some variables might be observed.

For example, item X is prompted to be answered only when answer to item Y is “yes”.

For example, genotype of a marker on Y chromosome is only available for males.

Mutually independent n categories can be placed in (n-1) dimensional space as vertices of (n-1)-simplex.

- Angle between two vectors pointing two vertices is \(\theta\) where \(cos\theta=-\frac{1}{k}\).

Assume k+1 dimensional space and its orthogonal basis vectors \(e_i =(0,0,....,0,1,0,...,0); i=1,2,...,k+1\).

They represent k+1 categories.

Take the subspace that is \[ \sum_{i=1}^{k+1} \beta_i e_i; \sum \beta_i = 1, \beta_i \ge 0 \], which is k-simplex.

The center of k-simplex is,

\[ o = \frac{1}{k+1} \sum_{i=1}^{k+1} e_i = \frac{1}{k+1} (1,1,...,1) \]

\[ f_i=e_i-o=(-\frac{1}{k+1},-\frac{1}{k+1},....,-\frac{1}{k+1},1-\frac{1}{k+1},-\frac{1}{k+1},...,-\frac{1}{k+1}) \] are k+1 vectors in k-dimensional space.

\[ |f_i|=\sqrt{\frac{1}{(k+1)^2}(k*1+k^2)}=\sqrt{\frac{k}{k+1}} \]

For any i and j (\(i \ne j\), \[ \theta_{i,j}$は$cos(\theta_{i,j})=\frac{f_i f_j}{|f_i||f_j|}=-\frac{1}{k+1} \frac{k+1}{k}=-\frac{1}{k} \]

1-dimensional space (straight line): 1-regular simplex line segment

2-dimensional space (plane): 2-regular simplex equilateral triangle

3-dimensional space (space): 3 regular simplex tetrahedron

With k larger, \(cos(\theta_{i,j})=-\frac{1}{k} \to 0\), means categories get closer to independent relation.

Complete graph is a graph in which all vertices are mutually connected meaning all pairwise relations are the same, that is a graph representation of multiple mutually independent categories.

Figure 4.4.

Pairwise comparison, comparison of two, can be said, evaluation of relation of two.

There are two ways to evaluate relation of A and B, relation of A to B and relation B to A.

\(A=B\) or \(A!=B\) won’t change after you exchange A and B.

When A “is larger than” B, B “is smaller” than A. In this case, after you exchange A and B, you had to change “larger” to “smaller”.

The former relation is symmetric and the latter is asymmetric relation.

When A is 2 and B is 1, A is larger than B by 1 and B is smaller than A by 1.

“Larger” and “smaller” are different but “by 1” are common between two expressions.

“1” is distance between A and B and distance seems “symmetric”.

Distance can be said as non-negative measure to quantitate difference between two.

It is better to define distance clearly.

Metric space is a set for which distances between all members of the set are defined.

- Distance

is defined between two of something (A, B)

takes the value of the non-negative (greater than or equal to 0)

should be symmetric (distance from A to B and distance from B to A are the same)

When A=B, distance between A and B should be 0

For three items, there are three distances. There should be a triagnle whose edge lengths are the three distances; in some cases, the triangle looks like a segment with two vertices at the ends and the third vertex is on the segment. This rool is called triangle inequality.

- triangle inequality

Figure 4.5. \[ \begin{equation*} b+c>a,c+a>b,a+b>c \end{equation*} \] then, \[ \begin{align*} x=\frac{a+b+c}{2}-a = \frac{1}{2} (b+c-a) \ge 0\\ y=\frac{a+b+c}{2}-b = \frac{1}{2} (c+a-b) \ge 0\\ z=\frac{a+b+c}{2}-c = \frac{1}{2} (a+b-c) \ge 0\\ x+y=c, y+z=a, z+x=b \end{align*} \]

This means when distances are defined for all pairs that meet triangle inequality, they should have a tree satisfying the distances.

Any formula that satisfies the above mentioned rules, defines distance.

Euclidean distance is familiar one: \[ \begin{equation*} d_M(A,B)=\sum_{i=1}^k |a_i-b_i|=\sum_{i=1} \delta_i \end{equation*} \]

This is another one, called Manhattan distance; \[ \begin{equation*} d_M(A,B)=\sum_{i=1}^k |a_i-b_i|=\sum_{i=1} \delta_i \end{equation*} \]

R’s dist() function have multiple distances.

For example, the distance between DNA sequences can be the number of different bases.

person: cccggaCAcCgActtcccGgggctcatt

Mouse: cccggaTGcAgGcttcccAgggctcatt

Five out of 29 bases are different. Distance is 5.

When there is an insertion/deletion, it is more difficult to measure the distance between them.

person: … cccggaCAcCgActtcccGgggctcattACcctCAc …

Mouse: … cccggaTGcAgGcttcccAgggctcatt = Tcct = Tc …

To define distance between DNA sequences that are different with substitutions and ins/dels, we have to have a formula to score the difference taking both types of variants into account.

The “blast” method quantitates the distance between DNA sequences by rareness that some biological events generated one sequence from another sequence. When quantitating the rareness, it uses a distribution called extremal value distribution.

Categrories in the shape of simplex Distance has expressed a symmetrical relationship with a value greater than or equal to 0. Or you will not be able to assess quantitatively the relationship, including the negative number. When you place a categorical in space, all of the categories is in the xxxx the relationship with each other, it is this angle is equivalent, has said that refers to the equal relationship of the category. Thus the angle can also represent a relationship between two things. It can be an amount that represents the relationship between the q, would be a cos q may be used as the amount. Now, two of the data is equal to a vector format, it is represented by the inner product and the length of the vector with each other.

\(cos(\theta)=\frac{\sum_{i=1}^k x_i y_i}{|x||y|}\)

This is a value called the correlation coefficient.

So far, we compared two.

From here we compare N samples.

You may be interested in one out of N and compare the one vs. (N-1) others.

You may be interested in all N and you may compare \(\frac{N}{N-1}{2}\) pairs. Or you may compare \(\frac{N(N+1)}{2}\) pairs including “one-and-its-self” relation. In case pairwise relation is asymmetric, \(N^2\) comparisons should be made.

See Figure 4.7.

When all items are located on a line, they are in perfect order.

Sometimes, some paris have order but others don’t.

Assume a power set of a set \(\{1,2,...,N\}\), with N elements.

There are \(2^N\) elements that are subset of \(\{1,2,...,N\}\) in the power set.

Between some elements in the power set, inclusion relation exists, but for some pairs no.

Figure 4.8 is a Hasse diagram representing inclusion relation of elements of power set with 4 elements.

They can be seen as 4-dimensional cubic that is projected to 2-D plane.

```
powerSetHasse<-function(k=4){
lev<-k+1
numelem<-rep(0,lev)
a<-0:k
numelem<-choose(k,a)
numnode<-2^k
nodesX<-rep(0,numnode)
nodesY<-rep(0,numnode)
vecs<-matrix(rep(0,numnode*k),ncol=k)
keta<-2^(0:(k-1))
counter<-rep(0,lev)
startval<--(numelem-1)/2
for(i in 2:numnode){
vecs[i,]<-vecs[i-1,]
vecs[i,1]<-vecs[i,1]+1
for(j in 1:(k-1)){
if(vecs[i,j]==2){
vecs[i,j]<-0
vecs[i,j+1]<-vecs[i,j+1]+1
}else{
j<-k
}
}
tmpsum<-sum(vecs[i,])
nodesY[i]<-tmpsum
nodesX[i]<-counter[tmpsum]+startval[tmpsum+1]
counter[tmpsum]<-counter[tmpsum]+1
}
values<-vecs%*%keta
edges<-matrix(rep(FALSE,numnode^2),ncol=numnode)
for(i in 1:(numnode-1)){
for(j in (i+1):numnode){
tmpdist<-sum(abs(vecs[i,]-vecs[j,]))
if(tmpdist==1){
edges[i,j]<-TRUE
}
}
}
Edges<-which(edges,arr.ind=TRUE)
plot(nodesX,nodesY,pch=15)
segments(nodesX[Edges[,1]],nodesY[Edges[,1]],nodesX[Edges[,2]],nodesY[Edges[,2]])
}
powerSetHasse(4)
```

When distance is defined for all pairs of N elements, the information can be expressed as a N x N non-negative symmetric matrix, called distance matrix.

Using nj() function in ape package of R (nj standing for neighbor joining method), you can generate a tree structure that satisfies the distance matrix.

Tree construction is one of hierarchical clustering methods, that will be a topic in Chapter 5.

\[ 1 2 3 4 2 4.725108 3 1.372943 4.600435 4 4.135837 2.220284 4.311534 5 2.020896 3.608861 1.844573 3.591590 \]

```
# Ns samples in k dimensional space
Ns<-5;k<-5;x <- matrix(rnorm(Ns*k), nrow=Ns)
dist(x,method="euclidean") # Euclidean distance
library(ape) # ape package
treu<-nj(dist(x,method="euclidean")) # tree construction with euclidean distance
trman<-nj(dist(x,method="manhattan")) # with manhattan distance
par(mfcol=c(1,2))
plot(treu);plot(trman)
par(mfcol=c(1,1))
```

The ways to handle samples can be classified into two; one is to handle samples individually and the other is to study samples as a distribution.

This chapter 5 is on the ways to handle them individually.

Next chapter 6 is on the ways to study them as a distribution.

Graph is a pair of a set of vertices and a set of edges.

Vertex

Edge

Directed vs. non-directed

Connection

Degree

Neighbor

Path

Cycle

Distance between vertex pair

Perfect graph

Acyclic graph

Tree graph

-Graph algorithms

Figure 5.1.

\[ No. edges = No. vertices - 1\\ \]

Root; rooted tree vs. unrooted tree

leaf

Figure 5.3.

Phylogenic tree for evolutionary history

Rooted tree

Cladegram

Hierarchical clustering

Starting from a data set, hierarchical clusters are to be generated with repeated procedures.

Multiple algorithms of hierarchical clustering.

Definition of distance

Definition of distance between tentative clusters

Rules in merging order

Figure 5.5.

Try using the hclust () function of R of data mining system (2 R5-1.R, Figure 5.6).

```
distMatrix<-dist(x,method="manhattan") # 距離にはマンハッタン距離を使用
trclust<-hclust(distMatrix,method="ward") # クラスタ間距離の定義にはウォード法を使用
plot(trclust)
```

N samples and M variables make N x M data matrix.

Using the matrix data, N samples can be hierarchically clustered and M variables can be also hierarchically clustered as shown in Figure 5.7, a heat map.

```
m<-matrix(rbinom(120,1,0.5),20,6)
heatmap(m)
```

A data set with M samples and N markers.

For N markers, \(\frac{N(N-1)}{2}\) pair LD index values can be calculated.

Because the order of N markers represents location of them, the order should not be changed, but evaluate the LD index value pattern with the order kept.

LD index, r, is correlation coefficient.

```
cormatrix<-cor(m);rsqmatrix<-cormatrix^2
image(1:nrow(rsqmatrix),1:ncol(rsqmatrix),rsqmatrix,col=gray((100:0)/100))
```

M rows x N columns data set.

Based on your interest, you may rearrange/not-rearrange order of columns / rows, as shown in Figure 5.9.

Relation among N items,

Relation among M items,

Relation both among N items and among M items

Figure 5.10.

Pedigree: Graph-like but not a graph

Transformation of pedigree to a graph

Cousin marriage as a closed loop

Directed graph

- The graph of the individual’s relationship

Autosomal recessive trait shown in Figure 5.11.

- The graph of chromosomal relationship

Figure 5.12 is a graph representing chromosomal relationship.

Figure 5.12 is one of many such graph satisfying Figure 5.11, meaning there are multiple “right answers” of chromosomal relation satisfying individual relation.

Figure 5.13 shows two answers.

This is why we have to estimate the chromosomal relation pattern(s) (genetic relation) that should describe the individual relation pattern (phenotypic relation).

Figure 5.14.

Chromosomal transmission graph changes its pattern at crossing over sites.

When focusing on a particular locus, no branchings in chromosomal graph.

When overlapping chromosomal graphs along chromosomes, every offspring chromosome is connected to paternal and maternal chromosome.

Figure 5.15.

For each locus, multiple trees exist in a population.

Multiple leaves share a common ancestral chromosome.

When going back, they merge; “coalescent”

In brief, linkage analysis tries to identify the tree that has a tree pattern fitting to the pattern of phanotype most.

R code is downloadable: R5-sup1.R

Tree is a graph without cycle.

Network is a graph with cycle.

In general graphs with cycles are difficult to handle and difficult to be interpreted.

Cycles may represent “positive / negative feedback”.

```
ARGsim<-function(n=8,g=5,l=6,r=r){
ids<-1:n
now<-1:n
parents<-rep(0,n*g*l)
data<-rep(0,n*g*l)
for(i in 1:n){
data[(1+(i-1)*l):(i*l)]<-i
}
count<-1+n*l
for(i in 2:g){
tmp<-sample(ids,replace=FALSE)
for(j in 1:(n/2)){
countinside<-1
for(x in 1:2){
first<-1
if(runif(1)>0.5){
first<-2
}
data[count]<-now[tmp[(j-1)*2+first]]
parents[count]<-tmp[(j-1)*2+first]
now[countinside]<-data[count]
count<-count+1
countinside<-countinside+1
for(k in 1:(l-1)){
if(runif(1)<r){
first<-first+1
if(first==3){
first<-1
}
}
data[count]<-now[tmp[(j-1)*2+first]]
parents[count]<-tmp[(j-1)*2+first]
now[countinside]<-data[count]
count<-count+1
countinside<-countinside+1
}
}
}
}
return(list(allele=matrix(data,nrow=l),parents=matrix(parents,nrow=l)))
}
n<-8
g<-10
l<-200
r<-0.05
d<-ARGsim(n=n,g=g,l=l,r=r)
mmm<-matrix(d$parents[1,],ncol=g)
plotSlice<-function(m){
points<-which(m>=0,arr.ind=TRUE)
plot(points[,1],points[,2])
for(i in 2:length(m[1,])){
for(j in 1:length(m[,1])){
segments(m[j,i],i-1,j,i)
}
}
}
locus<-1:l
PlotSerial<-function(m,locus,g){
for(i in locus){
mmm<-matrix(m[i,],ncol=g)
plotSlice(m=mmm)
}
}
PlotOverlap<-function(m,locus,g){
for(i in locus){
mmm<-matrix(m[i,],ncol=g)
par(new=T)
plotSlice(m=mmm)
}
}
PlotSerial(d$parents,locus=locus,g=g)
PlotOverlap(d$parents,locus=locus,g=g)
for(i in 1:l){
mmm<-matrix(d$parents[i,],ncol=g)
plotSlice(m=mmm)
}
locus<-1
mmm<-matrix(d$parents[locus,],ncol=g)
plotSlice(m=mmm)
```

In Chapter 5, we treated samples individually.

In this Chapter 6, we characterize the samples as a whole.

Assume samples with 1 quntitative variable. Let’s draw a box plot, distribution plot and cumulative distribution of them. (R6-1.R).

```
n1<-1000;n2<-500 # No. samples
# Generate unimodal values (unimodal ~ one-peak)
popdata1<-rnorm(n1,0,0.5) # Random value generation from normal dist
par(mfcol=c(1,3)) # Divide a drawing area into three
plot(ecdf(popdata1)) # cumulative density
plot(density(popdata1)) # density
boxplot(popdata1) # boxplot
par(mfcol=c(1,1))
summary(popdata1) # summaization of samples
# bimodal values
popdata2<-c(rnorm(n1,0,0.5),rnorm(n2,5,1))
par(mfcol=c(1,3))
plot(ecdf(popdata2))
plot(density(popdata2))
boxplot(popdata2)
par(mfcol=c(1,1))
```

Read unimodality from the density plots.

How do you read the same information from the cumulative plots?

Higheigt in the density plot ~ gradient of the cumulative plot.

When it is unimodal, basic descriptive statistics such as minimum, maximum, quntiles, median, mean, are infromative.

Read bimodality from the density and cumulative plots.

How is the boxplot useful?

How do the basic statistics useful?

What we should do for two dimensional data sets are similar to ones in one dimensional space.

Let’s make a data set with three peaks in 2 dimensional space and draw a coplot, perspective view and a clustered diagram. (R6-2.R, Figure 6.2).

```
# Use rnorm()
n1<-500;n2<-300;n3<-200;x<-c(rnorm(n1,0,0.5),rnorm(n2,5,1),rnorm(n3,8,2));y<-c(rnorm(n1,0,2),rnorm(n2,3,2),rnorm(n3,-3,1))
#library(gregmisc) # has a function hist2d()
library(gplots)
h2d <- hist2d(x,y, show=FALSE,same.scale=TRUE, nbins=c(10,10)) # taking 2d histogram information
plot(x,y) # coplot
filled.contour( h2d$x, h2d$y, h2d$counts, nlevels=9,col=gray((8:0)/8) ) # gray scale for density
persp( h2d$x, h2d$y, h2d$counts,ticktype="detailed", theta=60, phi=30,shade=0.5, col="cyan") # perspective view
```

When samples are in higher dimensional space, visualization of them is impossible (or difficult in the same way), but what we want to do for them is similar; we should grab their distribution and summarize them with descriptive statistics.

Because the samples look like to have three peaks, let’s assign each sample to one of three cluster groups. k-means method is one of the methods called non-hierarchical clustering methods.

First you specify the number of clusters, k.

Tentatively assign all samples to one of k clusters. Caclulate the center of each cluster based on the samples in the cluster.

For each sample, you have k center coordinates and the sample’s own coordinate. Based on the k+1 coordinate information, you re-assign the sample to one of k clusters. After the reassignment, the center coordinates of clusters should be updated.

You repeat the reassignments until no reassignment is necessary.

R6-3.R Drew Figure 6.2 (d).

```
# non-hierarchical clustering
m3<-matrix(c(x,y),ncol=2)
cl <- kmeans(m3, 3) # kmeans method, specifying no. clusters = 3
plot(m3, col = cl$cluster)
points(cl$centers,pch = 8, cex=10) # marking the center of clusters
```

Population genetics is a field to study populations of organisms from genetic standpoint and a big branch in statistical genetics.

Individuals in a population may interact each other at random, but in reality their interactions are not always at random.

Geology and other natural factors and cultural factors (language, religion, ethnic identity, state) cause the non-randomness.

Random stochastic events affected by various non-random factors along time generates uneven population along time axis.

When individuals are mating at random in a population, distribution of genotypes should be in the state calles Hardy-Weinberg equilibrium.

It is easy to model a HWE population. Because of various factors violating HWE, populations in reality aer not in HWE, or in Hardy-Weinberg Disequilibrium (HWD).

To model populations in HWD, you may mix multiplpe populations in HWE.

We model a population that is a mixture of two HWE populations in the followings.

Assume two populations and a biallelic polymorphism with A and a alleles. Allele frequency of A of two populations are p and q respectively and the population is a mixture of two with r vs. (1-r) fraction ratio.

Two populations; \[ \begin{equation*} p^2,2p(1-p),(1-p)^2 \end{equation*} \] \[ \begin{equation*} q^2,2q(1-q),(1-q)^2 \end{equation*} \] Their mixture; \[ \begin{equation*} (gm1,gm2,gm_3)=(cp^2+(1-c)q^2, 2cp(1-p)+2(1-c)q(1-q),c(1-p)^2+(1-c)(1-q)^2)) \end{equation*} \]

Giving values 1 and 0 to A and a, respectively. Now you can express the state of each population with mean and variance. \[ \begin{align*} m_1=2p,v_1=2p(1-p)\\ m_2=2q,v_2=2q(1-q) \end{align*} \]

The mixture population’s allele frequency is \[ cp+(1-c)q, \] which is easy.

Mean of the mixture population is easy. \[ \begin{equation*} mm=2(cp+(1-c)q)=2(c m_1+(1-c)m_2) \end{equation*} \]

Variance is a bit difficult, because we have to consider covariance as we learned in Chapter 3.

- Diffusion equation

Assume two isolated islands, A-island and B-island.

At some time point, all chromosomes had M allele in A-island and all had m in B-island.

Suddenly the two islands are connected somehow.

Since the connection was established, \(d\) moves from A-island to B-island in the unit time duration and the same number of individuals \(d\) moves from B-island to A-island without change in population size of both islands, \(P_A, P_B\).

Let \(Fa(t),Fb(t)\) denote allele frequency of M at time t in A and B islands, respectively, with t=0 being the time of establishment of connection.

When \(t\ le 0\), \(Fa(t)=1,Fb(t)=0)\).

The change from \(t=T\) to \(t=T+\delta\) is expressed as below; \[ \begin{equation*} 2P_A Fa(T+\delta)=2P_A Fa(T) -2d\delta Fa(T) + 2d \delta Fb(T) \end{equation*} \] \[ \begin{equation*} 2P_B Fb(T+\delta)=2P_B Fb(T) -2d\delta Fb(T) + 2d \delta Fa(T). \end{equation*} \] Considering \(P_A, P_B\), \[ \begin{equation*} 2P_A P_B (Fa(T+\delta)-Fb(T+\delta))=2P_A P_B (Fa(T)-Fb(T))-4d\delta (P_A+P_B)(Fa(T)-Fb(T)), \end{equation*} \] then, \[ \begin{equation*} (Fa(T+\delta)-Fb(T+\delta))-(Fa(T)-Fb(T))=- \frac{2(P_A+P_B)}{P_A P_B} (Fa(T)-Fb(T)). \end{equation*} \]

\[ \begin{equation*} G(T+\delta)-G(T)=- \frac{2(P_A+P_B)}{P_A P_B} G(T) \end{equation*} where $G(T)=Fa(T)-Fb(T)$, \] and \[ \begin{equation*} \frac{d}{dt} G(T)=- \frac{2(P_A+P_B)}{P_A P_B} G(T) \end{equation*} \] then, \[ \begin{equation*} G(T)= K e^{- \frac{2(P_A+P_B)}{P_A P_B}t} \end{equation*} \]

Now \(G(0)=1\), therefoer \(K=1\).

The ifferential equation of \(G(t)\) means that the change per unit time is proportional to the difference in allele frequency between two islands.

Because \[ $P_AFa(t)+P_BFb(t)=P_AFa(0)+p_BFb(0)=P_A$ \] therefore, \[ \begin{equation*} Fa(t)=\frac{P_A + P_B e^{- \frac{2(P_A+P_B)}{P_A P_B}t}}{P_A+P_B} \end{equation*} \begin{equation*} Fb(t)=\frac{P_A (1- e^{- \frac{2(P_A+P_B)}{P_A P_B}t})}{P_A+P_B} \end{equation*} \]

Let’s draw this chronological change with R (R6-4.R, Figure 6.3).

```
pa<-9000;pb<-1000;d<-100;t<-0:100 # pa,pb:population size,d:numebr of immigrants per unit time,t:generation
fa<-(pa+pb*exp(-2*d*(pa+pb)/(pa*pb)*t))/(pa+pb)
fb<-(pa*(1-exp(-2*d*(pa+pb)/(pa*pb)*t)))/(pa+pb)
plot(t,fa,ylim=c(0,1),type="l",xlab="t",ylab="frequency")
par(new=T)
plot(t,fb,ylim=c(0,1),type="l",xlab="t",ylab="frequency")
```

Over time (horizontal axis), allele frequency of 2 Island will converge to the same value. Speed of convergence depends on the fraction of moving people in the island population.

Eventually two islands allele frequencies become the same even with immigration; equilibirium.

The differential equation here is the simplest equation called diffusion equation in thermodynamics, a branch of physics.

- Represented by a transition matrix !

Assume five islands, A,B,C,D, and E. No change in population of each island. The fraction of movement from one island to the others are constant.

The fraction of move form A island to B,C,D and E is \[ P_A *m_{A\to X};X=B,C,D,E \]

and the fraction of not-move from A island is \[ m_{A\to A}=1-\sum_{X\in \{B,C,D,E\}}m_{A \to X}, \] then

\[ P_A(T+1)=P_A(T)m_{A\to A} +\sum_{X \in \{B,C,D,E\}} P_{X} m_{X \to A}=\sum_{X \in \{A,B,C,D,E\}} P_{X} m_{X \to A} \]

The matrix representation of these is, \[ \bordermatrix{ & \cr & m_{A\to A}&m_{B\to A}& m_{C\to A}& m_{D\to A}& m_{E\to A}\cr & m_{A\to B}&m_{B\to B}& m_{C\to B}& m_{D\to B}& m_{E\to B}\cr & m_{A\to C}&m_{B\to C}& m_{C\to C}& m_{D\to C}& m_{E\to C}\cr & m_{A\to D}&m_{B\to D}& m_{C\to D}& m_{D\to D}& m_{E\to D}\cr & m_{A\to E}&m_{B\to E}& m_{C\to E}& m_{D\to E}& m_{E\to E}\cr } \bordermatrix{ & \cr &f_A(T) \cr &f_B(T) \cr &f_C(T) \cr &f_D(T) \cr &f_E(T) \cr } \] \[ =\bordermatrix{ & \cr & f_A(T+1) & f_B(T+1) & f_C(T+1) & f_D(T+1) & f_E(T+1) \cr } \]

In the example of Section 6.3.3, positions of individuals in islands were not cared. It means that islands were assumed as points witout size, or they are 0-dimensional objects.

Now we consider “space”.

As the simplest example, assume a straight line, 1-dimensional space where individuals locate and move aournd.

They move because some locations in the space is better to live in.

Individuals tend to move to the best spot and eventually all individuals end up at the best spot.

This movement is expressed in a form of diffusion equation.

The source of the R to draw this figure is not me, you can download (R6-sup1.R).

```
L<-100
ngen=100
X<-rep(0,L)
X[20]<-1
Peak<-60.3
Benefit<-L-abs((1:L)-Peak)
counter<-0
C<-10
frac<-0.5
plot(X,ylim=c(0,1),type="l")
for(i in 2:ngen){
tmpX<-rep(0,L)
for(j in 1:L){
pre<-j-1
self<-j
post<-j+1
if(j==1) pre<-L
if(j==L) post<-1
tmpB<-c(Benefit[pre],Benefit[self],Benefit[post])
maxid<-which(tmpB==max(tmpB))
if(maxid==1){
tmpX[pre]<-tmpX[pre]+X[self]*frac
}else if(maxid==2){
tmpX[self]<-tmpX[self]+X[self]*frac
}else if(maxid==3){
tmpX[post]<-tmpX[post]+X[self]*frac
}
tmpX[self]<-tmpX[self]+X[self]*(1-frac)
}
counter<-counter+1
if(counter==C){
par(new=T)
plot(tmpX,ylim=c(0,1),type="l")
counter<-0
}
X<-tmpX
}
PlotMove<-function(L=100,ngen=200,loc=20,Peak=60.3,C=10,frac=0.5){
X<-rep(0,L)
X[loc]<-1
Peak=Peak
Benefit<-L-abs((1:L)-Peak)
counter<-0
C<-C
frac<-frac
plot(X,ylim=c(0,1),type="l")
for(i in 2:ngen){
tmpX<-rep(0,L)
for(j in 1:L){
pre<-j-1
self<-j
post<-j+1
if(j==1) pre<-L
if(j==L) post<-1
tmpB<-c(Benefit[pre],Benefit[self],Benefit[post])
maxid<-which(tmpB==max(tmpB))
if(maxid==1){
tmpX[pre]<-tmpX[pre]+X[self]*frac
}else if(maxid==2){
tmpX[self]<-tmpX[self]+X[self]*frac
}else if(maxid==3){
tmpX[post]<-tmpX[post]+X[self]*frac
}
tmpX[self]<-tmpX[self]+X[self]*(1-frac)
}
counter<-counter+1
if(counter==C){
par(new=T)
plot(tmpX,ylim=c(0,1),type="l")
counter<-0
}
X<-tmpX
}
}
PlotMove()
```

Simple space and time have been handled. Three-dimensional space and time make four-dimensional space.

Sometimes space is not “flat-like”. For example, the sphere is three dimensional object and its surface exists in three-dimensional space but the surface itself is two-dimensional and the surface is closed and finite.

The spacio-temporal space is handled, of course, in life-science and genetics, but also we handle a space of information, or space of variables to be observed, that could be very high-dimensional.

When we study informational spaces, they vary a lot.

There are two points for you to be sensitive about spaces.

Finite vs. infinete

Open vs. close

These two points look similar but somewhat different. For example, when closed, the space itself is finite but movements in it might be finite. (Figure 6.5)

When we make a model, we often assume infinity, that usually makes the model easier.

Population size can be modeled such that it increases infinite, that might be enough for some cases but might not be when you have to consider saturation.

Point mutations can be modeled to happen in DNA molecule with infinite length and it might be good for some scenario but might not be for others.

Space and time in general from physics and thermodynamics standpoints.

A world that is a point without time.

Everything locates at the same spot and everything happens right now. HWE will be achieved without delay.

A world that has some spacious space without time. Spacial distribution appears. The distribution might be consisted of discrete items. Sometimes it is easier to handle a material with density in the space. It is “continnum”. In physics, it might be called fluid.

Even with the spacious space, if no time, everything happens right now, that means it achieves the equilibrium without transition states. It might be called a steady state.

Assume a world with time.

Now we can have states that moves from one state to the other. There are states that are periodic and the periodic conditions might be steady state.

In biology steady but dynamic conditions seem to be dominant.

These description needed terms in theomodynamics. Thermodynamics handle the population of very small items as if the components are infinitesimally small.

Statistical mechanics is a branch of physics closely related with thermodynamics that pay attention to individual components and their stochastic features that allows us to study chaotic non-linear phenomena.

That is why thermodynamics/statistical physics are useful to study complicated phenomena in life science.

Chapter 7 scale, variable, degrees of freedom, dimension Chapter 8 statistic, index, probability, likelihood Chapter 9 probability and the likelihood Chapter 10 seen in the linkage analysis likelihood and variables Chapter 11 The index (index)

In the first Ⅱ part, what the data, what is the type of data, we have discussed the concept of what should be taken into account in order to handle the data how the samples with. In Part Ⅲ part introduces the concepts necessary for handling the actual data.

- How many values are necessary to describe a data set

Total number

Marginal counts

Expected value of cells under the assumption of independence

Deviation from expected

Two out of 2x3=6 cells

- Relationship between the variables sets

\[ \begin{align*} &n_{..}=\sum_{i=1}^2 \sum_{j=1}^3 n_{ij}\\ &n_{1.}=\sum_{j=1}^3 n_{1j}, n_{.1}=\sum_{i=1}^2 n_{i1}, n_{.2}=\sum_{i=1}^2 n_{i2}\\ &\delta_{11}=n_{11}- \frac{ n_{1.}\times n_{.1}}{n_{..}}, \delta_{12}=n_{12}- \frac{ n_{1.}\times n_{.2}}{n_{..}} \end{align*} \]

- Rough vs. Fine

Total \(n..\) and marginal counts were …

Rough vs. Fine

Rough: “No big difference from expected”

Fine : “Big difference from expected like …”

- Quantification of the rough report and the degree of freedom

“How big” can be measured as some kind of “distance”, statistical value.

e.g. \(\chi^2\) statistics.

\(\chi^2\) value is a function of deviation value of two cells.

\(\chi^2\) value tends to be bigger when it is a function of more vaiables.

No. variables: degrees of freedom

It is convenient to convert \(\chi^2\) value depending on degrees of freedom to something else, independent from degrees of freedom, p-value (R7-1.R).

```
obtable2 <- matrix(c(25, 23, 12, 15, 17, 8), nrow = 2, byrow = TRUE) # 2x3 table
chisq.test(obtable2)
# Measuring "distance" as chi-sq and converting it to p-value independent from condition, df
```

Rough: “No big difference from expected” can be rephrased as

Rough: “No big difference from expected with p = 0.901”

- The degree of freedom of the contingency table

2x3 table has 6 values.

Six values are necessary to tell the table in full.

No. variables ~ degrees of freedom of 2x3 table is 6.

But it is ofen said that “degrees of freedom” of 2x3 table is 2.

What is the difference between them?

- Variation

Total 100 patients, each taking one of three drugs

Three groups 40 + 40 + 20 = 100

A blood test

See Figure 7.1 (a).

100 values to describe 100 patients test results.

Average of all patients with 99 patients’ individual result. You can calculate the 100-th patient’s result with the average and the other 99 values.

See Figure 7.1 (b).

Average of three groups are different. Telling each group’s average is a good idea.

Again 100 values are necessary.

total average

2(group average of two groups)

(individual varlu of n_1-1)

(individual varlu of n_2-1)

(individual varlu of n_3-1)

- Small variation among samples in the same groups

Rough vs. fine

Total mean only

Total mean and total variance

When differences among means of groups is big enogh, group mean values should be added

Variance differ among goups or not? The difference is big enogh?

SSw: 2nd moment around the total mean \(m_w\): \[ \begin{equation*} SSw=\frac{1}{n_{.}}\sum_{i,j} (v_{ij}-m_w)^2 \end{equation*} \]

SSi: Sum of 2nd moments of each group around group’s mean \[ \begin{equation*} SSi=\sum_{i=1}^3 \sum_{j=1}^{n_{i}} (v_{ij}-m_{x_i})^2 \end{equation*} \]

SSb: Sum of 2nd moment of group mean around total mean

\[ \begin{equation*} SSw=\sum_{i=1}^3 n_i \times (m_{x_i}-m_w)^2 \end{equation*} \]

Their relation:

\[ \begin{equation*} SSw=SSi+SSb \end{equation*} \]

Given a data set, \(SSw\) is calculated.

When \(SSi\) is relatively bigger, \(SSb\) is relatively smaller and it means no big difference among goups.

Because \(SSi\) is fixed, \(SSb\) is automatically determined.

In other words, only two out of three (\(SSw,SSi,SSb\)) are free.

Therefore either one of \[ \begin{equation*} \frac{SSi}{SSw},\frac{SSb}{SSw}, \frac{SSb}{SSi} \end{equation*} \] represents the difference among groups.

It is fine to use any one of three.

Considering degrees of freedom of \(SSi\) and \(SSb\), \[ \begin{equation*} F=\frac{\frac{SSb}{df_b}}{\frac{SSi}{df_i}} \end{equation*} \] is used to measure the intensity of difference among groups.

Statistical test using this F statistics is called analysis of variance (ANOVA).

As we changed \(\chi^2\) value to p-value, we change \(F\) value to p-value based on F-distribution

```
# 3群の群別サンプル数、平???、SD
nx <- c(40, 40, 20); mb<- c(50, 60, 45); sdb <- c(50, 50, 50)*0.01
# 生起乱数で???ータ??????
t1 <- rnorm(nx[1], mb[1], sdb[1]); t2 <- rnorm(nx[2], mb[2], sdb[2]); t3 <- rnorm(nx[3], mb[3], sdb[3])
tb <- c(t1, t2, t3)
# サンプルに群ラベルをつける
x <- c(rep("x1", nx[1]), rep("x2", nx[2]), rep("x3", nx[3]))
boxplot(tb ~ x)
# ANOVA
summary(aov(tb ~ x))
```

\(\frac{SSb}{df_b}\) and \(\frac{SSi}{df_i}\) are unbiased estimate of inter-group and intra-group variance.

Degrees of freedom ~ difference of No. parameters

2x2 table df=1

2x3 table df =2

When something has d degrees of freedom, it should be located in d-dimensional space.

In Chapter 13, tables and their statistics are displayed in d-dimensional space.

6 cells in 2x3 table

There are 3 columu sums and 2 row sums and 1 total sum; 3 + 2 + 1 = 6 sums.

6 cell values, 6 variables, that satisfy 6 equations; that can be expressed as a linear algebraic form.

\[ \begin{pmatrix} 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ \end{pmatrix} \begin{pmatrix} r_1 \\ r_2 \\ r_3 \\ s_1 \\ s_2 \\ s_3 \end{pmatrix} = \begin{pmatrix} R \\ S\\ t_1 \\ t_2 \\ t_3 \\ T \end{pmatrix} \]

Can we solve this system?

Rank of matrix will tell the solution.

Degrees of freedom = number of variable - #rank.

```
m<-matrix(c(1,1,1,0,0,0,
0,0,0,1,1,1,
1,0,0,1,0,0,
0,1,0,0,1,0,
0,0,1,0,0,1,
1,1,1,1,1,1, # 6 rows down to here are constraints
1,0,0,0,0,0, # (1,1) cell value is given
0,1,0,0,0,0), # (1,2) cell value is given
ncol=6,byrow=TRUE)
# qr() returns rank
6-qr(m[1:6,])$rank # with 6 row constraints...
6-qr(m[1:8,])$rank # when two cells' values are determined...
```

When independent,

they are mutually perpendicular (\(\frac{pi}{2}\))

their inner product is zero

joint probability should be multiplication of two probabilities

(R7-4.R).

```
Ns <- 10000; A <- rnorm(Ns); B <- rnorm(Ns) # Making two vectors independently
ip<-sum(A*B) # inner product
cos<- sum(A * B)/sqrt(sum(A * A) * sum(B * B)) # cosine
acos(cos)/pi # angle
```

k vectors span k-dimensional space when they are linearly independent.

The vectors consist of basis of the space.

When k linearly independent vectors are mutually perpendicular, they consit of orthogonal basis.

A matrix representing orthogonal basis and all columns are unit vectors, then, M x t(M) is identity matrix.

It is called orthonormal basis.

Orthonormal basis matrix is generated by singular value/ eigen value decomposition, which tell us “good-looking” angle to see a multidimensional data set.

R7-5.R is such example.

There are many specimens are divided into a plurality of groups, about it, you think of the situation to take the data for the relatively large number of quantitative items. How to create and processing of data is as follows.

```
#4 big subpopulations ~ 100 and 20 small subpopulations ~ 10
#100 genetic markers
Nm<-100
Ns<-c(rpois(4,100),rpois(20,10))
Npop<-length(Ns) # No. subpops
M<-NULL # genotypes
# means for each subpop
for(j in 1:Npop){
tmpM<-matrix(rep(0,Nm*Ns[j]),ncol=Nm)
for(i in 1:Nm){
af<-rnorm(1)
tmpM[,i]<-rnorm(Ns[j],af)
}
M<-rbind(M,tmpM)
}
# Standardize
wholemean<-mean(M)
M<-M-wholemean # Total mean = 0
mu<-apply(M,2,mean) # column mean
M<-t(t(M)-mu) # column means = 0
# decomposition
svdout<-svd(M)
M2<-svdout$u%*%diag(svdout$d) # Post decomposition data matrix
par(mfcol=c(1,2))
# plot
image(1:sum(Ns),1:Nm,M,xlab="Big sub pops -> small sup pops",ylab="variables")
image(1:sum(Ns),1:Nm,M2,xlab="Big sub pops -> small sup pops",
ylab="post decomposition")
df1<-as.data.frame(M);df2<-as.data.frame(M2) # data frame
L<-1:5;par(mfcol=c(1,1))
plot(df1[,L]) # pre decomposition
plot(df2[,L]) # post decomposition
vM1<-apply(M,2,var) # variances pre
vM2<-apply(M2,2,var) # variances post
ylim<-c(min(vM1,vM2),max(vM1,vM2))
plot(vM1,ylim=ylim,type="b")
par(new=T)
plot(vM2,ylim=ylim,type="b",col="red")
```

Degree of freedom tells No. variables necessary to describe a data set.

Various ways to have variables.

Treat every variable equally

Order them and use tops

Hierarchical handling of variables

For a variable set, it has some meanings.

To look at a data set from another aspect, you may change the variable set to different set, that may have “biological meanings”.

After taking out some featuring values from a data set, we want to “understand” what the values mean.

Distribution

Stochastic variable

Probability distribution

Probability is non negative

Support

Discrete values, “red” or “white”, “yes” or “no”, “1” or “2” or “3”….

Sum of probability mass function is 1. Figure 8.1 (R8-1.R).

```
barplot(c(0.4,0.6),names.arg=c("赤","白"),ylab="確???")
pie(c(0.3,0.2,0.4,0.1),labels=c("赤","白","???","???"))
```

Continuous.

Example; taking real values non negative.

Probability density gets half, when value of the variable increases by L.

Decrease in radioactivity.

\[ \begin{equation*} P(x+L)=\frac{1}{2}\times P(x) \end{equation*} \]

\[ \begin{equation*} Q(x+L)=log(P(x+L))=log(P(x))-log(2) = Q(x)-log(2) \end{equation*} \]

With \(\lambda=\frac{log(2)}{L}\)

\[ \begin{equation*} P(x)=P(0)\times e^{-\lambda x} \end{equation*} \]

\[ P(x)=P(0)\times exp(-\frac{x}{L}log(2)) \]

Because integration of \(P(x)\) over all possible values of x should be 1,

\[ \begin{equation*} \lim_{X\to \infty} \int_0^X P(x) dx = \lim_{X\to \infty} P(0) \bigl[ -\frac{1}{\lambda} e^{-\lambda x} \bigr] ^X_0 = P(0) \times \frac{1}{\lambda} =1 \end{equation*} \]

then \(P(0)=\lambda\) and

\[ \begin{equation*} P(x)=\lambda e^{-\frac{x}{\lambda}};\lambda=\frac{log(2)}{L} \end{equation*} \]

This is the probability density function for it. It is exponential distribution.

\[ Q(x+L)-Q(L)=-log(2) \] |

\(Q(x)\) increases in proportinal to increase in \(x\). |

\[ Q(x)=Q(0) - \frac{x}{L} log(2) \] |

Putting \(P(x)\) back to it, \[ log(P(x))=log(P(0))-\frac{x}{L} log(2) \] then, \[ P(x)=P(0)\times exp(-\frac{x}{L}log(2)) \] |

If x can be negative and 0 and positive, \[ \begin{equation*} P(x)=\frac{1}{\sqrt{\pi}}\lambda^{\frac{1}{2}} e^{-\lambda |x|^2} \end{equation*} \]

Because the space becomes twice, the probability density function becomes half.

When decreae in probability is proportinal to distance in a direction, it is exponential distribution.

When decrease in probability is proportinal to square of distance, then,

\[ \begin{equation*} P(x)=C e^{-\lambda |x|^2} \end{equation*} \]

\[ \begin{equation*} \int_{0}^{\infty} P(x)=\frac{1}{2} \end{equation*} \] so, \[ \begin{equation*} P(x)=\frac{1}{\sqrt{\pi}}\lambda^{\frac{1}{2}} e^{-\lambda |x|^2} \end{equation*} \] \[ \begin{equation*} k=2,\lambda=\frac{1}{2\sigma^2} \end{equation*} \]

This is the formula of normal distribution.

\[ e^{-\lambda |x|^k} \]

k=1 exponential distribution (Laplace distribution) \(\frac{1}{2}\lambda e^{-\lambda x}\)

k=2 normal distribution \(\frac{1}{\sqrt{\pi}}\lambda^{\frac{1}{2}}e^{-\lambda x^2}\)

k -> 0, x=0:1,otherwise uniform

\(k \to \infty\), rectangular distribution

\[ C \lambda^{\frac{1}{k}}e^{-\lambda x^k} \]

\[ \begin{equation*} P(x;k)=\frac{1}{2} \frac{1}{\Gamma(\frac{1}{k}+1)} \lambda^{\frac{1}{k}} e^{-\lambda |x|^k} \end{equation*} \] where \(\Gamma\) is gamma function.

This form is generalized normal distribution.

Normal distribution

\[ \begin{equation*} \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{x^2}{2\sigma^2}} \end{equation*} \]

Standard normal distribution \(mean=0\),\(\sigma^2=1\) \[ \begin{equation*} \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \end{equation*} \]

Exponentila distribution is the sum of negative side and positive side of Laplace distribution.

Let’s do the same thing for normal distribution.

\[ \begin{equation*} 2\times \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}=\frac{1}{\sqrt{\pi}}e^{-\frac{x^2}{2}} \end{equation*} \] This is \(chi\) distribution of df =1.

This sum of negative side and positive side means that we want to handle distance rather than coordinates.

In one dimensional space, there are two points that have the same distance from the origin.

Now we change our dimension from 1 to 2.

\[ \begin{equation*} 2\pi x \end{equation*} \]

This is the length of a circle with radius x, that is sort of “number of points whose distance from the origin is x”.

The following is \(chi\) distribution for 2d space. \[ \begin{equation*} 2\pi x \times \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}= \sqrt{2\pi} x e^{-\frac{x^2}{2}} \end{equation*} \]

Double check this formula as below.

Because \(\frac{d e^{ax^2}}{dx} = 2ax e^{ax^2}\),

\[ \begin{equation*} \int_{0}^{\infty} \sqrt{2\pi} x e^{-\frac{x^2}{2}} dx = \sqrt{2\pi} [(-e^{\frac{x^2}{2}})]_0^{\infty}=\sqrt{2\pi} \end{equation*} \]

Unfortunately the total sum is not equal to 1.

Adding \(\sqrt{2\pi}\) to adjust the sum,

\[ \begin{equation*} \frac{1}{\sqrt{2\pi}} \sqrt{2 \pi} e^{-\frac{x^2}{2}}=e^{-\frac{x^2}{2}} \end{equation*} \]

This corresponds to the formula of \(\chi\) distribution where df =2, \[ \begin{equation*} \frac{2^{1-\frac{k}{2}}}{\Gamma(\frac{k}{2})} x^{k-1} e^{-\frac{x^2}{2}} =e^{-\frac{x^2}{2}} \end{equation*} \] In general the area of k-dimensional sphere is, \[ \begin{equation*} S(x;k)=2\frac{1}{\Gamma(\frac{k}{2})} \pi ^ {\frac{k}{2}} x^{k-1} \end{equation*} \]

then, \(\chi\) distribution of df=k should be proportional to \[ \begin{equation*} S(x;k) \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}= \frac{1}{\sqrt{2\pi}}\frac{2\pi^{\frac{k}{2}}}{\Gamma(\frac{k}{2})} x^{k-1} e^{-\frac{x^2}{2}} \end{equation*} \]

or,

\[ \begin{align*} &C \times S(x;k)\frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} = (\frac{1}{\sqrt{2\pi}})^k S(k) e^{-\frac{x^2}{2}}\\ &C=(\frac{1}{2\pi})^{\frac{k-1}{2}} \end{align*} \]

When you look at k-dimensional normal distribution formula, you will find the term of \((\frac{1}{\sqrt{2\pi}})^k\), because of the same reason.

\(chi\) distribution cares distance.

\(chi^2\) distribution cares square of distance.

Transforming \(chi\) distribution with \(X= x^2\).

\[ \begin{equation*} Pr(\chi^2=X)=\frac{1 }{2^{\frac{k}{2}} \Gamma(\frac{k}{2})} X^{\frac{k}{2}-1} e^{-\frac{X}{2}} \end{equation*} \]

Figure 8.5.

Assume 40 vballs in a bag, 12 of which are labeled “A” and the rest “B” (R9-1.R, Figure 9.1).

Take out a ball 10 times from the bag.

Two ways to take out balls; with replacement and without replacement.

There are 11 ways of samples. No. “A” balls can be 0,1,…,10.

```
# With replacement
p <- 0.3; x <- 0:10 # Fraction of A is 0.3
pr <- choose(10, x) * p^x * (1 - p)^(10 - x)
A <- 12; B <- 28; S <- A + B
x <- 0:10; y <- 10 - x
# without replacement
pr2 <- exp(lgamma(10 + 1) + lgamma(30 + 1) + lgamma(A + 1) + lgamma(B + 1) - (lgamma(S + 1) + lgamma(x + 1) + lgamma(y + 1) + lgamma(A - x + 1) + lgamma(B - y + 1)))
ylim <- c(0, 0.5)
plot(x-0.5, pr, ylim = ylim, type = "s")
par(new = T); plot(x - 0.5, pr2, ylim = ylim, type = "s", col = "red")
```

Assume 40 balls in a bag.

Assume no. “A” balls is changed from 0 to 40.

```
exProb<-function(m){ # exact prob
m1<-apply(m,1,sum) # row sum
m2<-apply(m,2,sum) # col sum
s<-sum(m)
exp(sum(lgamma(m1+1))+sum(lgamma(m2+1))-(lgamma(s+1)+sum(lgamma(m+1))))
}
# condisions
S<-40;A<-0:S;B<-S-A;N<-10;x<-0:N
probmat<-matrix(0,length(A),length(x))
for(i in A){
for(j in x){
y<-N-j; z<-i-j; w<-S-(j+y+z);
if(j>=0 & y>=0 & z>=0 & w>=0){
data<-c(j,y,z,w)
probmat[i+1,j+1]<-exProb(matrix(c(j,y,z,w),nrow=2))
}
}
}
phi<-80;theta<-0;shade<-0.3 # Parameters for drawing
persp(A,x,probmat,xlab="No. A in 10", ylab="No. A in 40",zlab="probability",phi=phi,theta=theta,shade=shade)
plot(A - 0.5, probmat[, 4], type = "s") # Probability no. "A" = 3 in the taken out 10 balls.
abline(h = 0, col = "red")
```

Assume a bag with 40 balls.

You took out 10 balls without replacement.

You found 3 “A” and 7 “B” balls.

What can you tell about no. “A” in the bag before you took out balls?

No. “A” should be 0,1,…,40, because no. balls in the back was 40.

No. “A” should be 3 or more because 3 “A”s were taken out.

See Figure 9.3.

Probability to observe something and likelihood to observe something are the sections that are mutually orthogonal.

Looking at the section for likelihood, no. “A” = 0,1,2,34,…,40 is impossible ~ likelihood of them = 0.

\[ \begin{equation*} L(A=a|(3,7))=\frac{a!(40-a!)10!30!}{40!3!7!(a-3!)(33-a)!} \end{equation*} \]

You can check the R9-3.R.

`sum(probmat[, 4]) `

After the observation of “A” = 3 and “B”= 7, you put them back to the bag.

Then again you took out 10 balles.

This time “A” = 7 and “B” = 3.

How can we calculate likelihood of no. “A” in the original bag? (R9-4.R).

```
# probmat[,4]:Likelihood for "A" = 3 and "B" = 7
# probmat[,8]:Likelihood for "A" = 7 and "B" = 3
# Multiply them because two observations are independent each other
plot(A - 0.5, probmat[, 4] * probmat[, 8], type = "s")
```

Together the two-time result, the likelihood is the largest, the 40 pieces of A, B was the case of each half (Figure 9.4 (c)).

Multiplication: Probabilities can be multiplied when events are mutually independent. Likelihood, same

\(\prod p \to \sum \log(p)\) : easy

The absolute value of likelihood is not so important but the relative value of likelihood is more important because conditions/hypotheses should be compared with their likelihood.

It is a good idea to use relative value of likelihood so that the sum/integral of likelihood over all conditions/hypothesis.

Prior and posterior distributions that are likelihoods.

Observation transforms prior distribtution to posterior distribution. This is “Bayesian”.

Two 2x2 tables; one is independent and the other is dependent.

B : b = 0.5 : 0.5 for both tables.

Without further information, you should anticipate the probability to get B is 0.5. B : b = 0.5 : 0.5 = 1 : 1.

When you know your sample has A, then does it have B or b?

Focus the left colum of the tables. For the left table, B : b = 0.3 : 0.3 = 1 : 1. For the right, B : b = 0.4 : 0.2 = 2 : 1.

Conditional probability

You can expand 2x2 table to a bigger one. Figure 9.6: 20 x 15

Figure 9.7.

No big difference from Figure 9.6.

2 stochastic variables

Conditional probability

Prior and posterior probability

Sensitivity and specificity

PPV and NPV

Every time you observe something, its information updates likelihood, new posterior density is generated by applying the information to its prior.

Linkage analysis:

Using phenotype information

Using genotype information

Using pedigree information

Using genetic marker location information

Considering recombination

Calculate likelihood markers are causative for the phenotype

Two types of linkage analysis

Parametric

Non-parametric

Pedigree is given.

Many patterns of chromosomal transmission are possible. Enumerate them \(2^n\). Chapter 5. For a point along a chromosome, there should be a tree. Figure 10.1.

```
library(gtools)
permutations(2,4,c(0,1),repeats=TRUE)
```

For parents with two offsprings, \(2^4=16\).

For L markers, \(16^L\) patterns.

When L=2, \(16^2\)

```
n<-4
m<-permutations(2,4,c(0,1),repeats=TRUE)
RecNumberMat<-as.matrix(dist(m,method="manhattan",diag=TRUE,upper=TRUE)) # no. transmission with recombination
NonRecNumberMat<-n-RecNumberMat # no. transmission without recombination
RecNumberMat
```

- Transmission patterns @ markers (Trees @ markers)
Genotypes of memebers of the markers are given. Conditional probability of 16 trees can be calculated. Some patterns are impossible (zero probability).

- Transmission patterns @ TRUE LOCUS (Trees @ True locus)
Using phenotype information and genetic model assumption(dominant/recessive), some trees are likely and the others are unlikely as transmission pattern of “TRUE LOCUS”. Likelihood/conditional probability can be calculated.

TRUE LOCUS may be @ a marker itself, or @ somewhere between markers.

(Tree @ marker 1) x (Tree @ marker 2) x (Tree @ TRUE LOCUS) : Very many combinations (tree trios).

Recombinations may be required between marker 1 and true locus and between true locus and marker 2 to realize the combination of three sites.

Likelihood function for trio of trees

\[ \begin{equation*} L(Pi,Pj,Pk)=\theta_{aG}^{Nrec_{i,k}}\times (1-\theta_{aG})^{Nnon_{i,k}} \times \theta_{Gb}^{Nrec_{k,j}}\times (1-\theta_{Gb})^{Nnon_{k,j}} \end{equation*} \] where \(Nrec_{i,j},Nnon_{i,j}\) are number of transmissions with or without Recombination and \(\theta_{pq}\) is probablity to have recombination between p and q.

- All trios should be considered,

\[ \begin{equation*} L=\sum_{all Pi,Pj,Pk} l(Pi,Pj,Pk)pipjpk \end{equation*} \]

- Maker 1 — TrueLocus(G) — Marker 2

Relation among \(\theta_{ab},\theta_{aG},\theta_{Gb}\) : \[ \begin{equation*} \theta_{ab}=\theta_{aG}(1-\theta_{bG})+\theta_{bG}(1-\theta_{aG}) \end{equation*} \]

Using parameter t \(0\le t \le 1\), \[ \begin{align*} \theta_{aG}=0.5-\sqrt{\frac{0.5-\theta_{12}}{2}\frac{cos(t)}{sin(t)}}\\ \theta_{bG}=0.5-\sqrt{\frac{0.5-\theta_{12}}{2}\frac{sin(t)}{2cos(t)}} \end{align*} \]

\(/theta_{12}\) is known from genetic map information.

Now, likelihood function \[ \begin{equation*} L=\sum_{all Pi,Pj,Pk} l(Pi,Pj,Pk)pipjpk \end{equation*} \] is now a function of t.

We can find maximum likelihood estimate of t.

A bit bigger familial data set is simulated and evaluated.

```
# Calculate likelihood of null and alternative hypotheses with variables below
# n: number of branches in tree
# m1,m2,G: probabilitis for markers 1 and 2 and true locus G
# theta: recombination rate between two markers
# k: number of points between m1 and m2 where likelihood will be calculated
CalcLike<-function(n,m1,m2,G,theta,k){
x<-seq(from=0,to=1,by=1/k)
t<-x*pi/2
theta1<--sqrt((0.5-theta)/(2*sin(t))*cos(t))+0.5
theta2<--sqrt((0.5-theta)/(2*cos(t))*sin(t))+0.5
range<-which(theta1>=0 & theta2>=0)
x<-x[range]
theta1<-theta1[range]
theta2<-theta2[range]
library(gtools)
trvec<-permutations(2,n,c(0,1),repeats=TRUE)
# number of recombination (+) transmission and number of recombination (-) transmission
RecNumberMat<-as.matrix(dist(trvec,method="manhattan",diag=TRUE,upper=TRUE))
NonRecNumberMat<-n-RecNumberMat
Lalt<-rep(0,length(x)) # Likelihood when alternative hypothesis is true
for(i in 1:length(Lalt)){
m1G<-m1%*%t(G) #likelihood between m1 and G.
# recombination (+)
m1Gx<-m1G*theta1[i]^RecNumberMat*(1-theta1[i])^NonRecNumberMat
m1Gx<-apply(m1Gx,2,sum) # tree patterns of m1 changes because of recombination into tree patterns of G. Patterns of G should be compared with tree pattersn of m2.
m2G<-m1Gx%*%t(m2) #likelihood between m1 and G.
m2G<-m2G*theta2[i]^RecNumberMat*(1-theta2[i])^NonRecNumberMat #recombination between G and m2 should be considered
Lalt[i]<-sum(m2G) #All tree pattern trios should be integrated
}
# Null hypothesis
m12<-m1%*%t(m2)
m12x<-m12*theta^RecNumberMat*(1-theta)^NonRecNumberMat/(2^n)
Lnull<-rep(sum(m12x),length(Lalt))
list(logLikeAlt=Lalt,logLikeNull=Lnull,location=x)
}
n<-4 # number of transmission in trees
set.seed(65432) # setting seed for simulation (you can change seed yourself)
# generate n^2 0/1 matrix for m1 and m2
m1<-sample(c(0,1),n^2,replace=TRUE,prob=c(0.8,0.2))
m1<-m1/sum(m1) # probability of trees for m1
m2<-sample(c(0,1),n^2,replace=TRUE,prob=c(0.8,0.2))
m2<-m2/sum(m2) # probablity of trees for m2
G<-0.9*m1+0.1*m2
G<-G/sum(G) # probability of trees for true locus, G
theta<-0.4 # theta between m1 and m2
k<-100 # number of points for likulihood calculation
LL<-CalcLike(n,m1,m2,G,theta,k)
ylim<-c(min(LL$logLikeAlt,LL$logLikeNull),max(LL$logLikeAlt,LL$logLikeNull))
ylim<-c(0,max(LL$logLikeAlt,LL$logLikeNull))
plot(LL$location,LL$logLikeAlt,type="l",ylim=ylim,ylab="LogLike") # loglikelihood of alternative hypothesis
par(new=T)
plot(LL$location,LL$logLikeNull,type="l",col="red",ylim=ylim,ylab="") # loglikelihood of null hypothesis
```

The highest likelihood indicates the most likely location of true locus.

Largeness of alternative hypothesis likelihood compared with likelihood under null hypothesis is used to judge statistical significance.

LOD score is logarithm of likelihood ratio.

The above is the basic concept of parametric linkage analysis. Because calculation is very heavy if you attempt to calculate for many markers, softwares for linkage analysis adopt many twists.

Parametric linakge analysis uses genetic models or penetrance, that are parameters.

Non-parametric linkage analysis does not use them.

Non-parametric method models true locus(loci) whose genotypes affect on expression phetypes and compares likelihood between the alternative hypothesis and the null hypothesis(no effect of genotypes on phenotype expression).

Non-parametric linkage analysis has been used for sib-pair analysis frequently and in this section, we study the method with sib-pair analysis.

Assume a biallelic locus with three genotypes.

Null hypothesis: \(GRR = 1\) for all genotypes.

Alternative hypothesis \(GRR \ne 1\) for some genotypes.

Consider a sib-pair. \(3^2=9\) genotype combinations are possible.

Probability that two of the pair are affected can be calculated for 9 combinattions.

For a sib-pair, there is one parent-pair, who has 4 alleles in total.

Each offspring has two allele and there are 4 ways to receive two out of parents’ four.

Considering two of the sib-pair, \(4^2=16\) ways of transmission.

Assume a risk allele. No. of risk alleles in 4 parents’ alleles can be 0,1,2,3,4 (5 ways).

See table 10.1. You can calculate probablity that the sibs express the phenotype for each condition of parental risk allele number and transmission patterns of two offsprings.

16 ways of transmission to two offsprings can be grouped into three ways using IBD = 0,1 or 2.

Tables 10.3 and 10.4, that are the same with different expression. That indicates probability/likelihood for Parental patterns x IBD value.

\(\Delta_{p>q}\) stands for a value related with difference of relative risks of genotypes p and q. They have some meanings in genetic models.

Using the information that both offsprings of sib-pairs have “affected” phenotype, the likelihood of IBD number should vary. IBD number for markers can be count or estimated. When you have many sib-pairs, their likelihood should be integrated (if pairs are independent, you can multiply them). Eventually you may pick the marker most likely. You can evaluate the positions between markers as well.

Indices tell you relative amount of something you are interested in.

The value of indices should be easily understandable; [0,1] or [0,100] is good for many cases.

F (3.3.3 Section) used in the evaluation of the Hardy-Weinberg equilibrium is an index.

Indices for linkage disequilibrium requires their value should take a value “appropriate” when in linkage Equilibrium; e.g., 0.

Also they should take value(s) easily understandable when LD is in one of extremes; e.g., 1.

As far as the indices meet these requirement, they can be defined in various ways and actually there are multiple definitions for LD.

p-value is ubiquitously used for hypothesis rejection, ranging from 0 to 1.

This is an index, too.

(R11-1.R).

```
df<-4;x <- seq(from = 0, to = 15, by = 0.1)
d <- dchisq(x, df);q <- pchisq(x, df);p <- pchisq(x, df, lower.tail = FALSE)
ylim<-c(0,1)
plot(x, d, ylim = ylim, type = "l")
par(new = T);plot(x, q, ylim = ylim, type = "l")
par(new = T);plot(x, p, ylim = ylim, type = "l")
```

Chapter 12 estimation

Chapter 13, rejection of test

Chapter 14 association and causal relation

Chapter Ⅳ parts of estimating the truth you want to know hidden in the data, determine the authenticity of the hypothesis based on the data (rejection and test), and you pick up that you consider whether the cause of the phenomenon or the result of (causal).

Bernoulli, 6 out of 20 were “yes” and 14 were “no”.

Likelihood that probability of “yes” is 0.3 is:

\[ \begin{equation*} \frac{20!}{6!14!} 0.3^6 \times (1-0.7)^{14} \end{equation*} \]

Changing 0.3 to p,

\[ \begin{equation*} f(p)= \frac{20!}{6!14!} p^6 \times (1-p)^{14} \end{equation*} \]

Likelihood function.

We want to know value of p that maximize f(p).

\[ \begin{align*} f'(p)&= \frac{20!}{6!14!} \times 6\times p^5 \times (1-p)^{14} - 14 \times p^6(1-p)^{13} \\ &= \frac{20!}{6!14!} \times p^5 \times (1-p)^{13} (6(1-p)-14 p )\\ &= \frac{20!}{6!14!} \times p^5 \times (1-p)^{13} (6-20p) \end{align*} \]

\[ \begin{equation*} f'(p=0.3)=0 \end{equation*} \]

0.3 is the maximum likelihood estimate of p.

p takes value from 0 to 1.

0.3 is most likeky value, but other values are possible.

The distribution function that has the same shape with the likelihood function and the area under the curve, \(\int_0^1 f(p) dp =1\) is useful.

\[ \begin{equation*} f_c(p)=Cp^6\times(1-p)^{14} \end{equation*} \]

where \[ \begin{equation*} \int_{0}^{1} f_c(p) dp = \int_{0}^{1} C p^6 \times (1-p)^{14} dp =1 \end{equation*} \]

Actually C is related with

\[ \begin{equation*} \frac{20!}{6!14!}p^6\times (1-p)^{14} \end{equation*} \] and

\[ \begin{equation*} C=\frac{(20+1)!}{6!14!}=\frac{21!}{6!14!} \end{equation*} \]

Using \(\Gamma\) function and \(\beta\) function, \[ \begin{equation*} C=\frac{(N+1)!}{k!(N-k)!}= \frac{\Gamma((N+1)+1)}{\Gamma(k+1)\Gamma((N-k)+1)} = \frac{\Gamma( (k+1)+(N-k+1) )}{\Gamma(k+1)\Gamma((N-k)+1)} =\frac{1}{B(k+1,(N-k+1))} \end{equation*}\]

Now \[ \begin{equation*} f_c(p)=\frac{1}{B(k+1,(N-k+1))}p^k(1-p)^{N-k}=\beta(p;k+1,N-k+1) \end{equation*} \] then,

\[ \begin{equation*} f_c(p)=\frac{1}{B(k+1,(N-k+1))}p^k(1-p)^{N-k}=\beta(p;k+1,N-k+1) \end{equation*} \]

This is called beta distribution.

Using the likelihood function or beta distribution, we can define interval estimate of p: \[ \begin{equation*} \int_{L}^{H} f_c(p) dp : \int_{0}^{L} f_c(p) dp+ \int_{H}^1 f_c(p) dp = \alpha : (1-\alpha) \end{equation*} \]

If we give further constraint, \[ \begin{equation*} \int_{0}^{L} f_c(p) dp = \int_{H}^1 f_c(p) dp \end{equation*} \]

We can tell lower and upper bound of the interval.

```
set.seed(.Random.seed[1]) # set seed for random values
N <- 20;k <- 6 # observation
p <- seq(from = 0, to = 1, by = 0.01) # list of values of p
v <- dbeta(p, k + 1, N - k + 1) # beta distribution
plot(p, v, type = "l")
abline(v = k/N) # MLE
cirange <- 0.95 # 0.95 - > 0.05/2 = 0.025
# quantiles of beta distribution
ci <- qbeta(c((1 - cirange)/2, 1 - (1 - cirange)/2), k + 1, N - k + 1)
abline(v = ci) # upper and lower bounds
```

Estimated distribution ofp is beta distribution:

MLE is \(\frac{\alpha-1}{\alpha-1+\beta-1}=0.3\) and mean (expected value) is \(\frac{\alpha}{\alpha+\beta}=0.318\)

Use the package “binom” (R12-2.R).

```
library(binom)
binom.confint(6, 20 ,conf.int=0.95, prior.shape1 = 1, prior.shape2 = 1)
```

How do you classify these various methods? Which one corresponds to the interval we calculated with beta distribution?

- Bayesian uses information and change prior to posterior distribution.

`binom.confint(6, 20, prior.shape1 = 1, prior.shape2 = 1)`

\[ \begin{equation*} \frac{1}{B(a,b)} p^{a-1}(1-p)^{b-1} \end{equation*} \] “a=prior.shape1,b=prior.shape2”

When a=1,b=1 \[ \begin{equation*} \frac{1}{B(1,1)}p^{1-1}(1-p)^{1-1}=\frac{\Gamma(1+1)}{\Gamma(1+1)\Gamma(1+1)}\times1=1 \end{equation*} \]

a=1,b=1 means k=0,N-k=0.

This means no observation.

You “know” more than uniform distribution before you get observation result.

Because you know something before the information, you have some kind of distribution of p in your mind that is not uniform.

If your “prior” distribution in the shape of beta distribution, you can specify the shape with two parameter values.

\[ \begin{equation*} \frac{1}{B(x=1.2,y=1.5)}p^{x-1}(1-p)^{y-1} \end{equation*} \]

```
x <- 1.2; y <- 1.5; p<-seq(from=0,to=1,by=0.01)
v3 <- dbeta(p, x, y)
plot(p, v3, type = "l", ylim = c(0, 4))
v4 <- dbeta(p,x+6,y+14)
par(new=TRUE)
plot(p, v4, type = "l", ylim = c(0, 4))
```

Assume you obtained information that 20 turned out to be 6 vs. 14, then,

your posterior distribution should be proportional to \[ \begin{equation*} \frac{1}{B(x,y)} p^{x-1}(1-p)^{y-1} \times p^6 (1-p)^{14}=\frac{1}{B(x,y)} p^{x+6-1}(1-p)^{y+14-1} \end{equation*} \]

Standardizing to: \[ \begin{equation*} \frac{1}{B(x+6,y+14)} p^{x+6-1}(1-p)^{y+14-1} \end{equation*} \]

The change is in general, \[ \begin{equation*} \frac{1}{B(x,y)}p^{x-1}(1-p)^{y-1} \to \frac{1}{B(x+k,y+(N-k))} p^{x+k-1}(1-p)^{x+(N-k)-1} \end{equation*} \]

You can summarize this change as change in parameter values of beta distribution; \[ \begin{equation*} (x,y) \to (x+k,y+(N-k)) \end{equation*} \]

Further infromation will do like \[ \begin{equation*} (x,y) \to (x+k,y+(N-k)) \to (x+k+k', y+(N-k)+(N'-k')) \end{equation*} \]

This is easy and convenient.

Binomial distribution and beta distribution work together for prior, updating prior to posterior. This relation is said that beta distribution is conjugate prior for binomial distribution.

The conjugate prior distribution for multinomial distribution is Dirichlet distribution and one for poisson distribution is gamma distribution.

Binomial ==(generalization)==> Multinomial \[ \begin{equation*} \frac{N!}{\prod_{i=1}^k n_i!} \prod_{i=1}^k (p_i )^{n_i} \end{equation*} \] \(\sum_{i=1}^k n_i=N,n_i \ge 0\)

Binomial <—> Beta

Multinomial <—> Dirichlet

Fig 12.5 R12-sup2.R

Three categoires can be visualized in a triangle in 2D (Simplex in chapter4).

Interval estimate is a bit complicated. You may define it as a contour line that has the same likelihood and inside and outside of the contour line have \(\alpha : (1-\alpha)\).

- Maximum likelihood estimation using a pseudo-random number

Assume two SNPs (A/a and B/b).

We observe 9 combinatorial genotype counts.

We want to estimate frequency of 4 haplotypes AB, Ab, aB, and ab.

We can tell for sure the number of haplotype alleles for eight out of nine combinatorial genotypes.

For the genotype (Aa, Bb), we are not sure (AB + ab) vs. (Ab + aB).

We have to determine the ratio of two (AB + ab) vs. (Ab + aB).

Tables 12.1 and 12.2 indicate genotype frequency in HWE assumption.

Haplotype frequency, \(f_1,f_2,f_3,f_4, \sum_{i=1}^4 f_i=1\)

\[ \begin{equation*} log(L(F))= \sum_{i=1}^3 \sum_{j=1}^3 log(F_{ij}^{n_{ij}}) = \sum_{i=1}^3 \sum_{j=1}^3 n_{ij}logF_{ij} \end{equation*} \]

We can generate random haplotype frequency vectors using Dirichlet random generator.

```
library(MCMCpack)
Make3x3 <- function(f) {From 4 haplotype freq to 9 genotype freq
tmp <- f %*% t(f)
matrix(c(tmp[1, 1], 2 * tmp[1, 2], tmp[2, 2], 2 * tmp[1,
3], 2 * (tmp[1, 4] + tmp[2, 3]), 2 * tmp[2, 4], tmp[3,
3], 2 * tmp[3, 4], tmp[4, 4]), nrow = 3, byrow = TRUE)
}
# Log likelihood to observe genotype counts under haplotype frequency vector
CalcLike3x3 <- function(g = matrix(1, 3, 3), f = c(0.25, 0.25, 0.25, 0.25)) {
F <- Make3x3(f)
sum(g * log(F))
}
CalcR <- function(h) { # LD index r from haplotype frequency vector
p_A <- h[1] + h[3]
p_B <- h[1] + h[2]
(h[1] - p_A * p_B)/sqrt(p_A * (1 - p_A) * p_B * (1 - p_B))
}
ns <- matrix(c(2, 5, 2, 10, 18, 14, 4, 25, 20), nrow = 3, byrow = TRUE) # 9 genotype counts
N <- 10000 # no. iterations
p <- c(1, 1, 1, 1) #parameters for dirichlet random generator
sampled <- rdirichlet(N, p) # Dirichlet random vectors
loglikes <- apply(sampled,1,CalcLike3x3,g=ns)
maxset <- sampled[which(loglikes == max(loglikes)), ] # Select ML
maxp_A <- maxset[1] + maxset[3] # SNP A'S allele freq
maxp_B <- maxset[1] + maxset[2] # SNP B's allele freq
maxr <- CalcR(maxset) # LD index
maxset
```

- Maximum likelihood estimation with constraints

Because we can tell no. A, a and B, b in the samples, we can calculate allele frequency as sample frequency.

Actually the values are maximum likelihood estimate of A, a, B and b.

We may believe these single SNP allele frequency as they are.

Then four haplotype frequencies are 1-df.

\[ \begin{align*} h_1=p_Ap_B+r\sqrt{p_Ap_ap_Bp_b}\\ h_2=p_Ap_b-r\sqrt{p_Ap_ap_Bp_b}\\ h_3=p_ap_B-r\sqrt{p_Ap_ap_Bp_b}\\ h_4=p_ap_b+r\sqrt{p_Ap_ap_Bp_b} \end{align*} \]

1-df optimization is simple and easy.

(R12-5.R).

```
# p_A,p_B
p_A <- 0.4;p_B <- 0.3
# haplotype freq under the assumption of LE
f1 <- p_A * p_B; f2 <- (1 - p_A) * p_B; f3 <- p_A * (1 - p_B); f4 <- (1 - p_A) * (1 - p_B)
rs <- seq(from = -1, to = 1, by = 0.001) # r
ds <- rs * sqrt(p_A * (1 - p_A) * p_B * (1 - p_B))
f1 <- ds + f1; f2 <- -ds + f2; f3 <- -ds + f3; f4 <- ds + f4
F <- matrix(c(f1, f2, f3, f4), nrow = length(f1))
minF <- apply(F, 1, min) # exclude negative frequency cases
rs <- rs[minF > 0]
F <- F[minF > 0, ]
loglikes2 <- apply(F,1,CalcLike3x3,g=ns)
maxset2 <- F[which(loglikes == max(loglikes2)), ]
maxp_A2 <- maxset2[1] + maxset2[3];maxp_B2 <- maxset2[1] + maxset2[2]
maxr2 <- CalcR(maxset2)
plot(rs,loglikes2,type="l")
```

This result is shown in Figure 12.6. Likelihood has been shown to make a peak. Values that bring this peak is the maximum likelihood estimate of r. In fact, Compared with the obtained by Dirichlet distribution random number that was carried out the likelihood that the previously obtained likelihood, will be higher for this time of the likelihood.

Another way to reach to the MLE with the constraints of A, a, B, b frequency.

(R12-6.R).

```
EM2loci <- function(n, p = NULL, Niter = 1000) {
if (is.null(p)) p <- rep(0.25, 4) # when no initial freq is given, even value will be,
f <- p
rs <- rep(0, Niter) # history of LD index value
logLikes <- rep(0, Niter) # history of LogLike
fs <- matrix(0, Niter, 4) # history of 4 haplotype freuq
# Deterministic no. haplotypes
fixed<- c(n[1, 1] * 2 + n[1, 2] + n[2, 1], n[1, 3] * 2 + n[1, 2] + n[2, 3], n[3, 1] * 2 + n[2, 1] +
n[3, 2], n[3, 3] * 2 + n[2, 3] + n[3, 2])
for (i in 1:Niter) {
# tentative assignment of double heterozygotes to haplotype pairs
tmpratio<-f[1]*f[4]/(f[1]*f[4]+f[2]*f[3])
tmp <- rep(0,4)
tmp[1] <- fixed[1] + n[2, 2] * tmpratio
tmp[2] <- fixed[2] + n[2, 2] * (1 - tmpratio)
tmp[3] <- fixed[3] + n[2, 2] * (1 - tmpratio)
tmp[4] <- fixed[4] + n[2, 2] * tmpratio
# record updated values
fs[i, ] <- tmp/sum(tmp); logLikes[i] <- CalcLike3x3(n, tmp); rs[i] <- CalcR(tmp)
f <- tmp # update "current val"
}
list(f = f, r = rs[Niter], logLike = logLikes[Niter],
fHsitory = fs, rHistory = rs, logLikeHistory = logLikes)
}
emout <- EM2loci(ns)
maxp_A <- emout$f[1] + emout$f[3]; maxp_B <- emout$f[1] + emout$f[2]; maxr <- emout$r
plot(emout$logLikeHistory[1:20], type = "b") # plot up to 20 iterations
```

```
x<-1:10
s<-3*x
f<-7*x
mean<-lower<-upper<-matrix(0,length(x),11)
for(i in x){
tmp<-binom.confint(x=s[i],n=s[i]+f[i],prior.shape1=1,prior.shape2=1)
mean[i,]<-tmp[,4]
lower[i,]<-tmp[,5]
upper[i,]<-tmp[,6]
}
plotlist<-c(2,3,5,9)
collist<-c("black","red","blue","green")
ylim<-c(0,1)
for(i in 1:length(plotlist)){
plot(x,mean[,plotlist[i]],type="l",ylim=ylim,col=collist[i])
par(new=T)
plot(x,lower[,plotlist[i]],type="l",ylim=ylim,col=collist[i])
par(new=T)
plot(x,upper[,plotlist[i]],type="l",ylim=ylim,col=collist[i])
par(new=T)
}
abline(h=0.3)
```

Reject a hypothesis if the observation is too rare to believe it. (R13-1.R).

There are three categories.

10,7,3 were the observed counts of three categories.

\[ \begin{equation*} \frac{N!}{n_1!n_2!n_3!} (\frac{1}{3})^N \end{equation*} \]

```
dpoly<-function(n=c(1,2,3),p=NULL){ # function to calculate probability
N<-sum(n)
if(is.null(p)){
p<-rep(1/length(n),length(n))
}
exp(lgamma(N+1)-sum(lgamma(n+1))+sum(n*log(p)))
}
DrawHigherLower<-function(g=c(10,7,3),p=NULL){
N<-sum(g)
x<-0:N;y<-0:N
xy<-expand.grid(x,y) # combination of all x and y values
z<--(xy[,1]+xy[,2])+N
xyz<-matrix(c(xy[,1],xy[,2],z),nrow=length(z))
xyz<-xyz[apply(xyz,1,min)>=0,] # non negative
probs<-apply(xyz,1,dpoly,p=p) # apply the above-made-function to xyz
probObs<-dpoly(c(10,7,3)) # probablity of the observation
higher<-which(probs>probObs)
lower<-which(probs<=probObs)
xyz2<-apply(xyz,1,CoordTriangle)
x2<-xyz2[1,];y2<-xyz2[2,]
xlim<-c(0,1);ylim<-c(0,2/sqrt(3))
# Black vs. Red
plot(xyz2[1,which(probs>probObs)]/N,xyz2[2,which(probs>probObs)]/N,xlim=xlim,ylim=ylim,col = "black",pch=15)
par(new=T)
plot(xyz2[1,which(probs<=probObs)]/N,xyz2[2,which(probs<=probObs)]/N,xlim=xlim,ylim=ylim,col = "red",pch=15)
DrawTriangleFrame()
}
par(mfcol=c(1,2))
DrawHigherLower(c(10,7,3))
DrawHigherLower(c(10,7,3),c(0.9,0.05,0.05))
par(mfcol=c(1,1))
```

Draw probablities under two hypotheses.

```
DrawDirichletDensity<-function(prior=c(1,2,3),k=0.01){
library(MCMCpack)
xyz<-TriLattice(k=k)
density <- Density(ddirichlet(xyz, prior))
xyz2 <- apply(xyz, 1, CoordTriangle)
x2 <- xyz2[1, ];y2 <- xyz2[2, ]
xlim <- c(0, 1);ylim <- c(0, 2/sqrt(3))
plot(xyz2[1,], xyz2[2,], xlim = xlim, ylim = ylim, col = gray(density), pch = 15)
DrawTriangleFrame()
}
DrawTriangleFrame <- function() { # Drawing triangle
framePointsX <- c(1, 0, 0, 1)
framePointsY <- c(1/2, 2/sqrt(3), 0, 1/2)
s <- seq(length(framePointsX) - 1)
segments(framePointsX[s], framePointsY[s], framePointsX[s + 1], framePointsY[s + 1])
}
CoordTriangle <- function(x) { # coordinate in triangle coordinate
x2 <- x[1]; y2 <- 2/sqrt(3) * x[2] + 1/2 * x[1]
c(x2, y2, 0)
}
TriLattice<-function(k=0.01){
x <-y<- seq(from = 0, to = 1, by = k)
xy <- expand.grid(x, y)
z <- 1 - xy[1] - xy[2]
xyz <- cbind(xy[1], xy[2], z)
minval <- apply(xyz, 1, min)
xyz[minval > 0, ]
}
Density<-function(d){
if(max(d)==min(d)){
rep(0.5,length(d))
}else{
(max(d)-d)/(max(d)-min(d))
}
}
DrawDirichletDensity(c(1,1,1))
DrawDirichletDensity(c(10,7,3))
DrawCI <- function(p = c(11, 8, 4), ci = 0.95, N = 10000) {
library(MCMCpack)
rs <- rdirichlet(N, p)
ds <- ddirichlet(rs, p)
ord <- order(ds)
OutPoints <- N * (1 - ci)
rs2 <- apply(rs, 1, CoordTriangle)
xlim <- c(0, 1);ylim <- c(0, 2/sqrt(3))
plot(rs2[1, ord[1:OutPoints]], rs2[2, ord[1:OutPoints]], col = gray(6/8), xlim = xlim, ylim = ylim, pch = 15)
par(new = T)
plot(rs2[1, ord[(OutPoints + 1):N]], rs2[2, ord[(OutPoints + 1):N]], xlim = xlim, ylim = ylim, pch = 15)
DrawTriangleFrame()
}
DrawCI(p = c(11, 8, 4), ci = 0.95, N = 10000)
```

2x2 table

Assume p is probability of A. \[ \begin{equation*} \frac{n_{1.}!}{n_{11}! n_{12}!} \frac{n_{2.}!}{n_{21}!n_{22}!} p^{n_{.1}} (1-p)^{n_{.2}} \end{equation*} \]

Regardless of p, prob to observe 4 cell values should be proportinal to \[ \frac{n_{1.}!}{n_{11}! n_{12}!} \frac{n_{2.}!}{n_{21}!n_{22}!} \]

To make the sum of probability of all cases, it should be changed to: \[ \begin{equation*} \frac{n_{1.}!}{n_{11}! n_{12}!} \frac{n_{2.}!}{n_{21}!n_{22}!} \times \frac{n_{.1}! n_{.2}!}{n_{..}!} \end{equation*} \]

Using this we can determine p-value, which is p-value of exact probability test.

(R13-2.R).

The calculated value fits well to the formula below:

\[ \begin{equation*} a\times e^{-b \delta^2} \end{equation*} \]

Exact prob decreases along with the increase in square of deviation from the expected value.

```
# expected table
makeExptable<-function(m = matrix(c(10, 20, 30, 40), nrow = 2)) {
m1 <- apply(m, 1, sum);m2 <- apply(m, 2, sum); N <- sum(m);etable <- m1 %*% t(m2)/N
}
# Change a value of (1,1) from expected value by 1
ProbDistance <- function(m = matrix(c(10, 20, 30, 40), nrow = 2)) {
# marginal counts
m1 <- apply(m, 1, sum)
m2 <- apply(m, 2, sum)
etable <- makeExptable(m)
N<-sum(etable)
d <- seq(from = 0, to = 20, by = 1) # deviation
# All cells should be non negative
x <- d + etable[1, 1]; y <- -d + etable[1, 2]; z <- -d + etable[2, 1]; w <- d + etable[2, 2]
mat <- matrix(c(x, y, z, w), ncol = 4);mins <- apply(mat, 1, min);matOK <- mat[mins > 0, ]
# component of exact prob calculated with marginal counts
tmp <- sum(lgamma(m1 + 1)) + sum(lgamma(m2 + 1)) - lgamma(N)
# exact prob of all possible tables
prob <- exp(-apply(lgamma(matOK+1),1,sum)+tmp)
list(x = d[mins > 0], prob = prob)
}
d <- ProbDistance(m = matrix(c(10, 20, 30, 40), nrow = 2)) #c(10,20,30,40) is cell values of 2x2 table
ylim <- c(0, max(d$prob))
plot(d$x, d$prob, ylim = ylim)
# function to be used for optimization
fForOptim <- function(x) {
a <- x[1]; b <- x[2] ;c<-x[3]
# value to be optimized
sum((d$prob - (a * exp(-b * d$x^2)))^2)
}
optout <- optim(c(1, 1), fForOptim) # initial coefficients are set at (1,1)
xfine <- seq(from = min(optout$par), to = max(optout$par), by = 0.01) # values for horizontal axis
optprob <- optout$par[1] * exp(-optout$par[2] * xfine^2)# estimated function value
par(new=T)
plot(xfine, optprob, ylim = ylim, col = "red", type = "l")
```

\[ \begin{equation*} \chi^2=\sum_{i,j} \frac{(n_{ij}-e_{ij})^2}{e_{ij}} \end{equation*} \]

```
chisqCalc<-function(m=matrix(c(10,20,30,40),nrow=2)){
etable<-makeExptable(m)
m1<-length(m[,1]);m2<-length(m[1,])
chi2<-sum((m-etable)^2/etable)
df<-(length(m1)-1)*(length(m2)-1) # df=(N-1)x(M-1)
p<-pchisq(chi2,df,lower.tail=FALSE) #
list(chi2=chi2,p=p,df=df)
}
chisqCalc(m=matrix(c(10,20,30,40),nrow=2))
```

Comapre likelihood under the null hypothesis and likelihood under the alternative hypothesis.

For alternative hypothesis, we assume maximu likelihood alternative hypothesis in terms of parameter(s) modeling the alternative hypothesis.

Alternative hypothesis:

Maximum likelihood estimates:

\(\frac{n_{11}}{n_{1.}},\frac{n_{21}}{n_{2.}}\)

\[ \begin{align*} L(G1\not = G2) &= \frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} (\frac{n_{11}}{n_{1.}})^{n_{11}} (\frac{n_{12}}{n_{1.}})^{n_{12}} (\frac{n_{21}}{n_{2.}})^{n_{21}} (\frac{n_{22}}{n_{2.}})^{n_{22}} \\ &=\frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} \frac{ n_{11}^{n_{11}} n_{12}^{n_{12}} n_{21}^{n_{21}} n_{22}^{n_{22}} } {n_{1.}^{n_{1.}} n_{2.}^{n_{2.}} } \end{align*} \]

Null hypothesis: \[ \begin{align*} L(G1= G2) &= \frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} (\frac{n_{.1}}{n_{..}})^{n_{11}+n_{21}} (\frac{n_{.2}}{n_{..}})^{n_{12}+n_{22}} \\ &= \frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} \frac{ n_{.1}^{n_{.1}} n_{.2}^{n_{.2}} }{n_{..}^{n_{..}} } \end{align*} \]

Likelihood ratio: k! components are cancelled out.

\[ \begin{equation*} \frac{L(G1\not = G2) }{L(G1= G2) }= \frac{n_{..}^{n_{..}} n_{11}^{n_{11}} n_{12}^{n_{12}} n_{21}^{n_{21}} n_{22}^{n_{22}} } {n_{1.}^{n_{1.}} n_{2.}^{n_{2.}}n_{.1}^{n_{.1}} n_{.2}^{n_{.2}} } \end{equation*} \]

In general \[ \begin{equation*} \frac{L(G1\not = G2) }{L(G1= G2) } = \prod_{i,j} (\frac{n_{ij}}{e_{ij}})^{n_{ij}} \end{equation*} \]

twice of log-likelihood ratio can be treated as \(\chi^2\) value. R13-4.R.

```
likelihoodRatioTest<-function(m=matrix(c(10,20,30,40),nrow=2)){
etable<-makeExptable(m)
chi2<-2*sum(log(m/etable)*m)
df<-(length(m[,1])-1)*(length(m[1,])-1)
p<-pchisq(chi2,df,lower.tail=FALSE)
return(list(statistic=chi2,p.value=p,df=df))
}
likelihoodRatioTest(m=matrix(c(10,20,30,40),nrow=2))
```

We have three test methods for contingency tables.

They are similar, but of course they are different each other.

We should know when they are same and when they are different and how.

- When sample size is small
- When sample size is large

Small table (3, 6, 9, 12)

Large table (100, 200, 300, 400)

Figures 13.3 13.4 13.5 and R13-supl.R

Symmetricity?

\(\chi^2\) distribution support infinite value of distance.

Exact test does not accept negative cell values, meaning finite df-dimentional space is covered.

Calculation load

Behaviour when small sample size

symmetricity

Exact vs. asymptotic

Discrete vs. continuous

Finite vs. infinite

Table 13.4.

2x3 table for case-control x SNP genotypes

Null hypothesis

Dominant model df =1

Recessive model df=1

Additive model df = 1

Any one of three genetic models 1 < df < 2

Any difference between case and control df = 2

Table 13.5 : RR constraints

- Test methods Table 13.6

```
#library(Rassoc)
# Make a table from coordinates in 2D space
casecontRTheta<-function(R,t,af,f,N){
popW<-c(af^2+af*(1-af)*f,2*af*(1-af)*(1-f),(1-af)^2+af*(1-af)*f)
table<-matrix(c(popW*N[1],popW*N[2]),nrow=2,byrow=TRUE)
table[1,1]<-table[1,1]-R/2*cos(t)+sqrt(3)/2*R*sin(t)
table[2,1]<-table[2,1]+R/2*cos(t)-sqrt(3)/2*R*sin(t)
table[1,2]<-table[1,2]+R*cos(t)
table[2,2]<-table[2,2]-R*cos(t)
table[1,3]<-table[1,3]-R/2*cos(t)-sqrt(3)/2*R*sin(t)
table[2,3]<-table[2,3]+R/2*cos(t)+sqrt(3)/2*R*sin(t)
return(table)
}
af<-0.4;f<-0;N<-c(100,100) # allele frequency,HWE-f,sample size
x<-y<-seq(from=-10,to=10,by=1)
# apply multiple test methods to the generated table
cattp<-max3p<-df2p<-domp<-recp<-matrix(rep(0,length(x)*length(y)),nrow=length(x))
for(i in 1:length(x)){
for(j in 1:length(y)){
R<-sqrt(x[i]^2+y[j]^2)
if(x[i]==0){
t<-pi/2
}else{
t<-atan(y[j]/x[i])
}
popdata<-casecontRTheta(R=R,t=t,af=af,f=f,N=N)
if(min(popdata>0)){
catout<-CATT(popdata,x=0.5) # Cockran-Armitage trend test with 0.5
domout<-CATT(popdata,x=1) # Cockran-Armitage trend test with 1
recout<-CATT(popdata,x=0) # Cockran-Armitage trend test with 0
max3out<-MAX3(popdata)
df2chi2out<-chisq.test(popdata,2)
cattp[i,j]<-catout$p
max3p[i,j]<-max3out$p
df2p[i,j]<-df2chi2out$p.value
domp[i,j]<-domout$p
recp[i,j]<-recout$p
}
}
}
filled.contour(x,y,-log(cattp,10),color=terrain.colors)
filled.contour(x,y,-log(max3p,10),color=terrain.colors)
filled.contour(x,y,-log(df2p,10),color=terrain.colors)
filled.contour(x,y,-log(domp,10),color=terrain.colors)
filled.contour(x,y,-log(recp,10),color=terrain.colors)
```

Figure 13.6.

When likelihood for alternatigve hypothesis has free parameter(s), the 2xloglike can be interpreted with \(\chi^2\) distribution of df = no. parameters.

However alternative hypothesis does not have a continuous parameter, like DNA identification, likelihood ratio itself but not p-value in likelihood ratio test should be used for judgement.

(R13-6.R).

```
#library(Rassoc)
Niter<-1000
Nca<-1000
Nco<-1000
chiP<-cattP<-domP<-recP<-rep(0,Niter)
for(i in 1:Niter){
af<-runif(1)*0.6+0.2
df<-c(af^2,2*af*(1-af),(1-af)^2)
case<-sample(c(0,1,2),Nca,df,replace=TRUE)
cont<-sample(c(0,1,2),Nco,df,replace=TRUE)
m<-matrix(c(length(which(case==0)),length(which(case==1)),length(which(case==2)),length(which(cont==0)),length(which(cont==1)),length(which(cont==2))),nrow=2,byrow=TRUE)
chiP[i]<-chisq.test(m,correct=FALSE)$p.value
cattP[i]<-CATT(m,0.5)$p
domP[i]<-CATT(m,1)$p
recP[i]<-CATT(m,0)$p
}
plot(as.data.frame(cbind(chiP, domP, cattP, recP)), cex = 0.1)
cormat <- cor(cbind(chiP, domP, cattP, recP))
cormat
```

When multiple tests are mutually dependent, multiple testing correction (Chapter 17) of them becomes more difficult.

Two variables are observed for samples.

Relation between two variables are evaluated/tested with various ways.

The data type of variables determines the appropriate method.

- Test details of the will investigate the source of the R

```
# NxM table
# Randomly generate tables with a given size (NxM), multiple tests are applied to them and compare their results
CompareTests<-function(N=2,M=2,Niter=1000,Ns=100,k1=10,k2=3){
library(MCMCpack)
library(clinfun)
pearsonp<-trendp<-kwp<-jtp<-lmp<-rep(0,Niter)
for(i in 1:Niter){
# (Generate NxM tables at random)
fn<-rdirichlet(1,rep(k1,N))
fm<-rdirichlet(1,rep(k2,M))
first<-sample(1:N,Ns,prob=fn,replace=TRUE)
second<-sample(1:M,Ns,prob=fm,replace=TRUE)
t<-table(first,second) # ???割表を作る Convert to table
pearsonp[i]<-chisq.test(t,correct=FALSE)$p.value # Pearson's independence test
if(N==2){
trendp[i]<-prop.trend.test(t[1,],t[1,]+t[2,],score=1:M)$p.value # Trend chisquare test with additive model
}
kwp[i]<-kruskal.test(second~first)$p.value # Kruskal-Wallis
jtp[i]<-jonckheere.test(second,first,alternative="two.sided")$p.value # Jonckheere-Terpstra
lmp[i]<-summary(lm(second~first))$coefficients[2,4] # Linear regression
}
databind<-cbind(pearsonp,kwp,jtp,lmp,trendp)
plot(as.data.frame(databind))
return(databind)
}
N<-2;M<-2;Niter<-100;Ns<-1000 # 2x2 table
ctout22<-CompareTests(N,M,Niter,Ns)
N<-2;M<-3;Niter<-100;Ns<-1000 # 2x3 table
ctout22<-CompareTests(N,M,Niter,Ns)
N<-2;M<-10;Niter<-100;Ns<-1000 # 2x10???ーブル 2x10 table
ctout22<-CompareTests(N,M,Niter,Ns)
N<-3;M<-3;Niter<-100;Ns<-1000 # 3x3???ーブル 3x3 table
ctout22<-CompareTests(N,M,Niter,Ns)
```

One axis is 2-categorical

When one axis has three or more categories without order in them

When both axes are ordered

Table 13.7

When values are stocked as vectors in R. R functions can accept them even if they are “NOT appropriate” from “STATISTs’ standpoint”.

What does this mean; “not appropriate”?

```
m<-matrix(c(1,2,3,4)*3,nrow=2)
N<-sum(m)
x<-0:N
m1<-apply(m,1,sum)
m2<-apply(m,2,sum)
y<--x+m1[1]
z<--x+m2[1]
w<--(x+y+z)+N
data<-matrix(c(x,y,z,w),ncol=4)
min4<-apply(data,1,min)
dataOK<-data[min4>=0,]
L<-length(dataOK[,1])
pFisher<-pChiNoCorrect<-pChiNoCorrect<-ChiNoCorrect<-pLRT<-ChiLRT<-rep(0,L)
for(i in 1:L){
tmptable<-matrix(dataOK[i,],nrow=2)
chisqout<-chisq.test(tmptable,correct=FALSE)
LRTout<-likelihoodRatioTest(tmptable)
pChiNoCorrect[i]<-chisqout$p.value
ChiNoCorrect[i]<-chisqout$statistic
pLRT[i]<-LRTout$p.value
ChiLRT[i]<- LRTout$statistic
pFisher[i]<-fisher.test(tmptable)$p.value
}
ylim=c(0,1)
ylim<-c(0,1)
plot(x[min4>=0],pChiNoCorrect,ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],pLRT,ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],pFisher,ylim=ylim,col="blue",type="l")
ylim<-c(log(min(pFisher),10),1)
plot(x[min4>=0],log(pChiNoCorrect,10),ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],log(pLRT,10),ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],log(pFisher,10),ylim=ylim,col="blue",type="l")
ps<-cbind(pChiNoCorrect,pLRT,pFisher)
plot(as.data.frame(ps),cex=1)
ps<-cbind(pChiNoCorrect,pLRT,pFisher)
plot(log(as.data.frame(ps),10),cex=1)
m<-matrix(c(1,2,3,4)*100,nrow=2)
N<-sum(m)
x<-0:N
m1<-apply(m,1,sum)
m2<-apply(m,2,sum)
y<--x+m1[1]
z<--x+m2[1]
w<--(x+y+z)+N
data<-matrix(c(x,y,z,w),ncol=4)
min4<-apply(data,1,min)
dataOK<-data[min4>=0,]
L<-length(dataOK[,1])
pFisher<-pChiNoCorrect<-pChiNoCorrect<-ChiNoCorrect<-pLRT<-ChiLRT<-rep(0,L)
for(i in 1:L){
tmptable<-matrix(dataOK[i,],nrow=2)
chisqout<-chisq.test(tmptable,correct=FALSE)
LRTout<-likelihoodRatioTest(tmptable)
pChiNoCorrect[i]<-chisqout$p.value
ChiNoCorrect[i]<-chisqout$statistic
pLRT[i]<-LRTout$p.value
ChiLRT[i]<- LRTout$statistic
pFisher[i]<-fisher.test(tmptable)$p.value
}
ylim=c(0,1)
ylim<-c(0,1)
plot(x[min4>=0],pChiNoCorrect,ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],pLRT,ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],pFisher,ylim=ylim,col="blue",type="l")
ylim<-c(log(min(pFisher),10),1)
plot(x[min4>=0],log(pChiNoCorrect,10),ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],log(pLRT,10),ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],log(pFisher,10),ylim=ylim,col="blue",type="l")
abline(v=120)
abline(v=20)
abline(v=220)
abline(h=log(pChiNoCorrect[21],10))
ps<-cbind(pChiNoCorrect,pLRT,pFisher)
plot(as.data.frame(ps),cex=0.1)
plot(log(as.data.frame(ps),10),cex=0.1)
```

Epidemiology can causal relation.

Chronological order and prospective study and retrospective study

There are the following four patterns for relation between A and B.

If A is the cause of B

If B is the cause of A

If the result of A and B together, caused by C

It is both A and B, and is the cause of C

Figure 14.1.

Genotypes, first.

Order or directional relation is expressed as directed edge in graphs or networks.

Cycles are not good to handle causation with graph, that is why acyclic graph is often required for network analysis.

Particularly bayesian networks are acyclic directed graphs.

using a bnlearn package of R (R14-1.R).

```
library(bnlearn)
e <- empty.graph(LETTERS[1:5]) # 5 nodes
arc.set <- matrix(c("A", "C", "B", "D", "C", "E","D","A","A","E"), ncol = 2, byrow = TRUE)
arcs(e)<-arc.set # set edges
plot(e)
# generate data set
set.seed(123456)
Ns<-10000
A<-rpois(Ns,3);B<-rpois(Ns,3);C<-rpois(Ns,3);D<-rpois(Ns,3);E<-rpois(Ns,3)
B[1:(Ns/4)]<-A[1:(Ns/4)];C[(1+Ns/4):(Ns/2)]<-A[(1+Ns/4):(Ns/2)]
D[(1+Ns/2):(3*Ns/4)]<-A[(1+Ns/2):(3*Ns/4)];E[(1+3*Ns/4):Ns]<-A[(1+3*Ns/4):Ns]
d<-as.data.frame(cbind(A,B,C,D,E))+0.0
res<-gs(d) # Network estimation with Grow-Shrink method, one of multiple methods
plot(res)
# If networks are at random...
plot(random.graph(LETTERS[1:5], num = 1))
# Compare the estimated graph and random graphs for wellness in data-description
GSscore<-score(res, d) # Score the wellness of estimated graph
Niter<-100 # ランダ???に作るグラ??????数
scores<-rep(0,Niter) # Score wellness of random graphs
for(i in 1:Niter){
randomGraph<-random.graph(LETTERS[1:5], num = 1)
scores[i]<-score(randomGraph,d)
}
ylim<-c(min(scores,GSscore),max(scores,GSscore))
plot(scores,ylim=ylim,pch=20)
abline(h=GSscore)
bn.fit(res,d)
```

Chapter 15 Enumeration

Chapter 16 omitting something

Chapter 17 A lot of test

Large-scale data science, including the genome, features a huge amount of data and a lot of factors. In Part Ⅴ parts and to enumerate it is huge, it focuses on how to deal with it is enormous.

- Permutation and combination

\[ \begin{equation*} P(N,k)=N(N-1)(N-2)...(N-(k-1))=\frac{N!}{(N-k)!}=\frac{\Gamma(N+1)}{\Gamma(N-k+1)} \end{equation*} \]

\(\Gamma\) function is important.

```
permN<-function(N=10,k=3){ exp(lgamma(N+1)-lgamma((N-k)+1)) }
permN(N=10,k=3)
```

Repeated permutation \[ \begin{equation*} \Pi(N,k)=N^k \end{equation*} \]

```
repPermN<-function(N=10,k=3){ N^k }
repPermN(N=10,k=3)
```

Total k objects, that shoudl be labeled with 1,2,…,or N; no. of each label should be \(n_i\). \[ \begin{equation*} \frac{k!}{\prod_{i=1}^N n_i!};\;\; \sum_{i=1}^N n_i=k \end{equation*} \]

For the 1st label set, \[ \begin{equation*} \frac{n_{..}!}{\prod_{i=1}^N n_{i.}!} \end{equation*} \] for the second, \[ \begin{equation*} \frac{n_{..}!}{\prod_{j=1}^M n_{.j}!} \end{equation*} \]

In combination, \[ \begin{equation*} X= \frac{n_{..}!}{\prod_{i=1}^N n_{i.}!} \times \frac{n_{..}!}{\prod_{j=1}^M n_{.j}!} \end{equation*} \]

We now make combination of labels, \(N\times M\) types. \[ \begin{equation*} Y=\frac{n_{..}!}{ \prod_{i=1}^N\prod_{j=1}^M n_{ij}!} \end{equation*} \]

The ratio Y over X is the exact probability of 2x2 table. \[ \begin{equation*} \frac{Y}{X}=\frac{n_{..}!}{ \prod_{i=1}^N\prod_{j=1}^M n_{ij}!} \frac {\prod_{i=1}^N n_{i.}!} {n_{..}!} \times \frac{\prod_{j=1}^M n_{.j}!}{n_{..}!} = \frac{\prod_{i=1}^N n_{i.}! \prod_{j=1}^M n_{.j}!}{n_{..}! \prod_{i=1}^N \prod_{j=1}^M n_{ij}!} \end{equation*} \]

\[ \begin{equation*} C(N,k)=\frac{N!}{k!(N-k)!}=\frac{P(N,k)}{k!} \end{equation*} \]

`choose(20,3)`

Repeated combination, H (N, 2). \[ \begin{equation*} H(N,k)=C(N+k-1,k)= \sum_{i=0}^k (C(N,k-i) \times H(k-i,i)) = \sum_{i=0}^k (C(N,k-i) \times C(k-1,i)) \end{equation*} \]

This can be modeled as; N segments separated with N-1 separating bars. Put k items in N spaces. We can not distinguish N-1 separating bars or k items…

\[ \begin{equation*} H(N,k)=\frac{P(N-1+k)}{P(N-1)P(k)}=C(N+k-1,k) \end{equation*} \]

```
repCombN<-function(N=10,k=3){
choose(N+k-1,k)
}
repchoose<-function(n=10,k=3){
ret<-0
for(i in 0:k){
ret<-ret+choose(n,k-i)*repCombN(k-i,i)
}
ret
}
repCombN(N=10,k=3)
repchoose(n=10,k=3)
```

What is partition?

Put N items into k gourps. All k group should have one or more items.

The 2nd Stirling number, \(St2(N,k)\) is the number of ways of partition.

It satisfies the following equation. \[ \begin{equation*} St2(N+1,k)=St2(N,k)\times k +St2(N,k-1) \end{equation*} \]

When k=2, it is related with the sum of the ways to choose i (i=1,2,…N-1) out of N: \[ \begin{align*} St2(N,2)&=\frac{1}{2} \sum_{i=1}^{N-1} C(N,i) \\ &= \frac{1}{2} (\sum_{i=0}^N C(N,i) -(C(N,0)+C(N,N)) )\\ &= \frac{1}{2} ((1+1)^N -2) = 2^{N-1}-1 \end{align*} \]

Partioning N into any number of groups is Bell number. \[ \begin{equation*} B(N)=\sum_{i=1}^N St2(N,i) \end{equation*} \] or \[ \begin{equation*} B(N+1)=\sum_{i=0}^N C(N,i) B(i) \end{equation*} \]

Figure 15.2 shows one way of enumeration can be interpreted in multiple meanings.

```
Stirling2N<-function(N=10,k=3){
ret<-0
if(k<=N && k>=1){
if(k==1){
ret<-1
}else{
ret<-Stirling2N(N-1,k)*k+Stirling2N(N-1,k-1)
}
}
ret
}
Stirling2N(N=10,k=3)
```

```
bellN<-function(N=10){
bs<-rep(0,N+1)
bs[1]<-1
if(N>=1){
for(i in 2:(N+1)){
for(j in 0:(i-2)){
bs[i]<-bs[i]+choose(i-2,j)*bs[j+1]
}
}
}
bs[2:length(bs)]
}
bellN(N=10)
```

When we have multiple categories, we may want to group them into fewer categories. That is partioning.

For example, dominant genetic model makes 3 genotypes into 2 groups.

When categories have an order, partioning of them is to place border(s) in the order.

Figure 15.3.

\[ \begin{equation*} B_{order}(N)=\sum_{i=1}^{N-1} C(N-1,i)=2^{N-1}-1=St2(N,2) \end{equation*} \]

R15-7.R in R.

```
Border<-function(N){
2^{N-1}-1
}
Border(10)
```

It will be necessary to know the number of ways to construct trees with N leaves, when performing hierarchical clustering.

One type of number of ways of tree is Catalan number.

Rooted tree with N edges

Graph with N non-leaf vertices = N+1 leaf-binary-tree

```
CatalanN <- function(N = 10) {
exp(lgamma(2 * N + 1) - lgamma(N + 2) - lgamma(N + 1))
}
CatalanN(N=10)
```

Double-factorial number

\[ \begin{equation*} (2N-1)!!=1\times 3\times 5 \times ...\times (2N-1) \end{equation*} \]

NUmber of clustering trees is : \[ \begin{equation*} 2^{\frac{N(N-1)}{2}} \end{equation*} \]

(R 5-9.R, Figure 5.5).

```
library(phangorn)
allTrees(5) # No. items = 5
trees<-allTrees(5)
plot(trees)
doubleFactorialN<-function(N=5){
if(N<3){
print("no answer")
}else{
ret<-1
for(i in 3:N){
ret<-ret*(2*(i-2)-1)
}
ret
}
}
doubleFactorialN(N=5)
```

Non-directed graph:

\(\frac{N(N-1)}{2}\) is the possible edge number.

\[ \begin{equation*} 2^{\frac{N(N-1)}{2}} \end{equation*} \]

is the number of ways of non-directed graph without self-directing edge.

Directed graph: \[ \begin{equation*} 3^{\frac{N(N-1)}{2}} \end{equation*} \] because {no, a->b, a<-b}.

Acyclic directed graph that is a type of graph for Baeysian network: \[ \begin{align*} N_{acg}(0)=1\\ N_{acg}(N)=\sum_{i=1}^N (-1)^{k+1} C(N,k)\times 2^{k(N-k)} \times N_{acg}(N-k)\\ \end{align*} \]

You may find these integer sequences @ http://www.research.att.com/~njas/sequences/

- permutation (Factorial numbers): id A000142
- the second Stirling number (Stirling numbers of second kind): id A008277
- Number of Bell (Bell or exponential numbers): id A00011 0
- Catalan number (Catalan numbers): id A000108
- double fact realistic number (Double factorial numbers): idA0011 47
- directed acyclic graph number (Number of acyclic digraphs (or DAGs) with n labeled nodes): id A003024

The pseudo-random number sequence. R16-1.R, Figure 16.1 It is.

```
library(MCMCpack)
d<-rdirichlet(1000,c(1,1,1)) # Random sampling from dirichlet distribution
plot(d[,1],sqrt(3)/2*(d[,2]-d[,3]),xlim=c(0,1),ylim=c(-sqrt(3)/2,sqrt(3)/2))
segments(c(0,1,0),c(-sqrt(3)/2,0,sqrt(3)/2),c(1,0,0),c(0,sqrt(3)/2,-sqrt(3)/2)) # Triangle
```

- Resampling

We have samples.

We re-sample samples from the sample pool.

Replacement is allowed or not.

```
x<-1:10
#bootstrap
sample(x,replace=TRUE)
#permutation
sample(x,replace=FALSE)
```

Used in cross-validation, population parameter estimation

- Permutation

Changing the order of 1,2,…,n.

\(n!\) ways.

It is related with exact probability test because exact probability test enumerate all patterns with marginal counts fixed and permutation handles samples as the set of samples is given.

The following is Monte-carlo permutation-based p-value estimation.

```
# Data set generation
A<-1:8
B<-c(1,3,2,6,4,7,8,5)
# All permutations vs. monte-carlo permutations.
PermutationTestJT<-function(A,B,MonteCarlo=TRUE,n=1000){# n : in case of monte-carlo
library(clinfun) # package of jonckheere.test()
ret<-0
Ps<-NULL
jtp<-jonckheere.test(A,B)$p.value # p value of observed
if(MonteCarlo || length(A)>10){ # if element size>10, monte-carlo.
Ps<-rep(0,n)
for(i in 1:n){
tmp<-sample(1:length(A),length(A))
Ps[i]<-jonckheere.test(A[tmp],B)$p.value
}
}else{ # full permutatin
library(gtools) # package for permutations()
perms<-permutations(length(A),length(A)) # all permutations will return
Ps<-rep(0,length(perms[,1]))
for(i in 1:length(perms[,1])){
Ps[i]<-jonckheere.test(A[perms[i,]],B)$p.value
}
}
ret<-length(Ps[Ps<=jtp])/length(Ps)
list(originalp.value=jtp,permp.value=ret,Ps=Ps)
}
out1<-PermutationTestJT(A,B,MonteCarlo=FALSE) # full permutation result=exact test
out2<-PermutationTestJT(A,B,MonteCarlo=TRUE,n=100) # Mote-carlo result
out1$originalp.value
out1$permp.value
out2$permp.value
plot(sort(out1$Ps))
abline(h=out1$permp)
```

Random walk based on Markov chain

Random walk is supposed to walk around anywhere evenly

Weighted random walk may the procedure more efficiently, using Metropolis-hasting, methods using conjugate prior.

```
nstep<-100
rwalk<-matrix(0,nstep,2)
rtheta<-rnorm(nstep-1)
stepx<-cos(rtheta)
stepy<-sin(rtheta)
for(i in 1:(nstep-1)){
rwalk[i+1,1]<-rwalk[i,1]+stepx[i]
rwalk[i+1,2]<-rwalk[i,2]+stepy[i]
}
plot(rwalk,type="l")
```

- Fit to a model

Exact test and \(chi^2\) test are approximately same.

Model function is set and optimization procedure estimate coefficients, which is also approximation.

Extreme value distributions are to be used as model distribution of minimum p-values in the setting of multiple-testings.

\[ \begin{equation*} G(x)=e^{-(1+\zeta(\frac{x-\mu}{\sigma})^{-\frac{1}{\zeta}}} \end{equation*} \]

The formula is scary but we want to do is to have good estimates for parameters of this formula.

```
# random chisq value (df=1)
N<-100000;M<-100;t<-rchisq(N,1)
# group into 100 groups
matt<-matrix(t,nrow=M)
# max of each set
maxt<-apply(matt,1,max)
#evd package
#Generalized Extreme Value Disvribution parameter estimation
# Compare the estimated distribution and samples
library(evd)
gevest<-fgev(maxt)
qfromgev<-qgev(ppoints(length(maxt),a=0),loc=gevest$estimate[1],scale=gevest$estimate[2],shape=gevest$estimate[3])
plot(ppoints(length(maxt),a=0),sort(maxt),ylim=c(0,max(maxt)),type="p",cex=1,col=gray(5/8))
par(new=T)
plot(ppoints(length(maxt),a=0),qfromgev,ylim=c(0,max(maxt)),type="l")
```

- Fit to a format

Taking out big componens and throw away tiny ones.

Eigen value decomposition; each eigen value stands for the significance of the corresponding feature.

Polynomial approximation is also the same. \[ \begin{equation*} f(x)=a_0+a_1 x + a_2 x^2 +... + a_k ^k \end{equation*} \]

```
sortedMaxt<-sort(maxt) # max chisq values, sorted
s<-ppoints(length(sortedMaxt),a=0)
# for example y=a+bx+cx^2 approximation
f2 <- function(x) {
sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2))^2)
}
optout2<-optim(rep(1,3),f2,method="BFGS")
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")
####
# Optional
f1 <- function(x) {
sum((sortedMaxt-(x[1]*s^0+x[2]*s^1))^2)
}
optout2<-optim(rep(1,2),f1,method="BFGS")
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")
#
f3 <- function(x) {
sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2+x[4]*s^3))^2)
}
optout2<-optim(rep(1,4),f3,method="BFGS")
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")
#
f4 <- function(x) {
sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2+x[4]*s^3+x[5]*s^4))^2)
}
optout2<-optim(rep(1,5),f4,method="BFGS")
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")
#
f5 <- function(x) {
sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2+x[4]*s^3+x[5]*s^4+x[6]*s^5))^2)
}
optout2<-optim(rep(1,6),f5,method="BFGS")
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")
```

As the saying goes that “bad gun also hit if shoot number”, for once be carried out many times over and unusual event is intended to occur. The same thing with this, the unusual statistic which corresponds to the p-values less, it is no longer unusual if Hakare repeatedly. This is the reason to change the interpretation of the p-value at the time you make a lot of tests. Problem is called multiple testing. In this chapter, we will describe how to change the interpretation of the p-value (correction method). ### 17.1.2 Expected p-value in multiple test-setting

Expected value of minimum p value. \(E(min(p)|k)=\frac{1}{k+1}\)

Expected value of the i-th minimum p value.

\[ \begin{equation*} E(i_{th} p | k)=\frac{i}{k+1} \end{equation*} \]

\footnote{ all p < a: \[Pr(all p<a|k)=1-(1-a)^k\]

minimum p is between a and \(a+\delta a\)

\[Pr(all p < a+\delta a|k)-Pr(all p< a|k)\]

Because \(\delta a \to 0\) so \[Pr(min p=a|k)=\frac{d Pr(all p<a|k)}{da} (a)=k(1-a)^{k-1}\]

Expected value is: \[ \int_{0}^{1} a Pr(min p=a|k) da = \int_{0}^{1} ka (1-a)^{k-1} da = k ( \int_{0}^{1} (1-a)^{k-1} da -\int_{0}^{1} (1-a)^{k} da ) = \Biggl[ k (\frac{1}{k} (1-a)^k-\frac{1}{k+1} (1-a)^{k+1}) \Biggl]_0^1= k(\frac{1}{k}-\frac{1}{k+1})=\frac{1}{k+1}\]

The i-th smallest out of k:

1,2,…i-th = a

\[\frac{k!}{i!(k-i)!} a^i(1-a)^{k-i}\] then

\[\frac{i}{k+1}\]

Sidak’s method \[ \begin{equation*} 1-(1-a)^k=p_c \end{equation*} \] \[ \begin{equation*} a=1-(1-p_c)^{\frac{1}{k}} \end{equation*} \]

Bonferroni’s method \[ \begin{equation*} p_c=a\times k \end{equation*} \]

Linkage disequilibrium

Structure population

Bonferroni and Sidak are too conservative

Permutation method to correct p might be good, but…

Monte-carlo permutation method with no. iterations = n, then corrected p can not be samller than 1/n.

```
?
Ns<-100;Nm<-10;Niter<-500
Fx<-function(d){
t.test(d~P[shuffle])$p.value
}
P<-rbinom(Ns,1,0.5) # binomial phenotype
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
meanPs<-matrix(0,Ns,Nm);Pminhistories<-matrix(0,Ns,Niter)
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-0.3
for(i in 2:Nm){
R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}
shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)
Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
shuffle<-sample(1:Ns)
Plist[i,]<-sort(apply(X,2,Fx))
if(min(Plist[i,])<=obsMinP){
counter<-counter+1
}
Phistory[i]<-counter/i
}
plot(Phistory, type = "b")
```

When tests are mutually dependent, p-values tend to be similar.

That means p-values get clustered…

Structured population

```
set.seed(995599);Niter<-1000
#library(Rassoc)
st<-rep(0,Niter);p<-rep(0,Niter)
for(i in 1:Niter){
af<-runif(1)*0.6+0.2
delta<-rnorm(1)
af1<-af+af*0.05*delta; af2<-af-af*0.05*delta
case<-sample(c(0,1,2),1000,c(af1^2,2*af1*(1-af1),(1-af1)^2),replace=TRUE)
cont<-sample(c(0,1,2),1000,c(af2^2,2*af2*(1-af2),(1-af2)^2),replace=TRUE)
t<-matrix(c(length(case[case==0]),length(case[case==1]),length(case[case==2]),
length(cont[cont==0]),length(cont[cont==1]),length(cont[cont==2])),nrow=2,byrow=TRUE)
cattout<-CATT(t)
st[i]<-(cattout$statistic)^2; p[i]<-cattout$p
}
plot(ppoints(length(p),a=0),sort(p))
ylim=c(0,1)
plot(ppoints(length(p),a=0),sort(p),ylim=ylim,type="l")
for(i in 2:20){
par(new=T)
plot(ppoints(length(p),a=0),sort(pchisq(st/i,1,lower.tail=FALSE)),ylim=ylim,col="red",type="l")
}
lambda<-quantile(st,0.5)/qchisq(0.5,1)
stc<-st/lambda
pc<-pchisq(stc,1,lower.tail=FALSE)
par(new=T)
plot(ppoints(length(pc),a=0),sort(pc),type="l")
```

Only using observed p

correction with 1 parameter, \(/lambda\)

Non-central \(chi^2\) distribution.

```
#library(Rassoc)
stB<-rep(0,Niter);pB<-rep(0,Niter)
af<-runif(1)*0.6+0.2 ;delta<-0.05;af1<-af+af*delta;af2<-af-af*delta
exptable<-matrix(c(af1^2,2*af1*(1-af1),(1-af1)^2,af2^2,2*af2*(1-af2),(1-af2)^2),nrow=2,byrow=TRUE)*1000
cattout<-CATT(exptable)
expSt<-(cattout$statistic)^2
for(i in 1:Niter){ ｰ
case<-sample(c(0,1,2),1000,c(af1^2,2*af1*(1-af1),(1-af1)^2),replace=TRUE)
cont<-sample(c(0,1,2),1000,c(af2^2,2*af2*(1-af2),(1-af2)^2),replace=TRUE)
t<-matrix(c(length(case[case==0]),length(case[case==1]),length(case[case==2]),
length(cont[cont==0]),length(cont[cont==1]),length(cont[cont==2])),nrow=2,byrow=TRUE)
cattout<-CATT(t); stB[i]<-(cattout$statistic)^2; pB[i]<-cattout$p
}
lambdaB<-quantile(stB,0.5)/qchisq(0.5,1) ;stcB<-stB/lambdaB;pcB<-pchisq(stcB,1,lower.tail=FALSE)
plot(ppoints(length(pB),a=0),sort(pB),type="l",ylim=ylim)
par(new=T)
plot(ppoints(length(pcB),a=0),sort(pcB),type="l",col="red",ylim=ylim)
meanstB<-mean(stB);varstB<-var(stB)
ncp<-meanstB-1
ncpvalue<-qchisq(ppoints(length(pcB),a=0),1,ncp)
ylim<-c(0,max(stB,ncpvalue))
plot(ppoints(length(stB),a=0),sort(stB),ylim=ylim,col=gray(7/8))
par(new=T)
plot(ppoints(length(stB),a=0),ncpvalue,type="l",ylim=ylim)
```

Type 1 error

Type 2 error

Power

```
x<-seq(from=0,to=30,by=0.01)
df<-1;lambda<-ncp
chi<-dchisq(x,df,0)
ncChi<-dchisq(x,df,lambda)
ylim=c(0,0.2)
plot(x,chi,ylim=ylim,type="l")
par(new=T)
plot(x,ncChi,ylim=ylim,type="l",col="red")
abline(v=qchisq(0.05,1,lower.tail=FALSE))
pchisq(qchisq(0.05,1,lower.tail=FALSE),1,ncp,lower.tail=FALSE)
```

The components of the hypothesis

Sample size

Rejection level

Power

```
help(power.prop.test)
help(power.t.test)
help(power.anova.test)
power.prop.test(p1 = 0.5, p2 = 0.4, sig.level = 0.01, power = 0.9)
```

The output of the R17-5.R is as follows. And it did not specify the number of samples n is calculated, is shown along with the value of a specified variable.

Observing many markers, the differnece of case-control populatios are apparent.

Using the difference seen, correct the test results.

```
Nm<-1000 ;Npop<-4;
Ns<-c(100,200,200,200)
M<-NULL
for(j in 1:Npop){
tmpM<-matrix(rep(0,Nm*Ns[j]),nrow=Nm)
for(i in 1:Nm){
af<-runif(1)*0.8+0.1; f<-rnorm(1,sd=0.01); if(abs(f)>1) f=0
df<-c(af^2,2*af*(1-af),(1-af)^2)
df[1]<-df[1]+f/2*df[2]; df[3]<-df[3]+f/2*df[2]; df[2]<-1-df[1]-df[3]
tmpM[i,]<-sample(c(0,1,2),Ns[j],replace=TRUE,prob=df)
}
M<-cbind(M,tmpM)
}
```

```
Nm<-1000 ;Npop<-4;
Ns<-c(100,200,200,200)
M<-NULL
for(j in 1:Npop){
tmpM<-matrix(rep(0,Nm*Ns[j]),nrow=Nm)
for(i in 1:Nm){
af<-runif(1)*0.8+0.1; f<-rnorm(1,sd=0.01); if(abs(f)>1) f=0
df<-c(af^2,2*af*(1-af),(1-af)^2)
df[1]<-df[1]+f/2*df[2]; df[3]<-df[3]+f/2*df[2]; df[2]<-1-df[1]-df[3]
tmpM[i,]<-sample(c(0,1,2),Ns[j],replace=TRUE,prob=df)
}
M<-cbind(M,tmpM)
}
```

Then from this data, by principal component analysis, take out the major axis. Please step-by-step process (R17-7.R).

```
##PCA##
mu<-apply(M,1,mean)
M<-M-mu
M<-M/sqrt(mu/2*(1-mu/2))
X<-1/Nm*t(M)%*%M
eiout<-eigen(X)
plot(eiout$values)
```

```
eivect<-as.data.frame(eiout$vectors)
eilist<-1:(Npop+1)
plot(eivect[,eilist])
```

Then, samples the case and control at different rates from the structure Qa’a population (R17-9.R).

`phenotype<-c(sample(c(0,1),sum(Ns)/2,replace=TRUE,prob=c(0.45,0.55)),sample(c(0,1),sum(Ns)/2,replace=TRUE,prob=c(0.55,0.45)))`

```
Chisq<-rep(0,Nm);CorrChisq<-rep(0,Nm);Ps<-rep(0,Nm);CorrPs<-rep(0,Nm)
L<-3
Emat<-eiout$vectors[,1:L]
Esqs<-apply(Emat*Emat,2,sum)
phenotype<-phenotype-mean(phenotype)
Gamma<-apply(Emat*phenotype,2,sum)/Esqs
corrphenotype<-phenotype-Emat%*%Gamma
for(i in 1:Nm){
genotype<-M[i,]
Gamma<-apply(Emat*genotype,2,sum)/Esqs
corrgenotype<-genotype-Emat%*%Gamma
Chisq[i]<-(sum(Ns))*cor(genotype,phenotype)^2
CorrChisq[i]<-(sum(Ns)-(L+1))*cor(corrgenotype,corrphenotype)^2
}
Ps<-pchisq(Chisq,1,lower.tail=FALSE)
CorrPs<-pchisq(CorrChisq,1,lower.tail=FALSE)
ylim<-c(0,1)
plot(ppoints(length(Ps),a=0),sort(Ps),ylim=ylim)
par(new=T)
plot(ppoints(length(Ps),a=0),sort(CorrPs),ylim=ylim,col=gray(6/8))
plot(Ps, CorrPs)
chivalue<- qchisq(Ps,1,lower.tail=FALSE)
lambda<-quantile(chivalue,0.5)/qchisq(0.5,1)
chivalueGC<-chivalue/lambda
pGC<-pchisq(chivalueGC,1,lower.tail=FALSE)
plot(Ps,pGC)
```

- False Discovery Rate (FDR)

```
N<-1000
peven<-p1<-ppoints(N,a=0)
p1<-peven^5
plot(peven, peven,xlim=c(0,1),ylim=c(0,1),type="l")
par(new=T)
plot(peven,p1,xlim=c(0,1),ylim=c(0,1),type="l")
abline(v=0.05)
```

```
N<-1000
peven<-p1<-ppoints(N,a=0)
peven<-ppoints(length(p),a=0)
p1<-peven^5
pfdr<-p.adjust(p1,"fdr")
b<-0.05
plot(peven,sort(p1,decreasing=TRUE),xlim=c(0,1),ylim=c(0,1),type="l")
par(new=T)
plot(peven,b*((length(p1)-(1:length(p1))+1)/length(p1)),xlim=c(0,1),ylim=c(0,1),type="l",col="red")
par(new=T)
plot(peven,sort(pfdr,decreasing=TRUE),xlim=c(0,1),ylim=c(0,1),type="l",col="blue")
abline(h=b)
abline(v=peven[length(which(pfdr>b))])
```

- Bayesian factor

Independence of the test by 2 ?? 2 contingency table will think of when two there. The two tests are intended to examine one of the things some of the (relationship of factor A / a and factor B / b), and shall test to collect the sample to completely independent. In this way, when you examine using two sample sets in that there, it is one of the story to integrate doing what the results of the two sets. Consider that the two assayed retaken the sample is repeated many times. Chi-squared value of Pearson’s independence test under each of the null hypothesis, so you follow the chi-squared distribution with one degree of freedom, draw a scatter diagram of the p-value of the chi-square test of Pearson test of two test and, it should look like the one shown in Figure 17.13 (R17-13.R).

```
Nt<-100
df1<-1
df2<-1
chi1<-rchisq(Nt,df1)
chi2<-rchisq(Nt,df2)
plot(pchisq(chi1,df1,lower.tail=FALSE),pchisq(chi2,df2,lower.tail=FALSE))
chisum<-chi1+chi2?
plot(ppoints(Nt,a=0),sort(pchisq(chisum,df=df1+df2,lower.tail=FALSE)),type="l")
```

```
t1<-t2<-matrix(c(10,10,10,10),nrow=2,byrow=TRUE)
m11<-m21<-apply(t1,1,sum);m12<-m22<-apply(t1,2,sum);M1<-M2<-sum(t1)
e1<-m11%*%t(m12)/M1;e2<-m21%*%t(m22)/M2
x11<-seq(from=-M1,to=M1,by=1);x12<--x11+m11[1];x21<--x11+m12[1];x22<--x12+m12[2]
xbind<-cbind(x11,x12,x21,x22);okx<-which(apply(xbind,1,min)>0)
x11<-x11[okx];x12<-x12[okx];x21<-x21[okx];x22<-x22[okx]
y11<-seq(from=-M2,to=M2,by=1);y12<--y11+m21[1];y21<--y11+m22[1];y22<--y12+m22[2];ybind<-cbind(y11,y12,y21,y22);oky<-which(apply(ybind,1,min)>0)
y11<-y11[oky];y12<-y12[oky];y21<-y21[oky];y22<-y22[oky]
chi1<-(x11-e1[1,1])^2/e1[1,1]+(x12-e1[1,2])^2/e1[1,2]+(x21-e1[2,1])^2/e1[2,1]+(x22-e1[2,2])^2/e1[2,2]
chi2<-(y11-e2[1,1])^2/e2[1,1]+(y12-e2[1,2])^2/e2[1,2]+(y21-e2[2,1])^2/e2[2,1]+(y22-e2[2,2])^2/e2[2,2]
z<-outer(chi1,chi2,FUN="+")
xlim<-ylim<-c(min(x11-e1[1,1],y11-e2[1,1]),max(x11-e1[1,1],y11-e2[1,1]))
contour(x11-e1[1,1],y11-e2[1,1],z,xlim=xlim,ylim=ylim)
abline(h=0);abline(v=0)
```

```
sum11<-outer(x11,y11,FUN="+");sum12<-outer(x12,y12,FUN="+")
sum21<-outer(x21,y21,FUN="+");sum22<-outer(x22,y22,FUN="+")
sume11<-e1[1,1]+e2[1,1];sume12<-e1[1,2]+e2[1,2];sume21<-e1[2,1]+e2[2,1];sume22<-e1[2,2]+e2[2,2]
sumz<-(sum11-sume11)^2/sume11+(sum12-sume12)^2/sume12+(sum21-sume21)^2/sume21+(sum22-sume22)^2/sume22
xlim<-ylim<-c(min(x11-e1[1,1],y11-e2[1,1]),max(x11-e1[1,1],y11-e2[1,1]))
contour(x11-e1[1,1],y11-e2[1,1],sumz,xlim=xlim,ylim=ylim)
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],z,xlim=xlim,ylim=ylim,col=gray(6/8))
abline(a=0,b=1)
```

```
library(rmeta)
n.case<-c(m11[1],m21[1])
n.ctrl<-c(m11[2],m21[2])
zMH<-z
zDSL<-z
for(i in 1:length(z[,1])){
for(j in 1:length(z[1,])){
mhout<-meta.MH(n.case,n.ctrl,c(x11[i],y11[j]),c(m12[1]-x11[i],m22[1]-y11[j]))
zMH[i,j]<-mhout$MHtest[1]
dslout<-meta.DSL(n.case,n.ctrl,c(x11[i],y11[j]),c(m12[1]-x11[i],m22[1]-y11[j]))
zDSL[i,j]<-dslout$test[1]^2
}
}
zlim<-c(0,max(sumz))
contour(x11-e1[1,1],y11-e2[1,1],sumz,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10,col=gray(6/8))
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],zMH,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10)
###
contour(x11-e1[1,1],y11-e2[1,1],z,xlim=xlim,ylim=ylim,zlim=zlim,,nlevels=10,col=gray(6/8))
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],zDSL,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10)
#######
library(rmeta)
t1<-matrix(c(10,20,30,40),nrow=2,byrow=TRUE)
t2<-matrix(c(200,300,50,40),nrow=2,byrow=TRUE)
m11<-apply(t1,1,sum)
m12<-apply(t1,2,sum)
M1<-sum(t1)
m21<-apply(t2,1,sum)
m22<-apply(t2,2,sum)
M2<-sum(t2)
m11<-apply(t1,1,sum)
m12<-apply(t1,2,sum)
M1<-sum(t1)
m21<-apply(t2,1,sum)
m22<-apply(t2,2,sum)
M2<-sum(t2)
e1<-m11%*%t(m12)/M1
e2<-m21%*%t(m22)/M2
x11<-seq(from=-M1,to=M1,by=1)
y11<-seq(from=-M2,to=M2,by=1)
x12<--x11+m11[1]
x21<--x11+m12[1]
x22<--x12+m12[2]
xbind<-cbind(x11,x12,x21,x22)
okx<-which(apply(xbind,1,min)>0)
x11<-x11[okx]
x12<-x12[okx]
x21<-x21[okx]
x22<-x22[okx]
y12<--y11+m21[1]
y21<--y11+m22[1]
y22<--y12+m22[2]
ybind<-cbind(y11,y12,y21,y22)
oky<-which(apply(ybind,1,min)>0)
y11<-y11[oky]
y12<-y12[oky]
y21<-y21[oky]
y22<-y22[oky]
sum11<-outer(x11,y11,FUN="+");sum12<-outer(x12,y12,FUN="+")
sum21<-outer(x21,y21,FUN="+");sum22<-outer(x22,y22,FUN="+")
sume11<-e1[1,1]+e2[1,1];sume12<-e1[1,2]+e2[1,2];sume21<-e1[2,1]+e2[2,1];sume22<-e1[2,2]+e2[2,2]
sumz<-(sum11-sume11)^2/sume11+(sum12-sume12)^2/sume12+(sum21-sume21)^2/sume21+(sum22-sume22)^2/sume22
xlim<-ylim<-c(min(x11-e1[1,1],y11-e2[1,1]),max(x11-e1[1,1],y11-e2[1,1]))
chi1<-(x11-e1[1,1])^2/e1[1,1]+(x12-e1[1,2])^2/e1[1,2]+(x21-e1[2,1])^2/e1[2,1]+(x22-e1[2,2])^2/e1[2,2]
chi2<-(y11-e2[1,1])^2/e2[1,1]+(y12-e2[1,2])^2/e2[1,2]+(y21-e2[2,1])^2/e2[2,1]+(y22-e2[2,2])^2/e2[2,2]
z<-outer(chi1,chi2,FUN="+")
n.case<-c(m11[1],m21[1])
n.ctrl<-c(m11[2],m21[2])
zMH<-zDSL<-z
for(i in 1:length(z[,1])){
for(j in 1:length(z[1,])){
mhout<-meta.MH(n.case,n.ctrl,c(x11[i],y11[j]),c(m12[1]-x11[i],m22[1]-y11[j]))
zMH[i,j]<-mhout$MHtest[1]
dslout<-meta.DSL(n.case,n.ctrl,c(x11[i],y11[j]),c(m12[1]-x11[i],m22[1]-y11[j]))
zDSL[i,j]<-dslout$test[1]^2
}
}
zlim<-c(0,max(zMH,zDSL))
contour(x11-e1[1,1],y11-e2[1,1],sumz,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10,col=gray(6/8))
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],zMH,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10)
contour(x11-e1[1,1],y11-e2[1,1],z,xlim=xlim,ylim=ylim,zlim=zlim,,nlevels=10,col=gray(6/8))
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],zDSL,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10)
```

```
a<-0.00001
k<-1:100000
FWER<-1-(1-a)^k
Bonferroni<-a*k
ylim<-c(0,1)
plot(k,FWER,type="l",ylim=ylim,col=gray(6/8))
par(new=T)
plot(k,Bonferroni,type="l",ylim=ylim)
FWER[1]
FWER[100000]
ylim<-c(log(min(FWER,Bonferroni),10),0)
plot(k,log(FWER,10),type="l",ylim=ylim,col=gray(6/8))
par(new=T)
plot(k,log(Bonferroni,10),type="l",ylim=ylim)
pc=0.01
As<-1-(1-pc)^(1/k)
Bfs<-pc/k
ylim<-c(log(min(As,Bfs),10),0)
plot(k,log(As,10),type="l",ylim=ylim,col=gray(6/8))
par(new=T)
plot(k,log(Bfs,10),type="l",ylim=ylim)
As[1]
As[100000]
```

```
Ns<-100
Nm<-10
Niter<-500
rs<-c(0,0.75,0.9,0.95,1)
Fx<-function(d){
t.test(d~P[shuffle])$p.value
}
P<-rbinom(Ns,1,0.5)
#X<-matrix(sample(c(0,1),Ns*Nm,0.5),nrow=Ns)
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
meanPs<-matrix(0,length(rs),Nm)
Pminhistories<-matrix(0,length(rs),Niter)
xx<-1
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}
#for(i in 1:Ns){
#R<-sample(c(0,1),Nm,replace=TRUE,prob=c(r,1-r))
#R<-runif(Nm)
#if(i==1){
#X[i,]<-X[i-1,]*R*r+X[i,]*(1-R*r)
#}
#X[i,]<-X[i-1,]*R*r+X[i,]*(1-R*r)
#}
shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)
Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
shuffle<-sample(1:Ns)
Plist[i,]<-sort(apply(X,2,Fx))
if(min(Plist[i,])<=obsMinP){
counter<-counter+1
}
Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP
image(cor(cbind(P,X)))
plot(Phistory,type="b")
xx<-2
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}
shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)
Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
shuffle<-sample(1:Ns)
Plist[i,]<-sort(apply(X,2,Fx))
if(min(Plist[i,])<=obsMinP){
counter<-counter+1
}
Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP
image(cor(cbind(P,X)))
xx<-3
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}
shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)
Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
shuffle<-sample(1:Ns)
Plist[i,]<-sort(apply(X,2,Fx))
if(min(Plist[i,])<=obsMinP){
counter<-counter+1
}
Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP
image(cor(cbind(P,X)))
xx<-4
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}
shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)
Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
shuffle<-sample(1:Ns)
Plist[i,]<-sort(apply(X,2,Fx))
if(min(Plist[i,])<=obsMinP){
counter<-counter+1
}
Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP
image(cor(cbind(P,X)))
xx<-5
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}
shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)
Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
shuffle<-sample(1:Ns)
Plist[i,]<-sort(apply(X,2,Fx))
if(min(Plist[i,])<=obsMinP){
counter<-counter+1
}
Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP
image(cor(cbind(P,X)))
xlim<-c(0,1)
ylim<-c(0,1)
plot(ppoints(length(meanPs[1,]),a=0),meanPs[1,],xlim=xlim,ylim=ylim,type="b")
for(i in 2:length(meanPs[,1])){
par(new=T)
plot(ppoints(length(meanPs[1,]),a=0),meanPs[i,],xlim=xlim,ylim=ylim,type="b")
}
```

```
Nt<-10000
Nk<-rpois(1,10)
dfs<-rpois(Nk,10)
chis<-matrix(0,Nt,Nk)
for(i in 1:Nk){
chis[,i]<-rchisq(Nt,dfs[i])
}
dfs
chisum<-apply(chis,1,sum)
dfsum<-sum(dfs)
plot(ppoints(Nt,a=0),sort(pchisq(chisum,df=dfsum,lower.tail=FALSE)),type="l")
```