# Preparation

Install packages to be used in the book. Package gregmisc is no more available. It is replaced by gplots for this document. Package Rassoc is no more available. All funcitons were copied from the archived version.

Rassoc package is not active. You can get the archived version from http://cran.r-project.org/web/packages/Rassoc/index.html .

The following functions are copy-pasted from the archived version.

# Preface !

## Statistical genetics, and statistical genetics - genetics, genomics and statistics

This book was written for the purpose of use as a textbook of statistical genetics or genetic statistics. Statistical genetics or genetic statistics is a field to interpret the biological phenomenama mathematically with heredity as the central axis.

A parent and child are similar. It is because genes are transmitted from a parent to child. They are very similar each other, they can be well distinguished. Individuals in a population also mutually similar. But there is a variation in the population. Genetics is to understand this variations in similarlity with transmission of genes and their functions.

Set of genes of the child is consisted of a halve of set of mother and a half of set of father. Conversely, only half of each parent’ gene set is transmitted to the child. Since the choice of this half occurs at random, genetic phenomena are stochastic. Another field on stochastic phenomena is “statistics”. “Genetics” and “statistics” shares their basis and actually the root of “statistics” was “genetics”.

Currently, genetics is aiming to understand biological phenomena with genomic approach and from molecular genetics aspects using large scale data sets. On the other hand, statistics has broadened its usage to almost all academic fields (Figure 0.1).

Mathematical statistical approach in genetics and genomics is very wide-ranging. This textbook covers, gene mapping-specific statistical issues and basics of statistical thinking that are shared by heterogeneous statistical methods in genetic statistics.

## Configuration of this document

This document is handling a wide range of the research on the function of the genetic phenomenon and the gene, it was constructed by focusing on the things that are common at the time to understand them in the statistical side.

Part I deals with the biological matters related to genetic-gene.

Part II is for matters relating to the basis for the handling of data. It focuses on categorical data type because of the discreteness of the gene. It also explains methods focusing on populational aspects of samples or focusing on individual relation among samples.

Part III is on descriptive statistics, dimensions/ degree of freedom that is the basics of estimation/test, probability, likelihood, distribution, describes the index. Likelihood and linkage analysis are also handled in this part.

Part IV describes how to derive meaning from data.

Part V is on the issues when data sets are large or complicated.

## How to use this book

~ This section is auto-translation results ~

As described above, this book was constructed by focusing on the things that are common at the time to understand the statistical aspects of the research on the function of the genetic phenomenon and the gene. Conversely, commentary and of the various techniques that are used to, such as gene mapping and gene function analysis, such as the definition of the various types of index we have decided to not dare treated. You can take advantage of their individual approaches, but is of course the thing that must be properly using the index, than the aim of understanding about them, then you no longer know did not matter that dealt with the future, developed and proposed will that, because not stick the force corresponding to such new methods and index. It than, by placing a weight on the basis of what is the concept that is common to a variety of techniques and index, because I hope and I want you to put the application force. In a desk study of the language, by holding down the foundation, is an image, such as aim to become Yomikonaseru also for the first time of the sentence. So, in this document, may read this chapter if you are interested in the transformation mapping, this chapter if you want to know the clustering, not configured and so on. Coming out one of the research themes matters relating to the to and fro, to read while they cross-reference has been making the best, that.

In addition, more than once, the terms and concepts that appeared, not necessarily be carefully explained at the first appearance, was described in the place where polite description is considered to be the most useful. This is, by writing the definition and description at the time of first appearance, is to avoid that the first half will be heavy. Therefore, somewhat, also remains vague, the process proceeds to read as long as the meaning is through, I think think of source Ikebayoshi, is convincing when read back again. Upon advance reading, when it is difficult to understand is the highlighted term in bold type 1, please examine the terms in the index. Page in which the term has appeared you can see somewhere. When you have appeared at a plurality of locations, since the page that explains the most carefully and listed in bold, please read the description of the page. In the pages that are most carefully explained, the term is Yes to express in a different type of font as shown in bold 2. Speaking in the web of the article, it is a condition, such as cross-links are stretched here and there. Please use with the intention, such as jumping through the index.

In addition, we have extensive use of the figure. In addition, we also posted free source of statistical software R of to draw the figure. Holding the contents in a sentence of natural language, on which was divided the image in the figure, while confirming an accurate representation in a formula, I think that the source of the R I want you to help understanding. In addition, since the source of R is what is often a simulation, by all means, use, it is recommended that you try to output the results in different conditions. These figures might be easy to understand it is better to look at the color. You can browse by http://www.genome.med.kyoto-u.ac.jp/func-gen-photo/index.php?album=StatGenetTextbook. R source of which was used in this document, it can be downloaded from http://www.genome.med.kyoto-u.ac.jp/StatGenet/lectures/2010/StatGenetTextbook/Rsrc.zip.

## People who helped me in the writing of this book

~ This section is auto-translation results ~

Upon writing a book, it became indebted to many people. In particular, everyone of the private meeting that I am riding on the Yorozu consultation of the day-to-day research (Waseda University Faculty of Science and Technology, Advanced Electrical and Information Engineering and Life Science Inoue Masato teacher, University of Tokyo Institute of Medical Science, Human Genome Center Imoto Kiyoshi??? teacher, Daiichi Sankyo Co., Ltd. Seiko Endo, University of Tsukuba graduate School of comprehensive human Sciences, Ohashi order teacher, the RIKEN Center for genomic Medicine statistical analysis research team Natsuhiko Kumasaka (in alphabetical order)) is very I learned a lot of things to. I would like to thank to take this opportunity. Also, in this member, especially in the Ohashi order teacher, we received a lot of advice over the range when writing this book. I’m really thankyou. In addition, Terao Chikashi’s Kyoto University Hospital and Immunology collagen disease internal medicine, the Kawaguchi TakashiHisa’s same school Graduate School of Medicine, University Medical Center genome, such as you to a confirmation of the document ・ R source, peacetime joint we become indebted in addition to research activities, in Kawaguchi, I was writing the R source of the demonstration program that was used in this book. In addition, from far away position is with genetic statistics, I am grateful to Junko wife gave me an opinion on the structure of the present through the eyes to the entire manuscript. And more than anything, statistical genetics and statistical genetics books published in the field of Ohmsha like choose to provide us with the opportunity of, also, be patient despite the great deal changed the content and structure while advancing the brush thank you to everyone of ohm’s development section who acquaintance.

In this way, it is this book, which was finished thanks to the many people, interested in even a little better of a lot of readers statistical genetics and statistical genetics, people of diverse academic background is a chance to work in this field we hope that.

## Reference book

~ This section is auto-translation results ~

Genetics of In reading this book, statistics, mentioned the related Statement of R. This book, here neither the introduction for an understanding of the present listed, read the book mentioned here, not a technical book should be read. Both are a may be read in first, but when you read both, it seems that understanding deepens.

# Chapter 1 Heredity— Similar or not-similar !

## 1.1 Trait is inherited

### 1.1.1 What is heredity? !

What is heredity? When there is a connection of blood, it is that certain features are shared. When from human parents was born a human child, the feature that is a human being is shared by parents and a child. When this is too obvious, human beings should not have found it interesting and they might not have created the word heredity to pinpoint the phenomenon. However a child shares some features with its parents but does not others, which attracted our ancestors’ interests and the idea of “heredity” was born and it seemed the starting point of “genetics”.

### 1.1.2 Characteristics of an organism—Trait and phenotype (phenotype) !

Feature with organisms is called the trait. Feature is anything trait. Or interested in what traits, to determine the contents of the study, to have a point of view of the new analysis, you are paraphrasing also possible to define a new trait. In that sense, it is useful to organize the point of view to consider the trait. One way when classifying the trait, will be mentioned whether the assessment by one of the five senses (sight, hearing, taste, smell, touch) is. As a way of another classification, anatomy and structural, physiological and functional, there is a way of dividing that molecular biology, pharmacological. In addition, whether a feature that was provided to the individual itself, the individual is whether a way of holding of the relationship between predators and the environment (such as behavior, immune response), you can also classify that. Others include whether the mathematical concept, seemingly to physically measure, any change to (garbled) appearance of the, point of view also helps to transform the category of. Data by observing these traits can be obtained. This trait of observation data is the phenotype (phenotype).

### 1.1.3 Identity and diversity !

Genes and their descendents from parents to offsprings have two aspects.

• Make almost identical offsprings

• Make different offsprings from parents

Heredity is a mixture of identity and diversity.

Genetics study heredititary phenomena (a mixture of identity and diversity) as realization of stochastic events in genes

Fitness is an important idea.

(Figure 1.1 )

~ This section is auto-translation results ~

Organism is the one characterized by leaving the same as their own to the next generation. Asexual reproduction ※ Dewako will directly take over the parent of the gene, but the person does not mean at all to leave the same individual in the sexual reproduction ※ of organisms, including, will take over the two halves of the parents of the gene . Speaking from the terms of adaptation, because why parents are living well, the child might also say that strategy ever thought it would go well if the same as the parent. On the other hand, rather than the individual, and then try to focus on the species that individual belongs. That’s the same all any kind of individuals, when the environment changes, all individuals would sometimes be difficult to live all together. It can be thought of as not a good idea. Even if there is a change in the environment, if there are a variety of individuals, from even happening is a change of environment will reduce the possibility of annihilation of the species, also considered a good idea that you configured in a variety of individuals. Genetic phenomenon, this way, it has a dual nature to ensure the identity of the security and diversity.

## 1.2 Gene !

### 1.2.1 What is gene?

What is gene?

It is quite difficult.

What is genome?

It is a bit easier than “what is gene?”.

### 1.2.2 Chromosomes

Basic knowledge on chromosomes are necessary.

## Use R

Try to draw a bar graph that represents the magnitude of the human chromosome (R1-1.R).

# Length of chromosomes (bp)
# A vector of chromosome length
chromLen<-c(247249719,242951149,199501827,191273063,180857866,
170899992,158821424,146274826,140273252,135374737,134452384,
132349534,114142980,106368585,100338915,88827254,78774742,
76117153,63811651,62435964,46944323,49691432,154913754,57772954)
# barplot. Labels of choromosomes are on "X" axis and their length are "Y" axis. col argument specifies the color of bars.
barplot(chromLen,names=c(1:22,"X","Y"),col="black")

### 1.2.3 Locus, allele, haplotype, diplotype, Phenotype

Define the terms below.

locus, loci base SNP allele haplotype exon, intron genotype diplotype phenotype

### 1.2.4 Diploid, homozygous, heterozygous, genotype, Phenotype, genetic form !

Define the terms below.

diploid homozygotes, heterozygotes 2x3 table of binary phenotype and three genotypes of biallelic locus.

Autosomal chromosomes exist in pairs of one by one from the parents. It means that you have two sets of duplicate genetic information. In this way it is called a diploid two sets with organisms. When you have two identical alleles for a locus, the individual is said to be homozygous for the gene locus, when you have a different allele, said to be heterozygous. From here and later, we handle diploid organisms only. Humans are also diploid. Diploid organisms have two alleles in the gene locus on autosomes. The way that individual has two alleles is genotype. On the other hand, the appearance of an individual’s traits is phenotype. The most clear-cut relationships between genotype and phenotype appear as dominant mode of inheritance and recessive mode of inheritance. Assume two alleles M and m. Assume a phenotype “no” and “yes” for a trait. Individuals have one of three genotypes, MM, Mm and mm.

Assume an allele m brings “yes” phenotype. $\begin{table} \renewcommand{\footnoterule}{} \begin{center} \begin{minipage}{10cm} \caption{Relation between genotypes and phenotypes 2\times 3 table} \begin{tabular}[htb]{|c|c|c|c|c|} \hline &MM&Mm&mm&sum\\ \hline yes&n_{10}&n_{11}&n_{12}&n_{1.}  \\ \hline no&n_{20}&n_{21}&n_{22}&n_{2.} \\ \hline sum&n_{.0}&n_{.1}&n_{.2}&n_{..} \\ \hline \end{tabular}$ $\begin{equation*} \sum_{i=1}^2 n_{ij}=n_{.j}; \sum_{j=0}^2 n_{ij}=n_{j.}; \sum_{i=1}^2 \sum_{j=0}^2 n_{ij}=n_{..}; \end{equation*}$ Rates of phenotype being “yes”, $$f_i=\frac{n_{1j}}{n_{.j}},j=0,1,2$$, are different for each genotype. Assume MM as the reference genotype. Then each genotype’s “yes” fraction can be expressed as fold change from MM’s fraction, that is the genotype relative risk (genotype relative risk: GRR). $$\lambda_0=1$$ from definition. When GRR of heterozygous (Mm) is equal to that of the homozygote (mm)($$\lambda_1=\lambda_2$$), it is called the way of impact of m on this phenotype is dominant mode of inheritance. On the contrary, when $$\lambda_1=\lambda_0$$, We call this form as recessive inheritance.

There are intermediate forms that does not fit to either the dominant or recessive mode. There are two definitions. One definition is based on the following formula, $\lambda_1=x \lambda_2 + (1-x) \lambda_0$ with whichx = 0.5 is additive (additive) model where , x = 0 is recessive and x = 1 is dominant.

The other expression is, $\lambda_1=\lambda_2^y \times \lambda_0^{1-y}$ and y = 0 is recessive, y = 0.5 is synergistic (multiplicative), y = 1 is dominant.

When x = 0.5, $\lambda_1=\frac{\lambda_0+\lambda_2}{2}$ means arithmetic mean.

When y=0.5, $\lambda_1=\sqrt{\lambda_0 \times \lambda_2}$ means geometric mean.

In addition, when the effect of having one copy of m is less than the effect of two copies and less than the effect of zero copy, it is called super-dominant (overdominance).

In the formula x or y, the intermediate effect is expressed by $0 < x, y < 1$ and all the other effects including overdominace is expressed by $x, y < 0$ or $x, y > 1$.

$-< x,y <$ covers all inheritance models.

# Chapter 2 DNA, RNA, protein, trait !

## 2.1 DNA double strand

(Figure 2.1).

### 2.1.1 Replication, mutation, recombination

Mutations

Somatic mutations Germline mutations

Crossovers and recombination and recombinant

Figure 2.2

### 2.1.2 Origin is the same—IBD

Identity by descent (IBD) vs Identity by state (IBS)

IBD = {0,1,2}

Parent-offspring $Pr(IBD={0,1,2}) = (0,1,0)$ Sibrings $Pr(IBD={0,1,2}) = (0.25,0.5,0.25)$ $0.25 = 0.5 \times 0.5$

Table 2.1

$\begin{table} \begin{center} \caption{血???関係とIBD数の確???の関???} \begin{tabular}[htb]{|c|c|c|c|c|c|c|} \hline &\multicolumn{3}{|c|}{IBD数別確???}&IBD数の??????値&\multicolumn{2}{|c|}{一致???}\\ 血???関???&2&1&0& &??????値&???散\\ \hline 自身・一卵性双生???&1&0&0&2&1&0 \\ 親???&0&1&0&1&\frac{1}{2}&0 \\ ??????&\frac{1}{4}&\frac{1}{2}&\frac{1}{4}&1&\frac{1}{2}&\frac{1}{8} \\ 祖父???-孫&0&\frac{1}{4}&\frac{3}{4}&\frac{1}{4}&\frac{1}{8}&\frac{3}{64} \\ \hline \end{tabular} \end{center} \end{table}$

### 2.1.3 It is easy to handle a single representing number—Expected value of IBD !

The probability that the number of IBD is 0, 1 or 2, is useful as the information of the genetic closeness. It is inconvenient to handle three probability values, a probability vector with three elements. When the closeness of kinship is your interest, it may be convenient to have a single value representing the closeness. For example the expected value of IBD can be a candidate for such purpose.

When the observable values are $$X=\{x_1,x_2,...\}$$ and their probability is $$P=\{p_1,p_2,...\} (\sum_{i} p_i =1)$$, the expected value is

$E(X)=\sum_{i} x_i p_i .$

In the case of identical twins $$E (X) = 2$$. The relation between one and its self is also $$E (X) = 2$$.

It might be reasonable to make the strength of relationship of identity 1, as the strongest blood relationship, then $$\frac{E(X)}{2}$$ might be a good index as well, that could be called “concordance rate of allele”. The concordance rate of various relations is shown in Table 2.1.

The concordance rate between parent-child and the rate between sibs are the same, but the IBD probability vectors are different ({0,1,0} vs. {0.25,0.5,0.25}). This means, the concordance rate is convenient because of its simplicity but the simplicity is realized by loosing detailed information in the probability vectors.

1. Mean, variance, the moment

the The expected value E (X) of $$X=\{x_1,x_2\}$$ with $$P=\{p_1,p_2,...\} (\sum_{i} p_i =1)$$ is, $E(X)=\sum_{i} x_i p_i$

This expected value is can be called “mean” of X.

Variance V (X) is an index representing a variation in values, and it $V(X)=\sum_{i} p_i (x_i-E(X))^2 .$

Both mean and variance are indices to describe X and its distribution.

You may call them more generally. Mean is the 1st moment and variance is the 2nd moment arount mean. Now you can summarize X and its distribution with the i-th moment around zero or mean.

It is expressed as, $\mu_k(c) = \sum_{i} p_i (x_i-c)^k$ with c as a center.

Therefore moments around zero is, $\mu_k(0) = \sum_{i} p_i x_i^k$

and moments around mean is,

$\mu_k(E(X))=\sum_{i} p_i(x_i-E(X))^k$

Let’s make a function to calculate moments with R as an exercize of R.

momentProb<-function (x, p,order = 1, center = FALSE)
{
if(center)
x <- x - momentProb(x,p,order=1,center=FALSE)
sum(x^order*p)
}

datax<-c(1,2,3)
datap<-c(0.2,0.3,0.5)
momentProb(x=datax,p=datap,order=2,center=TRUE)
momentX<-function (x, order = 1, center = FALSE)
{ rep(1,length(x))/length(x) #を与えたmomentProb()に同じ
momentProb(x,p=rep(1,length(x))/length(x),order=order,center=center)
}

Let’s apply the functions to the IBD vectors.

x<-c(1,0.5,0) # IBD数別の一致の値
pDoho<-c(1/4,1/2,1/4) # ?????????IBD数別確???
momentProb(x,pDoho,order=1,center=FALSE) # ??????値
momentProb(x,pDoho,order=2,center=TRUE) # ???散

The results are shown in Table 2.1. Using expected value(mean) and variance, you can distinguish parent-child vs. siblings.

### 2.1.4 Concordance rate of allele in sibs

1. 23 pairs of chromosomes

There are 23 pairs of chromosomes in human. Assume no recombination. One chromosome of a pair is from mother and the other is from father. Now we want to calculate how many pairs are IBD=2 and how many pairs are IBD=1 and how many paris are IBD=0.

When you compare identical twins, all 23 paris are IBD=2. How about the case of siblings?

In total 46 = 2 x 23 chromosomes descend from parents to a offspring.

Each one of 46 can be shared by a sibling pair or not. Let 1 and 0 denote shared and not-shared, respectively.

For a chromosome pair, the probability of “11” is 0.25 and “01” or “10” is 0.5 and “00” is 0.25.

There are 4 ways to make repeated permutation of length 2; 00, 01, 10, 11.

Each probability is $\frac{1}{2^{2}}$

For k chromosome pairs, there are $$2^{2k}$$ ways of number sequence of {0,1}, each sequence probability is $\frac{1}{2^{2k}}$

The probability that all 46 chromosomes are IBD between a sib pair is the probability to have “11….1” and it is $$\frac{1}{2^{2k}}$$.

The probability i out of $$2k$$ chromosomes are IBD is product of , $\binom{2k}{i}=\frac{2k!}{i! (2k-i)!}$ and

$\frac{i}{2^{2k}} .$

Therefore $\frac{2k!}{i! (2k-i)!}\frac{i}{2^{2k}}.$ Figure 2.3 is showing the probability of number of IBD chromosomes.

When number of pairs is bigger, the number of shared chromosomes is less variable. In other words, if number of chromosomes is small, some sib-pairs share a lot of chromosomes and some share only few; on the contrary, when number of chromosomes is big, the variation is in a narrow range.

sibIdValue<-function(k=1){
(0:(2*k))/(2*k) # i/(2k)
}

sibIdProb<-function(k=1){
dbinom(0:(2*k),2*k,0.5)
}
numch<-1:23
means<-vars<-rep(0,length(numch))
for(i in 1:length(numch)){
identity<-sibIdValue(numch[i])
prob<-sibIdProb(numch[i])
if(i>1){par(new=TRUE)}
plot(identity,prob,type="b",ylim=c(0,1))
means[i]<-momentProb(identity,prob,order=1,center=FALSE) # 平???
vars[i]<-momentProb(identity,prob,order=2,center=TRUE) # ???散
}
plot(means)
plot(vars)
1. Length of the chromosome is various

In the previous section, the length of the chromosome was plotted as shown in Figure 1.2. The shared fraction when i chromosomes are shared and all of the chromosomes are long is different from when i chromosomes are shared and all of them are short.

The following simulation takes the length of chromosomes into calculation for two scenarios; one scenario is that all chromosomes have the same length and the other scenario is that their lengths vary as human chromosomes.

The average concordance rate is 0.5 for both and the variance of them is also similar (R2-5.R).

SibSim<-function(f=f,Niter=10000){

rs<-matrix(rbinom(Niter*length(f),1,0.5),nrow=Niter)
rs%*%f
}

chs<-rep(chromLen[1:23],2)
f<-chs/sum(chs)
Niter<-10000
simOut<-SibSim(f,Niter=Niter)

cheven<-rep(1,length(f))
feven<-cheven/sum(cheven)
simOutEven<-SibSim(feven,Niter=Niter)

plot(ppoints(Niter,a=0),sort(simOut),type="l",ylim=c(0,1))
plot(ppoints(Niter,a=0),sort(simOutEven),type="l",ylim=c(0,1))

Mean and variance tell some information on the distribution and they are similar between two scenarios. Figure 2.4 is a plot of sorted concordance rates. The lines look different, smooth vs. step-like.

In the simulation, which went to the faithful to the length of the chromosome (a), whereas the curve is fairly smooth, it is that the case of equal length (b) is made in a staircase pattern. In the case of “equal-length”, it takes a discrete pattern.

Visualization can reveal extra information out of summarizing index values.

Three points from the example above:

• Without plotting data, some information minght be missed

• Be careful when handling data with special conditions (“equal-length” was the condition in the example above).

• Discreteness affects on distribution

1. 23 pairs should be divided into pieces - crossing, recombinant: Poisson process and the exponential distribution

Crossovers make recombinats, that makes chromosomes into segments. Segments are descended from a parent to an offspring.

The number of crossovers per chromosome per generation is not so many.

The number of crossovers per unit chromosomal length can be assumed in Poisson distribution, that is a distribution for rare stochastic events.

See Figure 2.5. Crossover sites are generated with two different ways; one is to determine the sites without using Poisson random numbers and the other is using Poisson random number generators.

RecombSim<-function(L=10000,r=0.001,Niter=1000){
m<-matrix(rbinom((L-1)*Niter,1,r),nrow=Niter)
apply(m,1,sum)
}
RecombPois<-function(L=10000,r=0.001,Niter=1000){
rpois(Niter,(L-1)*r)
}
L<-10000;r<-0.0001;Niter<-1000
NumSim<-RecombSim(L=L,r=r,Niter=Niter)
NumPois<-RecombPois(L=L,r=r,Niter=Niter)
ylim<-c(0,max(NumSim,NumPois))
plot(ppoints(Niter,a=0),sort(NumSim),ylim=ylim,col=gray(6/8))
par(new=T)
plot(ppoints(Niter,a=0),sort(NumPois),type="l",ylim=ylim)

Crossovers make mosaics of chromosomes as shown in Figure 2.6.

Shared fraction (IBD fraction) should be the sum of the segmetns with the same color.

Because crossovers make length of segments descendent to offspring shorter, which will not change the mean of fraction of IBD but will make the variance of fraction of IBD smaller (R2-7.R , Figure 2.7).

Niter<-1000
L<-1000000
r<-0.0001

crosses<-sort(sample(1:(L-1),rpois(1,(L-1)*r),replace=FALSE))
A<-c(0,crosses)
B<-c(crosses,L)
C<-B-A
rexps<-rexp(length(C),1/mean(C))
ylim<-c(0,max(C,rexps))
plot(sort(C),ylim=ylim,cex=0.5,pch=15)
par(new=T)
plot(sort(rexps),col="red",ylim=ylim,type="l") 

### 2.1.5 Fate of the mutant — Genetic drift

New alleles are born with mutation events. Their initial allele frequency is $$\frac{1}{2n}$$ where $$n$$ is population size. Their frequency changes with generations because some alleles are transmitted to next generation more than other alleles. This stochastic change in allele frequency is called genetic drift.

Newly born alleles might be extincted. Newly born alleles might take over the original allele eventually.

1. Contingency table

Figure 2.8.

It models that chromosomes make a pool that is expanded because each chromosome has a chance to transmit its copy to next generation more than 1 times. The next generation chromosome pool should be samples from the expanded pool.

This sampling process can be expressed as a contingency table.

This is expressed in 2 x 2 table and its exact occurence probabiity is $$\frac{4! 44! 12! 36!}{48! 2! 10! 2! 34!}$$.

1. State transition

Assume a population with only two individuals. In total there are 4 chromosomes. The number of copy of mutant allele can be 0, 1, 2, 3, or 4.

See Figure 2.9. The number of copy of mutant allele will change in the next generation.

Figure 2.9 describes the state transition.

State probability vector is a vector with length being number of states the element of the vector is non-negative and some of the elements is 1.

From the state with 0 mutant allele, the number of mutant copy of next generation is 0 with probability 1.

From the state with 1 mutant allele, the next generation value can be 0,1,2,3 and 4. The arrows in Figure 2.9 indicates the possible transition.

(R2-8.R Figure 2.10).

# ???団サイズ・変異アレル本数などを与え、何世代後に変異アレル本数が何本である??????確???を計算す???
#N 染色体集団サイズ,p 変異アレル本数,k ??????の染色体が次世代に残す染色体最大本数,infty 制限なく残せるならinfty=TRUE
probDrift02NInf<-function(N,p,k,infty=FALSE){
if(infty){ # 染色体を抜き取る??????ルのサイズを無限大にする
ret<-rep(0,(N+1)) # 変異染色体本数0,1,2,...,Nの状???
pr1<-p/N # 変異染色??????度
pr2<-1-pr1 # 非変異染色??????度
if(pr1==0){ # 変異染色体がなければ、次世代も変異染色??????な???
ret[N+1]<-1
}else if(pr2==0){ # 非変異染色体がなければ、次世代は変異染色??????かり
ret[1]<-1
}else{
for(i in 0:N){ # N本中i本が変異染色体である確???
ret[i+1]<-exp(lgamma(N+1)-lgamma(i+1)-lgamma(N-i+1)+i*log(pr1)+(N-i)*log(pr2))
}
}
ret
}else{ # 次世代染色体を抜き出?????????ルは有限本数とする
ret<-rep(0,(N+1))
kN<-k*N
kp<-k*p
kNp<-kN-kp
k1N<-(k-1)*N
commonLN<-lgamma(kp+1)+lgamma(kNp+1)+lgamma(k1N+1)+lgamma(N+1)-lgamma(kN+1)
for(i in 0:N){
if(kp>=i){ # ??????ルにある変異染色体本数より多い本数を抜??????すことはできな???
n11<-i
n12<-kp-n11
n21<-N-n11
n22<-k1N-n12
if(n11>=0 && n12>=0 && n21>=0 && n22>=00){
ret[i+1]<-exp(commonLN-lgamma(n11+1)-lgamma(n12+1)-lgamma(n21+1)-lgamma(n22+1))
}

}
}
ret
}

}
# ある状態において、次世代の変異染色体本数別の確???を計算す???
nextGenExProb2<-function(x,k,infty){
N<-length(x)-1 # 染色体集団サイズ
ret<-rep(0,length(x))
for(i in 1:length(x)){ # 状???0,1,...,Nごとに、次世代の状態への移り変わる確???を計算す???
p<-i-1
tmpret<-rep(0,length(x))
tmpret<-probDrift02NInf(N,p,k,infty=infty)
ret<-ret+tmpret*x[i] # ??????世代における変異本数がiの確???がx[i]なので、その比率に応じて、次世代の状???0,1,2,...,Nになる確???を加???
}
return(ret)
}

DriftSim3<-function(k=2,initial=1,np=20,ng=100,infty=FALSE){ # k: 最大子孫染色体本数, initial: 初期変異染色体本数, np: ???団サイズ, ng: シミュレーション世代数,infty:無限大??????ルにするかど??????
m<-matrix(rep(0,(np*2+1)*ng),nrow=ng) #全世代の全状??????確???を納め???
m[1,initial+1]<-1 # 第1世代は??????期本数の状??????確?????????
for(i in 2:ng){ # 第???世代以降を???次シミュレー???
m[i,]<-nextGenExProb2(m[i-1,],k=k,infty=infty)

}
return(m)
}
# 人数20人、染色体数40本、最大子染色体数2,初期変異染色体数10本、世代数25
out<-DriftSim3(k=2,np=20,initial=10,ng=25,infty=FALSE) # ??????
# 結果を鳥瞰図表示
persp(out,theta=120,phi=30,shade=0.2,xlab="generation",ylab="No. mutants",zlab="probability")
1. Transition matrix

When transition from one state to another is stochastic, transition of n states to n states is represented by a n x n matrix, transition matrix.

The i-th column of the matrix represents avector of next generation probablity of all states when the i-th state proability before transition is 1.

Using this matrix, the state probability vector can be calculated by multiplication of the matrix on the state vector and the calculation can be repeated.

The next time-point state is only based on the one-step previous state only and no further past states will not affect. This is called a Markov chain.

Markov chain will appear in Chapter 8 linkage analysis.

$pr_{i=i,j=j}=\frac{(2N)!(2(k-1)N)!(ki)!(k(2N-i)!}{(2kN)!j!(2N-j)!(ki-j)!(2N(k-1)-(ki-j))!}$

The corresponding 2x2 table is shown in the Japanese version textbook.

## 2.2 From DNA to RNA and Protein !

DNA RNA protein codon

### 2.2.1 From DNA to DNA — Transcription

Genetic information that has been recorded in the DNA will exert the function is Utsushitora to RNA. As shown in Figure 2.11, the upper portion of information “CAGGTT” of DNA is taken as the RNA copy of “CAGGUU”. It called a transfer this. DNA is “A”, “C”, “G”, but use the “T”, RNA will use the “U” instead of “T”. RNA Many also be further translated into a protein, RNA of the case is called messenger RNA (mRNA). There is also possible to exert the function as RNA without being translated, in that time, is called a functional RNA gene.

### 2.2.2 From RNA to Protein — Translation

It reads the information of mRNA translation and called to make a succession (protein) of the amino acid. Amino acids used in the translation There are 20 kinds. The length of the shortest base sequence for identifying the 20 amino acids at the base 4 kinds of permutations is $$4^2 \le 20 \le 4^3$$ but, in fact, each of the bases 3 permutations are translated to correspond to the amino acids. 3 a sequence of bases called a codon, showed the correspondence between the codon and amino acid in Figure 2.12. When translated into protein, and special codons which means the start of the translation process, by determining a specific codon which means the end of the translation, initiation of translation processing and end are controlled. Since an initiation codon “AUG” corresponds to “Met” is a specific amino acid, the first amino acid at which the protein is made it is always “Met”. The end of the codon There are several, but by it to do not to correspond whatsoever of amino acids, the translation work is completed.

# Chapter 3 Aspects of diversity !

## 3.1 Nucleic acid, protein diversity

### 3.1.1 Diversity of DNA sequences, species differences, gene polymorphism !

Figures 3.1 and 3.2

substitution insertion/deletion repeat inversion translocation

microscopic vs. macroscopic variants

structural variants

Figure 3.3

{A,T,G,C} : $$4^n$$

{0,1} : $$2^n$$

### 3.1.2 Diversity of RNA and protein

splicing

splicing variants

Using the R, we try to make an array of splicing variant (R3-1.R).

seq1<-sample(c("A","T","G","C"),100,replace=TRUE) # 長???100のDNA??????をランダ???に作る
exonpattern1<-c(11:20,41:60,81:90) # mRNA1は???エクソン。そのパターン
exonpattern2<-c(11:20,41:60,66:77,81:90) #mRNA2はmRAN1の第???エクソンと第???エクソンの間にエクソンが挿入されて???ま???
seq1[seq1=="T"]<-"U" # mRNAではDNAの"T"???"U"になりま??????
mRNA1<-seq1[exonpattern1] # mRNA1の??????を抜??????しま???
mRNA2<-seq1[exonpattern2]

DNA sequence variations ~ Genome variations

Other omics layer variations

(Figure 3.4).

## 3.2 Diversity and variance

### 3.2.1 Decomposition of variance — Variance, covariance

Diversity means “values” vary. Therefore variance, an statistical index, in a good indicator for it.

In some conditions, multiple variances can be summed up and one variance can be decomposed into multiple pieces that are also variances of somethings.

Assume a population with subpopulations.

You can measure variance of a whole population. Also you can measure variance of each subpopulation and also variance of means of subpopulations.

These “variances” are mutually related, and their relation is informative.

Assume a variable $$X_s$$ and its $$N$$ samples $$\{x_{s,1},...,x_{s,N}\}$$.

Variance of $$X_s$$ is, $V(Xs)=\frac{1}{N}\sum_{i=1}^N (x_{s,i}-\mu_s)^2$ where $$\mu_s=\frac{1}{N}\sum_{i=1}^N x_{s,i}$$.

You can write this as, $\sigma_{s,s}=V(Xs,Xs)=\frac{1}{N}\sum_{i=1}^N (x_{s,i}-\mu_s) \times (x_{s,i}-\mu_s)$

Square is multiplication of self.

If you replace one $$s$$ out of two $$s$$ s to $$t$$,

$\sigma_{s,t}=\frac{1}{N}\sum_{i=1}^N (x_{s,i}-\mu_s) \times (x_{t,i}-\mu_t)$ where $$X_t$$ is another variable.

This should be something. Something related with variance but considering two variables. Actually this is called covariance.

Assume $$X1,X2$$, such as test scores of physics and biology.

You are interested in the sum of two subjects, $$P=\{p_i=x_{1,i}+x_{2,i}\}$$

$\sigma_{P,P}=\sigma_{X1,X1}+\sigma_{X2,X2}+\sigma_{X1,X2}+\sigma_{X2,X1}=\sum_{s=1}^2 \sum_{t=1}^2 \sigma_{X_s,X_t}$ Covariance is 0 when two variables are mutually independent. You can easily find detailed description or proof of this.

Rather than repeating the regular textbook, here we simulate these with R.

nsample<-10000;X1<-rnorm(nsample);X2<-rnorm(nsample) # rnorm(), x1 and x2 are independent
p12<-X1+X2
cov12<-cov(cbind(X1,X2)) #X1, X2 variance covariance
cov12
vpp_2<-cov(p12,p12) #p12 variance
vpp_2-sum(cov12)  # p12 vs. sum of variance and covariance
plot(X1,X2)
# x1 and x2 are dependent
X3<-X1+0.1*rnorm(nsample)
p13<-X1+X3
cov13<-cov(cbind(X1,X3))
cov13
vpp_3<-cov(p13,p13)
vpp_3-sum(cov13)
plot(X1,X3) 

Number of variables is arbitrary, k.

$\sigma_{p,p}=\sum_{s=1}^k \sum_{t=1}^k \sigma_{s,t}$

### 3.2.2 Variance and heritability !

Phenotypes vary and have variance.

Genotypes vary and have variacen.

Environmental factors vary and have variance.

We assume phenotypic variance is sum of genotypic variance and environmental variance.

$\sigma_{P,P}=\sigma_{G,G}+\sigma_{E,E}$

Heritability, how heavily phenotype is affected by genes, is defined as $\frac{\sigma_{G,G}}{\sigma_{P,P}}$

### 3.2.3 Hardy-Weinberg equilibrium (HWE) and variance !

Assume a biallelic polymorphism with 2 alleles, A and a, whose allele frequencies are p and 1-p.

G0,G1 and G2 (“AA”, “Aa”, and “aa”) are diplotypes.

Under the assumption of independence in combination of two alleles, \begin{align*} G2=p^2\\ G1=2p(1-p)\\ G0=(1-p)^2 \end{align*} The state of this assumption is Hardy-Weinberg equilibrium.

When the assumption is not true, a term representing the deviation from HWE should be added.

\begin{align*} G0=p^2+\Delta\\ G1=2p(1-p)-2\times \Delta\\ G2=(1-p)^2+\Delta \end{align*}

This can be re-written as, \begin{align*} G0=p^2+p(1-p)F\\ G1=2p(1-p)(1-F)\\ G2=(1-p)^2+p(1-p)F. \end{align*}

Then F is an index called inbreeding coefficient.

Giving values 1 and 0 to “A” and “a”, we can population numerically.

Mean and variance of the population of 2N chromosomes are, $\begin{equation*} \mu(h)=(p \times 1 +(1-p) \times 0) =p \end{equation*}$ $\begin{equation*} v(h)=p \times (1-p)^2+ (1-p) \times (0-p)^2=p(1-p) \end{equation*}$

Considering diploid population,

$\begin{equation*} \mu(d)=G0\times 0 +G1\times 1 + G2 \times 2=2p=2\mu(h) \end{equation*}$ \begin{align*} v(d)&=G0 \times (0-\mu(d))^2+G1\times (1-\mu(d))^2+G2\times (2-\mu(d))^2\\ &=2p(1-p)(1+F)\\ &=2v(h)+2\times F \times v(h) \end{align*}

Variance of diploids is twice of haploids.

Covariance is expressed as $Cov(d)=F \times v(h)$.

$F=\frac{Cov(d)}{v(h)}=\frac{Cov(d)}{p(1-p)}$ F is a measure of deviation from HWE and it is covariance of two alleles.

## 3.3 The ways to handle data and their effect on variance/covariance !

### 3.3.1 When handling alleles in two columns for HWE and LD !

See Figure 3.6.

Two columns for two variables with 0 or 1.

F for HWD and r for LD are proportinal to covariance of two variables.

Difference between HWD and LD is the way to treat (1,0) and (0,1).

### 3.3.2 Mode of inheritance (dominant, recessive) is represented by the third column

See Figure 3.7.

Handling dominant model, we need the third column. The third column gives information on the heterozygotes.

The value of genotypes is the sum of three columns.

$\begin{equation*} Vd=G1(1-G1)d^2 \end{equation*} \begin{equation*} Cov1d=Cov2d=\frac{1}{2}G1(G2-G0)d \end{equation*}$

## 3.4 A lot of factors — Multifactorial inheritance !

Polygenic traits or complex genetic traits.

When many factors get together, the overall effects tend to take normal distribution.

R3-3.R models many loci that are in HWE and mutually in LE.

The phenotypic variance is sum of each loci’s variance and their covariances.

covXY<-function(x,y){ #?????????値ベクトルから共???散?????????
mx<-momentX(x,order=1,center=FALSE)
my<-momentX(y,order=1,center=FALSE)
sum((x-mx)*(y-my))/length(x)
}
covMat<-function(m){ # すべての??????クトルのペアにつ???て共???散?????????
ret<-matrix(rep(0,length(m[1,])^2),nrow=length(m[1,]))
for(i in 1:length(m[1,])){
for(j in 1:length(m[1,])){
ret[i,j]=covXY(m[,i],m[,j])
}
}
ret
}
N<-1000;M<-100 # サンプル数 2アレル型多型数
af<-runif(M) # アレル頻度 runif()関数につ???ては、付録A.5 確?????????関数、疑似乱数??????発生を??????
r<-rnorm(M) # アレル???本???の得点
d<-rnorm(M) # ??????カーごとの相???モ???ルからのずれ
Xmat<-matrix(rep(0,N*M),nrow=N)
for(i in 1:M){ # ??????カーごとに人数???の得点?????????
x<-rep(0,N)
x<-x+rbinom(N,1,af[i])
x<-x+rbinom(N,1,af[i])
x[x==1]<-1+d[i] # ヘテロ接合体特??????値(相???モ???ルからのずれ)?????????
x<-x*r[i] # 座??????強さを掛け???
Xmat[,i]<-x
}
X<-apply(Xmat,1,sum) # 個人の全多型???の得点
momentX(X,order=2,center=TRUE) # 全多型???の得点の???散と
sum(covMat(Xmat)) #???散共???散?????????値の??????一致する
hist(X) # 個人の全多型???得点の??????

# Part II Data, sample, sample set !

It is evaluated by Chapter 4 observation Capture the Chapter 5 sample individually Capture the Chapter 6 sample as a group In order to understand the relationship between the genotype-phenotype of the organism, to observe the genotype-phenotype, it is necessary to deal with it as data. In the first part Ⅱ, after thinking about the type of data is the root for handling the data, we will look at two minutes increases the approach for handling the sample data is attributable. One approach is a way to distinguish between the individual samples, and one is the way to capture the sample as a group.

# Chapter 4 Observation and assessment !

## 4.1 The type and configuration of the data !

### 4.1.1 The type of data that was seen from the gene — Genotype and phenotype, the final trait and an intermediate trait !

Since the life phenomena to organize and interpreted in terms of gene in the genetic statistics, even when considering the type and structure of the data, and the type classification from the point of view. Divided into two a

• genotype (genotype)

• phenotype (phenotype) covering all but genotype

Figure 4.1.

• Interactions between layers.

• One-way vs. two-ways.

• Intermediate phenotypes and terminal phenotypes.

### 4.1.2 The type of data as analyzed — Data Types

1. Data types
• Discrete vs. continuous

• Ordered vs. not-ordered (vs. partially-ordered)

• 2 categories and 3-or-more categories are different from order-standpoint

Table 4.1

1. 2 category type is also true that there is also that there is a sequence

A or non-A.

A is a subset of universe. non-A is complement of A.

Figure 4.2

Two categories can be handled as ordered. Figure 4.3.

1. 3 or more category type

Three categories can be drawn as subsets as shown in Figure 4.2.

Three genotypes of SNPs may be treated as ordered (Figure 4.3) when focusing number of copies of one allele.

Three categories’ symmetricity can berepresented as a triangle (Figure 4.3).

Ordered categroies can be generated by dividing continuous variable with thresholds (Figure 4.3).

Order can be placed in one dimensional space.

1. Continuous type

Variables of this type is realized on a line.

### 4.1.3 Partially ordered

When a line structure exists in variables, they are totally ordered. Some of values of variable are ordered but no order is defined between some values, they are called partially ordered.

Combinatorial genotypes of 2 or more SNPs can be said partially ordered.

\begin{align*} (0,0) < (0,1) < (0,2)\\ (0,0) < (1,0) < (2,0)\\ (0,0) < (1,1) < (2,2) \end{align*}

### 4.1.4 The combination of categories !

$\prod_{i=1}^N k_i =k_1\times k_2 \times ... \times k_N$

### 4.1.5 Only-one selection, duplicated selection !

Zero or more items might be selected from a item list. That may be found in questionnaires or in clinical criteria.

Some clinical criterion have a scoreing system that has a weight vector for items.

There are multiple ways to record them as shown in tables in the Japanese textbook.

Combination $$\binom{N}{k}=\frac{N!}{k!(N-k)!}$$ or permutation $$\frac{N!}{(N-k)!}$$ is important.

In case of questionnaire, multiple items might be selected (combination) or they should be selected with the order of preference (permutation).

### 4.1.6 Special aspects in diploids — Hardy-Weinberg equilibrium exact test (HWE) !

Assume a biallelic variant, with A and T alleles. Genotypes are AA, AT and TT.

We genotyped N individuals’ genotype and $$n_0,n_1,n_2$$ individuals had AA, AT and TT, respectively.

This observation makes N x 2 table, the table in Japanese text book.

The sum of first column is the number of A allele observed and sum of the second column is the number of T allele.

All row sums are 2, because everybody is diploid.

Total sum is 2N.

This can be seen as the result of repeated procedure that select two from two categories; two values can be same or different.

Let’s test the null hypothesis where alleles are selected at random.

We perform the “exact probablity test” for HWE. You may find some notes on exact probability tests in general in Chapter 13.

There are $$n_0+n_2$$ cells with value 2, $$n_0 + n_2$$ cells with value 0, and $$2\times n_1$$ cells with value1, and $$N$$ marginal counts with value2 and $$n_A$$ and $$N_T$$ are the marginal counts of colum-sums, therefore,

$\begin{equation*} \frac{\prod_{i=1}^N (2!) n_A! n_T!}{(2N)! (2!)^{n_0+n_2} (0!)^{n_0+n_2}(1!)^{2\times n_1}}= \frac{2^{n_1} n_A! n_T!}{(2N)!} \end{equation*}$ Note $$0!=1,1!=1,2!=2$$,$$N=n_0+n_1+n_2$$

The table treats all N rows separately, that means individuals are mutually distiguishable.

However the observation is on the numbers of people with three genotypes and no distinction of individuals.

The number of ways to have the $$N x 2$$ table is $$\frac{N!}{n_0! n_1! n_2!}$$, therefore, the exact probability to observe $$n_0,n_1,n_2$$ under the assumption of HWE is $\begin{equation*} \frac{N!}{n_0!n_1!n_2!} \times \frac{2^{n_1} n_A! n_T!}{(2N)!} = \frac{2^{n_1} n_A! n_T! N!}{(2N)!n_0!n_1!n_2!} \end{equation*}$

hweExact<-function(g=c(813,182,5)){ # number of people with three genotypes
n<-sum(g) # total
nA<-2*g[1]+g[2] # No. allele A
na<-2*g[3]+g[2] # No. allele a
evod<-g[2]%%2 # Heterozygoutes are even or odd
maxAa<-min(nA,na)-evod
Aa<-seq(from=evod,to=maxAa,by=2) # list of possible heterozygous people number
AA<-(nA-Aa)/2 # corresponding No. AA
aa<-(na-Aa)/2 # corresponding No. aa
obs<-(g[2]-evod)/2+1 # observed heterozygoutes
prob<-rep(0,length(Aa)) # exact probs of all cases
prob<-exp(n*lgamma(2+1)+lgamma(nA+1)+lgamma(na+1)-lgamma(2*n+1)-(AA*lgamma(2+1)+Aa*lgamma(1+1)+aa*lgamma(2+1))+lgamma(n+1)-(lgamma(AA+1)+lgamma(Aa+1)+lgamma(aa+1)))
p.value<-sum(prob[prob<=prob[obs]]) # p-value
# Aa No. heteros
# prob exact probs
# obsprob exact prob of observed table
# p.value
list(Aa=Aa,prob=prob,obsprob=prob[obs],p.value=p.value)
}
xx<-hweExact(c(813,182,5))
xx$p.value # test p-value sum(xx$prob) # sum of all exact probs is 1

### 4.1.7 Parent items and child items !

Only when a value of one variable is a particular value, some variables might be observed.

For example, item X is prompted to be answered only when answer to item Y is “yes”.

For example, genotype of a marker on Y chromosome is only available for males.

### 4.1.8 Placement of categories, non-independence between categories, regular simplex !

Mutually independent n categories can be placed in (n-1) dimensional space as vertices of (n-1)-simplex.

• Angle between two vectors pointing two vertices is $$\theta$$ where $$cos\theta=-\frac{1}{k}$$.

Assume k+1 dimensional space and its orthogonal basis vectors $$e_i =(0,0,....,0,1,0,...,0); i=1,2,...,k+1$$.

They represent k+1 categories.

Take the subspace that is $\sum_{i=1}^{k+1} \beta_i e_i; \sum \beta_i = 1, \beta_i \ge 0$, which is k-simplex.

The center of k-simplex is,

$o = \frac{1}{k+1} \sum_{i=1}^{k+1} e_i = \frac{1}{k+1} (1,1,...,1)$

$f_i=e_i-o=(-\frac{1}{k+1},-\frac{1}{k+1},....,-\frac{1}{k+1},1-\frac{1}{k+1},-\frac{1}{k+1},...,-\frac{1}{k+1})$ are k+1 vectors in k-dimensional space.

$|f_i|=\sqrt{\frac{1}{(k+1)^2}(k*1+k^2)}=\sqrt{\frac{k}{k+1}}$

For any i and j ($$i \ne j$$, $\theta_{i,j}はcos(\theta_{i,j})=\frac{f_i f_j}{|f_i||f_j|}=-\frac{1}{k+1} \frac{k+1}{k}=-\frac{1}{k}$

• 1-dimensional space (straight line): 1-regular simplex line segment

• 2-dimensional space (plane): 2-regular simplex equilateral triangle

• 3-dimensional space (space): 3 regular simplex tetrahedron

With k larger, $$cos(\theta_{i,j})=-\frac{1}{k} \to 0$$, means categories get closer to independent relation.

Complete graph is a graph in which all vertices are mutually connected meaning all pairwise relations are the same, that is a graph representation of multiple mutually independent categories.

Figure 4.4.

## 4.2 Comparison !

### 4.2.1 Pairwise comparison - symmetrical relationship and asymmetrical relationship

Pairwise comparison, comparison of two, can be said, evaluation of relation of two.

There are two ways to evaluate relation of A and B, relation of A to B and relation B to A.

$$A=B$$ or $$A!=B$$ won’t change after you exchange A and B.

When A “is larger than” B, B “is smaller” than A. In this case, after you exchange A and B, you had to change “larger” to “smaller”.

The former relation is symmetric and the latter is asymmetric relation.

### 4.2.2 Change an asymmetric relation to a symmetric one — Distance !

When A is 2 and B is 1, A is larger than B by 1 and B is smaller than A by 1.

“Larger” and “smaller” are different but “by 1” are common between two expressions.

“1” is distance between A and B and distance seems “symmetric”.

Distance can be said as non-negative measure to quantitate difference between two.

It is better to define distance clearly.

Metric space is a set for which distances between all members of the set are defined.

1. Distance
• is defined between two of something (A, B)

• takes the value of the non-negative (greater than or equal to 0)

• should be symmetric (distance from A to B and distance from B to A are the same)

• When A=B, distance between A and B should be 0

• For three items, there are three distances. There should be a triagnle whose edge lengths are the three distances; in some cases, the triangle looks like a segment with two vertices at the ends and the third vertex is on the segment. This rool is called triangle inequality.

1. triangle inequality

Figure 4.5. $\begin{equation*} b+c>a,c+a>b,a+b>c \end{equation*}$ then, \begin{align*} x=\frac{a+b+c}{2}-a = \frac{1}{2} (b+c-a) \ge 0\\ y=\frac{a+b+c}{2}-b = \frac{1}{2} (c+a-b) \ge 0\\ z=\frac{a+b+c}{2}-c = \frac{1}{2} (a+b-c) \ge 0\\ x+y=c, y+z=a, z+x=b \end{align*}

This means when distances are defined for all pairs that meet triangle inequality, they should have a tree satisfying the distances.

### 4.2.3 Euclidean distance and other distances !

Any formula that satisfies the above mentioned rules, defines distance.

Euclidean distance is familiar one: $\begin{equation*} d_M(A,B)=\sum_{i=1}^k |a_i-b_i|=\sum_{i=1} \delta_i \end{equation*}$

This is another one, called Manhattan distance; $\begin{equation*} d_M(A,B)=\sum_{i=1}^k |a_i-b_i|=\sum_{i=1} \delta_i \end{equation*}$

R’s dist() function have multiple distances.

### 4.2.4 Sequence differences and Manhattan distance

For example, the distance between DNA sequences can be the number of different bases.

• person: cccggaCAcCgActtcccGgggctcatt

• Mouse: cccggaTGcAgGcttcccAgggctcatt

Five out of 29 bases are different. Distance is 5.

When there is an insertion/deletion, it is more difficult to measure the distance between them.

• person: … cccggaCAcCgActtcccGgggctcattACcctCAc …

• Mouse: … cccggaTGcAgGcttcccAgggctcatt = Tcct = Tc …

To define distance between DNA sequences that are different with substitutions and ins/dels, we have to have a formula to score the difference taking both types of variants into account.

The “blast” method quantitates the distance between DNA sequences by rareness that some biological events generated one sequence from another sequence. When quantitating the rareness, it uses a distribution called extremal value distribution.

### 4.2.5 Angle as distance — Correlation coefficient

Categrories in the shape of simplex Distance has expressed a symmetrical relationship with a value greater than or equal to 0. Or you will not be able to assess quantitatively the relationship, including the negative number. When you place a categorical in space, all of the categories is in the xxxx the relationship with each other, it is this angle is equivalent, has said that refers to the equal relationship of the category. Thus the angle can also represent a relationship between two things. It can be an amount that represents the relationship between the q, would be a cos q may be used as the amount. Now, two of the data is equal to a vector format, it is represented by the inner product and the length of the vector with each other.

$$cos(\theta)=\frac{\sum_{i=1}^k x_i y_i}{|x||y|}$$

This is a value called the correlation coefficient.

## 4.3 Multiple samples, a lot of comparisons

### 4.3.1 one vs. (N - 1) and N vs. N !

So far, we compared two.

From here we compare N samples.

• You may be interested in one out of N and compare the one vs. (N-1) others.

• You may be interested in all N and you may compare $$\frac{N}{N-1}{2}$$ pairs. Or you may compare $$\frac{N(N+1)}{2}$$ pairs including “one-and-its-self” relation. In case pairwise relation is asymmetric, $$N^2$$ comparisons should be made.

See Figure 4.7.

### 4.3.2 4.3.2 Partial order !

When all items are located on a line, they are in perfect order.

Sometimes, some paris have order but others don’t.

Assume a power set of a set $$\{1,2,...,N\}$$, with N elements.

There are $$2^N$$ elements that are subset of $$\{1,2,...,N\}$$ in the power set.

Between some elements in the power set, inclusion relation exists, but for some pairs no.

Figure 4.8 is a Hasse diagram representing inclusion relation of elements of power set with 4 elements.

They can be seen as 4-dimensional cubic that is projected to 2-D plane.

powerSetHasse<-function(k=4){
lev<-k+1
numelem<-rep(0,lev)
a<-0:k
numelem<-choose(k,a)

numnode<-2^k
nodesX<-rep(0,numnode)
nodesY<-rep(0,numnode)
vecs<-matrix(rep(0,numnode*k),ncol=k)
keta<-2^(0:(k-1))

counter<-rep(0,lev)
startval<--(numelem-1)/2
for(i in 2:numnode){
vecs[i,]<-vecs[i-1,]
vecs[i,1]<-vecs[i,1]+1
for(j in 1:(k-1)){
if(vecs[i,j]==2){
vecs[i,j]<-0
vecs[i,j+1]<-vecs[i,j+1]+1
}else{
j<-k
}
}
tmpsum<-sum(vecs[i,])
nodesY[i]<-tmpsum
nodesX[i]<-counter[tmpsum]+startval[tmpsum+1]
counter[tmpsum]<-counter[tmpsum]+1
}

values<-vecs%*%keta

edges<-matrix(rep(FALSE,numnode^2),ncol=numnode)
for(i in 1:(numnode-1)){
for(j in (i+1):numnode){
tmpdist<-sum(abs(vecs[i,]-vecs[j,]))
if(tmpdist==1){
edges[i,j]<-TRUE
}
}
}
Edges<-which(edges,arr.ind=TRUE)
plot(nodesX,nodesY,pch=15)
segments(nodesX[Edges[,1]],nodesY[Edges[,1]],nodesX[Edges[,2]],nodesY[Edges[,2]])
}

powerSetHasse(4)

### 4.3.3 Distance matrix and tree

When distance is defined for all pairs of N elements, the information can be expressed as a N x N non-negative symmetric matrix, called distance matrix.

Using nj() function in ape package of R (nj standing for neighbor joining method), you can generate a tree structure that satisfies the distance matrix.

Tree construction is one of hierarchical clustering methods, that will be a topic in Chapter 5.

$1 2 3 4 2 4.725108 3 1.372943 4.600435 4 4.135837 2.220284 4.311534 5 2.020896 3.608861 1.844573 3.591590$

# Ns samples in k dimensional space
Ns<-5;k<-5;x <- matrix(rnorm(Ns*k), nrow=Ns)
dist(x,method="euclidean") # Euclidean distance
library(ape) # ape package
treu<-nj(dist(x,method="euclidean")) # tree construction with euclidean distance
trman<-nj(dist(x,method="manhattan")) # with manhattan distance
par(mfcol=c(1,2))
plot(treu);plot(trman)
par(mfcol=c(1,1)) 

# Chapter 5 Treat samples individually

The ways to handle samples can be classified into two; one is to handle samples individually and the other is to study samples as a distribution.

This chapter 5 is on the ways to handle them individually.

Next chapter 6 is on the ways to study them as a distribution.

## 5.1 Graph !

### 5.1.1 Definition of graph !

Graph is a pair of a set of vertices and a set of edges.

• Vertex

• Edge

• Directed vs. non-directed

• Connection

• Degree

• Neighbor

• Path

• Cycle

• Distance between vertex pair

• Perfect graph

• Acyclic graph

• Tree graph

-Graph algorithms

Figure 5.1.

## 5.3 Tree graph !

### 5.3.1 Tree

$No. edges = No. vertices - 1\\$

• Root; rooted tree vs. unrooted tree

• leaf

Figure 5.3.

## 5.4 Understanding of data by making a tree - hierarchical clustering !

### 5.4.1 Phylogenic tree

• Phylogenic tree for evolutionary history

• Rooted tree

• Hierarchical clustering

### 5.4.2 Hierarchical clustering

Starting from a data set, hierarchical clusters are to be generated with repeated procedures.

Multiple algorithms of hierarchical clustering.

• Definition of distance

• Definition of distance between tentative clusters

• Rules in merging order

Figure 5.5.

Try using the hclust () function of R of data mining system (2 R5-1.R, Figure 5.6).

distMatrix<-dist(x,method="manhattan") # 距離にはマンハッタン距離を使用
trclust<-hclust(distMatrix,method="ward") # クラスタ間距離の定義にはウォード法を使用
plot(trclust)

## 5.5 Data in the shape of matrix !

### 5.5.1 Rearranging the order of elements of matrix, both rows and columns shoudl be rearranged — heat map

N samples and M variables make N x M data matrix.

Using the matrix data, N samples can be hierarchically clustered and M variables can be also hierarchically clustered as shown in Figure 5.7, a heat map.

m<-matrix(rbinom(120,1,0.5),20,6)
heatmap(m)

### 5.5.2 Order of elements is not changed — linkage disequilibrium index plot

A data set with M samples and N markers.

For N markers, $$\frac{N(N-1)}{2}$$ pair LD index values can be calculated.

Because the order of N markers represents location of them, the order should not be changed, but evaluate the LD index value pattern with the order kept.

LD index, r, is correlation coefficient.

cormatrix<-cor(m);rsqmatrix<-cormatrix^2
image(1:nrow(rsqmatrix),1:ncol(rsqmatrix),rsqmatrix,col=gray((100:0)/100))

### 5.5.3 Order of rows, order of columns

M rows x N columns data set.

Based on your interest, you may rearrange/not-rearrange order of columns / rows, as shown in Figure 5.9.

• Relation among N items,

• Relation among M items,

• Relation both among N items and among M items

## 5.6 Pedigree, allele genealogy - graph of indivuals of a species

Figure 5.10.

Pedigree: Graph-like but not a graph

Transformation of pedigree to a graph

Cousin marriage as a closed loop

Directed graph

### 5.6.1 Graph of individuals and graph of chromosomes

1. The graph of the individual’s relationship

Autosomal recessive trait shown in Figure 5.11.

1. The graph of chromosomal relationship

Figure 5.12 is a graph representing chromosomal relationship.

Figure 5.12 is one of many such graph satisfying Figure 5.11, meaning there are multiple “right answers” of chromosomal relation satisfying individual relation.

Figure 5.13 shows two answers.

This is why we have to estimate the chromosomal relation pattern(s) (genetic relation) that should describe the individual relation pattern (phenotypic relation).

### 5.6.2 Graph of the chromosome transmission and recombination

Figure 5.14.

Chromosomal transmission graph changes its pattern at crossing over sites.

When focusing on a particular locus, no branchings in chromosomal graph.

When overlapping chromosomal graphs along chromosomes, every offspring chromosome is connected to paternal and maternal chromosome.

### 5.6.3 Going back to ancestrors — Coalescent

Figure 5.15.

For each locus, multiple trees exist in a population.

Multiple leaves share a common ancestral chromosome.

When going back, they merge; “coalescent”

In brief, linkage analysis tries to identify the tree that has a tree pattern fitting to the pattern of phanotype most.

## 5.7 Network

Tree is a graph without cycle.

Network is a graph with cycle.

In general graphs with cycles are difficult to handle and difficult to be interpreted.

Cycles may represent “positive / negative feedback”.

ARGsim<-function(n=8,g=5,l=6,r=r){
ids<-1:n
now<-1:n
parents<-rep(0,n*g*l)
data<-rep(0,n*g*l)
for(i in 1:n){
data[(1+(i-1)*l):(i*l)]<-i
}
count<-1+n*l
for(i in 2:g){
tmp<-sample(ids,replace=FALSE)
for(j in 1:(n/2)){
countinside<-1
for(x in 1:2){
first<-1
if(runif(1)>0.5){
first<-2
}
data[count]<-now[tmp[(j-1)*2+first]]
parents[count]<-tmp[(j-1)*2+first]
now[countinside]<-data[count]
count<-count+1
countinside<-countinside+1
for(k in 1:(l-1)){
if(runif(1)<r){
first<-first+1
if(first==3){
first<-1
}
}
data[count]<-now[tmp[(j-1)*2+first]]
parents[count]<-tmp[(j-1)*2+first]
now[countinside]<-data[count]
count<-count+1
countinside<-countinside+1
}
}
}
}

return(list(allele=matrix(data,nrow=l),parents=matrix(parents,nrow=l)))
}
n<-8
g<-10
l<-200
r<-0.05
d<-ARGsim(n=n,g=g,l=l,r=r)
mmm<-matrix(d$parents[1,],ncol=g) plotSlice<-function(m){ points<-which(m>=0,arr.ind=TRUE) plot(points[,1],points[,2]) for(i in 2:length(m[1,])){ for(j in 1:length(m[,1])){ segments(m[j,i],i-1,j,i) } } } locus<-1:l PlotSerial<-function(m,locus,g){ for(i in locus){ mmm<-matrix(m[i,],ncol=g) plotSlice(m=mmm) } } PlotOverlap<-function(m,locus,g){ for(i in locus){ mmm<-matrix(m[i,],ncol=g) par(new=T) plotSlice(m=mmm) } } PlotSerial(d$parents,locus=locus,g=g)
PlotOverlap(d$parents,locus=locus,g=g) for(i in 1:l){ mmm<-matrix(d$parents[i,],ncol=g)
plotSlice(m=mmm)
}

locus<-1
plot(emout$logLikeHistory[1:20], type = "b") # plot up to 20 iterations x<-1:10 s<-3*x f<-7*x mean<-lower<-upper<-matrix(0,length(x),11) for(i in x){ tmp<-binom.confint(x=s[i],n=s[i]+f[i],prior.shape1=1,prior.shape2=1) mean[i,]<-tmp[,4] lower[i,]<-tmp[,5] upper[i,]<-tmp[,6] } plotlist<-c(2,3,5,9) collist<-c("black","red","blue","green") ylim<-c(0,1) for(i in 1:length(plotlist)){ plot(x,mean[,plotlist[i]],type="l",ylim=ylim,col=collist[i]) par(new=T) plot(x,lower[,plotlist[i]],type="l",ylim=ylim,col=collist[i]) par(new=T) plot(x,upper[,plotlist[i]],type="l",ylim=ylim,col=collist[i]) par(new=T) } abline(h=0.3) # Chapter 13 Rejection of hypothesis and statistical test ! ## 13.1 Rejection of hypothesis that is difficult to believe — Observation of 3 categories Reject a hypothesis if the observation is too rare to believe it. (R13-1.R). There are three categories. 10,7,3 were the observed counts of three categories. $\begin{equation*} \frac{N!}{n_1!n_2!n_3!} (\frac{1}{3})^N \end{equation*}$ dpoly<-function(n=c(1,2,3),p=NULL){ # function to calculate probability N<-sum(n) if(is.null(p)){ p<-rep(1/length(n),length(n)) } exp(lgamma(N+1)-sum(lgamma(n+1))+sum(n*log(p))) } DrawHigherLower<-function(g=c(10,7,3),p=NULL){ N<-sum(g) x<-0:N;y<-0:N xy<-expand.grid(x,y) # combination of all x and y values z<--(xy[,1]+xy[,2])+N xyz<-matrix(c(xy[,1],xy[,2],z),nrow=length(z)) xyz<-xyz[apply(xyz,1,min)>=0,] # non negative probs<-apply(xyz,1,dpoly,p=p) # apply the above-made-function to xyz probObs<-dpoly(c(10,7,3)) # probablity of the observation higher<-which(probs>probObs) lower<-which(probs<=probObs) xyz2<-apply(xyz,1,CoordTriangle) x2<-xyz2[1,];y2<-xyz2[2,] xlim<-c(0,1);ylim<-c(0,2/sqrt(3)) # Black vs. Red plot(xyz2[1,which(probs>probObs)]/N,xyz2[2,which(probs>probObs)]/N,xlim=xlim,ylim=ylim,col = "black",pch=15) par(new=T) plot(xyz2[1,which(probs<=probObs)]/N,xyz2[2,which(probs<=probObs)]/N,xlim=xlim,ylim=ylim,col = "red",pch=15) DrawTriangleFrame() } par(mfcol=c(1,2)) DrawHigherLower(c(10,7,3)) DrawHigherLower(c(10,7,3),c(0.9,0.05,0.05)) par(mfcol=c(1,1)) Draw probablities under two hypotheses. DrawDirichletDensity<-function(prior=c(1,2,3),k=0.01){ library(MCMCpack) xyz<-TriLattice(k=k) density <- Density(ddirichlet(xyz, prior)) xyz2 <- apply(xyz, 1, CoordTriangle) x2 <- xyz2[1, ];y2 <- xyz2[2, ] xlim <- c(0, 1);ylim <- c(0, 2/sqrt(3)) plot(xyz2[1,], xyz2[2,], xlim = xlim, ylim = ylim, col = gray(density), pch = 15) DrawTriangleFrame() } DrawTriangleFrame <- function() { # Drawing triangle framePointsX <- c(1, 0, 0, 1) framePointsY <- c(1/2, 2/sqrt(3), 0, 1/2) s <- seq(length(framePointsX) - 1) segments(framePointsX[s], framePointsY[s], framePointsX[s + 1], framePointsY[s + 1]) } CoordTriangle <- function(x) { # coordinate in triangle coordinate x2 <- x[1]; y2 <- 2/sqrt(3) * x[2] + 1/2 * x[1] c(x2, y2, 0) } TriLattice<-function(k=0.01){ x <-y<- seq(from = 0, to = 1, by = k) xy <- expand.grid(x, y) z <- 1 - xy[1] - xy[2] xyz <- cbind(xy[1], xy[2], z) minval <- apply(xyz, 1, min) xyz[minval > 0, ] } Density<-function(d){ if(max(d)==min(d)){ rep(0.5,length(d)) }else{ (max(d)-d)/(max(d)-min(d)) } } DrawDirichletDensity(c(1,1,1)) DrawDirichletDensity(c(10,7,3)) DrawCI <- function(p = c(11, 8, 4), ci = 0.95, N = 10000) { library(MCMCpack) rs <- rdirichlet(N, p) ds <- ddirichlet(rs, p) ord <- order(ds) OutPoints <- N * (1 - ci) rs2 <- apply(rs, 1, CoordTriangle) xlim <- c(0, 1);ylim <- c(0, 2/sqrt(3)) plot(rs2[1, ord[1:OutPoints]], rs2[2, ord[1:OutPoints]], col = gray(6/8), xlim = xlim, ylim = ylim, pch = 15) par(new = T) plot(rs2[1, ord[(OutPoints + 1):N]], rs2[2, ord[(OutPoints + 1):N]], xlim = xlim, ylim = ylim, pch = 15) DrawTriangleFrame() } DrawCI(p = c(11, 8, 4), ci = 0.95, N = 10000) ## 13.2 Contingency table test 2x2 table Assume p is probability of A. $\begin{equation*} \frac{n_{1.}!}{n_{11}! n_{12}!} \frac{n_{2.}!}{n_{21}!n_{22}!} p^{n_{.1}} (1-p)^{n_{.2}} \end{equation*}$ Regardless of p, prob to observe 4 cell values should be proportinal to $\frac{n_{1.}!}{n_{11}! n_{12}!} \frac{n_{2.}!}{n_{21}!n_{22}!}$ To make the sum of probability of all cases, it should be changed to: $\begin{equation*} \frac{n_{1.}!}{n_{11}! n_{12}!} \frac{n_{2.}!}{n_{21}!n_{22}!} \times \frac{n_{.1}! n_{.2}!}{n_{..}!} \end{equation*}$ Using this we can determine p-value, which is p-value of exact probability test. (R13-2.R). The calculated value fits well to the formula below: $\begin{equation*} a\times e^{-b \delta^2} \end{equation*}$ Exact prob decreases along with the increase in square of deviation from the expected value. # expected table makeExptable<-function(m = matrix(c(10, 20, 30, 40), nrow = 2)) { m1 <- apply(m, 1, sum);m2 <- apply(m, 2, sum); N <- sum(m);etable <- m1 %*% t(m2)/N } # Change a value of (1,1) from expected value by 1 ProbDistance <- function(m = matrix(c(10, 20, 30, 40), nrow = 2)) { # marginal counts m1 <- apply(m, 1, sum) m2 <- apply(m, 2, sum) etable <- makeExptable(m) N<-sum(etable) d <- seq(from = 0, to = 20, by = 1) # deviation # All cells should be non negative x <- d + etable[1, 1]; y <- -d + etable[1, 2]; z <- -d + etable[2, 1]; w <- d + etable[2, 2] mat <- matrix(c(x, y, z, w), ncol = 4);mins <- apply(mat, 1, min);matOK <- mat[mins > 0, ] # component of exact prob calculated with marginal counts tmp <- sum(lgamma(m1 + 1)) + sum(lgamma(m2 + 1)) - lgamma(N) # exact prob of all possible tables prob <- exp(-apply(lgamma(matOK+1),1,sum)+tmp) list(x = d[mins > 0], prob = prob) } d <- ProbDistance(m = matrix(c(10, 20, 30, 40), nrow = 2)) #c(10,20,30,40) is cell values of 2x2 table ylim <- c(0, max(d$prob))
plot(d$x, d$prob, ylim = ylim)
# function to be used for optimization
fForOptim <- function(x) {
a <- x[1]; b <- x[2] ;c<-x[3]
# value to be optimized
sum((d$prob - (a * exp(-b * d$x^2)))^2)
}
optout <- optim(c(1, 1), fForOptim) # initial coefficients are set at (1,1)
xfine <- seq(from = min(optout$par), to = max(optout$par), by = 0.01) # values for horizontal axis
optprob <- optout$par[1] * exp(-optout$par[2] * xfine^2)# estimated function value
par(new=T)
plot(xfine, optprob, ylim = ylim, col = "red", type = "l")

### 13.2.1 Pearson’s independence test - chi-square test

$\begin{equation*} \chi^2=\sum_{i,j} \frac{(n_{ij}-e_{ij})^2}{e_{ij}} \end{equation*}$

chisqCalc<-function(m=matrix(c(10,20,30,40),nrow=2)){
etable<-makeExptable(m)
m1<-length(m[,1]);m2<-length(m[1,])
chi2<-sum((m-etable)^2/etable)
df<-(length(m1)-1)*(length(m2)-1) # df=(N-1)x(M-1)
p<-pchisq(chi2,df,lower.tail=FALSE) #
list(chi2=chi2,p=p,df=df)
}
chisqCalc(m=matrix(c(10,20,30,40),nrow=2))

### 13.3.2 To statistics by comparing the null hypothesis and the maximum likelihood hypothesis —Likelihood ratio test

Comapre likelihood under the null hypothesis and likelihood under the alternative hypothesis.

For alternative hypothesis, we assume maximu likelihood alternative hypothesis in terms of parameter(s) modeling the alternative hypothesis.

Alternative hypothesis:

Maximum likelihood estimates:

$$\frac{n_{11}}{n_{1.}},\frac{n_{21}}{n_{2.}}$$

\begin{align*} L(G1\not = G2) &= \frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} (\frac{n_{11}}{n_{1.}})^{n_{11}} (\frac{n_{12}}{n_{1.}})^{n_{12}} (\frac{n_{21}}{n_{2.}})^{n_{21}} (\frac{n_{22}}{n_{2.}})^{n_{22}} \\ &=\frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} \frac{ n_{11}^{n_{11}} n_{12}^{n_{12}} n_{21}^{n_{21}} n_{22}^{n_{22}} } {n_{1.}^{n_{1.}} n_{2.}^{n_{2.}} } \end{align*}

Null hypothesis: \begin{align*} L(G1= G2) &= \frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} (\frac{n_{.1}}{n_{..}})^{n_{11}+n_{21}} (\frac{n_{.2}}{n_{..}})^{n_{12}+n_{22}} \\ &= \frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} \frac{ n_{.1}^{n_{.1}} n_{.2}^{n_{.2}} }{n_{..}^{n_{..}} } \end{align*}

Likelihood ratio: k! components are cancelled out.

$\begin{equation*} \frac{L(G1\not = G2) }{L(G1= G2) }= \frac{n_{..}^{n_{..}} n_{11}^{n_{11}} n_{12}^{n_{12}} n_{21}^{n_{21}} n_{22}^{n_{22}} } {n_{1.}^{n_{1.}} n_{2.}^{n_{2.}}n_{.1}^{n_{.1}} n_{.2}^{n_{.2}} } \end{equation*}$

In general $\begin{equation*} \frac{L(G1\not = G2) }{L(G1= G2) } = \prod_{i,j} (\frac{n_{ij}}{e_{ij}})^{n_{ij}} \end{equation*}$

twice of log-likelihood ratio can be treated as $$\chi^2$$ value. R13-4.R.

likelihoodRatioTest<-function(m=matrix(c(10,20,30,40),nrow=2)){
etable<-makeExptable(m)
chi2<-2*sum(log(m/etable)*m)
df<-(length(m[,1])-1)*(length(m[1,])-1)
p<-pchisq(chi2,df,lower.tail=FALSE)
return(list(statistic=chi2,p.value=p,df=df))
}
likelihoodRatioTest(m=matrix(c(10,20,30,40),nrow=2))

## 13.3 Comparison of the three methods - exact test,Pearson’s independence test, and the likelihood ratio test

We have three test methods for contingency tables.

They are similar, but of course they are different each other.

We should know when they are same and when they are different and how.

### 13.3.1 When sample size is large and when sample size is small

1. When sample size is small
2. When sample size is large

Small table (3, 6, 9, 12)

Large table (100, 200, 300, 400)

Figures 13.3 13.4 13.5 and R13-supl.R

Symmetricity?

### 13.3.3 Difference between limited space and infinite space

$$\chi^2$$ distribution support infinite value of distance.

Exact test does not accept negative cell values, meaning finite df-dimentional space is covered.

### 13.3.5 A summary of the difference in the amount of calculation

• Behaviour when small sample size

• symmetricity

• Exact vs. asymptotic

• Discrete vs. continuous

• Finite vs. infinite

Table 13.4.

### 13.4 Testing hypothesis with constraints

2x3 table for case-control x SNP genotypes

• Null hypothesis

• Dominant model df =1

• Recessive model df=1

• Additive model df = 1

• Any one of three genetic models 1 < df < 2

• Any difference between case and control df = 2

Table 13.5 : RR constraints

• Test methods Table 13.6

### 13.4.1 Application of various tests to one contingency table

#library(Rassoc)
# Make a table from coordinates in 2D space
casecontRTheta<-function(R,t,af,f,N){
popW<-c(af^2+af*(1-af)*f,2*af*(1-af)*(1-f),(1-af)^2+af*(1-af)*f)
table<-matrix(c(popW*N[1],popW*N[2]),nrow=2,byrow=TRUE)
table[1,1]<-table[1,1]-R/2*cos(t)+sqrt(3)/2*R*sin(t)
table[2,1]<-table[2,1]+R/2*cos(t)-sqrt(3)/2*R*sin(t)
table[1,2]<-table[1,2]+R*cos(t)
table[2,2]<-table[2,2]-R*cos(t)
table[1,3]<-table[1,3]-R/2*cos(t)-sqrt(3)/2*R*sin(t)
table[2,3]<-table[2,3]+R/2*cos(t)+sqrt(3)/2*R*sin(t)
return(table)
}
af<-0.4;f<-0;N<-c(100,100) # allele frequency,HWE-f,sample size
x<-y<-seq(from=-10,to=10,by=1)
# apply multiple test methods to the generated table
cattp<-max3p<-df2p<-domp<-recp<-matrix(rep(0,length(x)*length(y)),nrow=length(x))
for(i in 1:length(x)){
for(j in 1:length(y)){
R<-sqrt(x[i]^2+y[j]^2)
if(x[i]==0){
t<-pi/2
}else{
t<-atan(y[j]/x[i])
}

popdata<-casecontRTheta(R=R,t=t,af=af,f=f,N=N)
if(min(popdata>0)){
catout<-CATT(popdata,x=0.5) # Cockran-Armitage trend test with 0.5
domout<-CATT(popdata,x=1) #  Cockran-Armitage trend test with 1
recout<-CATT(popdata,x=0) #  Cockran-Armitage trend test with 0
max3out<-MAX3(popdata)
df2chi2out<-chisq.test(popdata,2)
cattp[i,j]<-catout$p max3p[i,j]<-max3out$p
df2p[i,j]<-df2chi2out$p.value domp[i,j]<-domout$p
recp[i,j]<-recout$p } } } filled.contour(x,y,-log(cattp,10),color=terrain.colors) filled.contour(x,y,-log(max3p,10),color=terrain.colors) filled.contour(x,y,-log(df2p,10),color=terrain.colors) filled.contour(x,y,-log(domp,10),color=terrain.colors) filled.contour(x,y,-log(recp,10),color=terrain.colors) Figure 13.6. ### 13.4.2 Comparison of the likelihood ratios at discrete hypothesis space When likelihood for alternatigve hypothesis has free parameter(s), the 2xloglike can be interpreted with $$\chi^2$$ distribution of df = no. parameters. However alternative hypothesis does not have a continuous parameter, like DNA identification, likelihood ratio itself but not p-value in likelihood ratio test should be used for judgement. ## 13.5 Dependency among multiple tests (R13-6.R). #library(Rassoc) Niter<-1000 Nca<-1000 Nco<-1000 chiP<-cattP<-domP<-recP<-rep(0,Niter) for(i in 1:Niter){ af<-runif(1)*0.6+0.2 df<-c(af^2,2*af*(1-af),(1-af)^2) case<-sample(c(0,1,2),Nca,df,replace=TRUE) cont<-sample(c(0,1,2),Nco,df,replace=TRUE) m<-matrix(c(length(which(case==0)),length(which(case==1)),length(which(case==2)),length(which(cont==0)),length(which(cont==1)),length(which(cont==2))),nrow=2,byrow=TRUE) chiP[i]<-chisq.test(m,correct=FALSE)$p.value
cattP[i]<-CATT(m,0.5)$p domP[i]<-CATT(m,1)$p
recP[i]<-CATT(m,0)$p } plot(as.data.frame(cbind(chiP, domP, cattP, recP)), cex = 0.1) cormat <- cor(cbind(chiP, domP, cattP, recP)) cormat When multiple tests are mutually dependent, multiple testing correction (Chapter 17) of them becomes more difficult. ## 13.6 Tables with various size ### 13.6.1 Data in the shape of table Two variables are observed for samples. Relation between two variables are evaluated/tested with various ways. The data type of variables determines the appropriate method. ### 13.6.2 Test methods for ordered or non-ordered categories 1. Test details of the will investigate the source of the R # NxM table # Randomly generate tables with a given size (NxM), multiple tests are applied to them and compare their results CompareTests<-function(N=2,M=2,Niter=1000,Ns=100,k1=10,k2=3){ library(MCMCpack) library(clinfun) pearsonp<-trendp<-kwp<-jtp<-lmp<-rep(0,Niter) for(i in 1:Niter){ # (Generate NxM tables at random) fn<-rdirichlet(1,rep(k1,N)) fm<-rdirichlet(1,rep(k2,M)) first<-sample(1:N,Ns,prob=fn,replace=TRUE) second<-sample(1:M,Ns,prob=fm,replace=TRUE) t<-table(first,second) # ???割表を作る Convert to table pearsonp[i]<-chisq.test(t,correct=FALSE)$p.value #  Pearson's independence test
if(N==2){
trendp[i]<-prop.trend.test(t[1,],t[1,]+t[2,],score=1:M)$p.value # Trend chisquare test with additive model } kwp[i]<-kruskal.test(second~first)$p.value # Kruskal-Wallis
jtp[i]<-jonckheere.test(second,first,alternative="two.sided")$p.value # Jonckheere-Terpstra lmp[i]<-summary(lm(second~first))$coefficients[2,4] #  Linear regression
}
databind<-cbind(pearsonp,kwp,jtp,lmp,trendp)
plot(as.data.frame(databind))
return(databind)
}
N<-2;M<-2;Niter<-100;Ns<-1000 #  2x2 table
ctout22<-CompareTests(N,M,Niter,Ns)
N<-2;M<-3;Niter<-100;Ns<-1000 #  2x3 table
ctout22<-CompareTests(N,M,Niter,Ns)
N<-2;M<-10;Niter<-100;Ns<-1000 # 2x10???ーブル 2x10 table
ctout22<-CompareTests(N,M,Niter,Ns)
N<-3;M<-3;Niter<-100;Ns<-1000 # 3x3???ーブル 3x3 table
ctout22<-CompareTests(N,M,Niter,Ns)
1. One axis is 2-categorical

2. When one axis has three or more categories without order in them

3. When both axes are ordered

Table 13.7

### 13.6.3 Comparison of performance of various test methods

When values are stocked as vectors in R. R functions can accept them even if they are “NOT appropriate” from “STATISTs’ standpoint”.

What does this mean; “not appropriate”?

m<-matrix(c(1,2,3,4)*3,nrow=2)
N<-sum(m)
x<-0:N
m1<-apply(m,1,sum)
m2<-apply(m,2,sum)
y<--x+m1[1]
z<--x+m2[1]
w<--(x+y+z)+N
data<-matrix(c(x,y,z,w),ncol=4)
min4<-apply(data,1,min)
dataOK<-data[min4>=0,]
L<-length(dataOK[,1])
pFisher<-pChiNoCorrect<-pChiNoCorrect<-ChiNoCorrect<-pLRT<-ChiLRT<-rep(0,L)
for(i in 1:L){
tmptable<-matrix(dataOK[i,],nrow=2)
chisqout<-chisq.test(tmptable,correct=FALSE)
LRTout<-likelihoodRatioTest(tmptable)
pChiNoCorrect[i]<-chisqout$p.value ChiNoCorrect[i]<-chisqout$statistic
pLRT[i]<-LRTout$p.value ChiLRT[i]<- LRTout$statistic
pFisher[i]<-fisher.test(tmptable)$p.value } ylim=c(0,1) ylim<-c(0,1) plot(x[min4>=0],pChiNoCorrect,ylim=ylim,type="l") par(new=T) plot(x[min4>=0],pLRT,ylim=ylim,col="red",type="l") par(new=T) plot(x[min4>=0],pFisher,ylim=ylim,col="blue",type="l") ylim<-c(log(min(pFisher),10),1) plot(x[min4>=0],log(pChiNoCorrect,10),ylim=ylim,type="l") par(new=T) plot(x[min4>=0],log(pLRT,10),ylim=ylim,col="red",type="l") par(new=T) plot(x[min4>=0],log(pFisher,10),ylim=ylim,col="blue",type="l") ps<-cbind(pChiNoCorrect,pLRT,pFisher) plot(as.data.frame(ps),cex=1) ps<-cbind(pChiNoCorrect,pLRT,pFisher) plot(log(as.data.frame(ps),10),cex=1) m<-matrix(c(1,2,3,4)*100,nrow=2) N<-sum(m) x<-0:N m1<-apply(m,1,sum) m2<-apply(m,2,sum) y<--x+m1[1] z<--x+m2[1] w<--(x+y+z)+N data<-matrix(c(x,y,z,w),ncol=4) min4<-apply(data,1,min) dataOK<-data[min4>=0,] L<-length(dataOK[,1]) pFisher<-pChiNoCorrect<-pChiNoCorrect<-ChiNoCorrect<-pLRT<-ChiLRT<-rep(0,L) for(i in 1:L){ tmptable<-matrix(dataOK[i,],nrow=2) chisqout<-chisq.test(tmptable,correct=FALSE) LRTout<-likelihoodRatioTest(tmptable) pChiNoCorrect[i]<-chisqout$p.value
ChiNoCorrect[i]<-chisqout$statistic pLRT[i]<-LRTout$p.value
ChiLRT[i]<- LRTout$statistic pFisher[i]<-fisher.test(tmptable)$p.value
}
ylim=c(0,1)

ylim<-c(0,1)
plot(x[min4>=0],pChiNoCorrect,ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],pLRT,ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],pFisher,ylim=ylim,col="blue",type="l")

ylim<-c(log(min(pFisher),10),1)
plot(x[min4>=0],log(pChiNoCorrect,10),ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],log(pLRT,10),ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],log(pFisher,10),ylim=ylim,col="blue",type="l")
abline(v=120)
abline(v=20)
abline(v=220)
abline(h=log(pChiNoCorrect[21],10))

ps<-cbind(pChiNoCorrect,pLRT,pFisher)
plot(as.data.frame(ps),cex=0.1)

plot(log(as.data.frame(ps),10),cex=0.1)

# Chapter 14 Association and causal relation !

## 14.1 Cause and result and time

• Epidemiology can causal relation.

• Chronological order and prospective study and retrospective study

There are the following four patterns for relation between A and B.

If A is the cause of B

If B is the cause of A

If the result of A and B together, caused by C

It is both A and B, and is the cause of C

Figure 14.1.

## 14.2 Genotypes that are cauastive

Genotypes, first.

## 14.3 Directed graph, Bayesian network

Order or directional relation is expressed as directed edge in graphs or networks.

Cycles are not good to handle causation with graph, that is why acyclic graph is often required for network analysis.

Particularly bayesian networks are acyclic directed graphs.

using a bnlearn package of R (R14-1.R).

library(bnlearn)
e <- empty.graph(LETTERS[1:5]) # 5 nodes
arc.set <- matrix(c("A", "C", "B", "D", "C", "E","D","A","A","E"), ncol = 2, byrow = TRUE)
arcs(e)<-arc.set # set edges
plot(e)
# generate data set
set.seed(123456)
Ns<-10000
A<-rpois(Ns,3);B<-rpois(Ns,3);C<-rpois(Ns,3);D<-rpois(Ns,3);E<-rpois(Ns,3)

B[1:(Ns/4)]<-A[1:(Ns/4)];C[(1+Ns/4):(Ns/2)]<-A[(1+Ns/4):(Ns/2)]
D[(1+Ns/2):(3*Ns/4)]<-A[(1+Ns/2):(3*Ns/4)];E[(1+3*Ns/4):Ns]<-A[(1+3*Ns/4):Ns]
d<-as.data.frame(cbind(A,B,C,D,E))+0.0
res<-gs(d) # Network estimation with Grow-Shrink method, one of multiple methods
plot(res)
# If networks are at random...
plot(random.graph(LETTERS[1:5], num = 1))
# Compare the estimated graph and random graphs for wellness in data-description
GSscore<-score(res, d) # Score the wellness of estimated graph
Niter<-100 # ランダ???に作るグラ??????数
scores<-rep(0,Niter) # Score wellness of random graphs
for(i in 1:Niter){
randomGraph<-random.graph(LETTERS[1:5], num = 1)
scores[i]<-score(randomGraph,d)
}
ylim<-c(min(scores,GSscore),max(scores,GSscore))
plot(scores,ylim=ylim,pch=20)
abline(h=GSscore)
bn.fit(res,d)

# Part V Large-scale

Chapter 15 Enumeration

Chapter 16 omitting something

Chapter 17 A lot of test

Large-scale data science, including the genome, features a huge amount of data and a lot of factors. In Part Ⅴ parts and to enumerate it is huge, it focuses on how to deal with it is enormous.

# 15.1 Permutation, repeated permutation, exact probability of contingency table

1. Permutation and combination

$\begin{equation*} P(N,k)=N(N-1)(N-2)...(N-(k-1))=\frac{N!}{(N-k)!}=\frac{\Gamma(N+1)}{\Gamma(N-k+1)} \end{equation*}$

$$\Gamma$$ function is important.

permN<-function(N=10,k=3){ exp(lgamma(N+1)-lgamma((N-k)+1)) }
permN(N=10,k=3)

Repeated permutation $\begin{equation*} \Pi(N,k)=N^k \end{equation*}$

repPermN<-function(N=10,k=3){ N^k }
repPermN(N=10,k=3)

Total k objects, that shoudl be labeled with 1,2,…,or N; no. of each label should be $$n_i$$. $\begin{equation*} \frac{k!}{\prod_{i=1}^N n_i!};\;\; \sum_{i=1}^N n_i=k \end{equation*}$

For the 1st label set, $\begin{equation*} \frac{n_{..}!}{\prod_{i=1}^N n_{i.}!} \end{equation*}$ for the second, $\begin{equation*} \frac{n_{..}!}{\prod_{j=1}^M n_{.j}!} \end{equation*}$

In combination, $\begin{equation*} X= \frac{n_{..}!}{\prod_{i=1}^N n_{i.}!} \times \frac{n_{..}!}{\prod_{j=1}^M n_{.j}!} \end{equation*}$

We now make combination of labels, $$N\times M$$ types. $\begin{equation*} Y=\frac{n_{..}!}{ \prod_{i=1}^N\prod_{j=1}^M n_{ij}!} \end{equation*}$

The ratio Y over X is the exact probability of 2x2 table. $\begin{equation*} \frac{Y}{X}=\frac{n_{..}!}{ \prod_{i=1}^N\prod_{j=1}^M n_{ij}!} \frac {\prod_{i=1}^N n_{i.}!} {n_{..}!} \times \frac{\prod_{j=1}^M n_{.j}!}{n_{..}!} = \frac{\prod_{i=1}^N n_{i.}! \prod_{j=1}^M n_{.j}!}{n_{..}! \prod_{i=1}^N \prod_{j=1}^M n_{ij}!} \end{equation*}$

### 15.1.2 Combination, repeated combination, number of diploid genotypes

$\begin{equation*} C(N,k)=\frac{N!}{k!(N-k)!}=\frac{P(N,k)}{k!} \end{equation*}$

choose(20,3)

Repeated combination, H (N, 2). $\begin{equation*} H(N,k)=C(N+k-1,k)= \sum_{i=0}^k (C(N,k-i) \times H(k-i,i)) = \sum_{i=0}^k (C(N,k-i) \times C(k-1,i)) \end{equation*}$

This can be modeled as; N segments separated with N-1 separating bars. Put k items in N spaces. We can not distinguish N-1 separating bars or k items…

$\begin{equation*} H(N,k)=\frac{P(N-1+k)}{P(N-1)P(k)}=C(N+k-1,k) \end{equation*}$

repCombN<-function(N=10,k=3){
choose(N+k-1,k)
}
repchoose<-function(n=10,k=3){
ret<-0
for(i in 0:k){
ret<-ret+choose(n,k-i)*repCombN(k-i,i)
}
ret
}
repCombN(N=10,k=3)
repchoose(n=10,k=3)

## 15.2 Number of partitions — Stirling number and Bell number

What is partition?

Put N items into k gourps. All k group should have one or more items.

The 2nd Stirling number, $$St2(N,k)$$ is the number of ways of partition.

It satisfies the following equation. $\begin{equation*} St2(N+1,k)=St2(N,k)\times k +St2(N,k-1) \end{equation*}$

When k=2, it is related with the sum of the ways to choose i (i=1,2,…N-1) out of N: \begin{align*} St2(N,2)&=\frac{1}{2} \sum_{i=1}^{N-1} C(N,i) \\ &= \frac{1}{2} (\sum_{i=0}^N C(N,i) -(C(N,0)+C(N,N)) )\\ &= \frac{1}{2} ((1+1)^N -2) = 2^{N-1}-1 \end{align*}

Partioning N into any number of groups is Bell number. $\begin{equation*} B(N)=\sum_{i=1}^N St2(N,i) \end{equation*}$ or $\begin{equation*} B(N+1)=\sum_{i=0}^N C(N,i) B(i) \end{equation*}$

Figure 15.2 shows one way of enumeration can be interpreted in multiple meanings.

Stirling2N<-function(N=10,k=3){
ret<-0
if(k<=N && k>=1){
if(k==1){
ret<-1
}else{
ret<-Stirling2N(N-1,k)*k+Stirling2N(N-1,k-1)
}
}
ret
}
Stirling2N(N=10,k=3)
bellN<-function(N=10){
bs<-rep(0,N+1)
bs[1]<-1
if(N>=1){
for(i in 2:(N+1)){
for(j in 0:(i-2)){
bs[i]<-bs[i]+choose(i-2,j)*bs[j+1]
}
}
}
bs[2:length(bs)]
}
bellN(N=10)

## 15.3 Partition and integration of categories

### 15.3.1 When categories are non-ordered

When we have multiple categories, we may want to group them into fewer categories. That is partioning.

For example, dominant genetic model makes 3 genotypes into 2 groups.

### 15.3.2 When categories are ordered

When categories have an order, partioning of them is to place border(s) in the order.

Figure 15.3.

$\begin{equation*} B_{order}(N)=\sum_{i=1}^{N-1} C(N-1,i)=2^{N-1}-1=St2(N,2) \end{equation*}$

R15-7.R in R.

Border<-function(N){
2^{N-1}-1
}
Border(10)

## 15.4 Number of shapes of tree, number of graphs — tree, clustering, Bayesian network

### 15.4.1 Number of patterns of tree

It will be necessary to know the number of ways to construct trees with N leaves, when performing hierarchical clustering.

One type of number of ways of tree is Catalan number.

• Rooted tree with N edges

• Graph with N non-leaf vertices = N+1 leaf-binary-tree

CatalanN <- function(N = 10) {
exp(lgamma(2 * N + 1) - lgamma(N + 2) - lgamma(N + 1))
}
CatalanN(N=10)

### 15.4.2 Number of patterns of clustering

Double-factorial number

$\begin{equation*} (2N-1)!!=1\times 3\times 5 \times ...\times (2N-1) \end{equation*}$

NUmber of clustering trees is : $\begin{equation*} 2^{\frac{N(N-1)}{2}} \end{equation*}$

(R 5-9.R, Figure 5.5).

library(phangorn)
allTrees(5) # No. items = 5
trees<-allTrees(5)
plot(trees)
doubleFactorialN<-function(N=5){
if(N<3){
}else{
ret<-1
for(i in 3:N){
ret<-ret*(2*(i-2)-1)
}
ret
}
}
doubleFactorialN(N=5)

### 15.4.3 Number of undirected graphs, number of directed graphs and number of acyclic directed graphs

Non-directed graph:

$$\frac{N(N-1)}{2}$$ is the possible edge number.

$\begin{equation*} 2^{\frac{N(N-1)}{2}} \end{equation*}$

is the number of ways of non-directed graph without self-directing edge.

Directed graph: $\begin{equation*} 3^{\frac{N(N-1)}{2}} \end{equation*}$ because {no, a->b, a<-b}.

Acyclic directed graph that is a type of graph for Baeysian network: \begin{align*} N_{acg}(0)=1\\ N_{acg}(N)=\sum_{i=1}^N (-1)^{k+1} C(N,k)\times 2^{k(N-k)} \times N_{acg}(N-k)\\ \end{align*}

You may find these integer sequences @ http://www.research.att.com/~njas/sequences/

• permutation (Factorial numbers): id A000142
• the second Stirling number (Stirling numbers of second kind): id A008277
• Number of Bell (Bell or exponential numbers): id A00011 0
• Catalan number (Catalan numbers): id A000108
• double fact realistic number (Double factorial numbers): idA0011 47
• directed acyclic graph number (Number of acyclic digraphs (or DAGs) with n labeled nodes): id A003024

# Chapter 16 Omitting something

## 16.1 Sample at random, Walk at random

### 16.1.1 Sample at random from a known distribution

The pseudo-random number sequence. R16-1.R, Figure 16.1 It is.

library(MCMCpack)
d<-rdirichlet(1000,c(1,1,1)) # Random sampling from dirichlet distribution
plot(d[,1],sqrt(3)/2*(d[,2]-d[,3]),xlim=c(0,1),ylim=c(-sqrt(3)/2,sqrt(3)/2))
segments(c(0,1,0),c(-sqrt(3)/2,0,sqrt(3)/2),c(1,0,0),c(0,sqrt(3)/2,-sqrt(3)/2)) # Triangle

### 16.1.2 Sample at random from samples; resampling and permutation

1. Resampling

We have samples.

We re-sample samples from the sample pool.

Replacement is allowed or not.

x<-1:10
#bootstrap
sample(x,replace=TRUE)
#permutation
sample(x,replace=FALSE)

Used in cross-validation, population parameter estimation

1. Permutation

Changing the order of 1,2,…,n.

$$n!$$ ways.

It is related with exact probability test because exact probability test enumerate all patterns with marginal counts fixed and permutation handles samples as the set of samples is given.

The following is Monte-carlo permutation-based p-value estimation.

# Data set generation
A<-1:8
B<-c(1,3,2,6,4,7,8,5)
# All permutations vs. monte-carlo permutations.
PermutationTestJT<-function(A,B,MonteCarlo=TRUE,n=1000){# n : in case of monte-carlo
library(clinfun) # package of jonckheere.test()
ret<-0
Ps<-NULL
jtp<-jonckheere.test(A,B)$p.value # p value of observed if(MonteCarlo || length(A)>10){ # if element size>10, monte-carlo. Ps<-rep(0,n) for(i in 1:n){ tmp<-sample(1:length(A),length(A)) Ps[i]<-jonckheere.test(A[tmp],B)$p.value
}
}else{ # full permutatin
library(gtools) # package for permutations()
perms<-permutations(length(A),length(A)) # all permutations will return
Ps<-rep(0,length(perms[,1]))
for(i in 1:length(perms[,1])){
Ps[i]<-jonckheere.test(A[perms[i,]],B)$p.value } } ret<-length(Ps[Ps<=jtp])/length(Ps) list(originalp.value=jtp,permp.value=ret,Ps=Ps) } out1<-PermutationTestJT(A,B,MonteCarlo=FALSE) # full permutation result=exact test out2<-PermutationTestJT(A,B,MonteCarlo=TRUE,n=100) # Mote-carlo result out1$originalp.value
out1$permp.value out2$permp.value
plot(sort(out1$Ps)) abline(h=out1$permp)

### 16.1.3 Random walk

Random walk based on Markov chain

Random walk is supposed to walk around anywhere evenly

Weighted random walk may the procedure more efficiently, using Metropolis-hasting, methods using conjugate prior.

nstep<-100
rwalk<-matrix(0,nstep,2)
rtheta<-rnorm(nstep-1)
stepx<-cos(rtheta)
stepy<-sin(rtheta)
for(i in 1:(nstep-1)){
rwalk[i+1,1]<-rwalk[i,1]+stepx[i]
rwalk[i+1,2]<-rwalk[i,2]+stepy[i]
}
plot(rwalk,type="l")

## 16.2 Use a part that is principal

### 16.2.1 Approximation

1. Fit to a model

Exact test and $$chi^2$$ test are approximately same.

Model function is set and optimization procedure estimate coefficients, which is also approximation.

Extreme value distributions are to be used as model distribution of minimum p-values in the setting of multiple-testings.

$\begin{equation*} G(x)=e^{-(1+\zeta(\frac{x-\mu}{\sigma})^{-\frac{1}{\zeta}}} \end{equation*}$

The formula is scary but we want to do is to have good estimates for parameters of this formula.

# random chisq value (df=1)
N<-100000;M<-100;t<-rchisq(N,1)
# group into 100 groups
matt<-matrix(t,nrow=M)
# max of each set
maxt<-apply(matt,1,max)
#evd package
#Generalized Extreme Value Disvribution parameter estimation
# Compare the estimated distribution and samples
library(evd)
gevest<-fgev(maxt)
qfromgev<-qgev(ppoints(length(maxt),a=0),loc=gevest$estimate[1],scale=gevest$estimate[2],shape=gevest$estimate[3]) plot(ppoints(length(maxt),a=0),sort(maxt),ylim=c(0,max(maxt)),type="p",cex=1,col=gray(5/8)) par(new=T) plot(ppoints(length(maxt),a=0),qfromgev,ylim=c(0,max(maxt)),type="l") 1. Fit to a format Taking out big componens and throw away tiny ones. Eigen value decomposition; each eigen value stands for the significance of the corresponding feature. Polynomial approximation is also the same. $\begin{equation*} f(x)=a_0+a_1 x + a_2 x^2 +... + a_k ^k \end{equation*}$ sortedMaxt<-sort(maxt) # max chisq values, sorted s<-ppoints(length(sortedMaxt),a=0) # for example y=a+bx+cx^2 approximation f2 <- function(x) { sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2))^2) } optout2<-optim(rep(1,3),f2,method="BFGS") ylim<-c(0,max(sortedMaxt)) plot(s,sortedMaxt,ylim=ylim,type="l") optt<-rep(0,length(s)) for(i in 1:length(optt)){ optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) ) } par(new=T) plot(s,optt,ylim=ylim,col="red",type="l") #### # Optional f1 <- function(x) { sum((sortedMaxt-(x[1]*s^0+x[2]*s^1))^2) } optout2<-optim(rep(1,2),f1,method="BFGS") ylim<-c(0,max(sortedMaxt)) plot(s,sortedMaxt,ylim=ylim,type="l") optt<-rep(0,length(s)) for(i in 1:length(optt)){ optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) ) } par(new=T) plot(s,optt,ylim=ylim,col="red",type="l") # f3 <- function(x) { sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2+x[4]*s^3))^2) } optout2<-optim(rep(1,4),f3,method="BFGS") ylim<-c(0,max(sortedMaxt)) plot(s,sortedMaxt,ylim=ylim,type="l") optt<-rep(0,length(s)) for(i in 1:length(optt)){ optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) ) } par(new=T) plot(s,optt,ylim=ylim,col="red",type="l") # f4 <- function(x) { sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2+x[4]*s^3+x[5]*s^4))^2) } optout2<-optim(rep(1,5),f4,method="BFGS") ylim<-c(0,max(sortedMaxt)) plot(s,sortedMaxt,ylim=ylim,type="l") optt<-rep(0,length(s)) for(i in 1:length(optt)){ optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) ) } par(new=T) plot(s,optt,ylim=ylim,col="red",type="l") # f5 <- function(x) { sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2+x[4]*s^3+x[5]*s^4+x[6]*s^5))^2) } optout2<-optim(rep(1,6),f5,method="BFGS") ylim<-c(0,max(sortedMaxt)) plot(s,sortedMaxt,ylim=ylim,type="l") optt<-rep(0,length(s)) for(i in 1:length(optt)){ optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2\$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")