Preparation

Install packages to be used in the book. Package gregmisc is no more available. It is replaced by gplots for this document. Package Rassoc is no more available. All funcitons were copied from the archived version.

Rassoc package is not active. You can get the archived version from http://cran.r-project.org/web/packages/Rassoc/index.html .

The following functions are copy-pasted from the archived version.

Preface !

Statistical genetics, and statistical genetics - genetics, genomics and statistics

This book was written for the purpose of use as a textbook of statistical genetics or genetic statistics. Statistical genetics or genetic statistics is a field to interpret the biological phenomenama mathematically with heredity as the central axis.

A parent and child are similar. It is because genes are transmitted from a parent to child. They are very similar each other, they can be well distinguished. Individuals in a population also mutually similar. But there is a variation in the population. Genetics is to understand this variations in similarlity with transmission of genes and their functions.

Set of genes of the child is consisted of a halve of set of mother and a half of set of father. Conversely, only half of each parent’ gene set is transmitted to the child. Since the choice of this half occurs at random, genetic phenomena are stochastic. Another field on stochastic phenomena is “statistics”. “Genetics” and “statistics” shares their basis and actually the root of “statistics” was “genetics”.

Currently, genetics is aiming to understand biological phenomena with genomic approach and from molecular genetics aspects using large scale data sets. On the other hand, statistics has broadened its usage to almost all academic fields (Figure 0.1).

Mathematical statistical approach in genetics and genomics is very wide-ranging. This textbook covers, gene mapping-specific statistical issues and basics of statistical thinking that are shared by heterogeneous statistical methods in genetic statistics.

Configuration of this document

This document is handling a wide range of the research on the function of the genetic phenomenon and the gene, it was constructed by focusing on the things that are common at the time to understand them in the statistical side.

Part I deals with the biological matters related to genetic-gene.

Part II is for matters relating to the basis for the handling of data. It focuses on categorical data type because of the discreteness of the gene. It also explains methods focusing on populational aspects of samples or focusing on individual relation among samples.

Part III is on descriptive statistics, dimensions/ degree of freedom that is the basics of estimation/test, probability, likelihood, distribution, describes the index. Likelihood and linkage analysis are also handled in this part.

Part IV describes how to derive meaning from data.

Part V is on the issues when data sets are large or complicated.

How to use this book

~ This section is auto-translation results ~

As described above, this book was constructed by focusing on the things that are common at the time to understand the statistical aspects of the research on the function of the genetic phenomenon and the gene. Conversely, commentary and of the various techniques that are used to, such as gene mapping and gene function analysis, such as the definition of the various types of index we have decided to not dare treated. You can take advantage of their individual approaches, but is of course the thing that must be properly using the index, than the aim of understanding about them, then you no longer know did not matter that dealt with the future, developed and proposed will that, because not stick the force corresponding to such new methods and index. It than, by placing a weight on the basis of what is the concept that is common to a variety of techniques and index, because I hope and I want you to put the application force. In a desk study of the language, by holding down the foundation, is an image, such as aim to become Yomikonaseru also for the first time of the sentence. So, in this document, may read this chapter if you are interested in the transformation mapping, this chapter if you want to know the clustering, not configured and so on. Coming out one of the research themes matters relating to the to and fro, to read while they cross-reference has been making the best, that.

In addition, more than once, the terms and concepts that appeared, not necessarily be carefully explained at the first appearance, was described in the place where polite description is considered to be the most useful. This is, by writing the definition and description at the time of first appearance, is to avoid that the first half will be heavy. Therefore, somewhat, also remains vague, the process proceeds to read as long as the meaning is through, I think think of source Ikebayoshi, is convincing when read back again. Upon advance reading, when it is difficult to understand is the highlighted term in bold type 1, please examine the terms in the index. Page in which the term has appeared you can see somewhere. When you have appeared at a plurality of locations, since the page that explains the most carefully and listed in bold, please read the description of the page. In the pages that are most carefully explained, the term is Yes to express in a different type of font as shown in bold 2. Speaking in the web of the article, it is a condition, such as cross-links are stretched here and there. Please use with the intention, such as jumping through the index.

In addition, we have extensive use of the figure. In addition, we also posted free source of statistical software R of to draw the figure. Holding the contents in a sentence of natural language, on which was divided the image in the figure, while confirming an accurate representation in a formula, I think that the source of the R I want you to help understanding. In addition, since the source of R is what is often a simulation, by all means, use, it is recommended that you try to output the results in different conditions. These figures might be easy to understand it is better to look at the color. You can browse by http://www.genome.med.kyoto-u.ac.jp/func-gen-photo/index.php?album=StatGenetTextbook. R source of which was used in this document, it can be downloaded from http://www.genome.med.kyoto-u.ac.jp/StatGenet/lectures/2010/StatGenetTextbook/Rsrc.zip.

People who helped me in the writing of this book

~ This section is auto-translation results ~

Upon writing a book, it became indebted to many people. In particular, everyone of the private meeting that I am riding on the Yorozu consultation of the day-to-day research (Waseda University Faculty of Science and Technology, Advanced Electrical and Information Engineering and Life Science Inoue Masato teacher, University of Tokyo Institute of Medical Science, Human Genome Center Imoto Kiyoshi??? teacher, Daiichi Sankyo Co., Ltd. Seiko Endo, University of Tsukuba graduate School of comprehensive human Sciences, Ohashi order teacher, the RIKEN Center for genomic Medicine statistical analysis research team Natsuhiko Kumasaka (in alphabetical order)) is very I learned a lot of things to. I would like to thank to take this opportunity. Also, in this member, especially in the Ohashi order teacher, we received a lot of advice over the range when writing this book. I’m really thankyou. In addition, Terao Chikashi’s Kyoto University Hospital and Immunology collagen disease internal medicine, the Kawaguchi TakashiHisa’s same school Graduate School of Medicine, University Medical Center genome, such as you to a confirmation of the document ・ R source, peacetime joint we become indebted in addition to research activities, in Kawaguchi, I was writing the R source of the demonstration program that was used in this book. In addition, from far away position is with genetic statistics, I am grateful to Junko wife gave me an opinion on the structure of the present through the eyes to the entire manuscript. And more than anything, statistical genetics and statistical genetics books published in the field of Ohmsha like choose to provide us with the opportunity of, also, be patient despite the great deal changed the content and structure while advancing the brush thank you to everyone of ohm’s development section who acquaintance.

In this way, it is this book, which was finished thanks to the many people, interested in even a little better of a lot of readers statistical genetics and statistical genetics, people of diverse academic background is a chance to work in this field we hope that.

Reference book

~ This section is auto-translation results ~

Genetics of In reading this book, statistics, mentioned the related Statement of R. This book, here neither the introduction for an understanding of the present listed, read the book mentioned here, not a technical book should be read. Both are a may be read in first, but when you read both, it seems that understanding deepens.

Part I Genotypes and phenotypes !

Chapter 1 Heredity— Similar or not-similar !

1.1 Trait is inherited

1.1.1 What is heredity? !

What is heredity? When there is a connection of blood, it is that certain features are shared. When from human parents was born a human child, the feature that is a human being is shared by parents and a child. When this is too obvious, human beings should not have found it interesting and they might not have created the word heredity to pinpoint the phenomenon. However a child shares some features with its parents but does not others, which attracted our ancestors’ interests and the idea of “heredity” was born and it seemed the starting point of “genetics”.

1.1.2 Characteristics of an organism—Trait and phenotype (phenotype) !

Feature with organisms is called the trait. Feature is anything trait. Or interested in what traits, to determine the contents of the study, to have a point of view of the new analysis, you are paraphrasing also possible to define a new trait. In that sense, it is useful to organize the point of view to consider the trait. One way when classifying the trait, will be mentioned whether the assessment by one of the five senses (sight, hearing, taste, smell, touch) is. As a way of another classification, anatomy and structural, physiological and functional, there is a way of dividing that molecular biology, pharmacological. In addition, whether a feature that was provided to the individual itself, the individual is whether a way of holding of the relationship between predators and the environment (such as behavior, immune response), you can also classify that. Others include whether the mathematical concept, seemingly to physically measure, any change to (garbled) appearance of the, point of view also helps to transform the category of. Data by observing these traits can be obtained. This trait of observation data is the phenotype (phenotype).

1.1.3 Identity and diversity !

Genes and their descendents from parents to offsprings have two aspects.

  • Make almost identical offsprings

  • Make different offsprings from parents

Heredity is a mixture of identity and diversity.

Genetics study heredititary phenomena (a mixture of identity and diversity) as realization of stochastic events in genes

Fitness is an important idea.

(Figure 1.1 )

~ This section is auto-translation results ~

Organism is the one characterized by leaving the same as their own to the next generation. Asexual reproduction ※ Dewako will directly take over the parent of the gene, but the person does not mean at all to leave the same individual in the sexual reproduction ※ of organisms, including, will take over the two halves of the parents of the gene . Speaking from the terms of adaptation, because why parents are living well, the child might also say that strategy ever thought it would go well if the same as the parent. On the other hand, rather than the individual, and then try to focus on the species that individual belongs. That’s the same all any kind of individuals, when the environment changes, all individuals would sometimes be difficult to live all together. It can be thought of as not a good idea. Even if there is a change in the environment, if there are a variety of individuals, from even happening is a change of environment will reduce the possibility of annihilation of the species, also considered a good idea that you configured in a variety of individuals. Genetic phenomenon, this way, it has a dual nature to ensure the identity of the security and diversity.

1.2 Gene !

1.2.1 What is gene?

What is gene?

It is quite difficult.

What is genome?

It is a bit easier than “what is gene?”.

1.2.2 Chromosomes

Basic knowledge on chromosomes are necessary.

Use R

Try to draw a bar graph that represents the magnitude of the human chromosome (R1-1.R).

# Length of chromosomes (bp)
# A vector of chromosome length
chromLen<-c(247249719,242951149,199501827,191273063,180857866,
170899992,158821424,146274826,140273252,135374737,134452384,
132349534,114142980,106368585,100338915,88827254,78774742,
76117153,63811651,62435964,46944323,49691432,154913754,57772954)
# barplot. Labels of choromosomes are on "X" axis and their length are "Y" axis. col argument specifies the color of bars.
barplot(chromLen,names=c(1:22,"X","Y"),col="black")

1.2.3 Locus, allele, haplotype, diplotype, Phenotype

Define the terms below.

locus, loci base SNP allele haplotype exon, intron genotype diplotype phenotype

1.2.4 Diploid, homozygous, heterozygous, genotype, Phenotype, genetic form !

Define the terms below.

diploid homozygotes, heterozygotes 2x3 table of binary phenotype and three genotypes of biallelic locus.

Autosomal chromosomes exist in pairs of one by one from the parents. It means that you have two sets of duplicate genetic information. In this way it is called a diploid two sets with organisms. When you have two identical alleles for a locus, the individual is said to be homozygous for the gene locus, when you have a different allele, said to be heterozygous. From here and later, we handle diploid organisms only. Humans are also diploid. Diploid organisms have two alleles in the gene locus on autosomes. The way that individual has two alleles is genotype. On the other hand, the appearance of an individual’s traits is phenotype. The most clear-cut relationships between genotype and phenotype appear as dominant mode of inheritance and recessive mode of inheritance. Assume two alleles M and m. Assume a phenotype “no” and “yes” for a trait. Individuals have one of three genotypes, MM, Mm and mm.

Assume an allele m brings “yes” phenotype. \[ \begin{table} \renewcommand{\footnoterule}{} \begin{center} \begin{minipage}{10cm} \caption{Relation between genotypes and phenotypes $2\times 3$ table} \begin{tabular}[htb]{|c|c|c|c|c|} \hline  &MM&Mm&mm&sum\\ \hline yes&$n_{10}$&$n_{11}$&$n_{12}$&$n_{1.} $ \\ \hline no&$n_{20}$&$n_{21}$&$n_{22}$&$n_{2.}$ \\ \hline sum&$n_{.0}$&$n_{.1}$&$n_{.2}$&$n_{..} $\\ \hline \end{tabular} \] \[ \begin{equation*} \sum_{i=1}^2 n_{ij}=n_{.j}; \sum_{j=0}^2 n_{ij}=n_{j.}; \sum_{i=1}^2 \sum_{j=0}^2 n_{ij}=n_{..}; \end{equation*} \] Rates of phenotype being “yes”, \(f_i=\frac{n_{1j}}{n_{.j}},j=0,1,2\), are different for each genotype. Assume MM as the reference genotype. Then each genotype’s “yes” fraction can be expressed as fold change from MM’s fraction, that is the genotype relative risk (genotype relative risk: GRR). \(\lambda_0=1\) from definition. When GRR of heterozygous (Mm) is equal to that of the homozygote (mm)(\(\lambda_1=\lambda_2\)), it is called the way of impact of m on this phenotype is dominant mode of inheritance. On the contrary, when \(\lambda_1=\lambda_0\), We call this form as recessive inheritance.

There are intermediate forms that does not fit to either the dominant or recessive mode. There are two definitions. One definition is based on the following formula, \[ \lambda_1=x \lambda_2 + (1-x) \lambda_0 \] with whichx = 0.5 is additive (additive) model where , x = 0 is recessive and x = 1 is dominant.

The other expression is, \[ \lambda_1=\lambda_2^y \times \lambda_0^{1-y} \] and y = 0 is recessive, y = 0.5 is synergistic (multiplicative), y = 1 is dominant.

When x = 0.5, \[ \lambda_1=\frac{\lambda_0+\lambda_2}{2} \] means arithmetic mean.

When y=0.5, \[ \lambda_1=\sqrt{\lambda_0 \times \lambda_2} \] means geometric mean.

In addition, when the effect of having one copy of m is less than the effect of two copies and less than the effect of zero copy, it is called super-dominant (overdominance).

In the formula x or y, the intermediate effect is expressed by $ 0 < x, y < 1 $ and all the other effects including overdominace is expressed by $ x, y < 0$ or $ x, y > 1$.

$ -< x,y < $ covers all inheritance models.

Chapter 2 DNA, RNA, protein, trait !

2.1 DNA double strand

(Figure 2.1).

2.1.1 Replication, mutation, recombination

Mutations

Somatic mutations Germline mutations

Crossovers and recombination and recombinant

Figure 2.2

2.1.2 Origin is the same—IBD

Identity by descent (IBD) vs Identity by state (IBS)

IBD = {0,1,2}

Parent-offspring \[ Pr(IBD={0,1,2}) = (0,1,0) \] Sibrings \[ Pr(IBD={0,1,2}) = (0.25,0.5,0.25) \] \[ 0.25 = 0.5 \times 0.5 \]

Table 2.1

\[ \begin{table} \begin{center} \caption{血???関係とIBD数の確???の関???} \begin{tabular}[htb]{|c|c|c|c|c|c|c|} \hline  &\multicolumn{3}{|c|}{IBD数別確???}&IBD数の??????値&\multicolumn{2}{|c|}{一致???}\\ 血???関???&$2$&$1$&$0$& &??????値&???散\\ \hline 自身・一卵性双生???&$1$&$0$&$0$&$2$&$1$&$0$ \\ 親???&$0$&$1$&$0$&$1$&$\frac{1}{2}$&$0$ \\ ??????&$\frac{1}{4}$&$\frac{1}{2}$&$\frac{1}{4}$&$1$&$\frac{1}{2}$&$\frac{1}{8}$ \\ 祖父???-孫&$0$&$\frac{1}{4}$&$\frac{3}{4}$&$\frac{1}{4}$&$\frac{1}{8}$&$\frac{3}{64}$ \\ \hline \end{tabular} \end{center} \end{table} \]

2.1.3 It is easy to handle a single representing number—Expected value of IBD !

The probability that the number of IBD is 0, 1 or 2, is useful as the information of the genetic closeness. It is inconvenient to handle three probability values, a probability vector with three elements. When the closeness of kinship is your interest, it may be convenient to have a single value representing the closeness. For example the expected value of IBD can be a candidate for such purpose.

When the observable values are \(X=\{x_1,x_2,...\}\) and their probability is \(P=\{p_1,p_2,...\} (\sum_{i} p_i =1)\), the expected value is

\[ E(X)=\sum_{i} x_i p_i . \]

In the case of identical twins \(E (X) = 2\). The relation between one and its self is also \(E (X) = 2\).

It might be reasonable to make the strength of relationship of identity 1, as the strongest blood relationship, then \(\frac{E(X)}{2}\) might be a good index as well, that could be called “concordance rate of allele”. The concordance rate of various relations is shown in Table 2.1.

The concordance rate between parent-child and the rate between sibs are the same, but the IBD probability vectors are different ({0,1,0} vs. {0.25,0.5,0.25}). This means, the concordance rate is convenient because of its simplicity but the simplicity is realized by loosing detailed information in the probability vectors.

  1. Mean, variance, the moment

the The expected value E (X) of \(X=\{x_1,x_2\}\) with \(P=\{p_1,p_2,...\} (\sum_{i} p_i =1)\) is, \[ E(X)=\sum_{i} x_i p_i \]

This expected value is can be called “mean” of X.

Variance V (X) is an index representing a variation in values, and it \[ V(X)=\sum_{i} p_i (x_i-E(X))^2 . \]

Both mean and variance are indices to describe X and its distribution.

You may call them more generally. Mean is the 1st moment and variance is the 2nd moment arount mean. Now you can summarize X and its distribution with the i-th moment around zero or mean.

It is expressed as, \[ \mu_k(c) = \sum_{i} p_i (x_i-c)^k \] with c as a center.

Therefore moments around zero is, \[ \mu_k(0) = \sum_{i} p_i x_i^k \]

and moments around mean is,

\[ \mu_k(E(X))=\sum_{i} p_i(x_i-E(X))^k \]

Let’s make a function to calculate moments with R as an exercize of R.

momentProb<-function (x, p,order = 1, center = FALSE) 
{
 if(center) 
  x <- x - momentProb(x,p,order=1,center=FALSE)
 sum(x^order*p)
}

datax<-c(1,2,3)
datap<-c(0.2,0.3,0.5)
momentProb(x=datax,p=datap,order=2,center=TRUE)
momentX<-function (x, order = 1, center = FALSE) 
{ rep(1,length(x))/length(x) #を与えたmomentProb()に同じ
 momentProb(x,p=rep(1,length(x))/length(x),order=order,center=center) 
}

Let’s apply the functions to the IBD vectors.

x<-c(1,0.5,0) # IBD数別の一致の値
pDoho<-c(1/4,1/2,1/4) # ?????????IBD数別確???
momentProb(x,pDoho,order=1,center=FALSE) # ??????値
momentProb(x,pDoho,order=2,center=TRUE) # ???散

The results are shown in Table 2.1. Using expected value(mean) and variance, you can distinguish parent-child vs. siblings.

2.1.4 Concordance rate of allele in sibs

  1. 23 pairs of chromosomes

There are 23 pairs of chromosomes in human. Assume no recombination. One chromosome of a pair is from mother and the other is from father. Now we want to calculate how many pairs are IBD=2 and how many pairs are IBD=1 and how many paris are IBD=0.

When you compare identical twins, all 23 paris are IBD=2. How about the case of siblings?

In total 46 = 2 x 23 chromosomes descend from parents to a offspring.

Each one of 46 can be shared by a sibling pair or not. Let 1 and 0 denote shared and not-shared, respectively.

For a chromosome pair, the probability of “11” is 0.25 and “01” or “10” is 0.5 and “00” is 0.25.

There are 4 ways to make repeated permutation of length 2; 00, 01, 10, 11.

Each probability is \[ \frac{1}{2^{2}} \]

For k chromosome pairs, there are \(2^{2k}\) ways of number sequence of {0,1}, each sequence probability is \[ \frac{1}{2^{2k}} \]

The probability that all 46 chromosomes are IBD between a sib pair is the probability to have “11….1” and it is \(\frac{1}{2^{2k}}\).

The probability i out of \(2k\) chromosomes are IBD is product of , \[ \binom{2k}{i}=\frac{2k!}{i! (2k-i)!} \] and

\[ \frac{i}{2^{2k}} . \]

Therefore \[ \frac{2k!}{i! (2k-i)!}\frac{i}{2^{2k}}. \] Figure 2.3 is showing the probability of number of IBD chromosomes.

When number of pairs is bigger, the number of shared chromosomes is less variable. In other words, if number of chromosomes is small, some sib-pairs share a lot of chromosomes and some share only few; on the contrary, when number of chromosomes is big, the variation is in a narrow range.

sibIdValue<-function(k=1){ 
 (0:(2*k))/(2*k) # i/(2k)
}

sibIdProb<-function(k=1){ 
 dbinom(0:(2*k),2*k,0.5) 
}
numch<-1:23 
means<-vars<-rep(0,length(numch))
for(i in 1:length(numch)){
 identity<-sibIdValue(numch[i])
 prob<-sibIdProb(numch[i])
 if(i>1){par(new=TRUE)}
 plot(identity,prob,type="b",ylim=c(0,1))
 means[i]<-momentProb(identity,prob,order=1,center=FALSE) # 平???
 vars[i]<-momentProb(identity,prob,order=2,center=TRUE) # ???散
}
plot(means)
plot(vars)
  1. Length of the chromosome is various

In the previous section, the length of the chromosome was plotted as shown in Figure 1.2. The shared fraction when i chromosomes are shared and all of the chromosomes are long is different from when i chromosomes are shared and all of them are short.

The following simulation takes the length of chromosomes into calculation for two scenarios; one scenario is that all chromosomes have the same length and the other scenario is that their lengths vary as human chromosomes.

The average concordance rate is 0.5 for both and the variance of them is also similar (R2-5.R).

SibSim<-function(f=f,Niter=10000){ 

 rs<-matrix(rbinom(Niter*length(f),1,0.5),nrow=Niter) 
 rs%*%f 
}

chs<-rep(chromLen[1:23],2) 
f<-chs/sum(chs) 
Niter<-10000 
simOut<-SibSim(f,Niter=Niter)

cheven<-rep(1,length(f))
feven<-cheven/sum(cheven)
simOutEven<-SibSim(feven,Niter=Niter)

plot(ppoints(Niter,a=0),sort(simOut),type="l",ylim=c(0,1))
plot(ppoints(Niter,a=0),sort(simOutEven),type="l",ylim=c(0,1))

Mean and variance tell some information on the distribution and they are similar between two scenarios. Figure 2.4 is a plot of sorted concordance rates. The lines look different, smooth vs. step-like.

In the simulation, which went to the faithful to the length of the chromosome (a), whereas the curve is fairly smooth, it is that the case of equal length (b) is made in a staircase pattern. In the case of “equal-length”, it takes a discrete pattern.

Visualization can reveal extra information out of summarizing index values.

Three points from the example above:

  • Without plotting data, some information minght be missed

  • Be careful when handling data with special conditions (“equal-length” was the condition in the example above).

  • Discreteness affects on distribution

  1. 23 pairs should be divided into pieces - crossing, recombinant: Poisson process and the exponential distribution

Crossovers make recombinats, that makes chromosomes into segments. Segments are descended from a parent to an offspring.

The number of crossovers per chromosome per generation is not so many.

The number of crossovers per unit chromosomal length can be assumed in Poisson distribution, that is a distribution for rare stochastic events.

See Figure 2.5. Crossover sites are generated with two different ways; one is to determine the sites without using Poisson random numbers and the other is using Poisson random number generators.

RecombSim<-function(L=10000,r=0.001,Niter=1000){ 
 m<-matrix(rbinom((L-1)*Niter,1,r),nrow=Niter)
 apply(m,1,sum)
}
RecombPois<-function(L=10000,r=0.001,Niter=1000){
 rpois(Niter,(L-1)*r)
}
L<-10000;r<-0.0001;Niter<-1000
NumSim<-RecombSim(L=L,r=r,Niter=Niter)
NumPois<-RecombPois(L=L,r=r,Niter=Niter)
ylim<-c(0,max(NumSim,NumPois))
plot(ppoints(Niter,a=0),sort(NumSim),ylim=ylim,col=gray(6/8))
par(new=T)
plot(ppoints(Niter,a=0),sort(NumPois),type="l",ylim=ylim)

Crossovers make mosaics of chromosomes as shown in Figure 2.6.

Shared fraction (IBD fraction) should be the sum of the segmetns with the same color.

Because crossovers make length of segments descendent to offspring shorter, which will not change the mean of fraction of IBD but will make the variance of fraction of IBD smaller (R2-7.R , Figure 2.7).

Niter<-1000 
L<-1000000
r<-0.0001 

crosses<-sort(sample(1:(L-1),rpois(1,(L-1)*r),replace=FALSE))
A<-c(0,crosses) 
B<-c(crosses,L)
C<-B-A 
rexps<-rexp(length(C),1/mean(C))
ylim<-c(0,max(C,rexps))
plot(sort(C),ylim=ylim,cex=0.5,pch=15) 
par(new=T)
plot(sort(rexps),col="red",ylim=ylim,type="l") 

2.1.5 Fate of the mutant — Genetic drift

New alleles are born with mutation events. Their initial allele frequency is \(\frac{1}{2n}\) where \(n\) is population size. Their frequency changes with generations because some alleles are transmitted to next generation more than other alleles. This stochastic change in allele frequency is called genetic drift.

Newly born alleles might be extincted. Newly born alleles might take over the original allele eventually.

  1. Contingency table

Figure 2.8.

It models that chromosomes make a pool that is expanded because each chromosome has a chance to transmit its copy to next generation more than 1 times. The next generation chromosome pool should be samples from the expanded pool.

This sampling process can be expressed as a contingency table.

This is expressed in 2 x 2 table and its exact occurence probabiity is \(\frac{4! 44! 12! 36!}{48! 2! 10! 2! 34!}\).

  1. State transition

Assume a population with only two individuals. In total there are 4 chromosomes. The number of copy of mutant allele can be 0, 1, 2, 3, or 4.

See Figure 2.9. The number of copy of mutant allele will change in the next generation.

Figure 2.9 describes the state transition.

State probability vector is a vector with length being number of states the element of the vector is non-negative and some of the elements is 1.

From the state with 0 mutant allele, the number of mutant copy of next generation is 0 with probability 1.

From the state with 1 mutant allele, the next generation value can be 0,1,2,3 and 4. The arrows in Figure 2.9 indicates the possible transition.

(R2-8.R Figure 2.10).

# ???団サイズ・変異アレル本数などを与え、何世代後に変異アレル本数が何本である??????確???を計算す???
#N 染色体集団サイズ,p 変異アレル本数,k ??????の染色体が次世代に残す染色体最大本数,infty 制限なく残せるならinfty=TRUE
probDrift02NInf<-function(N,p,k,infty=FALSE){ 
 if(infty){ # 染色体を抜き取る??????ルのサイズを無限大にする
  ret<-rep(0,(N+1)) # 変異染色体本数0,1,2,...,Nの状???
  pr1<-p/N # 変異染色??????度
  pr2<-1-pr1 # 非変異染色??????度
  if(pr1==0){ # 変異染色体がなければ、次世代も変異染色??????な???
   ret[N+1]<-1
  }else if(pr2==0){ # 非変異染色体がなければ、次世代は変異染色??????かり
   ret[1]<-1
  }else{
   for(i in 0:N){ # N本中i本が変異染色体である確???
    ret[i+1]<-exp(lgamma(N+1)-lgamma(i+1)-lgamma(N-i+1)+i*log(pr1)+(N-i)*log(pr2))
   }
  }
  ret
 }else{ # 次世代染色体を抜き出?????????ルは有限本数とする
  ret<-rep(0,(N+1))
  kN<-k*N
  kp<-k*p
  kNp<-kN-kp
  k1N<-(k-1)*N
  commonLN<-lgamma(kp+1)+lgamma(kNp+1)+lgamma(k1N+1)+lgamma(N+1)-lgamma(kN+1)
  for(i in 0:N){
   if(kp>=i){ # ??????ルにある変異染色体本数より多い本数を抜??????すことはできな???
    n11<-i
    n12<-kp-n11
    n21<-N-n11
    n22<-k1N-n12
    if(n11>=0 && n12>=0 && n21>=0 && n22>=00){
     ret[i+1]<-exp(commonLN-lgamma(n11+1)-lgamma(n12+1)-lgamma(n21+1)-lgamma(n22+1))
    }

   }
  }
  ret
 }

}
# ある状態において、次世代の変異染色体本数別の確???を計算す???
nextGenExProb2<-function(x,k,infty){
 N<-length(x)-1 # 染色体集団サイズ
 ret<-rep(0,length(x))
 for(i in 1:length(x)){ # 状???0,1,...,Nごとに、次世代の状態への移り変わる確???を計算す???
  p<-i-1
  tmpret<-rep(0,length(x))
  tmpret<-probDrift02NInf(N,p,k,infty=infty) 
  ret<-ret+tmpret*x[i] # ??????世代における変異本数がiの確???がx[i]なので、その比率に応じて、次世代の状???0,1,2,...,Nになる確???を加???
 }
 return(ret)
}

DriftSim3<-function(k=2,initial=1,np=20,ng=100,infty=FALSE){ # k: 最大子孫染色体本数, initial: 初期変異染色体本数, np: ???団サイズ, ng: シミュレーション世代数,infty:無限大??????ルにするかど??????
 m<-matrix(rep(0,(np*2+1)*ng),nrow=ng) #全世代の全状??????確???を納め???
 m[1,initial+1]<-1 # 第1世代は??????期本数の状??????確?????????
 for(i in 2:ng){ # 第???世代以降を???次シミュレー???
  m[i,]<-nextGenExProb2(m[i-1,],k=k,infty=infty)

 }
 return(m)
}
# 人数20人、染色体数40本、最大子染色体数2,初期変異染色体数10本、世代数25
out<-DriftSim3(k=2,np=20,initial=10,ng=25,infty=FALSE) # ??????
# 結果を鳥瞰図表示
# theta,phiは鳥瞰の視点を決める変数、shadeは陰影をつけて見やすくするための変数
persp(out,theta=120,phi=30,shade=0.2,xlab="generation",ylab="No. mutants",zlab="probability")
  1. Transition matrix

When transition from one state to another is stochastic, transition of n states to n states is represented by a n x n matrix, transition matrix.

The i-th column of the matrix represents avector of next generation probablity of all states when the i-th state proability before transition is 1.

Using this matrix, the state probability vector can be calculated by multiplication of the matrix on the state vector and the calculation can be repeated.

The next time-point state is only based on the one-step previous state only and no further past states will not affect. This is called a Markov chain.

Markov chain will appear in Chapter 8 linkage analysis.

\[ pr_{i=i,j=j}=\frac{(2N)!(2(k-1)N)!(ki)!(k(2N-i)!}{(2kN)!j!(2N-j)!(ki-j)!(2N(k-1)-(ki-j))!} \]

The corresponding 2x2 table is shown in the Japanese version textbook.

2.2 From DNA to RNA and Protein !

DNA RNA protein codon

2.2.1 From DNA to DNA — Transcription

Genetic information that has been recorded in the DNA will exert the function is Utsushitora to RNA. As shown in Figure 2.11, the upper portion of information “CAGGTT” of DNA is taken as the RNA copy of “CAGGUU”. It called a transfer this. DNA is “A”, “C”, “G”, but use the “T”, RNA will use the “U” instead of “T”. RNA Many also be further translated into a protein, RNA of the case is called messenger RNA (mRNA). There is also possible to exert the function as RNA without being translated, in that time, is called a functional RNA gene.

2.2.2 From RNA to Protein — Translation

It reads the information of mRNA translation and called to make a succession (protein) of the amino acid. Amino acids used in the translation There are 20 kinds. The length of the shortest base sequence for identifying the 20 amino acids at the base 4 kinds of permutations is \(4^2 \le 20 \le 4^3\) but, in fact, each of the bases 3 permutations are translated to correspond to the amino acids. 3 a sequence of bases called a codon, showed the correspondence between the codon and amino acid in Figure 2.12. When translated into protein, and special codons which means the start of the translation process, by determining a specific codon which means the end of the translation, initiation of translation processing and end are controlled. Since an initiation codon “AUG” corresponds to “Met” is a specific amino acid, the first amino acid at which the protein is made it is always “Met”. The end of the codon There are several, but by it to do not to correspond whatsoever of amino acids, the translation work is completed.

Chapter 3 Aspects of diversity !

3.1 Nucleic acid, protein diversity

3.1.1 Diversity of DNA sequences, species differences, gene polymorphism !

Figures 3.1 and 3.2

substitution insertion/deletion repeat inversion translocation

microscopic vs. macroscopic variants

structural variants

Figure 3.3

{A,T,G,C} : \(4^n\)

{0,1} : \(2^n\)

3.1.2 Diversity of RNA and protein

splicing

splicing variants

Using the R, we try to make an array of splicing variant (R3-1.R).

seq1<-sample(c("A","T","G","C"),100,replace=TRUE) # 長???100のDNA??????をランダ???に作る
exonpattern1<-c(11:20,41:60,81:90) # mRNA1は???エクソン。そのパターン
exonpattern2<-c(11:20,41:60,66:77,81:90) #mRNA2はmRAN1の第???エクソンと第???エクソンの間にエクソンが挿入されて???ま???
seq1[seq1=="T"]<-"U" # mRNAではDNAの"T"???"U"になりま??????
mRNA1<-seq1[exonpattern1] # mRNA1の??????を抜??????しま???
mRNA2<-seq1[exonpattern2]

DNA sequence variations ~ Genome variations

Other omics layer variations

(Figure 3.4).

3.2 Diversity and variance

3.2.1 Decomposition of variance — Variance, covariance

Diversity means “values” vary. Therefore variance, an statistical index, in a good indicator for it.

In some conditions, multiple variances can be summed up and one variance can be decomposed into multiple pieces that are also variances of somethings.

Assume a population with subpopulations.

You can measure variance of a whole population. Also you can measure variance of each subpopulation and also variance of means of subpopulations.

These “variances” are mutually related, and their relation is informative.

Assume a variable \(X_s\) and its \(N\) samples \(\{x_{s,1},...,x_{s,N}\}\).

Variance of \(X_s\) is, \[ V(Xs)=\frac{1}{N}\sum_{i=1}^N (x_{s,i}-\mu_s)^2 \] where \(\mu_s=\frac{1}{N}\sum_{i=1}^N x_{s,i}\).

You can write this as, \[ \sigma_{s,s}=V(Xs,Xs)=\frac{1}{N}\sum_{i=1}^N (x_{s,i}-\mu_s) \times (x_{s,i}-\mu_s) \]

Square is multiplication of self.

If you replace one \(s\) out of two \(s\) s to \(t\),

\[ \sigma_{s,t}=\frac{1}{N}\sum_{i=1}^N (x_{s,i}-\mu_s) \times (x_{t,i}-\mu_t) \] where \(X_t\) is another variable.

This should be something. Something related with variance but considering two variables. Actually this is called covariance.

Assume \(X1,X2\), such as test scores of physics and biology.

You are interested in the sum of two subjects, \(P=\{p_i=x_{1,i}+x_{2,i}\}\)

\[ \sigma_{P,P}=\sigma_{X1,X1}+\sigma_{X2,X2}+\sigma_{X1,X2}+\sigma_{X2,X1}=\sum_{s=1}^2 \sum_{t=1}^2 \sigma_{X_s,X_t} \] Covariance is 0 when two variables are mutually independent. You can easily find detailed description or proof of this.

Rather than repeating the regular textbook, here we simulate these with R.

nsample<-10000;X1<-rnorm(nsample);X2<-rnorm(nsample) # rnorm(), x1 and x2 are independent
p12<-X1+X2
cov12<-cov(cbind(X1,X2)) #X1, X2 variance covariance
cov12 
vpp_2<-cov(p12,p12) #p12 variance
vpp_2-sum(cov12)  # p12 vs. sum of variance and covariance
plot(X1,X2)
# x1 and x2 are dependent
X3<-X1+0.1*rnorm(nsample) 
p13<-X1+X3
cov13<-cov(cbind(X1,X3))
cov13
vpp_3<-cov(p13,p13)
vpp_3-sum(cov13) 
plot(X1,X3) 

Number of variables is arbitrary, k.

\[ \sigma_{p,p}=\sum_{s=1}^k \sum_{t=1}^k \sigma_{s,t} \]

3.2.2 Variance and heritability !

Phenotypes vary and have variance.

Genotypes vary and have variacen.

Environmental factors vary and have variance.

We assume phenotypic variance is sum of genotypic variance and environmental variance.

\[ \sigma_{P,P}=\sigma_{G,G}+\sigma_{E,E} \]

Heritability, how heavily phenotype is affected by genes, is defined as \[ \frac{\sigma_{G,G}}{\sigma_{P,P}} \]

3.2.3 Hardy-Weinberg equilibrium (HWE) and variance !

Assume a biallelic polymorphism with 2 alleles, A and a, whose allele frequencies are p and 1-p.

G0,G1 and G2 (“AA”, “Aa”, and “aa”) are diplotypes.

Under the assumption of independence in combination of two alleles, \[ \begin{align*} G2=p^2\\ G1=2p(1-p)\\ G0=(1-p)^2 \end{align*} \] The state of this assumption is Hardy-Weinberg equilibrium.

When the assumption is not true, a term representing the deviation from HWE should be added.

\[ \begin{align*} G0=p^2+\Delta\\ G1=2p(1-p)-2\times \Delta\\ G2=(1-p)^2+\Delta \end{align*} \]

This can be re-written as, \[ \begin{align*} G0=p^2+p(1-p)F\\ G1=2p(1-p)(1-F)\\ G2=(1-p)^2+p(1-p)F. \end{align*} \]

Then F is an index called inbreeding coefficient.

Giving values 1 and 0 to “A” and “a”, we can population numerically.

Mean and variance of the population of 2N chromosomes are, \[ \begin{equation*} \mu(h)=(p \times 1 +(1-p) \times 0) =p \end{equation*} \] \[ \begin{equation*} v(h)=p \times (1-p)^2+ (1-p) \times (0-p)^2=p(1-p) \end{equation*} \]

Considering diploid population,

\[ \begin{equation*} \mu(d)=G0\times 0 +G1\times 1 + G2 \times 2=2p=2\mu(h) \end{equation*} \] \[ \begin{align*} v(d)&=G0 \times (0-\mu(d))^2+G1\times (1-\mu(d))^2+G2\times (2-\mu(d))^2\\ &=2p(1-p)(1+F)\\ &=2v(h)+2\times F \times v(h) \end{align*} \]

Variance of diploids is twice of haploids.

Covariance is expressed as \[ Cov(d)=F \times v(h) \].

\[ F=\frac{Cov(d)}{v(h)}=\frac{Cov(d)}{p(1-p)} \] F is a measure of deviation from HWE and it is covariance of two alleles.

3.3 The ways to handle data and their effect on variance/covariance !

3.3.1 When handling alleles in two columns for HWE and LD !

See Figure 3.6.

Two columns for two variables with 0 or 1.

F for HWD and r for LD are proportinal to covariance of two variables.

Difference between HWD and LD is the way to treat (1,0) and (0,1).

3.3.2 Mode of inheritance (dominant, recessive) is represented by the third column

See Figure 3.7.

Handling dominant model, we need the third column. The third column gives information on the heterozygotes.

The value of genotypes is the sum of three columns.

\[ \begin{equation*} Vd=G1(1-G1)d^2 \end{equation*} \begin{equation*} Cov1d=Cov2d=\frac{1}{2}G1(G2-G0)d \end{equation*} \]

3.4 A lot of factors — Multifactorial inheritance !

Polygenic traits or complex genetic traits.

When many factors get together, the overall effects tend to take normal distribution.

R3-3.R models many loci that are in HWE and mutually in LE.

The phenotypic variance is sum of each loci’s variance and their covariances.

covXY<-function(x,y){ #?????????値ベクトルから共???散?????????
 mx<-momentX(x,order=1,center=FALSE)
 my<-momentX(y,order=1,center=FALSE)
 sum((x-mx)*(y-my))/length(x)
}
covMat<-function(m){ # すべての??????クトルのペアにつ???て共???散?????????
 ret<-matrix(rep(0,length(m[1,])^2),nrow=length(m[1,]))
 for(i in 1:length(m[1,])){
  for(j in 1:length(m[1,])){
   ret[i,j]=covXY(m[,i],m[,j])
  }
 }
 ret
}
N<-1000;M<-100 # サンプル数 2アレル型多型数
af<-runif(M) # アレル頻度 runif()関数につ???ては、付録A.5 確?????????関数、疑似乱数??????発生を??????
r<-rnorm(M) # アレル???本???の得点
d<-rnorm(M) # ??????カーごとの相???モ???ルからのずれ
Xmat<-matrix(rep(0,N*M),nrow=N)
for(i in 1:M){ # ??????カーごとに人数???の得点?????????
 x<-rep(0,N)
 x<-x+rbinom(N,1,af[i])
 x<-x+rbinom(N,1,af[i])
 x[x==1]<-1+d[i] # ヘテロ接合体特??????値(相???モ???ルからのずれ)?????????
 x<-x*r[i] # 座??????強さを掛け???
 Xmat[,i]<-x
}
X<-apply(Xmat,1,sum) # 個人の全多型???の得点
momentX(X,order=2,center=TRUE) # 全多型???の得点の???散と
sum(covMat(Xmat)) #???散共???散?????????値の??????一致する
hist(X) # 個人の全多型???得点の??????

Part II Data, sample, sample set !

It is evaluated by Chapter 4 observation Capture the Chapter 5 sample individually Capture the Chapter 6 sample as a group In order to understand the relationship between the genotype-phenotype of the organism, to observe the genotype-phenotype, it is necessary to deal with it as data. In the first part Ⅱ, after thinking about the type of data is the root for handling the data, we will look at two minutes increases the approach for handling the sample data is attributable. One approach is a way to distinguish between the individual samples, and one is the way to capture the sample as a group.

Chapter 4 Observation and assessment !

4.1 The type and configuration of the data !

4.1.1 The type of data that was seen from the gene — Genotype and phenotype, the final trait and an intermediate trait !

Since the life phenomena to organize and interpreted in terms of gene in the genetic statistics, even when considering the type and structure of the data, and the type classification from the point of view. Divided into two a

  • genotype (genotype)

  • phenotype (phenotype) covering all but genotype

Figure 4.1.

  • Interactions between layers.

  • One-way vs. two-ways.

  • Intermediate phenotypes and terminal phenotypes.

4.1.2 The type of data as analyzed — Data Types

  1. Data types
  • Discrete vs. continuous

  • Ordered vs. not-ordered (vs. partially-ordered)

  • 2 categories and 3-or-more categories are different from order-standpoint

Table 4.1

  1. 2 category type is also true that there is also that there is a sequence

A or non-A.

A is a subset of universe. non-A is complement of A.

Figure 4.2

Two categories can be handled as ordered. Figure 4.3.

  1. 3 or more category type

Three categories can be drawn as subsets as shown in Figure 4.2.

Three genotypes of SNPs may be treated as ordered (Figure 4.3) when focusing number of copies of one allele.

Three categories’ symmetricity can berepresented as a triangle (Figure 4.3).

Ordered categroies can be generated by dividing continuous variable with thresholds (Figure 4.3).

Order can be placed in one dimensional space.

  1. Continuous type

Variables of this type is realized on a line.

4.1.3 Partially ordered

When a line structure exists in variables, they are totally ordered. Some of values of variable are ordered but no order is defined between some values, they are called partially ordered.

Combinatorial genotypes of 2 or more SNPs can be said partially ordered.

\[ \begin{align*} (0,0) < (0,1) < (0,2)\\ (0,0) < (1,0) < (2,0)\\ (0,0) < (1,1) < (2,2) \end{align*} \]

4.1.4 The combination of categories !

\[ \prod_{i=1}^N k_i =k_1\times k_2 \times ... \times k_N \]

4.1.5 Only-one selection, duplicated selection !

Zero or more items might be selected from a item list. That may be found in questionnaires or in clinical criteria.

Some clinical criterion have a scoreing system that has a weight vector for items.

There are multiple ways to record them as shown in tables in the Japanese textbook.

Combination \(\binom{N}{k}=\frac{N!}{k!(N-k)!}\) or permutation \(\frac{N!}{(N-k)!}\) is important.

In case of questionnaire, multiple items might be selected (combination) or they should be selected with the order of preference (permutation).

4.1.6 Special aspects in diploids — Hardy-Weinberg equilibrium exact test (HWE) !

Assume a biallelic variant, with A and T alleles. Genotypes are AA, AT and TT.

We genotyped N individuals’ genotype and \(n_0,n_1,n_2\) individuals had AA, AT and TT, respectively.

This observation makes N x 2 table, the table in Japanese text book.

The sum of first column is the number of A allele observed and sum of the second column is the number of T allele.

All row sums are 2, because everybody is diploid.

Total sum is 2N.

This can be seen as the result of repeated procedure that select two from two categories; two values can be same or different.

Let’s test the null hypothesis where alleles are selected at random.

We perform the “exact probablity test” for HWE. You may find some notes on exact probability tests in general in Chapter 13.

There are \(n_0+n_2\) cells with value 2, \(n_0 + n_2\) cells with value 0, and \(2\times n_1\) cells with value1, and \(N\) marginal counts with value2 and \(n_A\) and \(N_T\) are the marginal counts of colum-sums, therefore,

\[ \begin{equation*} \frac{\prod_{i=1}^N (2!) n_A! n_T!}{(2N)! (2!)^{n_0+n_2} (0!)^{n_0+n_2}(1!)^{2\times n_1}}= \frac{2^{n_1} n_A! n_T!}{(2N)!} \end{equation*} \] Note \(0!=1,1!=1,2!=2\),\(N=n_0+n_1+n_2\)

The table treats all N rows separately, that means individuals are mutually distiguishable.

However the observation is on the numbers of people with three genotypes and no distinction of individuals.

The number of ways to have the \(N x 2\) table is \(\frac{N!}{n_0! n_1! n_2!}\), therefore, the exact probability to observe \(n_0,n_1,n_2\) under the assumption of HWE is \[ \begin{equation*} \frac{N!}{n_0!n_1!n_2!} \times \frac{2^{n_1} n_A! n_T!}{(2N)!} = \frac{2^{n_1} n_A! n_T! N!}{(2N)!n_0!n_1!n_2!} \end{equation*} \]

hweExact<-function(g=c(813,182,5)){ # number of people with three genotypes
 n<-sum(g) # total 
 nA<-2*g[1]+g[2] # No. allele A
 na<-2*g[3]+g[2] # No. allele a
 evod<-g[2]%%2 # Heterozygoutes are even or odd
 maxAa<-min(nA,na)-evod
 Aa<-seq(from=evod,to=maxAa,by=2) # list of possible heterozygous people number
 AA<-(nA-Aa)/2 # corresponding No. AA
 aa<-(na-Aa)/2 # corresponding No. aa
 obs<-(g[2]-evod)/2+1 # observed heterozygoutes
 prob<-rep(0,length(Aa)) # exact probs of all cases
 prob<-exp(n*lgamma(2+1)+lgamma(nA+1)+lgamma(na+1)-lgamma(2*n+1)-(AA*lgamma(2+1)+Aa*lgamma(1+1)+aa*lgamma(2+1))+lgamma(n+1)-(lgamma(AA+1)+lgamma(Aa+1)+lgamma(aa+1)))
 p.value<-sum(prob[prob<=prob[obs]]) # p-value
 # Aa No. heteros
 # prob exact probs
 # obsprob exact prob of observed table
 # p.value 
 list(Aa=Aa,prob=prob,obsprob=prob[obs],p.value=p.value)
}
xx<-hweExact(c(813,182,5))
xx$p.value # test p-value
sum(xx$prob) # sum of all exact probs is 1

4.1.7 Parent items and child items !

Only when a value of one variable is a particular value, some variables might be observed.

For example, item X is prompted to be answered only when answer to item Y is “yes”.

For example, genotype of a marker on Y chromosome is only available for males.

4.1.8 Placement of categories, non-independence between categories, regular simplex !

Mutually independent n categories can be placed in (n-1) dimensional space as vertices of (n-1)-simplex.

  • Angle between two vectors pointing two vertices is \(\theta\) where \(cos\theta=-\frac{1}{k}\).

Assume k+1 dimensional space and its orthogonal basis vectors \(e_i =(0,0,....,0,1,0,...,0); i=1,2,...,k+1\).

They represent k+1 categories.

Take the subspace that is \[ \sum_{i=1}^{k+1} \beta_i e_i; \sum \beta_i = 1, \beta_i \ge 0 \], which is k-simplex.

The center of k-simplex is,

\[ o = \frac{1}{k+1} \sum_{i=1}^{k+1} e_i = \frac{1}{k+1} (1,1,...,1) \]

\[ f_i=e_i-o=(-\frac{1}{k+1},-\frac{1}{k+1},....,-\frac{1}{k+1},1-\frac{1}{k+1},-\frac{1}{k+1},...,-\frac{1}{k+1}) \] are k+1 vectors in k-dimensional space.

\[ |f_i|=\sqrt{\frac{1}{(k+1)^2}(k*1+k^2)}=\sqrt{\frac{k}{k+1}} \]

For any i and j (\(i \ne j\), \[ \theta_{i,j}$は$cos(\theta_{i,j})=\frac{f_i f_j}{|f_i||f_j|}=-\frac{1}{k+1} \frac{k+1}{k}=-\frac{1}{k} \]

  • 1-dimensional space (straight line): 1-regular simplex line segment

  • 2-dimensional space (plane): 2-regular simplex equilateral triangle

  • 3-dimensional space (space): 3 regular simplex tetrahedron

With k larger, \(cos(\theta_{i,j})=-\frac{1}{k} \to 0\), means categories get closer to independent relation.

Complete graph is a graph in which all vertices are mutually connected meaning all pairwise relations are the same, that is a graph representation of multiple mutually independent categories.

Figure 4.4.

4.2 Comparison !

4.2.1 Pairwise comparison - symmetrical relationship and asymmetrical relationship

Pairwise comparison, comparison of two, can be said, evaluation of relation of two.

There are two ways to evaluate relation of A and B, relation of A to B and relation B to A.

\(A=B\) or \(A!=B\) won’t change after you exchange A and B.

When A “is larger than” B, B “is smaller” than A. In this case, after you exchange A and B, you had to change “larger” to “smaller”.

The former relation is symmetric and the latter is asymmetric relation.

4.2.2 Change an asymmetric relation to a symmetric one — Distance !

When A is 2 and B is 1, A is larger than B by 1 and B is smaller than A by 1.

“Larger” and “smaller” are different but “by 1” are common between two expressions.

“1” is distance between A and B and distance seems “symmetric”.

Distance can be said as non-negative measure to quantitate difference between two.

It is better to define distance clearly.

Metric space is a set for which distances between all members of the set are defined.

  1. Distance
  • is defined between two of something (A, B)

  • takes the value of the non-negative (greater than or equal to 0)

  • should be symmetric (distance from A to B and distance from B to A are the same)

  • When A=B, distance between A and B should be 0

  • For three items, there are three distances. There should be a triagnle whose edge lengths are the three distances; in some cases, the triangle looks like a segment with two vertices at the ends and the third vertex is on the segment. This rool is called triangle inequality.

  1. triangle inequality

Figure 4.5. \[ \begin{equation*} b+c>a,c+a>b,a+b>c \end{equation*} \] then, \[ \begin{align*} x=\frac{a+b+c}{2}-a = \frac{1}{2} (b+c-a) \ge 0\\ y=\frac{a+b+c}{2}-b = \frac{1}{2} (c+a-b) \ge 0\\ z=\frac{a+b+c}{2}-c = \frac{1}{2} (a+b-c) \ge 0\\ x+y=c, y+z=a, z+x=b \end{align*} \]

This means when distances are defined for all pairs that meet triangle inequality, they should have a tree satisfying the distances.

4.2.3 Euclidean distance and other distances !

Any formula that satisfies the above mentioned rules, defines distance.

Euclidean distance is familiar one: \[ \begin{equation*} d_M(A,B)=\sum_{i=1}^k |a_i-b_i|=\sum_{i=1} \delta_i \end{equation*} \]

This is another one, called Manhattan distance; \[ \begin{equation*} d_M(A,B)=\sum_{i=1}^k |a_i-b_i|=\sum_{i=1} \delta_i \end{equation*} \]

R’s dist() function have multiple distances.

4.2.4 Sequence differences and Manhattan distance

For example, the distance between DNA sequences can be the number of different bases.

  • person: cccggaCAcCgActtcccGgggctcatt

  • Mouse: cccggaTGcAgGcttcccAgggctcatt

Five out of 29 bases are different. Distance is 5.

When there is an insertion/deletion, it is more difficult to measure the distance between them.

  • person: … cccggaCAcCgActtcccGgggctcattACcctCAc …

  • Mouse: … cccggaTGcAgGcttcccAgggctcatt = Tcct = Tc …

To define distance between DNA sequences that are different with substitutions and ins/dels, we have to have a formula to score the difference taking both types of variants into account.

The “blast” method quantitates the distance between DNA sequences by rareness that some biological events generated one sequence from another sequence. When quantitating the rareness, it uses a distribution called extremal value distribution.

4.2.5 Angle as distance — Correlation coefficient

Categrories in the shape of simplex Distance has expressed a symmetrical relationship with a value greater than or equal to 0. Or you will not be able to assess quantitatively the relationship, including the negative number. When you place a categorical in space, all of the categories is in the xxxx the relationship with each other, it is this angle is equivalent, has said that refers to the equal relationship of the category. Thus the angle can also represent a relationship between two things. It can be an amount that represents the relationship between the q, would be a cos q may be used as the amount. Now, two of the data is equal to a vector format, it is represented by the inner product and the length of the vector with each other.

\(cos(\theta)=\frac{\sum_{i=1}^k x_i y_i}{|x||y|}\)

This is a value called the correlation coefficient.

4.3 Multiple samples, a lot of comparisons

4.3.1 one vs. (N - 1) and N vs. N !

So far, we compared two.

From here we compare N samples.

  • You may be interested in one out of N and compare the one vs. (N-1) others.

  • You may be interested in all N and you may compare \(\frac{N}{N-1}{2}\) pairs. Or you may compare \(\frac{N(N+1)}{2}\) pairs including “one-and-its-self” relation. In case pairwise relation is asymmetric, \(N^2\) comparisons should be made.

See Figure 4.7.

4.3.2 4.3.2 Partial order !

When all items are located on a line, they are in perfect order.

Sometimes, some paris have order but others don’t.

Assume a power set of a set \(\{1,2,...,N\}\), with N elements.

There are \(2^N\) elements that are subset of \(\{1,2,...,N\}\) in the power set.

Between some elements in the power set, inclusion relation exists, but for some pairs no.

Figure 4.8 is a Hasse diagram representing inclusion relation of elements of power set with 4 elements.

They can be seen as 4-dimensional cubic that is projected to 2-D plane.

powerSetHasse<-function(k=4){
 lev<-k+1
 numelem<-rep(0,lev)
 a<-0:k
 numelem<-choose(k,a)

 numnode<-2^k
 nodesX<-rep(0,numnode)
 nodesY<-rep(0,numnode)
 vecs<-matrix(rep(0,numnode*k),ncol=k)
 keta<-2^(0:(k-1))

 counter<-rep(0,lev)
 startval<--(numelem-1)/2
 for(i in 2:numnode){
  vecs[i,]<-vecs[i-1,]
  vecs[i,1]<-vecs[i,1]+1
  for(j in 1:(k-1)){
   if(vecs[i,j]==2){
    vecs[i,j]<-0
    vecs[i,j+1]<-vecs[i,j+1]+1
   }else{
    j<-k
   }
  }
  tmpsum<-sum(vecs[i,])
  nodesY[i]<-tmpsum
  nodesX[i]<-counter[tmpsum]+startval[tmpsum+1]
  counter[tmpsum]<-counter[tmpsum]+1
 }

 values<-vecs%*%keta

 edges<-matrix(rep(FALSE,numnode^2),ncol=numnode)
 for(i in 1:(numnode-1)){
  for(j in (i+1):numnode){
   tmpdist<-sum(abs(vecs[i,]-vecs[j,]))
   if(tmpdist==1){
    edges[i,j]<-TRUE
   }
  }
 }
 Edges<-which(edges,arr.ind=TRUE)
 plot(nodesX,nodesY,pch=15)
 segments(nodesX[Edges[,1]],nodesY[Edges[,1]],nodesX[Edges[,2]],nodesY[Edges[,2]])
}

powerSetHasse(4)

4.3.3 Distance matrix and tree

When distance is defined for all pairs of N elements, the information can be expressed as a N x N non-negative symmetric matrix, called distance matrix.

Using nj() function in ape package of R (nj standing for neighbor joining method), you can generate a tree structure that satisfies the distance matrix.

Tree construction is one of hierarchical clustering methods, that will be a topic in Chapter 5.

\[ 1 2 3 4 2 4.725108 3 1.372943 4.600435 4 4.135837 2.220284 4.311534 5 2.020896 3.608861 1.844573 3.591590 \]

# Ns samples in k dimensional space
Ns<-5;k<-5;x <- matrix(rnorm(Ns*k), nrow=Ns)
dist(x,method="euclidean") # Euclidean distance
library(ape) # ape package
treu<-nj(dist(x,method="euclidean")) # tree construction with euclidean distance
trman<-nj(dist(x,method="manhattan")) # with manhattan distance
par(mfcol=c(1,2)) 
plot(treu);plot(trman) 
par(mfcol=c(1,1)) 

Chapter 5 Treat samples individually

The ways to handle samples can be classified into two; one is to handle samples individually and the other is to study samples as a distribution.

This chapter 5 is on the ways to handle them individually.

Next chapter 6 is on the ways to study them as a distribution.

5.1 Graph !

5.1.1 Definition of graph !

Graph is a pair of a set of vertices and a set of edges.

  • Vertex

  • Edge

  • Directed vs. non-directed

  • Connection

  • Degree

  • Neighbor

  • Path

  • Cycle

  • Distance between vertex pair

  • Perfect graph

  • Acyclic graph

  • Tree graph

-Graph algorithms

5.2 Arrange samplea in a line — Number line that is a graph

Figure 5.1.

5.3 Tree graph !

5.3.1 Tree

\[ No. edges = No. vertices - 1\\ \]

  • Root; rooted tree vs. unrooted tree

  • leaf

5.3.2 tree shape - topology

Figure 5.3.

5.4 Understanding of data by making a tree - hierarchical clustering !

5.4.1 Phylogenic tree

  • Phylogenic tree for evolutionary history

  • Rooted tree

  • Cladegram

  • Hierarchical clustering

5.4.2 Hierarchical clustering

Starting from a data set, hierarchical clusters are to be generated with repeated procedures.

Multiple algorithms of hierarchical clustering.

  • Definition of distance

  • Definition of distance between tentative clusters

  • Rules in merging order

Figure 5.5.

Try using the hclust () function of R of data mining system (2 R5-1.R, Figure 5.6).

distMatrix<-dist(x,method="manhattan") # 距離にはマンハッタン距離を使用
trclust<-hclust(distMatrix,method="ward") # クラスタ間距離の定義にはウォード法を使用
plot(trclust)

5.5 Data in the shape of matrix !

5.5.1 Rearranging the order of elements of matrix, both rows and columns shoudl be rearranged — heat map

N samples and M variables make N x M data matrix.

Using the matrix data, N samples can be hierarchically clustered and M variables can be also hierarchically clustered as shown in Figure 5.7, a heat map.

m<-matrix(rbinom(120,1,0.5),20,6)
heatmap(m)

5.5.2 Order of elements is not changed — linkage disequilibrium index plot

A data set with M samples and N markers.

For N markers, \(\frac{N(N-1)}{2}\) pair LD index values can be calculated.

Because the order of N markers represents location of them, the order should not be changed, but evaluate the LD index value pattern with the order kept.

LD index, r, is correlation coefficient.

cormatrix<-cor(m);rsqmatrix<-cormatrix^2
image(1:nrow(rsqmatrix),1:ncol(rsqmatrix),rsqmatrix,col=gray((100:0)/100))

5.5.3 Order of rows, order of columns

M rows x N columns data set.

Based on your interest, you may rearrange/not-rearrange order of columns / rows, as shown in Figure 5.9.

  • Relation among N items,

  • Relation among M items,

  • Relation both among N items and among M items

5.6 Pedigree, allele genealogy - graph of indivuals of a species

Figure 5.10.

Pedigree: Graph-like but not a graph

Transformation of pedigree to a graph

Cousin marriage as a closed loop

Directed graph

5.6.1 Graph of individuals and graph of chromosomes

  1. The graph of the individual’s relationship

Autosomal recessive trait shown in Figure 5.11.

  1. The graph of chromosomal relationship

Figure 5.12 is a graph representing chromosomal relationship.

Figure 5.12 is one of many such graph satisfying Figure 5.11, meaning there are multiple “right answers” of chromosomal relation satisfying individual relation.

Figure 5.13 shows two answers.

This is why we have to estimate the chromosomal relation pattern(s) (genetic relation) that should describe the individual relation pattern (phenotypic relation).

5.6.2 Graph of the chromosome transmission and recombination

Figure 5.14.

Chromosomal transmission graph changes its pattern at crossing over sites.

When focusing on a particular locus, no branchings in chromosomal graph.

When overlapping chromosomal graphs along chromosomes, every offspring chromosome is connected to paternal and maternal chromosome.

5.6.3 Going back to ancestrors — Coalescent

Figure 5.15.

For each locus, multiple trees exist in a population.

Multiple leaves share a common ancestral chromosome.

When going back, they merge; “coalescent”

In brief, linkage analysis tries to identify the tree that has a tree pattern fitting to the pattern of phanotype most.

R code is downloadable: R5-sup1.R

5.7 Network

Tree is a graph without cycle.

Network is a graph with cycle.

In general graphs with cycles are difficult to handle and difficult to be interpreted.

Cycles may represent “positive / negative feedback”.

ARGsim<-function(n=8,g=5,l=6,r=r){
 ids<-1:n
 now<-1:n
 parents<-rep(0,n*g*l)
 data<-rep(0,n*g*l)
 for(i in 1:n){
  data[(1+(i-1)*l):(i*l)]<-i
 }
 count<-1+n*l
 for(i in 2:g){
  tmp<-sample(ids,replace=FALSE)
  for(j in 1:(n/2)){
   countinside<-1
   for(x in 1:2){
    first<-1
    if(runif(1)>0.5){
     first<-2
    }
    data[count]<-now[tmp[(j-1)*2+first]]
    parents[count]<-tmp[(j-1)*2+first]
    now[countinside]<-data[count]
    count<-count+1
    countinside<-countinside+1
    for(k in 1:(l-1)){
     if(runif(1)<r){
      first<-first+1
      if(first==3){
       first<-1
      }
     }
     data[count]<-now[tmp[(j-1)*2+first]]
     parents[count]<-tmp[(j-1)*2+first]
     now[countinside]<-data[count]
     count<-count+1
     countinside<-countinside+1
    }
   }
  }
 }

 return(list(allele=matrix(data,nrow=l),parents=matrix(parents,nrow=l)))
}
n<-8
g<-10
l<-200
r<-0.05
d<-ARGsim(n=n,g=g,l=l,r=r)
mmm<-matrix(d$parents[1,],ncol=g)


plotSlice<-function(m){
 points<-which(m>=0,arr.ind=TRUE)
 plot(points[,1],points[,2])
 for(i in 2:length(m[1,])){
  for(j in 1:length(m[,1])){
   segments(m[j,i],i-1,j,i)
  }
 }

}
locus<-1:l

PlotSerial<-function(m,locus,g){
 for(i in locus){
  mmm<-matrix(m[i,],ncol=g)
  plotSlice(m=mmm)
 }
}
PlotOverlap<-function(m,locus,g){
 for(i in locus){
  mmm<-matrix(m[i,],ncol=g)
  par(new=T)
  plotSlice(m=mmm)
 }
}
PlotSerial(d$parents,locus=locus,g=g)
PlotOverlap(d$parents,locus=locus,g=g)

for(i in 1:l){
 mmm<-matrix(d$parents[i,],ncol=g)
 plotSlice(m=mmm)
}

locus<-1
mmm<-matrix(d$parents[locus,],ncol=g)
plotSlice(m=mmm)

Chapter 6 Capture the samples as a group !

In Chapter 5, we treated samples individually.

In this Chapter 6, we characterize the samples as a whole.

6.1 Distribution of samples !

6.1.1 One-dimensional space !

Assume samples with 1 quntitative variable. Let’s draw a box plot, distribution plot and cumulative distribution of them. (R6-1.R).

n1<-1000;n2<-500 # No. samples
# Generate unimodal values (unimodal ~ one-peak)
popdata1<-rnorm(n1,0,0.5) # Random value generation from normal dist
par(mfcol=c(1,3)) # Divide a drawing area into three
plot(ecdf(popdata1)) # cumulative density
plot(density(popdata1)) # density
boxplot(popdata1) # boxplot
par(mfcol=c(1,1))
summary(popdata1) # summaization of samples
# bimodal values
popdata2<-c(rnorm(n1,0,0.5),rnorm(n2,5,1))
par(mfcol=c(1,3)) 
plot(ecdf(popdata2))  
plot(density(popdata2))  
boxplot(popdata2)  
par(mfcol=c(1,1))

Read unimodality from the density plots.

How do you read the same information from the cumulative plots?

Higheigt in the density plot ~ gradient of the cumulative plot.

When it is unimodal, basic descriptive statistics such as minimum, maximum, quntiles, median, mean, are infromative.

Read bimodality from the density and cumulative plots.

How is the boxplot useful?

How do the basic statistics useful?

6.1.2 Two dimensional space !

What we should do for two dimensional data sets are similar to ones in one dimensional space.

Let’s make a data set with three peaks in 2 dimensional space and draw a coplot, perspective view and a clustered diagram. (R6-2.R, Figure 6.2).

# Use rnorm()
n1<-500;n2<-300;n3<-200;x<-c(rnorm(n1,0,0.5),rnorm(n2,5,1),rnorm(n3,8,2));y<-c(rnorm(n1,0,2),rnorm(n2,3,2),rnorm(n3,-3,1))
#library(gregmisc) # has a function hist2d()
library(gplots)
h2d <- hist2d(x,y, show=FALSE,same.scale=TRUE, nbins=c(10,10)) # taking 2d histogram information
plot(x,y) # coplot
filled.contour( h2d$x, h2d$y, h2d$counts, nlevels=9,col=gray((8:0)/8) ) # gray scale for density
persp( h2d$x, h2d$y, h2d$counts,ticktype="detailed", theta=60, phi=30,shade=0.5, col="cyan") # perspective view

When samples are in higher dimensional space, visualization of them is impossible (or difficult in the same way), but what we want to do for them is similar; we should grab their distribution and summarize them with descriptive statistics.

6.2 non-hierarchical clustering !

Because the samples look like to have three peaks, let’s assign each sample to one of three cluster groups. k-means method is one of the methods called non-hierarchical clustering methods.

First you specify the number of clusters, k.

Tentatively assign all samples to one of k clusters. Caclulate the center of each cluster based on the samples in the cluster.

For each sample, you have k center coordinates and the sample’s own coordinate. Based on the k+1 coordinate information, you re-assign the sample to one of k clusters. After the reassignment, the center coordinates of clusters should be updated.

You repeat the reassignments until no reassignment is necessary.

R6-3.R Drew Figure 6.2 (d).

# non-hierarchical clustering
m3<-matrix(c(x,y),ncol=2)
cl <- kmeans(m3, 3) # kmeans method, specifying no. clusters = 3
plot(m3, col = cl$cluster)
points(cl$centers,pch = 8, cex=10) # marking the center of clusters

6.3 Population genetics !

Population genetics is a field to study populations of organisms from genetic standpoint and a big branch in statistical genetics.

6.3.1 Unevenness and disequilibrium !

Individuals in a population may interact each other at random, but in reality their interactions are not always at random.

Geology and other natural factors and cultural factors (language, religion, ethnic identity, state) cause the non-randomness.

Random stochastic events affected by various non-random factors along time generates uneven population along time axis.

6.3.2 Homogeneous populations and the Hardy-Weinberg equilibrium (HWE) — Mixture of homogeneous population !

When individuals are mating at random in a population, distribution of genotypes should be in the state calles Hardy-Weinberg equilibrium.

It is easy to model a HWE population. Because of various factors violating HWE, populations in reality aer not in HWE, or in Hardy-Weinberg Disequilibrium (HWD).

To model populations in HWD, you may mix multiplpe populations in HWE.

We model a population that is a mixture of two HWE populations in the followings.

Assume two populations and a biallelic polymorphism with A and a alleles. Allele frequency of A of two populations are p and q respectively and the population is a mixture of two with r vs. (1-r) fraction ratio.

Two populations; \[ \begin{equation*} p^2,2p(1-p),(1-p)^2 \end{equation*} \] \[ \begin{equation*} q^2,2q(1-q),(1-q)^2 \end{equation*} \] Their mixture; \[ \begin{equation*} (gm1,gm2,gm_3)=(cp^2+(1-c)q^2, 2cp(1-p)+2(1-c)q(1-q),c(1-p)^2+(1-c)(1-q)^2)) \end{equation*} \]

Giving values 1 and 0 to A and a, respectively. Now you can express the state of each population with mean and variance. \[ \begin{align*} m_1=2p,v_1=2p(1-p)\\ m_2=2q,v_2=2q(1-q) \end{align*} \]

The mixture population’s allele frequency is \[ cp+(1-c)q, \] which is easy.

Mean of the mixture population is easy. \[ \begin{equation*} mm=2(cp+(1-c)q)=2(c m_1+(1-c)m_2) \end{equation*} \]

Variance is a bit difficult, because we have to consider covariance as we learned in Chapter 3.

6.3.3 Chronological changes !

  1. Diffusion equation

Assume two isolated islands, A-island and B-island.

At some time point, all chromosomes had M allele in A-island and all had m in B-island.

Suddenly the two islands are connected somehow.

Since the connection was established, \(d\) moves from A-island to B-island in the unit time duration and the same number of individuals \(d\) moves from B-island to A-island without change in population size of both islands, \(P_A, P_B\).

Let \(Fa(t),Fb(t)\) denote allele frequency of M at time t in A and B islands, respectively, with t=0 being the time of establishment of connection.

When \(t\ le 0\), \(Fa(t)=1,Fb(t)=0)\).

The change from \(t=T\) to \(t=T+\delta\) is expressed as below; \[ \begin{equation*} 2P_A Fa(T+\delta)=2P_A Fa(T) -2d\delta Fa(T) + 2d \delta Fb(T) \end{equation*} \] \[ \begin{equation*} 2P_B Fb(T+\delta)=2P_B Fb(T) -2d\delta Fb(T) + 2d \delta Fa(T). \end{equation*} \] Considering \(P_A, P_B\), \[ \begin{equation*} 2P_A P_B (Fa(T+\delta)-Fb(T+\delta))=2P_A P_B (Fa(T)-Fb(T))-4d\delta (P_A+P_B)(Fa(T)-Fb(T)), \end{equation*} \] then, \[ \begin{equation*} (Fa(T+\delta)-Fb(T+\delta))-(Fa(T)-Fb(T))=- \frac{2(P_A+P_B)}{P_A P_B} (Fa(T)-Fb(T)). \end{equation*} \]

\[ \begin{equation*} G(T+\delta)-G(T)=- \frac{2(P_A+P_B)}{P_A P_B} G(T) \end{equation*} where $G(T)=Fa(T)-Fb(T)$, \] and \[ \begin{equation*} \frac{d}{dt} G(T)=- \frac{2(P_A+P_B)}{P_A P_B} G(T) \end{equation*} \] then, \[ \begin{equation*} G(T)= K e^{- \frac{2(P_A+P_B)}{P_A P_B}t} \end{equation*} \]

Now \(G(0)=1\), therefoer \(K=1\).

The ifferential equation of \(G(t)\) means that the change per unit time is proportional to the difference in allele frequency between two islands.

Because \[ $P_AFa(t)+P_BFb(t)=P_AFa(0)+p_BFb(0)=P_A$ \] therefore, \[ \begin{equation*} Fa(t)=\frac{P_A + P_B e^{- \frac{2(P_A+P_B)}{P_A P_B}t}}{P_A+P_B} \end{equation*} \begin{equation*} Fb(t)=\frac{P_A (1- e^{- \frac{2(P_A+P_B)}{P_A P_B}t})}{P_A+P_B} \end{equation*} \]

Let’s draw this chronological change with R (R6-4.R, Figure 6.3).

pa<-9000;pb<-1000;d<-100;t<-0:100 # pa,pb:population size,d:numebr of immigrants per unit time,t:generation
fa<-(pa+pb*exp(-2*d*(pa+pb)/(pa*pb)*t))/(pa+pb)
fb<-(pa*(1-exp(-2*d*(pa+pb)/(pa*pb)*t)))/(pa+pb)
plot(t,fa,ylim=c(0,1),type="l",xlab="t",ylab="frequency")
par(new=T)
plot(t,fb,ylim=c(0,1),type="l",xlab="t",ylab="frequency")

Over time (horizontal axis), allele frequency of 2 Island will converge to the same value. Speed of convergence depends on the fraction of moving people in the island population.

Eventually two islands allele frequencies become the same even with immigration; equilibirium.

The differential equation here is the simplest equation called diffusion equation in thermodynamics, a branch of physics.

  1. Represented by a transition matrix !

Assume five islands, A,B,C,D, and E. No change in population of each island. The fraction of movement from one island to the others are constant.

The fraction of move form A island to B,C,D and E is \[ P_A *m_{A\to X};X=B,C,D,E \]

and the fraction of not-move from A island is \[ m_{A\to A}=1-\sum_{X\in \{B,C,D,E\}}m_{A \to X}, \] then

\[ P_A(T+1)=P_A(T)m_{A\to A} +\sum_{X \in \{B,C,D,E\}} P_{X} m_{X \to A}=\sum_{X \in \{A,B,C,D,E\}} P_{X} m_{X \to A} \]

The matrix representation of these is, \[ \bordermatrix{ & \cr & m_{A\to A}&m_{B\to A}& m_{C\to A}& m_{D\to A}& m_{E\to A}\cr & m_{A\to B}&m_{B\to B}& m_{C\to B}& m_{D\to B}& m_{E\to B}\cr & m_{A\to C}&m_{B\to C}& m_{C\to C}& m_{D\to C}& m_{E\to C}\cr & m_{A\to D}&m_{B\to D}& m_{C\to D}& m_{D\to D}& m_{E\to D}\cr & m_{A\to E}&m_{B\to E}& m_{C\to E}& m_{D\to E}& m_{E\to E}\cr } \bordermatrix{ & \cr &f_A(T) \cr &f_B(T) \cr &f_C(T) \cr &f_D(T) \cr &f_E(T) \cr } \] \[ =\bordermatrix{ & \cr & f_A(T+1) & f_B(T+1) & f_C(T+1) & f_D(T+1) & f_E(T+1) \cr } \]

6.3.4 the movement in the space !

In the example of Section 6.3.3, positions of individuals in islands were not cared. It means that islands were assumed as points witout size, or they are 0-dimensional objects.

Now we consider “space”.

As the simplest example, assume a straight line, 1-dimensional space where individuals locate and move aournd.

They move because some locations in the space is better to live in.

Individuals tend to move to the best spot and eventually all individuals end up at the best spot.

This movement is expressed in a form of diffusion equation.

The source of the R to draw this figure is not me, you can download (R6-sup1.R).

L<-100
ngen=100
X<-rep(0,L)
X[20]<-1
Peak<-60.3
Benefit<-L-abs((1:L)-Peak)
counter<-0
C<-10
frac<-0.5
plot(X,ylim=c(0,1),type="l")
for(i in 2:ngen){
 tmpX<-rep(0,L)

 for(j in 1:L){
  pre<-j-1
  self<-j
  post<-j+1
  if(j==1) pre<-L
  if(j==L) post<-1
  tmpB<-c(Benefit[pre],Benefit[self],Benefit[post])
  maxid<-which(tmpB==max(tmpB))
  if(maxid==1){
   tmpX[pre]<-tmpX[pre]+X[self]*frac
  }else if(maxid==2){
   tmpX[self]<-tmpX[self]+X[self]*frac
  }else if(maxid==3){
   tmpX[post]<-tmpX[post]+X[self]*frac
  }
  tmpX[self]<-tmpX[self]+X[self]*(1-frac)
 }
 counter<-counter+1
 if(counter==C){
  par(new=T)
  plot(tmpX,ylim=c(0,1),type="l")
  counter<-0
 }
 X<-tmpX

}

PlotMove<-function(L=100,ngen=200,loc=20,Peak=60.3,C=10,frac=0.5){
X<-rep(0,L)
X[loc]<-1
Peak=Peak
Benefit<-L-abs((1:L)-Peak)
counter<-0
C<-C
frac<-frac
plot(X,ylim=c(0,1),type="l")
for(i in 2:ngen){
 tmpX<-rep(0,L)

 for(j in 1:L){
  pre<-j-1
  self<-j
  post<-j+1
  if(j==1) pre<-L
  if(j==L) post<-1
  tmpB<-c(Benefit[pre],Benefit[self],Benefit[post])
  maxid<-which(tmpB==max(tmpB))
  if(maxid==1){
   tmpX[pre]<-tmpX[pre]+X[self]*frac
  }else if(maxid==2){
   tmpX[self]<-tmpX[self]+X[self]*frac
  }else if(maxid==3){
   tmpX[post]<-tmpX[post]+X[self]*frac
  }
  tmpX[self]<-tmpX[self]+X[self]*(1-frac)
 }
 counter<-counter+1
 if(counter==C){
  par(new=T)
  plot(tmpX,ylim=c(0,1),type="l")
  counter<-0
 }
 X<-tmpX

}

}

PlotMove()

6.4 thermodynamics, statistical mechanics, fluid dynamics !

6.4.1 Spatio-temporal, finite and infinite

Simple space and time have been handled. Three-dimensional space and time make four-dimensional space.

Sometimes space is not “flat-like”. For example, the sphere is three dimensional object and its surface exists in three-dimensional space but the surface itself is two-dimensional and the surface is closed and finite.

The spacio-temporal space is handled, of course, in life-science and genetics, but also we handle a space of information, or space of variables to be observed, that could be very high-dimensional.

When we study informational spaces, they vary a lot.

There are two points for you to be sensitive about spaces.

  • Finite vs. infinete

  • Open vs. close

These two points look similar but somewhat different. For example, when closed, the space itself is finite but movements in it might be finite. (Figure 6.5)

When we make a model, we often assume infinity, that usually makes the model easier.

Population size can be modeled such that it increases infinite, that might be enough for some cases but might not be when you have to consider saturation.

Point mutations can be modeled to happen in DNA molecule with infinite length and it might be good for some scenario but might not be for others.

6.4.2 uniform, equilibrium, steady !

Space and time in general from physics and thermodynamics standpoints.

A world that is a point without time.

Everything locates at the same spot and everything happens right now. HWE will be achieved without delay.

A world that has some spacious space without time. Spacial distribution appears. The distribution might be consisted of discrete items. Sometimes it is easier to handle a material with density in the space. It is “continnum”. In physics, it might be called fluid.

Even with the spacious space, if no time, everything happens right now, that means it achieves the equilibrium without transition states. It might be called a steady state.

Assume a world with time.

Now we can have states that moves from one state to the other. There are states that are periodic and the periodic conditions might be steady state.

In biology steady but dynamic conditions seem to be dominant.

These description needed terms in theomodynamics. Thermodynamics handle the population of very small items as if the components are infinitesimally small.

Statistical mechanics is a branch of physics closely related with thermodynamics that pay attention to individual components and their stochastic features that allows us to study chaotic non-linear phenomena.

That is why thermodynamics/statistical physics are useful to study complicated phenomena in life science.

Part III Characterization of a collection of sample

Chapter 7 scale, variable, degrees of freedom, dimension Chapter 8 statistic, index, probability, likelihood Chapter 9 probability and the likelihood Chapter 10 seen in the linkage analysis likelihood and variables Chapter 11 The index (index)

In the first Ⅱ part, what the data, what is the type of data, we have discussed the concept of what should be taken into account in order to handle the data how the samples with. In Part Ⅲ part introduces the concepts necessary for handling the actual data.

Chapter 7 Scale, variable, degrees of freedom, dimension !

7.1 Brief a data set !

7.1.1 Brief a contingency table !

  1. How many values are necessary to describe a data set
  • Total number

  • Marginal counts

  • Expected value of cells under the assumption of independence

  • Deviation from expected

  • Two out of 2x3=6 cells

  1. Relationship between the variables sets

\[ \begin{align*} &n_{..}=\sum_{i=1}^2 \sum_{j=1}^3 n_{ij}\\ &n_{1.}=\sum_{j=1}^3 n_{1j}, n_{.1}=\sum_{i=1}^2 n_{i1}, n_{.2}=\sum_{i=1}^2 n_{i2}\\ &\delta_{11}=n_{11}- \frac{ n_{1.}\times n_{.1}}{n_{..}}, \delta_{12}=n_{12}- \frac{ n_{1.}\times n_{.2}}{n_{..}} \end{align*} \]

  1. Rough vs. Fine
  • Total \(n..\) and marginal counts were …

  • Rough vs. Fine

  • Rough: “No big difference from expected”

  • Fine : “Big difference from expected like …”

  1. Quantification of the rough report and the degree of freedom
  • “How big” can be measured as some kind of “distance”, statistical value.

  • e.g. \(\chi^2\) statistics.

  • \(\chi^2\) value is a function of deviation value of two cells.

  • \(\chi^2\) value tends to be bigger when it is a function of more vaiables.

  • No. variables: degrees of freedom

  • It is convenient to convert \(\chi^2\) value depending on degrees of freedom to something else, independent from degrees of freedom, p-value (R7-1.R).

obtable2 <- matrix(c(25, 23, 12, 15, 17, 8), nrow = 2, byrow = TRUE) # 2x3 table
chisq.test(obtable2)
# Measuring "distance" as chi-sq and converting it to p-value independent from condition, df
  • Rough: “No big difference from expected” can be rephrased as

  • Rough: “No big difference from expected with p = 0.901”

  1. The degree of freedom of the contingency table

2x3 table has 6 values.

Six values are necessary to tell the table in full.

No. variables ~ degrees of freedom of 2x3 table is 6.

But it is ofen said that “degrees of freedom” of 2x3 table is 2.

What is the difference between them?

7.1.2 Brief quantitative data

  1. Variation
  • Total 100 patients, each taking one of three drugs

  • Three groups 40 + 40 + 20 = 100

  • A blood test

See Figure 7.1 (a).

  • 100 values to describe 100 patients test results.

  • Average of all patients with 99 patients’ individual result. You can calculate the 100-th patient’s result with the average and the other 99 values.

See Figure 7.1 (b).

Average of three groups are different. Telling each group’s average is a good idea.

Again 100 values are necessary.

  • total average

  • 2(group average of two groups)

  • (individual varlu of n_1-1)

  • (individual varlu of n_2-1)

  • (individual varlu of n_3-1)

  1. Small variation among samples in the same groups
  • Rough vs. fine

  • Total mean only

  • Total mean and total variance

  • When differences among means of groups is big enogh, group mean values should be added

  • Variance differ among goups or not? The difference is big enogh?

SSw: 2nd moment around the total mean \(m_w\): \[ \begin{equation*} SSw=\frac{1}{n_{.}}\sum_{i,j} (v_{ij}-m_w)^2 \end{equation*} \]

SSi: Sum of 2nd moments of each group around group’s mean \[ \begin{equation*} SSi=\sum_{i=1}^3 \sum_{j=1}^{n_{i}} (v_{ij}-m_{x_i})^2 \end{equation*} \]

SSb: Sum of 2nd moment of group mean around total mean

\[ \begin{equation*} SSw=\sum_{i=1}^3 n_i \times (m_{x_i}-m_w)^2 \end{equation*} \]

Their relation:

\[ \begin{equation*} SSw=SSi+SSb \end{equation*} \]

Given a data set, \(SSw\) is calculated.

When \(SSi\) is relatively bigger, \(SSb\) is relatively smaller and it means no big difference among goups.

Because \(SSi\) is fixed, \(SSb\) is automatically determined.

In other words, only two out of three (\(SSw,SSi,SSb\)) are free.

Therefore either one of \[ \begin{equation*} \frac{SSi}{SSw},\frac{SSb}{SSw}, \frac{SSb}{SSi} \end{equation*} \] represents the difference among groups.

It is fine to use any one of three.

Considering degrees of freedom of \(SSi\) and \(SSb\), \[ \begin{equation*} F=\frac{\frac{SSb}{df_b}}{\frac{SSi}{df_i}} \end{equation*} \] is used to measure the intensity of difference among groups.

Statistical test using this F statistics is called analysis of variance (ANOVA).

As we changed \(\chi^2\) value to p-value, we change \(F\) value to p-value based on F-distribution

# 3群の群別サンプル数、平???、SD
nx <- c(40, 40, 20); mb<- c(50, 60, 45); sdb <- c(50, 50, 50)*0.01
# 生起乱数で???ータ??????
t1 <- rnorm(nx[1], mb[1], sdb[1]); t2 <- rnorm(nx[2], mb[2], sdb[2]); t3 <- rnorm(nx[3], mb[3], sdb[3])
tb <- c(t1, t2, t3)
# サンプルに群ラベルをつける
x <- c(rep("x1", nx[1]), rep("x2", nx[2]), rep("x3", nx[3]))
boxplot(tb ~ x)
# ANOVA
summary(aov(tb ~ x))

\(\frac{SSb}{df_b}\) and \(\frac{SSi}{df_i}\) are unbiased estimate of inter-group and intra-group variance.

7.2 Dimenstion, independence and orthogonality

7.2.1 Degrees of freedom and dimension

Degrees of freedom ~ difference of No. parameters

2x2 table df=1

2x3 table df =2

When something has d degrees of freedom, it should be located in d-dimensional space.

In Chapter 13, tables and their statistics are displayed in d-dimensional space.

7.2.2 Degree of freedom and linear independence — matrix

6 cells in 2x3 table

There are 3 columu sums and 2 row sums and 1 total sum; 3 + 2 + 1 = 6 sums.

6 cell values, 6 variables, that satisfy 6 equations; that can be expressed as a linear algebraic form.

\[ \begin{pmatrix} 1 & 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 1 & 1 \\ 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ \end{pmatrix} \begin{pmatrix} r_1 \\ r_2 \\ r_3 \\ s_1 \\ s_2 \\ s_3 \end{pmatrix} = \begin{pmatrix} R \\ S\\ t_1 \\ t_2 \\ t_3 \\ T \end{pmatrix} \]

Can we solve this system?

Rank of matrix will tell the solution.

Degrees of freedom = number of variable - #rank.

m<-matrix(c(1,1,1,0,0,0,
    0,0,0,1,1,1,
        1,0,0,1,0,0,
        0,1,0,0,1,0,
        0,0,1,0,0,1,
        1,1,1,1,1,1, # 6 rows down to here are constraints 
        1,0,0,0,0,0, # (1,1) cell value is given
        0,1,0,0,0,0), # (1,2) cell value is given
        ncol=6,byrow=TRUE)
# qr() returns rank
6-qr(m[1:6,])$rank # with 6 row constraints...
6-qr(m[1:8,])$rank # when two cells' values are determined...

7.2.3 Stochastic independence and orthogonality

When independent,

  • they are mutually perpendicular (\(\frac{pi}{2}\))

  • their inner product is zero

  • joint probability should be multiplication of two probabilities

(R7-4.R).

Ns <- 10000; A <- rnorm(Ns); B <- rnorm(Ns) # Making two vectors independently
ip<-sum(A*B) # inner product
cos<- sum(A * B)/sqrt(sum(A * A) * sum(B * B)) # cosine
acos(cos)/pi # angle

7.2.4 Linear independence and orthogonal basis

k vectors span k-dimensional space when they are linearly independent.

The vectors consist of basis of the space.

When k linearly independent vectors are mutually perpendicular, they consit of orthogonal basis.

A matrix representing orthogonal basis and all columns are unit vectors, then, M x t(M) is identity matrix.

It is called orthonormal basis.

Orthonormal basis matrix is generated by singular value/ eigen value decomposition, which tell us “good-looking” angle to see a multidimensional data set.

R7-5.R is such example.

7.2.5 Taking out a regular orthogonal basis — eigenvalue decomposition !

There are many specimens are divided into a plurality of groups, about it, you think of the situation to take the data for the relatively large number of quantitative items. How to create and processing of data is as follows.

#4 big subpopulations ~ 100 and 20 small subpopulations ~ 10
#100 genetic markers
Nm<-100 
Ns<-c(rpois(4,100),rpois(20,10)) 
Npop<-length(Ns) # No. subpops
M<-NULL # genotypes
# means for each subpop
for(j in 1:Npop){
 tmpM<-matrix(rep(0,Nm*Ns[j]),ncol=Nm)
 for(i in 1:Nm){ 
  af<-rnorm(1)
  tmpM[,i]<-rnorm(Ns[j],af) 
 }
 M<-rbind(M,tmpM)
}
# Standardize
wholemean<-mean(M)
M<-M-wholemean # Total mean = 0
mu<-apply(M,2,mean) # column mean
M<-t(t(M)-mu) # column means = 0
# decomposition
svdout<-svd(M)
M2<-svdout$u%*%diag(svdout$d) # Post decomposition data matrix
par(mfcol=c(1,2))
# plot
image(1:sum(Ns),1:Nm,M,xlab="Big sub pops -> small sup pops",ylab="variables")
image(1:sum(Ns),1:Nm,M2,xlab="Big sub pops -> small sup pops",
ylab="post decomposition")
df1<-as.data.frame(M);df2<-as.data.frame(M2) # data frame
L<-1:5;par(mfcol=c(1,1))
plot(df1[,L]) # pre decomposition
plot(df2[,L]) # post decomposition
vM1<-apply(M,2,var) # variances pre
vM2<-apply(M2,2,var) # variances post
ylim<-c(min(vM1,vM2),max(vM1,vM2))

plot(vM1,ylim=ylim,type="b")
par(new=T)
plot(vM2,ylim=ylim,type="b",col="red")

7.3 How to set variables

7.3.1 Structure of variables

Degree of freedom tells No. variables necessary to describe a data set.

Various ways to have variables.

  • Treat every variable equally

  • Order them and use tops

  • Hierarchical handling of variables

7.3.2 Variables chosen by meaning variables chosen by data structure

For a variable set, it has some meanings.

To look at a data set from another aspect, you may change the variable set to different set, that may have “biological meanings”.

Chapter 8 Statistics, index, probability, likelihood !

After taking out some featuring values from a data set, we want to “understand” what the values mean.

8.1 Probability distribution

8.1.1 Distribution

  • Distribution

  • Stochastic variable

  • Probability distribution

  • Probability is non negative

  • Support

8.1.2 Discrete probability distributions

Discrete values, “red” or “white”, “yes” or “no”, “1” or “2” or “3”….

Sum of probability mass function is 1. Figure 8.1 (R8-1.R).

barplot(c(0.4,0.6),names.arg=c("赤","白"),ylab="確???")
pie(c(0.3,0.2,0.4,0.1),labels=c("赤","白","???","???"))

8.1.3 Continuous probability distribution — Exponential distribution

Continuous.

Example; taking real values non negative.

Probability density gets half, when value of the variable increases by L.

Decrease in radioactivity.

\[ \begin{equation*} P(x+L)=\frac{1}{2}\times P(x) \end{equation*} \]

\[ \begin{equation*} Q(x+L)=log(P(x+L))=log(P(x))-log(2) = Q(x)-log(2) \end{equation*} \]

With \(\lambda=\frac{log(2)}{L}\)

\[ \begin{equation*} P(x)=P(0)\times e^{-\lambda x} \end{equation*} \]

\[ P(x)=P(0)\times exp(-\frac{x}{L}log(2)) \]

Because integration of \(P(x)\) over all possible values of x should be 1,

\[ \begin{equation*} \lim_{X\to \infty} \int_0^X P(x) dx = \lim_{X\to \infty} P(0) \bigl[ -\frac{1}{\lambda} e^{-\lambda x} \bigr] ^X_0 = P(0) \times \frac{1}{\lambda} =1 \end{equation*} \]

then \(P(0)=\lambda\) and

\[ \begin{equation*} P(x)=\lambda e^{-\frac{x}{\lambda}};\lambda=\frac{log(2)}{L} \end{equation*} \]

This is the probability density function for it. It is exponential distribution.

\[ Q(x+L)-Q(L)=-log(2) \]
\(Q(x)\) increases in proportinal to increase in \(x\).
\[ Q(x)=Q(0) - \frac{x}{L} log(2) \]
Putting \(P(x)\) back to it, \[ log(P(x))=log(P(0))-\frac{x}{L} log(2) \] then, \[ P(x)=P(0)\times exp(-\frac{x}{L}log(2)) \]

If x can be negative and 0 and positive, \[ \begin{equation*} P(x)=\frac{1}{\sqrt{\pi}}\lambda^{\frac{1}{2}} e^{-\lambda |x|^2} \end{equation*} \]

Because the space becomes twice, the probability density function becomes half.

8.1.4 The difference between the exponential distribution and normal distribution

When decreae in probability is proportinal to distance in a direction, it is exponential distribution.

When decrease in probability is proportinal to square of distance, then,

\[ \begin{equation*} P(x)=C e^{-\lambda |x|^2} \end{equation*} \]

\[ \begin{equation*} \int_{0}^{\infty} P(x)=\frac{1}{2} \end{equation*} \] so, \[ \begin{equation*} P(x)=\frac{1}{\sqrt{\pi}}\lambda^{\frac{1}{2}} e^{-\lambda |x|^2} \end{equation*} \] \[ \begin{equation*} k=2,\lambda=\frac{1}{2\sigma^2} \end{equation*} \]

This is the formula of normal distribution.

8.1.5 Uniform distribution, exponential distribution, normal distribution, rectangular distribution —General normal distribution !

\[ e^{-\lambda |x|^k} \]

  • k=1 exponential distribution (Laplace distribution) \(\frac{1}{2}\lambda e^{-\lambda x}\)

  • k=2 normal distribution \(\frac{1}{\sqrt{\pi}}\lambda^{\frac{1}{2}}e^{-\lambda x^2}\)

  • k -> 0, x=0:1,otherwise uniform

  • \(k \to \infty\), rectangular distribution

\[ C \lambda^{\frac{1}{k}}e^{-\lambda x^k} \]

\[ \begin{equation*} P(x;k)=\frac{1}{2} \frac{1}{\Gamma(\frac{1}{k}+1)} \lambda^{\frac{1}{k}} e^{-\lambda |x|^k} \end{equation*} \] where \(\Gamma\) is gamma function.

This form is generalized normal distribution.

8.1.6 Normal distribution, chi distribution and dimension

Normal distribution

\[ \begin{equation*} \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{x^2}{2\sigma^2}} \end{equation*} \]

Standard normal distribution \(mean=0\),\(\sigma^2=1\) \[ \begin{equation*} \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \end{equation*} \]

Exponentila distribution is the sum of negative side and positive side of Laplace distribution.

Let’s do the same thing for normal distribution.

\[ \begin{equation*} 2\times \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}=\frac{1}{\sqrt{\pi}}e^{-\frac{x^2}{2}} \end{equation*} \] This is \(chi\) distribution of df =1.

This sum of negative side and positive side means that we want to handle distance rather than coordinates.

In one dimensional space, there are two points that have the same distance from the origin.

Now we change our dimension from 1 to 2.

\[ \begin{equation*} 2\pi x \end{equation*} \]

This is the length of a circle with radius x, that is sort of “number of points whose distance from the origin is x”.

The following is \(chi\) distribution for 2d space. \[ \begin{equation*} 2\pi x \times \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}= \sqrt{2\pi} x e^{-\frac{x^2}{2}} \end{equation*} \]

Double check this formula as below.

Because \(\frac{d e^{ax^2}}{dx} = 2ax e^{ax^2}\),

\[ \begin{equation*} \int_{0}^{\infty} \sqrt{2\pi} x e^{-\frac{x^2}{2}} dx = \sqrt{2\pi} [(-e^{\frac{x^2}{2}})]_0^{\infty}=\sqrt{2\pi} \end{equation*} \]

Unfortunately the total sum is not equal to 1.

Adding \(\sqrt{2\pi}\) to adjust the sum,

\[ \begin{equation*} \frac{1}{\sqrt{2\pi}} \sqrt{2 \pi} e^{-\frac{x^2}{2}}=e^{-\frac{x^2}{2}} \end{equation*} \]

This corresponds to the formula of \(\chi\) distribution where df =2, \[ \begin{equation*} \frac{2^{1-\frac{k}{2}}}{\Gamma(\frac{k}{2})} x^{k-1} e^{-\frac{x^2}{2}} =e^{-\frac{x^2}{2}} \end{equation*} \] In general the area of k-dimensional sphere is, \[ \begin{equation*} S(x;k)=2\frac{1}{\Gamma(\frac{k}{2})} \pi ^ {\frac{k}{2}} x^{k-1} \end{equation*} \]

then, \(\chi\) distribution of df=k should be proportional to \[ \begin{equation*} S(x;k) \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}= \frac{1}{\sqrt{2\pi}}\frac{2\pi^{\frac{k}{2}}}{\Gamma(\frac{k}{2})} x^{k-1} e^{-\frac{x^2}{2}} \end{equation*} \]

or,

\[ \begin{align*} &C \times S(x;k)\frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}} = (\frac{1}{\sqrt{2\pi}})^k S(k) e^{-\frac{x^2}{2}}\\ &C=(\frac{1}{2\pi})^{\frac{k-1}{2}} \end{align*} \]

When you look at k-dimensional normal distribution formula, you will find the term of \((\frac{1}{\sqrt{2\pi}})^k\), because of the same reason.

8.1.7 From Chi distribution to Chi-square distribution

  • \(chi\) distribution cares distance.

  • \(chi^2\) distribution cares square of distance.

Transforming \(chi\) distribution with \(X= x^2\).

\[ \begin{equation*} Pr(\chi^2=X)=\frac{1 }{2^{\frac{k}{2}} \Gamma(\frac{k}{2})} X^{\frac{k}{2}-1} e^{-\frac{X}{2}} \end{equation*} \]

8.1.8 Which \(chi^2\) value is most likely to be observed?

Figure 8.5.

Chapter 9 Probability and likelihood !

9.1 Probability and likelihood

9.1.1 Probability

Assume 40 vballs in a bag, 12 of which are labeled “A” and the rest “B” (R9-1.R, Figure 9.1).

Take out a ball 10 times from the bag.

Two ways to take out balls; with replacement and without replacement.

There are 11 ways of samples. No. “A” balls can be 0,1,…,10.

# With replacement
p <- 0.3; x <- 0:10 # Fraction of A is 0.3
pr <- choose(10, x) * p^x * (1 - p)^(10 - x)
A <- 12; B <- 28; S <- A + B
x <- 0:10; y <- 10 - x
# without replacement
pr2 <- exp(lgamma(10 + 1) + lgamma(30 + 1) + lgamma(A + 1) + lgamma(B + 1) -  (lgamma(S + 1) + lgamma(x + 1) + lgamma(y + 1) + lgamma(A - x + 1) + lgamma(B -  y + 1)))
ylim <- c(0, 0.5)
plot(x-0.5, pr, ylim = ylim, type = "s")
par(new = T); plot(x - 0.5, pr2, ylim = ylim, type = "s", col = "red")

Assume 40 balls in a bag.

Assume no. “A” balls is changed from 0 to 40.

exProb<-function(m){ # exact prob
 m1<-apply(m,1,sum) # row sum
 m2<-apply(m,2,sum) # col sum
 s<-sum(m)
 exp(sum(lgamma(m1+1))+sum(lgamma(m2+1))-(lgamma(s+1)+sum(lgamma(m+1))))
}
# condisions
S<-40;A<-0:S;B<-S-A;N<-10;x<-0:N

probmat<-matrix(0,length(A),length(x))
for(i in A){
 for(j in x){
  y<-N-j;  z<-i-j;  w<-S-(j+y+z);
  if(j>=0 & y>=0 & z>=0 & w>=0){
   data<-c(j,y,z,w)
   probmat[i+1,j+1]<-exProb(matrix(c(j,y,z,w),nrow=2))
  }
 }
}

phi<-80;theta<-0;shade<-0.3 # Parameters for drawing
persp(A,x,probmat,xlab="No. A in 10", ylab="No. A in 40",zlab="probability",phi=phi,theta=theta,shade=shade)
plot(A - 0.5, probmat[, 4], type = "s") # Probability no. "A" = 3 in the taken out 10 balls.
abline(h = 0, col = "red")

9.1.2 Likelihood

Assume a bag with 40 balls.

You took out 10 balls without replacement.

You found 3 “A” and 7 “B” balls.

What can you tell about no. “A” in the bag before you took out balls?

No. “A” should be 0,1,…,40, because no. balls in the back was 40.

No. “A” should be 3 or more because 3 “A”s were taken out.

See Figure 9.3.

Probability to observe something and likelihood to observe something are the sections that are mutually orthogonal.

Looking at the section for likelihood, no. “A” = 0,1,2,34,…,40 is impossible ~ likelihood of them = 0.

\[ \begin{equation*} L(A=a|(3,7))=\frac{a!(40-a!)10!30!}{40!3!7!(a-3!)(33-a)!} \end{equation*} \]

9.1.3 The sum of probability is 1. The sum of likelihood is not 1.

You can check the R9-3.R.

sum(probmat[, 4]) 

After the observation of “A” = 3 and “B”= 7, you put them back to the bag.

Then again you took out 10 balles.

This time “A” = 7 and “B” = 3.

How can we calculate likelihood of no. “A” in the original bag? (R9-4.R).

# probmat[,4]:Likelihood for "A" = 3 and "B" = 7
# probmat[,8]:Likelihood for "A" = 7 and "B" = 3
# Multiply them because two observations are independent each other
plot(A - 0.5, probmat[, 4] * probmat[, 8], type = "s")

Together the two-time result, the likelihood is the largest, the 40 pieces of A, B was the case of each half (Figure 9.4 (c)).

9,1,4 Logarithm of likelihood — Log-likelihood, prior and posterior probabilities

  • Multiplication: Probabilities can be multiplied when events are mutually independent. Likelihood, same

  • \(\prod p \to \sum \log(p)\) : easy

  • The absolute value of likelihood is not so important but the relative value of likelihood is more important because conditions/hypotheses should be compared with their likelihood.

  • It is a good idea to use relative value of likelihood so that the sum/integral of likelihood over all conditions/hypothesis.

  • Prior and posterior distributions that are likelihoods.

  • Observation transforms prior distribtution to posterior distribution. This is “Bayesian”.

9.2 Conditional probability, probability, likelihood, dependence

9.2.1 Categorical type

Two 2x2 tables; one is independent and the other is dependent.

  • B : b = 0.5 : 0.5 for both tables.

  • Without further information, you should anticipate the probability to get B is 0.5. B : b = 0.5 : 0.5 = 1 : 1.

  • When you know your sample has A, then does it have B or b?

  • Focus the left colum of the tables. For the left table, B : b = 0.3 : 0.3 = 1 : 1. For the right, B : b = 0.4 : 0.2 = 2 : 1.

  • Conditional probability

  • You can expand 2x2 table to a bigger one. Figure 9.6: 20 x 15

9.2.2 Quantitative type

Figure 9.7.

No big difference from Figure 9.6.

9.2.3 Prior probability, “originally expected”, and positive predictive value (PPV) and negative predictive value (NPV)

  • 2 stochastic variables

  • Conditional probability

  • Prior and posterior probability

  • Sensitivity and specificity

  • PPV and NPV

  • Every time you observe something, its information updates likelihood, new posterior density is generated by applying the information to its prior.

Chapter 10 Likelihood and variable in liknage analysis !

10.1 Trait gene mapping using likelihood — linkage analysis !

Linkage analysis:

  • Using phenotype information

  • Using genotype information

  • Using pedigree information

  • Using genetic marker location information

  • Considering recombination

  • Calculate likelihood markers are causative for the phenotype

Two types of linkage analysis

  • Parametric

  • Non-parametric

10.2 Parametric linkage analysis and likelihood !

10.2.1 Transmission tree of markers and transmission tree of responsible locus

  • Pedigree is given.

  • Many patterns of chromosomal transmission are possible. Enumerate them \(2^n\). Chapter 5. For a point along a chromosome, there should be a tree. Figure 10.1.

library(gtools)
permutations(2,4,c(0,1),repeats=TRUE)
  • For parents with two offsprings, \(2^4=16\).

  • For L markers, \(16^L\) patterns.

  • When L=2, \(16^2\)

n<-4
m<-permutations(2,4,c(0,1),repeats=TRUE)
RecNumberMat<-as.matrix(dist(m,method="manhattan",diag=TRUE,upper=TRUE)) # no. transmission with recombination
NonRecNumberMat<-n-RecNumberMat # no. transmission without recombination
RecNumberMat
  • Transmission patterns @ markers (Trees @ markers)
  • Genotypes of memebers of the markers are given. Conditional probability of 16 trees can be calculated. Some patterns are impossible (zero probability).

  • Transmission patterns @ TRUE LOCUS (Trees @ True locus)
  • Using phenotype information and genetic model assumption(dominant/recessive), some trees are likely and the others are unlikely as transmission pattern of “TRUE LOCUS”. Likelihood/conditional probability can be calculated.

  • TRUE LOCUS may be @ a marker itself, or @ somewhere between markers.

  • (Tree @ marker 1) x (Tree @ marker 2) x (Tree @ TRUE LOCUS) : Very many combinations (tree trios).

  • Recombinations may be required between marker 1 and true locus and between true locus and marker 2 to realize the combination of three sites.

  • Likelihood function for trio of trees

\[ \begin{equation*} L(Pi,Pj,Pk)=\theta_{aG}^{Nrec_{i,k}}\times (1-\theta_{aG})^{Nnon_{i,k}} \times \theta_{Gb}^{Nrec_{k,j}}\times (1-\theta_{Gb})^{Nnon_{k,j}} \end{equation*} \] where \(Nrec_{i,j},Nnon_{i,j}\) are number of transmissions with or without Recombination and \(\theta_{pq}\) is probablity to have recombination between p and q.

  • All trios should be considered,

\[ \begin{equation*} L=\sum_{all Pi,Pj,Pk} l(Pi,Pj,Pk)pipjpk \end{equation*} \]

10.2.2 Recombination between the marker and the causal locus

  • Maker 1 — TrueLocus(G) — Marker 2

Relation among \(\theta_{ab},\theta_{aG},\theta_{Gb}\) : \[ \begin{equation*} \theta_{ab}=\theta_{aG}(1-\theta_{bG})+\theta_{bG}(1-\theta_{aG}) \end{equation*} \]

Using parameter t \(0\le t \le 1\), \[ \begin{align*} \theta_{aG}=0.5-\sqrt{\frac{0.5-\theta_{12}}{2}\frac{cos(t)}{sin(t)}}\\ \theta_{bG}=0.5-\sqrt{\frac{0.5-\theta_{12}}{2}\frac{sin(t)}{2cos(t)}} \end{align*} \]

\(/theta_{12}\) is known from genetic map information.

Now, likelihood function \[ \begin{equation*} L=\sum_{all Pi,Pj,Pk} l(Pi,Pj,Pk)pipjpk \end{equation*} \] is now a function of t.

We can find maximum likelihood estimate of t.

A bit bigger familial data set is simulated and evaluated.

# Calculate likelihood of null and alternative hypotheses with variables below
# n: number of branches in tree
# m1,m2,G: probabilitis for markers 1 and 2 and true locus G
# theta: recombination rate between two markers
# k: number of points between m1 and m2 where likelihood will be calculated
CalcLike<-function(n,m1,m2,G,theta,k){
 x<-seq(from=0,to=1,by=1/k)
 t<-x*pi/2
 theta1<--sqrt((0.5-theta)/(2*sin(t))*cos(t))+0.5
 theta2<--sqrt((0.5-theta)/(2*cos(t))*sin(t))+0.5
 range<-which(theta1>=0 & theta2>=0)
 x<-x[range]
 theta1<-theta1[range]
 theta2<-theta2[range]
 library(gtools)
  trvec<-permutations(2,n,c(0,1),repeats=TRUE)
 # number of recombination (+) transmission and number of recombination (-) transmission
 RecNumberMat<-as.matrix(dist(trvec,method="manhattan",diag=TRUE,upper=TRUE)) 
 NonRecNumberMat<-n-RecNumberMat 
 Lalt<-rep(0,length(x)) # Likelihood when alternative hypothesis is true
 for(i in 1:length(Lalt)){
  m1G<-m1%*%t(G) #likelihood between m1 and G.
 # recombination (+)
  m1Gx<-m1G*theta1[i]^RecNumberMat*(1-theta1[i])^NonRecNumberMat 
  m1Gx<-apply(m1Gx,2,sum) # tree patterns of m1 changes because of recombination into tree patterns of G. Patterns of G should be compared with tree pattersn of m2.
  m2G<-m1Gx%*%t(m2) #likelihood between m1 and G.
  m2G<-m2G*theta2[i]^RecNumberMat*(1-theta2[i])^NonRecNumberMat #recombination between G and m2 should be considered
  Lalt[i]<-sum(m2G) #All tree pattern trios should be integrated
 }
# Null hypothesis
 m12<-m1%*%t(m2)
 m12x<-m12*theta^RecNumberMat*(1-theta)^NonRecNumberMat/(2^n)
 Lnull<-rep(sum(m12x),length(Lalt))
 list(logLikeAlt=Lalt,logLikeNull=Lnull,location=x)
}
n<-4 # number of transmission in trees
set.seed(65432) # setting seed for simulation (you can change seed yourself)
# generate n^2 0/1 matrix for m1 and m2
m1<-sample(c(0,1),n^2,replace=TRUE,prob=c(0.8,0.2))
m1<-m1/sum(m1) # probability of trees for m1
m2<-sample(c(0,1),n^2,replace=TRUE,prob=c(0.8,0.2))
m2<-m2/sum(m2) # probablity of trees for m2
G<-0.9*m1+0.1*m2
G<-G/sum(G) # probability of trees for true locus, G
theta<-0.4 # theta between m1 and m2
k<-100 # number of points for likulihood calculation
LL<-CalcLike(n,m1,m2,G,theta,k)
ylim<-c(min(LL$logLikeAlt,LL$logLikeNull),max(LL$logLikeAlt,LL$logLikeNull))
ylim<-c(0,max(LL$logLikeAlt,LL$logLikeNull))
plot(LL$location,LL$logLikeAlt,type="l",ylim=ylim,ylab="LogLike") # loglikelihood of alternative hypothesis
par(new=T)
plot(LL$location,LL$logLikeNull,type="l",col="red",ylim=ylim,ylab="") # loglikelihood of null hypothesis

The highest likelihood indicates the most likely location of true locus.

Largeness of alternative hypothesis likelihood compared with likelihood under null hypothesis is used to judge statistical significance.

LOD score is logarithm of likelihood ratio.

The above is the basic concept of parametric linkage analysis. Because calculation is very heavy if you attempt to calculate for many markers, softwares for linkage analysis adopt many twists.

10.2.3 Hidden Markov models and likelihood calculation of linkage analysis

In the previous subsection, we enumerated all patterns for two markers and one true locus, and considered all combination trios along with a parameter moving from one marker to another. This task is really heavy.

In this subsection, a method, called hidden Markov model, is introduced that lighten the task.

Each one of three loci, m1, m2 and G, has tree patterns.

If you handle their trio, \(k^3\) cases should be considered.

Instead of handling \(k^3\) cases,

  • handle \(k^2\) first,

  • then, the \(k^2\) cases are pooled into \(k\) patterns,

  • then after, another \(k^2\) cases will be handled.

This can be possible because

  • three loci are in a line

  • when handling the combination of the 2nd and 3rd, you don’t have to care about the 1st, if you have already included information of the 1st and modified information of the 2nd before handling the combination between 2nd and 3rd.

Assume you are handling many items in a line.

You may start from the one end and process one by one heading to the other end. The transition form one to next is stochastic and the next state depends only on the current state and you do not need informatio of the past.

This is called (simple) Markov chain.

Figure 10.4.

The transition of tree patterns is Markov process. However, the transition (with or without recombination) is not observed but we are estimating based on genotype data and phenotype data of memebers who are connected in pedigree. This estimation means something we really interested in is not obvious but “hidden”.

10.3 Non-parametric linkage analysis — Affected sib-pair analysis

10.3.1 The relative risk as a variable

Parametric linakge analysis uses genetic models or penetrance, that are parameters.

Non-parametric linkage analysis does not use them.

Non-parametric method models true locus(loci) whose genotypes affect on expression phetypes and compares likelihood between the alternative hypothesis and the null hypothesis(no effect of genotypes on phenotype expression).

Non-parametric linkage analysis has been used for sib-pair analysis frequently and in this section, we study the method with sib-pair analysis.

Assume a biallelic locus with three genotypes.

  • Null hypothesis: \(GRR = 1\) for all genotypes.

  • Alternative hypothesis \(GRR \ne 1\) for some genotypes.

  • Consider a sib-pair. \(3^2=9\) genotype combinations are possible.

  • Probability that two of the pair are affected can be calculated for 9 combinattions.

10.3.2 Three cases; IBD = 0, 1, or 2

  • For a sib-pair, there is one parent-pair, who has 4 alleles in total.

  • Each offspring has two allele and there are 4 ways to receive two out of parents’ four.

  • Considering two of the sib-pair, \(4^2=16\) ways of transmission.

  • Assume a risk allele. No. of risk alleles in 4 parents’ alleles can be 0,1,2,3,4 (5 ways).

  • See table 10.1. You can calculate probablity that the sibs express the phenotype for each condition of parental risk allele number and transmission patterns of two offsprings.

  • 16 ways of transmission to two offsprings can be grouped into three ways using IBD = 0,1 or 2.

  • Tables 10.3 and 10.4, that are the same with different expression. That indicates probability/likelihood for Parental patterns x IBD value.

  • \(\Delta_{p>q}\) stands for a value related with difference of relative risks of genotypes p and q. They have some meanings in genetic models.

  • Using the information that both offsprings of sib-pairs have “affected” phenotype, the likelihood of IBD number should vary. IBD number for markers can be count or estimated. When you have many sib-pairs, their likelihood should be integrated (if pairs are independent, you can multiply them). Eventually you may pick the marker most likely. You can evaluate the positions between markers as well.

Chapter 11 Indices !

11.1 Indices indicate relative values

Indices tell you relative amount of something you are interested in.

The value of indices should be easily understandable; [0,1] or [0,100] is good for many cases.

11.2 Indices for disequilibrium

11.2.1 Hardy-Weinberg equilibrium

F (3.3.3 Section) used in the evaluation of the Hardy-Weinberg equilibrium is an index.

11.2.2 Linkage disequilibrium

Indices for linkage disequilibrium requires their value should take a value “appropriate” when in linkage Equilibrium; e.g., 0.

Also they should take value(s) easily understandable when LD is in one of extremes; e.g., 1.

As far as the indices meet these requirement, they can be defined in various ways and actually there are multiple definitions for LD.

11.2.3 p-value — Index for stochastic variables

p-value is ubiquitously used for hypothesis rejection, ranging from 0 to 1.

This is an index, too.

(R11-1.R).

df<-4;x <- seq(from = 0, to = 15, by = 0.1)
d <- dchisq(x, df);q <- pchisq(x, df);p <- pchisq(x, df, lower.tail = FALSE)
ylim<-c(0,1)
plot(x, d, ylim = ylim, type = "l")
par(new = T);plot(x, q, ylim = ylim, type = "l")
par(new = T);plot(x, p, ylim = ylim, type = "l")

Part IV Estimation, hypothesis, rejection, association, causal relation

Chapter 12 estimation

Chapter 13, rejection of test

Chapter 14 association and causal relation

Chapter Ⅳ parts of estimating the truth you want to know hidden in the data, determine the authenticity of the hypothesis based on the data (rejection and test), and you pick up that you consider whether the cause of the phenomenon or the result of (causal).

Chapter 12 Estimation

12.1 Maximum likelihood estimation

Bernoulli, 6 out of 20 were “yes” and 14 were “no”.

Likelihood that probability of “yes” is 0.3 is:

\[ \begin{equation*} \frac{20!}{6!14!} 0.3^6 \times (1-0.7)^{14} \end{equation*} \]

Changing 0.3 to p,

\[ \begin{equation*} f(p)= \frac{20!}{6!14!} p^6 \times (1-p)^{14} \end{equation*} \]

Likelihood function.

We want to know value of p that maximize f(p).

\[ \begin{align*} f'(p)&= \frac{20!}{6!14!} \times 6\times p^5 \times (1-p)^{14} - 14 \times p^6(1-p)^{13} \\ &= \frac{20!}{6!14!} \times p^5 \times (1-p)^{13} (6(1-p)-14 p )\\ &= \frac{20!}{6!14!} \times p^5 \times (1-p)^{13} (6-20p) \end{align*} \]

\[ \begin{equation*} f'(p=0.3)=0 \end{equation*} \]

0.3 is the maximum likelihood estimate of p.

12.2 Confidence interval

p takes value from 0 to 1.

0.3 is most likeky value, but other values are possible.

The distribution function that has the same shape with the likelihood function and the area under the curve, \(\int_0^1 f(p) dp =1\) is useful.

\[ \begin{equation*} f_c(p)=Cp^6\times(1-p)^{14} \end{equation*} \]

where \[ \begin{equation*} \int_{0}^{1} f_c(p) dp = \int_{0}^{1} C p^6 \times (1-p)^{14} dp =1 \end{equation*} \]

Actually C is related with

\[ \begin{equation*} \frac{20!}{6!14!}p^6\times (1-p)^{14} \end{equation*} \] and

\[ \begin{equation*} C=\frac{(20+1)!}{6!14!}=\frac{21!}{6!14!} \end{equation*} \]

Using \(\Gamma\) function and \(\beta\) function, \[ \begin{equation*} C=\frac{(N+1)!}{k!(N-k)!}= \frac{\Gamma((N+1)+1)}{\Gamma(k+1)\Gamma((N-k)+1)} = \frac{\Gamma( (k+1)+(N-k+1) )}{\Gamma(k+1)\Gamma((N-k)+1)} =\frac{1}{B(k+1,(N-k+1))} \end{equation*}\]

Now \[ \begin{equation*} f_c(p)=\frac{1}{B(k+1,(N-k+1))}p^k(1-p)^{N-k}=\beta(p;k+1,N-k+1) \end{equation*} \] then,

\[ \begin{equation*} f_c(p)=\frac{1}{B(k+1,(N-k+1))}p^k(1-p)^{N-k}=\beta(p;k+1,N-k+1) \end{equation*} \]

This is called beta distribution.

Using the likelihood function or beta distribution, we can define interval estimate of p: \[ \begin{equation*} \int_{L}^{H} f_c(p) dp : \int_{0}^{L} f_c(p) dp+ \int_{H}^1 f_c(p) dp = \alpha : (1-\alpha) \end{equation*} \]

If we give further constraint, \[ \begin{equation*} \int_{0}^{L} f_c(p) dp = \int_{H}^1 f_c(p) dp \end{equation*} \]

We can tell lower and upper bound of the interval.

set.seed(.Random.seed[1]) # set seed for random values
N <- 20;k <- 6 # observation
p <- seq(from = 0, to = 1, by = 0.01) # list of values of p
v <- dbeta(p, k + 1, N - k + 1) # beta distribution
plot(p, v, type = "l")
abline(v = k/N) # MLE
cirange <- 0.95 # 0.95 - > 0.05/2 = 0.025
# quantiles of beta distribution
ci <- qbeta(c((1 - cirange)/2, 1 - (1 - cirange)/2), k + 1,   N - k + 1)
abline(v = ci) # upper and lower bounds

Estimated distribution ofp is beta distribution:

MLE is \(\frac{\alpha-1}{\alpha-1+\beta-1}=0.3\) and mean (expected value) is \(\frac{\alpha}{\alpha+\beta}=0.318\)

12.3 Various intervals

Use the package “binom” (R12-2.R).

library(binom)
binom.confint(6, 20 ,conf.int=0.95, prior.shape1 = 1, prior.shape2 = 1)

How do you classify these various methods? Which one corresponds to the interval we calculated with beta distribution?

12.3 Bayesian estimation — When nothing is oberserved

  • Bayesian uses information and change prior to posterior distribution.
binom.confint(6, 20, prior.shape1 = 1, prior.shape2 = 1)

\[ \begin{equation*} \frac{1}{B(a,b)} p^{a-1}(1-p)^{b-1} \end{equation*} \] “a=prior.shape1,b=prior.shape2”

When a=1,b=1 \[ \begin{equation*} \frac{1}{B(1,1)}p^{1-1}(1-p)^{1-1}=\frac{\Gamma(1+1)}{\Gamma(1+1)\Gamma(1+1)}\times1=1 \end{equation*} \]

a=1,b=1 means k=0,N-k=0.

This means no observation.

12.3.2 Bayesian estimation, prior probability, conjugate prior distribution

You “know” more than uniform distribution before you get observation result.

Because you know something before the information, you have some kind of distribution of p in your mind that is not uniform.

If your “prior” distribution in the shape of beta distribution, you can specify the shape with two parameter values.

\[ \begin{equation*} \frac{1}{B(x=1.2,y=1.5)}p^{x-1}(1-p)^{y-1} \end{equation*} \]

x <- 1.2; y <- 1.5; p<-seq(from=0,to=1,by=0.01)
v3 <- dbeta(p, x, y)
plot(p, v3, type = "l", ylim = c(0, 4))
v4 <- dbeta(p,x+6,y+14)
par(new=TRUE)
plot(p, v4, type = "l", ylim = c(0, 4))

Assume you obtained information that 20 turned out to be 6 vs. 14, then,

your posterior distribution should be proportional to \[ \begin{equation*} \frac{1}{B(x,y)} p^{x-1}(1-p)^{y-1} \times p^6 (1-p)^{14}=\frac{1}{B(x,y)} p^{x+6-1}(1-p)^{y+14-1} \end{equation*} \]

Standardizing to: \[ \begin{equation*} \frac{1}{B(x+6,y+14)} p^{x+6-1}(1-p)^{y+14-1} \end{equation*} \]

The change is in general, \[ \begin{equation*} \frac{1}{B(x,y)}p^{x-1}(1-p)^{y-1} \to \frac{1}{B(x+k,y+(N-k))} p^{x+k-1}(1-p)^{x+(N-k)-1} \end{equation*} \]

You can summarize this change as change in parameter values of beta distribution; \[ \begin{equation*} (x,y) \to (x+k,y+(N-k)) \end{equation*} \]

Further infromation will do like \[ \begin{equation*} (x,y) \to (x+k,y+(N-k)) \to (x+k+k', y+(N-k)+(N'-k')) \end{equation*} \]

This is easy and convenient.

Binomial distribution and beta distribution work together for prior, updating prior to posterior. This relation is said that beta distribution is conjugate prior for binomial distribution.

The conjugate prior distribution for multinomial distribution is Dirichlet distribution and one for poisson distribution is gamma distribution.

12.3.3 Multinomial distribution and its conjugate prior distribution - Dirichlet distribution

  • Binomial ==(generalization)==> Multinomial \[ \begin{equation*} \frac{N!}{\prod_{i=1}^k n_i!} \prod_{i=1}^k (p_i )^{n_i} \end{equation*} \] \(\sum_{i=1}^k n_i=N,n_i \ge 0\)

  • Binomial <—> Beta

  • Multinomial <—> Dirichlet

Fig 12.5 R12-sup2.R

Three categoires can be visualized in a triangle in 2D (Simplex in chapter4).

Interval estimate is a bit complicated. You may define it as a contour line that has the same likelihood and inside and outside of the contour line have \(\alpha : (1-\alpha)\).

12.3.4 Maximum likelihood estimation and the haplotype frequency estimation - linkage disequilibrium coefficient estimation

  1. Maximum likelihood estimation using a pseudo-random number

Assume two SNPs (A/a and B/b).

We observe 9 combinatorial genotype counts.

We want to estimate frequency of 4 haplotypes AB, Ab, aB, and ab.

We can tell for sure the number of haplotype alleles for eight out of nine combinatorial genotypes.

For the genotype (Aa, Bb), we are not sure (AB + ab) vs. (Ab + aB).

We have to determine the ratio of two (AB + ab) vs. (Ab + aB).

Tables 12.1 and 12.2 indicate genotype frequency in HWE assumption.

Haplotype frequency, \(f_1,f_2,f_3,f_4, \sum_{i=1}^4 f_i=1\)

\[ \begin{equation*} log(L(F))= \sum_{i=1}^3 \sum_{j=1}^3 log(F_{ij}^{n_{ij}}) = \sum_{i=1}^3 \sum_{j=1}^3 n_{ij}logF_{ij} \end{equation*} \]

We can generate random haplotype frequency vectors using Dirichlet random generator.

library(MCMCpack)
Make3x3 <- function(f) {From 4 haplotype freq to 9 genotype freq
    tmp <- f %*% t(f)
    matrix(c(tmp[1, 1], 2 * tmp[1, 2], tmp[2, 2], 2 * tmp[1, 
        3], 2 * (tmp[1, 4] + tmp[2, 3]), 2 * tmp[2, 4], tmp[3, 
        3], 2 * tmp[3, 4], tmp[4, 4]), nrow = 3, byrow = TRUE)
}
# Log likelihood to observe genotype counts under haplotype frequency vector
CalcLike3x3 <- function(g = matrix(1, 3, 3), f = c(0.25, 0.25, 0.25, 0.25)) {
    F <- Make3x3(f)
    sum(g * log(F))
}
CalcR <- function(h) { # LD index r from haplotype frequency vector
    p_A <- h[1] + h[3]
    p_B <- h[1] + h[2]
    (h[1] - p_A * p_B)/sqrt(p_A * (1 - p_A) * p_B * (1 - p_B))
}
ns <- matrix(c(2, 5, 2, 10, 18, 14, 4, 25, 20), nrow = 3,  byrow = TRUE) # 9 genotype counts
N <- 10000 # no. iterations
p <- c(1, 1, 1, 1) #parameters for dirichlet random generator
sampled <- rdirichlet(N, p) # Dirichlet random vectors
loglikes <- apply(sampled,1,CalcLike3x3,g=ns)
maxset <- sampled[which(loglikes == max(loglikes)), ] # Select ML 
maxp_A <- maxset[1] + maxset[3] # SNP A'S allele freq
maxp_B <- maxset[1] + maxset[2] # SNP B's allele freq
maxr <- CalcR(maxset) # LD index
maxset
  1. Maximum likelihood estimation with constraints

Because we can tell no. A, a and B, b in the samples, we can calculate allele frequency as sample frequency.

Actually the values are maximum likelihood estimate of A, a, B and b.

We may believe these single SNP allele frequency as they are.

Then four haplotype frequencies are 1-df.

\[ \begin{align*} h_1=p_Ap_B+r\sqrt{p_Ap_ap_Bp_b}\\ h_2=p_Ap_b-r\sqrt{p_Ap_ap_Bp_b}\\ h_3=p_ap_B-r\sqrt{p_Ap_ap_Bp_b}\\ h_4=p_ap_b+r\sqrt{p_Ap_ap_Bp_b} \end{align*} \]

1-df optimization is simple and easy.

(R12-5.R).

# p_A,p_B 
p_A <- 0.4;p_B <- 0.3
# haplotype freq under the assumption of LE
f1 <- p_A * p_B; f2 <- (1 - p_A) * p_B; f3 <- p_A * (1 - p_B); f4 <- (1 - p_A) * (1 - p_B)
rs <- seq(from = -1, to = 1, by = 0.001) # r

ds <- rs * sqrt(p_A * (1 - p_A) * p_B * (1 - p_B))
f1 <- ds + f1; f2 <- -ds + f2; f3 <- -ds + f3; f4 <- ds + f4
F <- matrix(c(f1, f2, f3, f4), nrow = length(f1))
minF <- apply(F, 1, min) # exclude negative frequency cases
rs <- rs[minF > 0]
F <- F[minF > 0, ]
loglikes2 <- apply(F,1,CalcLike3x3,g=ns)
maxset2 <- F[which(loglikes == max(loglikes2)), ]
maxp_A2 <- maxset2[1] + maxset2[3];maxp_B2 <- maxset2[1] + maxset2[2]
maxr2 <- CalcR(maxset2)
plot(rs,loglikes2,type="l") 

This result is shown in Figure 12.6. Likelihood has been shown to make a peak. Values that bring this peak is the maximum likelihood estimate of r. In fact, Compared with the obtained by Dirichlet distribution random number that was carried out the likelihood that the previously obtained likelihood, will be higher for this time of the likelihood.

12.4 EM algorithm

Another way to reach to the MLE with the constraints of A, a, B, b frequency.

(R12-6.R).

 EM2loci <- function(n, p = NULL, Niter = 1000) {
     if (is.null(p))         p <- rep(0.25, 4) # when no initial freq is given, even value will be, 
     f <- p
     rs <- rep(0, Niter) # history of LD index value
     logLikes <- rep(0, Niter) # history of LogLike
     fs <- matrix(0, Niter, 4) # history of 4 haplotype freuq
    # Deterministic no. haplotypes
     fixed<- c(n[1, 1] * 2 + n[1, 2] + n[2, 1], n[1, 3] *  2 + n[1, 2] + n[2, 3], n[3, 1] * 2 + n[2, 1] + 
            n[3, 2], n[3, 3] * 2 + n[2, 3] + n[3, 2])
     for (i in 1:Niter) {
# tentative assignment of double heterozygotes to haplotype pairs
         tmpratio<-f[1]*f[4]/(f[1]*f[4]+f[2]*f[3]) 
         tmp <- rep(0,4)
         tmp[1] <- fixed[1] + n[2, 2] * tmpratio
         tmp[2] <- fixed[2] + n[2, 2] * (1 - tmpratio)
         tmp[3] <- fixed[3] + n[2, 2] * (1 - tmpratio)
         tmp[4] <- fixed[4] + n[2, 2] * tmpratio
# record updated values
         fs[i, ] <- tmp/sum(tmp); logLikes[i] <- CalcLike3x3(n, tmp); rs[i] <- CalcR(tmp)
         f <- tmp # update "current val"
     }
     list(f = f, r = rs[Niter], logLike = logLikes[Niter], 
         fHsitory = fs, rHistory = rs, logLikeHistory = logLikes)
 }
 emout <- EM2loci(ns)
 maxp_A <- emout$f[1] + emout$f[3]; maxp_B <- emout$f[1] + emout$f[2]; maxr <- emout$r
plot(emout$logLikeHistory[1:20], type = "b") # plot up to 20 iterations
x<-1:10
s<-3*x
f<-7*x
mean<-lower<-upper<-matrix(0,length(x),11)

for(i in x){
 tmp<-binom.confint(x=s[i],n=s[i]+f[i],prior.shape1=1,prior.shape2=1)
 mean[i,]<-tmp[,4]
 lower[i,]<-tmp[,5]
 upper[i,]<-tmp[,6]
}
plotlist<-c(2,3,5,9)
collist<-c("black","red","blue","green")
ylim<-c(0,1)
for(i in 1:length(plotlist)){
 plot(x,mean[,plotlist[i]],type="l",ylim=ylim,col=collist[i])
 par(new=T)
 plot(x,lower[,plotlist[i]],type="l",ylim=ylim,col=collist[i])
 par(new=T)
 plot(x,upper[,plotlist[i]],type="l",ylim=ylim,col=collist[i])
 par(new=T)
}
abline(h=0.3)

Chapter 13 Rejection of hypothesis and statistical test !

13.1 Rejection of hypothesis that is difficult to believe — Observation of 3 categories

Reject a hypothesis if the observation is too rare to believe it. (R13-1.R).

There are three categories.

10,7,3 were the observed counts of three categories.

\[ \begin{equation*} \frac{N!}{n_1!n_2!n_3!} (\frac{1}{3})^N \end{equation*} \]

dpoly<-function(n=c(1,2,3),p=NULL){ # function to calculate probability
  N<-sum(n)
  if(is.null(p)){
   p<-rep(1/length(n),length(n))
  }
  exp(lgamma(N+1)-sum(lgamma(n+1))+sum(n*log(p)))
}
DrawHigherLower<-function(g=c(10,7,3),p=NULL){
 N<-sum(g)
 x<-0:N;y<-0:N
 xy<-expand.grid(x,y) # combination of all x and y values
 z<--(xy[,1]+xy[,2])+N
 xyz<-matrix(c(xy[,1],xy[,2],z),nrow=length(z))
 xyz<-xyz[apply(xyz,1,min)>=0,] # non negative

 probs<-apply(xyz,1,dpoly,p=p) # apply the above-made-function to xyz
 probObs<-dpoly(c(10,7,3)) # probablity of the observation
 higher<-which(probs>probObs) 
 lower<-which(probs<=probObs)
 xyz2<-apply(xyz,1,CoordTriangle)
 x2<-xyz2[1,];y2<-xyz2[2,]
xlim<-c(0,1);ylim<-c(0,2/sqrt(3))
# Black vs. Red
 plot(xyz2[1,which(probs>probObs)]/N,xyz2[2,which(probs>probObs)]/N,xlim=xlim,ylim=ylim,col = "black",pch=15)
 par(new=T)
 plot(xyz2[1,which(probs<=probObs)]/N,xyz2[2,which(probs<=probObs)]/N,xlim=xlim,ylim=ylim,col = "red",pch=15)
 DrawTriangleFrame()

}
par(mfcol=c(1,2))
DrawHigherLower(c(10,7,3))
DrawHigherLower(c(10,7,3),c(0.9,0.05,0.05))
par(mfcol=c(1,1))

Draw probablities under two hypotheses.

DrawDirichletDensity<-function(prior=c(1,2,3),k=0.01){
 library(MCMCpack)
 xyz<-TriLattice(k=k)
 density <- Density(ddirichlet(xyz, prior)) 
 xyz2 <- apply(xyz, 1, CoordTriangle)
 x2 <- xyz2[1, ];y2 <- xyz2[2, ]
 xlim <- c(0, 1);ylim <- c(0, 2/sqrt(3))
 plot(xyz2[1,], xyz2[2,], xlim = xlim, ylim = ylim, col = gray(density),  pch = 15)
 DrawTriangleFrame()

}

DrawTriangleFrame <- function() { # Drawing triangle
     framePointsX <- c(1, 0, 0, 1)
    framePointsY <- c(1/2, 2/sqrt(3), 0, 1/2)
    s <- seq(length(framePointsX) - 1)
    segments(framePointsX[s], framePointsY[s], framePointsX[s + 1], framePointsY[s + 1])
}
CoordTriangle <- function(x) { # coordinate in triangle coordinate
    x2 <- x[1]; y2 <- 2/sqrt(3) * x[2] + 1/2 * x[1]
   c(x2, y2, 0)
}
TriLattice<-function(k=0.01){
 x <-y<- seq(from = 0, to = 1, by = k)
 xy <- expand.grid(x, y)
 z <- 1 - xy[1] - xy[2]
 xyz <- cbind(xy[1], xy[2], z)
 minval <- apply(xyz, 1, min)
 xyz[minval > 0, ]
}
Density<-function(d){
 if(max(d)==min(d)){
  rep(0.5,length(d))
 }else{
  (max(d)-d)/(max(d)-min(d))
 }
}


DrawDirichletDensity(c(1,1,1))

DrawDirichletDensity(c(10,7,3))

DrawCI <- function(p = c(11, 8, 4), ci = 0.95, N = 10000) {
     library(MCMCpack)
     rs <- rdirichlet(N, p)
    ds <- ddirichlet(rs, p)
    ord <- order(ds)
    OutPoints <- N * (1 - ci)
    rs2 <- apply(rs, 1, CoordTriangle)
    xlim <- c(0, 1);ylim <- c(0, 2/sqrt(3))
    plot(rs2[1, ord[1:OutPoints]], rs2[2, ord[1:OutPoints]], col = gray(6/8), xlim = xlim, ylim = ylim, pch = 15)
    par(new = T)
    plot(rs2[1, ord[(OutPoints + 1):N]], rs2[2, ord[(OutPoints + 1):N]], xlim = xlim, ylim = ylim, pch = 15)
    DrawTriangleFrame()
}
DrawCI(p = c(11, 8, 4), ci = 0.95, N = 10000)

13.2 Contingency table test

2x2 table

Assume p is probability of A. \[ \begin{equation*} \frac{n_{1.}!}{n_{11}! n_{12}!} \frac{n_{2.}!}{n_{21}!n_{22}!} p^{n_{.1}} (1-p)^{n_{.2}} \end{equation*} \]

Regardless of p, prob to observe 4 cell values should be proportinal to \[ \frac{n_{1.}!}{n_{11}! n_{12}!} \frac{n_{2.}!}{n_{21}!n_{22}!} \]

To make the sum of probability of all cases, it should be changed to: \[ \begin{equation*} \frac{n_{1.}!}{n_{11}! n_{12}!} \frac{n_{2.}!}{n_{21}!n_{22}!} \times \frac{n_{.1}! n_{.2}!}{n_{..}!} \end{equation*} \]

Using this we can determine p-value, which is p-value of exact probability test.

(R13-2.R).

The calculated value fits well to the formula below:

\[ \begin{equation*} a\times e^{-b \delta^2} \end{equation*} \]

Exact prob decreases along with the increase in square of deviation from the expected value.

# expected table
makeExptable<-function(m = matrix(c(10, 20, 30, 40), nrow = 2)) {
 m1 <- apply(m, 1, sum);m2 <- apply(m, 2, sum); N <- sum(m);etable <- m1 %*% t(m2)/N
}
# Change a value of (1,1) from expected value by 1
 ProbDistance <- function(m = matrix(c(10, 20, 30, 40), nrow = 2)) { 
# marginal counts
   m1 <- apply(m, 1, sum)
   m2 <- apply(m, 2, sum)
   etable <- makeExptable(m)
   N<-sum(etable)
   d <- seq(from = 0, to = 20, by = 1) # deviation 
# All cells should be non negative
    x <- d + etable[1, 1]; y <- -d + etable[1, 2]; z <- -d + etable[2, 1]; w <- d + etable[2, 2]
    mat <- matrix(c(x, y, z, w), ncol = 4);mins <- apply(mat, 1, min);matOK <- mat[mins > 0, ]
# component of exact prob calculated with marginal counts
    tmp <- sum(lgamma(m1 + 1)) + sum(lgamma(m2 + 1)) - lgamma(N)
# exact prob of all possible tables
    prob <- exp(-apply(lgamma(matOK+1),1,sum)+tmp)
    list(x = d[mins > 0], prob = prob) 
}
d <- ProbDistance(m = matrix(c(10, 20, 30, 40), nrow = 2)) #c(10,20,30,40) is cell values of 2x2 table
ylim <- c(0, max(d$prob))
plot(d$x, d$prob, ylim = ylim)
# function to be used for optimization
fForOptim <- function(x) {
     a <- x[1]; b <- x[2] ;c<-x[3]
# value to be optimized
     sum((d$prob - (a * exp(-b * d$x^2)))^2) 
}
optout <- optim(c(1, 1), fForOptim) # initial coefficients are set at (1,1)
xfine <- seq(from = min(optout$par), to = max(optout$par), by = 0.01) # values for horizontal axis
optprob <- optout$par[1] * exp(-optout$par[2] * xfine^2)# estimated function value
par(new=T)
plot(xfine, optprob, ylim = ylim, col = "red", type = "l")

13.2.1 Pearson’s independence test - chi-square test

\[ \begin{equation*} \chi^2=\sum_{i,j} \frac{(n_{ij}-e_{ij})^2}{e_{ij}} \end{equation*} \]

chisqCalc<-function(m=matrix(c(10,20,30,40),nrow=2)){ 
 etable<-makeExptable(m)
 m1<-length(m[,1]);m2<-length(m[1,]) 
 chi2<-sum((m-etable)^2/etable) 
 df<-(length(m1)-1)*(length(m2)-1) # df=(N-1)x(M-1)
 p<-pchisq(chi2,df,lower.tail=FALSE) #
 list(chi2=chi2,p=p,df=df)
}
chisqCalc(m=matrix(c(10,20,30,40),nrow=2))

13.3.2 To statistics by comparing the null hypothesis and the maximum likelihood hypothesis —Likelihood ratio test

Comapre likelihood under the null hypothesis and likelihood under the alternative hypothesis.

For alternative hypothesis, we assume maximu likelihood alternative hypothesis in terms of parameter(s) modeling the alternative hypothesis.

Alternative hypothesis:

Maximum likelihood estimates:

\(\frac{n_{11}}{n_{1.}},\frac{n_{21}}{n_{2.}}\)

\[ \begin{align*} L(G1\not = G2) &= \frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} (\frac{n_{11}}{n_{1.}})^{n_{11}} (\frac{n_{12}}{n_{1.}})^{n_{12}} (\frac{n_{21}}{n_{2.}})^{n_{21}} (\frac{n_{22}}{n_{2.}})^{n_{22}} \\ &=\frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} \frac{ n_{11}^{n_{11}} n_{12}^{n_{12}} n_{21}^{n_{21}} n_{22}^{n_{22}} } {n_{1.}^{n_{1.}} n_{2.}^{n_{2.}} } \end{align*} \]

Null hypothesis: \[ \begin{align*} L(G1= G2) &= \frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} (\frac{n_{.1}}{n_{..}})^{n_{11}+n_{21}} (\frac{n_{.2}}{n_{..}})^{n_{12}+n_{22}} \\ &= \frac{n_{.1}!}{n_{11}!n_{21}!} \frac{n_{.2}!}{n_{12}!n_{22}!} \frac{ n_{.1}^{n_{.1}} n_{.2}^{n_{.2}} }{n_{..}^{n_{..}} } \end{align*} \]

Likelihood ratio: k! components are cancelled out.

\[ \begin{equation*} \frac{L(G1\not = G2) }{L(G1= G2) }= \frac{n_{..}^{n_{..}} n_{11}^{n_{11}} n_{12}^{n_{12}} n_{21}^{n_{21}} n_{22}^{n_{22}} } {n_{1.}^{n_{1.}} n_{2.}^{n_{2.}}n_{.1}^{n_{.1}} n_{.2}^{n_{.2}} } \end{equation*} \]

In general \[ \begin{equation*} \frac{L(G1\not = G2) }{L(G1= G2) } = \prod_{i,j} (\frac{n_{ij}}{e_{ij}})^{n_{ij}} \end{equation*} \]

twice of log-likelihood ratio can be treated as \(\chi^2\) value. R13-4.R.

likelihoodRatioTest<-function(m=matrix(c(10,20,30,40),nrow=2)){
 etable<-makeExptable(m)
 chi2<-2*sum(log(m/etable)*m)
 df<-(length(m[,1])-1)*(length(m[1,])-1)
 p<-pchisq(chi2,df,lower.tail=FALSE)
 return(list(statistic=chi2,p.value=p,df=df))
}
likelihoodRatioTest(m=matrix(c(10,20,30,40),nrow=2))

13.3 Comparison of the three methods - exact test,Pearson’s independence test, and the likelihood ratio test

We have three test methods for contingency tables.

They are similar, but of course they are different each other.

We should know when they are same and when they are different and how.

13.3.1 When sample size is large and when sample size is small

  1. When sample size is small
  2. When sample size is large

Small table (3, 6, 9, 12)

Large table (100, 200, 300, 400)

Figures 13.3 13.4 13.5 and R13-supl.R

13.3.2 Symmetricity of tests

Symmetricity?

13.3.3 Difference between limited space and infinite space

\(\chi^2\) distribution support infinite value of distance.

Exact test does not accept negative cell values, meaning finite df-dimentional space is covered.

13.3.4 The difference of the amount of calculation

13.3.5 A summary of the difference in the amount of calculation

  • Calculation load

  • Behaviour when small sample size

  • symmetricity

  • Exact vs. asymptotic

  • Discrete vs. continuous

  • Finite vs. infinite

Table 13.4.

13.4 Testing hypothesis with constraints

2x3 table for case-control x SNP genotypes

  • Null hypothesis

  • Dominant model df =1

  • Recessive model df=1

  • Additive model df = 1

  • Any one of three genetic models 1 < df < 2

  • Any difference between case and control df = 2

Table 13.5 : RR constraints

  • Test methods Table 13.6

13.4.1 Application of various tests to one contingency table

#library(Rassoc)
# Make a table from coordinates in 2D space
casecontRTheta<-function(R,t,af,f,N){
 popW<-c(af^2+af*(1-af)*f,2*af*(1-af)*(1-f),(1-af)^2+af*(1-af)*f)
 table<-matrix(c(popW*N[1],popW*N[2]),nrow=2,byrow=TRUE)
 table[1,1]<-table[1,1]-R/2*cos(t)+sqrt(3)/2*R*sin(t)
 table[2,1]<-table[2,1]+R/2*cos(t)-sqrt(3)/2*R*sin(t)
 table[1,2]<-table[1,2]+R*cos(t)
 table[2,2]<-table[2,2]-R*cos(t)
 table[1,3]<-table[1,3]-R/2*cos(t)-sqrt(3)/2*R*sin(t)
 table[2,3]<-table[2,3]+R/2*cos(t)+sqrt(3)/2*R*sin(t)
 return(table)
}
af<-0.4;f<-0;N<-c(100,100) # allele frequency,HWE-f,sample size
x<-y<-seq(from=-10,to=10,by=1)
# apply multiple test methods to the generated table
cattp<-max3p<-df2p<-domp<-recp<-matrix(rep(0,length(x)*length(y)),nrow=length(x))
for(i in 1:length(x)){
 for(j in 1:length(y)){
  R<-sqrt(x[i]^2+y[j]^2)
  if(x[i]==0){
   t<-pi/2
  }else{
   t<-atan(y[j]/x[i])
  }

  popdata<-casecontRTheta(R=R,t=t,af=af,f=f,N=N)
  if(min(popdata>0)){
   catout<-CATT(popdata,x=0.5) # Cockran-Armitage trend test with 0.5
   domout<-CATT(popdata,x=1) #  Cockran-Armitage trend test with 1
   recout<-CATT(popdata,x=0) #  Cockran-Armitage trend test with 0
   max3out<-MAX3(popdata)
   df2chi2out<-chisq.test(popdata,2)
   cattp[i,j]<-catout$p
   max3p[i,j]<-max3out$p
   df2p[i,j]<-df2chi2out$p.value
   domp[i,j]<-domout$p
   recp[i,j]<-recout$p
  }
 }
}
filled.contour(x,y,-log(cattp,10),color=terrain.colors)
filled.contour(x,y,-log(max3p,10),color=terrain.colors)
filled.contour(x,y,-log(df2p,10),color=terrain.colors)
filled.contour(x,y,-log(domp,10),color=terrain.colors)
filled.contour(x,y,-log(recp,10),color=terrain.colors)

Figure 13.6.

13.4.2 Comparison of the likelihood ratios at discrete hypothesis space

When likelihood for alternatigve hypothesis has free parameter(s), the 2xloglike can be interpreted with \(\chi^2\) distribution of df = no. parameters.

However alternative hypothesis does not have a continuous parameter, like DNA identification, likelihood ratio itself but not p-value in likelihood ratio test should be used for judgement.

13.5 Dependency among multiple tests

(R13-6.R).

#library(Rassoc)
Niter<-1000
Nca<-1000
Nco<-1000
chiP<-cattP<-domP<-recP<-rep(0,Niter)
for(i in 1:Niter){
 af<-runif(1)*0.6+0.2
 df<-c(af^2,2*af*(1-af),(1-af)^2)
 case<-sample(c(0,1,2),Nca,df,replace=TRUE)
 cont<-sample(c(0,1,2),Nco,df,replace=TRUE)
 m<-matrix(c(length(which(case==0)),length(which(case==1)),length(which(case==2)),length(which(cont==0)),length(which(cont==1)),length(which(cont==2))),nrow=2,byrow=TRUE)
 chiP[i]<-chisq.test(m,correct=FALSE)$p.value
 cattP[i]<-CATT(m,0.5)$p
 domP[i]<-CATT(m,1)$p
 recP[i]<-CATT(m,0)$p
}
plot(as.data.frame(cbind(chiP, domP, cattP, recP)), cex = 0.1)
cormat <- cor(cbind(chiP, domP, cattP, recP))
cormat

When multiple tests are mutually dependent, multiple testing correction (Chapter 17) of them becomes more difficult.

13.6 Tables with various size

13.6.1 Data in the shape of table

Two variables are observed for samples.

Relation between two variables are evaluated/tested with various ways.

The data type of variables determines the appropriate method.

13.6.2 Test methods for ordered or non-ordered categories

  1. Test details of the will investigate the source of the R
# NxM table
# Randomly generate tables with a given size (NxM), multiple tests are applied to them and compare their results
CompareTests<-function(N=2,M=2,Niter=1000,Ns=100,k1=10,k2=3){
library(MCMCpack)
library(clinfun)
pearsonp<-trendp<-kwp<-jtp<-lmp<-rep(0,Niter)
for(i in 1:Niter){
#  (Generate NxM tables at random)
 fn<-rdirichlet(1,rep(k1,N))
 fm<-rdirichlet(1,rep(k2,M))
 first<-sample(1:N,Ns,prob=fn,replace=TRUE)
 second<-sample(1:M,Ns,prob=fm,replace=TRUE)
 t<-table(first,second) # ???割表を作る Convert to table
 pearsonp[i]<-chisq.test(t,correct=FALSE)$p.value #  Pearson's independence test
 if(N==2){
  trendp[i]<-prop.trend.test(t[1,],t[1,]+t[2,],score=1:M)$p.value # Trend chisquare test with additive model
 }
 kwp[i]<-kruskal.test(second~first)$p.value # Kruskal-Wallis
 jtp[i]<-jonckheere.test(second,first,alternative="two.sided")$p.value # Jonckheere-Terpstra
 lmp[i]<-summary(lm(second~first))$coefficients[2,4] #  Linear regression
}
databind<-cbind(pearsonp,kwp,jtp,lmp,trendp)
plot(as.data.frame(databind))
return(databind)
}
N<-2;M<-2;Niter<-100;Ns<-1000 #  2x2 table
ctout22<-CompareTests(N,M,Niter,Ns)
N<-2;M<-3;Niter<-100;Ns<-1000 #  2x3 table 
ctout22<-CompareTests(N,M,Niter,Ns)
N<-2;M<-10;Niter<-100;Ns<-1000 # 2x10???ーブル 2x10 table
ctout22<-CompareTests(N,M,Niter,Ns)
N<-3;M<-3;Niter<-100;Ns<-1000 # 3x3???ーブル 3x3 table
ctout22<-CompareTests(N,M,Niter,Ns)
  1. One axis is 2-categorical

  2. When one axis has three or more categories without order in them

  3. When both axes are ordered

Table 13.7

13.6.3 Comparison of performance of various test methods

When values are stocked as vectors in R. R functions can accept them even if they are “NOT appropriate” from “STATISTs’ standpoint”.

What does this mean; “not appropriate”?

m<-matrix(c(1,2,3,4)*3,nrow=2)
N<-sum(m)
x<-0:N
m1<-apply(m,1,sum)
m2<-apply(m,2,sum)
y<--x+m1[1]
z<--x+m2[1]
w<--(x+y+z)+N
data<-matrix(c(x,y,z,w),ncol=4)
min4<-apply(data,1,min)
dataOK<-data[min4>=0,]
L<-length(dataOK[,1])
pFisher<-pChiNoCorrect<-pChiNoCorrect<-ChiNoCorrect<-pLRT<-ChiLRT<-rep(0,L)
for(i in 1:L){
 tmptable<-matrix(dataOK[i,],nrow=2)
 chisqout<-chisq.test(tmptable,correct=FALSE)
 LRTout<-likelihoodRatioTest(tmptable)
 pChiNoCorrect[i]<-chisqout$p.value
 ChiNoCorrect[i]<-chisqout$statistic
 pLRT[i]<-LRTout$p.value
 ChiLRT[i]<- LRTout$statistic
 pFisher[i]<-fisher.test(tmptable)$p.value
}
ylim=c(0,1)

ylim<-c(0,1)
plot(x[min4>=0],pChiNoCorrect,ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],pLRT,ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],pFisher,ylim=ylim,col="blue",type="l")

ylim<-c(log(min(pFisher),10),1)
plot(x[min4>=0],log(pChiNoCorrect,10),ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],log(pLRT,10),ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],log(pFisher,10),ylim=ylim,col="blue",type="l")


ps<-cbind(pChiNoCorrect,pLRT,pFisher)
plot(as.data.frame(ps),cex=1)

ps<-cbind(pChiNoCorrect,pLRT,pFisher)
plot(log(as.data.frame(ps),10),cex=1)

m<-matrix(c(1,2,3,4)*100,nrow=2)
N<-sum(m)
x<-0:N
m1<-apply(m,1,sum)
m2<-apply(m,2,sum)
y<--x+m1[1]
z<--x+m2[1]
w<--(x+y+z)+N
data<-matrix(c(x,y,z,w),ncol=4)
min4<-apply(data,1,min)
dataOK<-data[min4>=0,]
L<-length(dataOK[,1])
pFisher<-pChiNoCorrect<-pChiNoCorrect<-ChiNoCorrect<-pLRT<-ChiLRT<-rep(0,L)
for(i in 1:L){
 tmptable<-matrix(dataOK[i,],nrow=2)
 chisqout<-chisq.test(tmptable,correct=FALSE)
 LRTout<-likelihoodRatioTest(tmptable)
 pChiNoCorrect[i]<-chisqout$p.value
 ChiNoCorrect[i]<-chisqout$statistic
 pLRT[i]<-LRTout$p.value
 ChiLRT[i]<- LRTout$statistic
 pFisher[i]<-fisher.test(tmptable)$p.value
}
ylim=c(0,1)

ylim<-c(0,1)
plot(x[min4>=0],pChiNoCorrect,ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],pLRT,ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],pFisher,ylim=ylim,col="blue",type="l")

ylim<-c(log(min(pFisher),10),1)
plot(x[min4>=0],log(pChiNoCorrect,10),ylim=ylim,type="l")
par(new=T)
plot(x[min4>=0],log(pLRT,10),ylim=ylim,col="red",type="l")
par(new=T)
plot(x[min4>=0],log(pFisher,10),ylim=ylim,col="blue",type="l")
abline(v=120)
abline(v=20)
abline(v=220)
abline(h=log(pChiNoCorrect[21],10))


ps<-cbind(pChiNoCorrect,pLRT,pFisher)
plot(as.data.frame(ps),cex=0.1)

plot(log(as.data.frame(ps),10),cex=0.1)

Chapter 14 Association and causal relation !

14.1 Cause and result and time

  • Epidemiology can causal relation.

  • Chronological order and prospective study and retrospective study

There are the following four patterns for relation between A and B.

If A is the cause of B

If B is the cause of A

If the result of A and B together, caused by C

It is both A and B, and is the cause of C

Figure 14.1.

14.2 Genotypes that are cauastive

Genotypes, first.

14.3 Directed graph, Bayesian network

Order or directional relation is expressed as directed edge in graphs or networks.

Cycles are not good to handle causation with graph, that is why acyclic graph is often required for network analysis.

Particularly bayesian networks are acyclic directed graphs.

using a bnlearn package of R (R14-1.R).

library(bnlearn)
e <- empty.graph(LETTERS[1:5]) # 5 nodes
arc.set <- matrix(c("A", "C", "B", "D", "C", "E","D","A","A","E"), ncol = 2, byrow = TRUE)
arcs(e)<-arc.set # set edges
plot(e)
# generate data set
 set.seed(123456) 
Ns<-10000
A<-rpois(Ns,3);B<-rpois(Ns,3);C<-rpois(Ns,3);D<-rpois(Ns,3);E<-rpois(Ns,3)

B[1:(Ns/4)]<-A[1:(Ns/4)];C[(1+Ns/4):(Ns/2)]<-A[(1+Ns/4):(Ns/2)]
D[(1+Ns/2):(3*Ns/4)]<-A[(1+Ns/2):(3*Ns/4)];E[(1+3*Ns/4):Ns]<-A[(1+3*Ns/4):Ns]
d<-as.data.frame(cbind(A,B,C,D,E))+0.0
res<-gs(d) # Network estimation with Grow-Shrink method, one of multiple methods
plot(res)
# If networks are at random...
plot(random.graph(LETTERS[1:5], num = 1))
# Compare the estimated graph and random graphs for wellness in data-description
GSscore<-score(res, d) # Score the wellness of estimated graph
Niter<-100 # ランダ???に作るグラ??????数
scores<-rep(0,Niter) # Score wellness of random graphs
for(i in 1:Niter){
 randomGraph<-random.graph(LETTERS[1:5], num = 1)
 scores[i]<-score(randomGraph,d)
}
ylim<-c(min(scores,GSscore),max(scores,GSscore))
plot(scores,ylim=ylim,pch=20)
abline(h=GSscore)
bn.fit(res,d)

Part V Large-scale

Chapter 15 Enumeration

Chapter 16 omitting something

Chapter 17 A lot of test

Large-scale data science, including the genome, features a huge amount of data and a lot of factors. In Part Ⅴ parts and to enumerate it is huge, it focuses on how to deal with it is enormous.

Chapter 15 Enumeration !

15.1 Permutation, repeated permutation, exact probability of contingency table

  1. Permutation and combination

\[ \begin{equation*} P(N,k)=N(N-1)(N-2)...(N-(k-1))=\frac{N!}{(N-k)!}=\frac{\Gamma(N+1)}{\Gamma(N-k+1)} \end{equation*} \]

\(\Gamma\) function is important.

permN<-function(N=10,k=3){ exp(lgamma(N+1)-lgamma((N-k)+1)) }
permN(N=10,k=3)

Repeated permutation \[ \begin{equation*} \Pi(N,k)=N^k \end{equation*} \]

repPermN<-function(N=10,k=3){ N^k }
repPermN(N=10,k=3)

Total k objects, that shoudl be labeled with 1,2,…,or N; no. of each label should be \(n_i\). \[ \begin{equation*} \frac{k!}{\prod_{i=1}^N n_i!};\;\; \sum_{i=1}^N n_i=k \end{equation*} \]

For the 1st label set, \[ \begin{equation*} \frac{n_{..}!}{\prod_{i=1}^N n_{i.}!} \end{equation*} \] for the second, \[ \begin{equation*} \frac{n_{..}!}{\prod_{j=1}^M n_{.j}!} \end{equation*} \]

In combination, \[ \begin{equation*} X= \frac{n_{..}!}{\prod_{i=1}^N n_{i.}!} \times \frac{n_{..}!}{\prod_{j=1}^M n_{.j}!} \end{equation*} \]

We now make combination of labels, \(N\times M\) types. \[ \begin{equation*} Y=\frac{n_{..}!}{ \prod_{i=1}^N\prod_{j=1}^M n_{ij}!} \end{equation*} \]

The ratio Y over X is the exact probability of 2x2 table. \[ \begin{equation*} \frac{Y}{X}=\frac{n_{..}!}{ \prod_{i=1}^N\prod_{j=1}^M n_{ij}!} \frac {\prod_{i=1}^N n_{i.}!} {n_{..}!} \times \frac{\prod_{j=1}^M n_{.j}!}{n_{..}!} = \frac{\prod_{i=1}^N n_{i.}! \prod_{j=1}^M n_{.j}!}{n_{..}! \prod_{i=1}^N \prod_{j=1}^M n_{ij}!} \end{equation*} \]

15.1.2 Combination, repeated combination, number of diploid genotypes

\[ \begin{equation*} C(N,k)=\frac{N!}{k!(N-k)!}=\frac{P(N,k)}{k!} \end{equation*} \]

choose(20,3)

Repeated combination, H (N, 2). \[ \begin{equation*} H(N,k)=C(N+k-1,k)= \sum_{i=0}^k (C(N,k-i) \times H(k-i,i)) = \sum_{i=0}^k (C(N,k-i) \times C(k-1,i)) \end{equation*} \]

This can be modeled as; N segments separated with N-1 separating bars. Put k items in N spaces. We can not distinguish N-1 separating bars or k items…

\[ \begin{equation*} H(N,k)=\frac{P(N-1+k)}{P(N-1)P(k)}=C(N+k-1,k) \end{equation*} \]

repCombN<-function(N=10,k=3){
 choose(N+k-1,k)
}
repchoose<-function(n=10,k=3){
 ret<-0
 for(i in 0:k){
  ret<-ret+choose(n,k-i)*repCombN(k-i,i)
 }
 ret
}
repCombN(N=10,k=3)
repchoose(n=10,k=3)

15.2 Number of partitions — Stirling number and Bell number

What is partition?

Put N items into k gourps. All k group should have one or more items.

The 2nd Stirling number, \(St2(N,k)\) is the number of ways of partition.

It satisfies the following equation. \[ \begin{equation*} St2(N+1,k)=St2(N,k)\times k +St2(N,k-1) \end{equation*} \]

When k=2, it is related with the sum of the ways to choose i (i=1,2,…N-1) out of N: \[ \begin{align*} St2(N,2)&=\frac{1}{2} \sum_{i=1}^{N-1} C(N,i) \\ &= \frac{1}{2} (\sum_{i=0}^N C(N,i) -(C(N,0)+C(N,N)) )\\ &= \frac{1}{2} ((1+1)^N -2) = 2^{N-1}-1 \end{align*} \]

Partioning N into any number of groups is Bell number. \[ \begin{equation*} B(N)=\sum_{i=1}^N St2(N,i) \end{equation*} \] or \[ \begin{equation*} B(N+1)=\sum_{i=0}^N C(N,i) B(i) \end{equation*} \]

Figure 15.2 shows one way of enumeration can be interpreted in multiple meanings.

Stirling2N<-function(N=10,k=3){
 ret<-0
 if(k<=N && k>=1){
  if(k==1){
    ret<-1
   }else{
    ret<-Stirling2N(N-1,k)*k+Stirling2N(N-1,k-1)
   }
 }
 ret
}
Stirling2N(N=10,k=3)
bellN<-function(N=10){
 bs<-rep(0,N+1)
 bs[1]<-1
 if(N>=1){
  for(i in 2:(N+1)){
   for(j in 0:(i-2)){
    bs[i]<-bs[i]+choose(i-2,j)*bs[j+1]
   }
  }
 }
 bs[2:length(bs)]
}
bellN(N=10)

15.3 Partition and integration of categories

15.3.1 When categories are non-ordered

When we have multiple categories, we may want to group them into fewer categories. That is partioning.

For example, dominant genetic model makes 3 genotypes into 2 groups.

15.3.2 When categories are ordered

When categories have an order, partioning of them is to place border(s) in the order.

Figure 15.3.

\[ \begin{equation*} B_{order}(N)=\sum_{i=1}^{N-1} C(N-1,i)=2^{N-1}-1=St2(N,2) \end{equation*} \]

R15-7.R in R.

Border<-function(N){
 2^{N-1}-1
}
Border(10)

15.4 Number of shapes of tree, number of graphs — tree, clustering, Bayesian network

15.4.1 Number of patterns of tree

It will be necessary to know the number of ways to construct trees with N leaves, when performing hierarchical clustering.

One type of number of ways of tree is Catalan number.

  • Rooted tree with N edges

  • Graph with N non-leaf vertices = N+1 leaf-binary-tree

CatalanN <- function(N = 10) {
 exp(lgamma(2 * N + 1) - lgamma(N + 2) - lgamma(N + 1))
}
CatalanN(N=10)

15.4.2 Number of patterns of clustering

Double-factorial number

\[ \begin{equation*} (2N-1)!!=1\times 3\times 5 \times ...\times (2N-1) \end{equation*} \]

NUmber of clustering trees is : \[ \begin{equation*} 2^{\frac{N(N-1)}{2}} \end{equation*} \]

(R 5-9.R, Figure 5.5).

library(phangorn)
allTrees(5) # No. items = 5
trees<-allTrees(5)
plot(trees)
doubleFactorialN<-function(N=5){
 if(N<3){
  print("no answer")
 }else{
  ret<-1
  for(i in 3:N){
   ret<-ret*(2*(i-2)-1)
  }
  ret
 }
}
doubleFactorialN(N=5)

15.4.3 Number of undirected graphs, number of directed graphs and number of acyclic directed graphs

Non-directed graph:

\(\frac{N(N-1)}{2}\) is the possible edge number.

\[ \begin{equation*} 2^{\frac{N(N-1)}{2}} \end{equation*} \]

is the number of ways of non-directed graph without self-directing edge.

Directed graph: \[ \begin{equation*} 3^{\frac{N(N-1)}{2}} \end{equation*} \] because {no, a->b, a<-b}.

Acyclic directed graph that is a type of graph for Baeysian network: \[ \begin{align*} N_{acg}(0)=1\\ N_{acg}(N)=\sum_{i=1}^N (-1)^{k+1} C(N,k)\times 2^{k(N-k)} \times N_{acg}(N-k)\\ \end{align*} \]

You may find these integer sequences @ http://www.research.att.com/~njas/sequences/

  • permutation (Factorial numbers): id A000142
  • the second Stirling number (Stirling numbers of second kind): id A008277
  • Number of Bell (Bell or exponential numbers): id A00011 0
  • Catalan number (Catalan numbers): id A000108
  • double fact realistic number (Double factorial numbers): idA0011 47
  • directed acyclic graph number (Number of acyclic digraphs (or DAGs) with n labeled nodes): id A003024

Chapter 16 Omitting something

16.1 Sample at random, Walk at random

16.1.1 Sample at random from a known distribution

The pseudo-random number sequence. R16-1.R, Figure 16.1 It is.

library(MCMCpack)
d<-rdirichlet(1000,c(1,1,1)) # Random sampling from dirichlet distribution
plot(d[,1],sqrt(3)/2*(d[,2]-d[,3]),xlim=c(0,1),ylim=c(-sqrt(3)/2,sqrt(3)/2))
segments(c(0,1,0),c(-sqrt(3)/2,0,sqrt(3)/2),c(1,0,0),c(0,sqrt(3)/2,-sqrt(3)/2)) # Triangle

16.1.2 Sample at random from samples; resampling and permutation

  1. Resampling

We have samples.

We re-sample samples from the sample pool.

Replacement is allowed or not.

x<-1:10
#bootstrap
sample(x,replace=TRUE)
#permutation
sample(x,replace=FALSE)

Used in cross-validation, population parameter estimation

  1. Permutation

Changing the order of 1,2,…,n.

\(n!\) ways.

It is related with exact probability test because exact probability test enumerate all patterns with marginal counts fixed and permutation handles samples as the set of samples is given.

The following is Monte-carlo permutation-based p-value estimation.

# Data set generation
A<-1:8
B<-c(1,3,2,6,4,7,8,5)
# All permutations vs. monte-carlo permutations.
PermutationTestJT<-function(A,B,MonteCarlo=TRUE,n=1000){# n : in case of monte-carlo
 library(clinfun) # package of jonckheere.test() 
 ret<-0
 Ps<-NULL
 jtp<-jonckheere.test(A,B)$p.value # p value of observed
 if(MonteCarlo || length(A)>10){ # if element size>10, monte-carlo.
  Ps<-rep(0,n)
  for(i in 1:n){
   tmp<-sample(1:length(A),length(A))
   Ps[i]<-jonckheere.test(A[tmp],B)$p.value
  }
 }else{ # full permutatin
  library(gtools) # package for permutations()
  perms<-permutations(length(A),length(A)) # all permutations will return
  Ps<-rep(0,length(perms[,1]))
  for(i in 1:length(perms[,1])){
   Ps[i]<-jonckheere.test(A[perms[i,]],B)$p.value
  }
 }
 ret<-length(Ps[Ps<=jtp])/length(Ps)
 list(originalp.value=jtp,permp.value=ret,Ps=Ps)
}
out1<-PermutationTestJT(A,B,MonteCarlo=FALSE) # full permutation result=exact test

out2<-PermutationTestJT(A,B,MonteCarlo=TRUE,n=100) # Mote-carlo result

out1$originalp.value 
out1$permp.value 
out2$permp.value 
plot(sort(out1$Ps))
abline(h=out1$permp)

16.1.3 Random walk

Random walk based on Markov chain

Random walk is supposed to walk around anywhere evenly

Weighted random walk may the procedure more efficiently, using Metropolis-hasting, methods using conjugate prior.

nstep<-100
rwalk<-matrix(0,nstep,2)
rtheta<-rnorm(nstep-1)
stepx<-cos(rtheta)
stepy<-sin(rtheta)
for(i in 1:(nstep-1)){
 rwalk[i+1,1]<-rwalk[i,1]+stepx[i]
 rwalk[i+1,2]<-rwalk[i,2]+stepy[i]
}
plot(rwalk,type="l")

16.2 Use a part that is principal

16.2.1 Approximation

  1. Fit to a model

Exact test and \(chi^2\) test are approximately same.

Model function is set and optimization procedure estimate coefficients, which is also approximation.

Extreme value distributions are to be used as model distribution of minimum p-values in the setting of multiple-testings.

\[ \begin{equation*} G(x)=e^{-(1+\zeta(\frac{x-\mu}{\sigma})^{-\frac{1}{\zeta}}} \end{equation*} \]

The formula is scary but we want to do is to have good estimates for parameters of this formula.

# random chisq value (df=1)
N<-100000;M<-100;t<-rchisq(N,1)
# group into 100 groups
matt<-matrix(t,nrow=M)
# max of each set
maxt<-apply(matt,1,max)
#evd package
#Generalized Extreme Value Disvribution parameter estimation
# Compare the estimated distribution and samples
library(evd)
gevest<-fgev(maxt)
qfromgev<-qgev(ppoints(length(maxt),a=0),loc=gevest$estimate[1],scale=gevest$estimate[2],shape=gevest$estimate[3])
plot(ppoints(length(maxt),a=0),sort(maxt),ylim=c(0,max(maxt)),type="p",cex=1,col=gray(5/8))
par(new=T)
plot(ppoints(length(maxt),a=0),qfromgev,ylim=c(0,max(maxt)),type="l")
  1. Fit to a format

Taking out big componens and throw away tiny ones.

Eigen value decomposition; each eigen value stands for the significance of the corresponding feature.

Polynomial approximation is also the same. \[ \begin{equation*} f(x)=a_0+a_1 x + a_2 x^2 +... + a_k ^k \end{equation*} \]

sortedMaxt<-sort(maxt) # max chisq values, sorted
s<-ppoints(length(sortedMaxt),a=0)
# for example y=a+bx+cx^2 approximation
f2 <- function(x) {
  sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2))^2)
}
optout2<-optim(rep(1,3),f2,method="BFGS")
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
 optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")

####
# Optional
f1 <- function(x) {
  sum((sortedMaxt-(x[1]*s^0+x[2]*s^1))^2) 
}
optout2<-optim(rep(1,2),f1,method="BFGS") 
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
 optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")

#
f3 <- function(x) {
  sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2+x[4]*s^3))^2)
}
optout2<-optim(rep(1,4),f3,method="BFGS") 
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
 optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")

#
f4 <- function(x) {
  sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2+x[4]*s^3+x[5]*s^4))^2)
}
optout2<-optim(rep(1,5),f4,method="BFGS")
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
 optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")

#
f5 <- function(x) {
  sum((sortedMaxt-(x[1]*s^0+x[2]*s^1+x[3]*s^2+x[4]*s^3+x[5]*s^4+x[6]*s^5))^2)
}
optout2<-optim(rep(1,6),f5,method="BFGS")
ylim<-c(0,max(sortedMaxt))
plot(s,sortedMaxt,ylim=ylim,type="l")
optt<-rep(0,length(s))
for(i in 1:length(optt)){
 optt[i]<-sum(optout2$par * s[i]^ (0:(length(optout2$par)-1)) )
}
par(new=T)
plot(s,optt,ylim=ylim,col="red",type="l")

16.3 Select items with higher significance, discard items with lower significance

Chapter 17 A lot of tests

17.1 Multiple tests

17.1.1 Repeating many tests mutually independent

As the saying goes that “bad gun also hit if shoot number”, for once be carried out many times over and unusual event is intended to occur. The same thing with this, the unusual statistic which corresponds to the p-values less, it is no longer unusual if Hakare repeatedly. This is the reason to change the interpretation of the p-value at the time you make a lot of tests. Problem is called multiple testing. In this chapter, we will describe how to change the interpretation of the p-value (correction method). ### 17.1.2 Expected p-value in multiple test-setting

Expected value of minimum p value. \(E(min(p)|k)=\frac{1}{k+1}\)

Expected value of the i-th minimum p value.

\[ \begin{equation*} E(i_{th} p | k)=\frac{i}{k+1} \end{equation*} \]

\footnote{ all p < a: \[Pr(all p<a|k)=1-(1-a)^k\]

minimum p is between a and \(a+\delta a\)

\[Pr(all p < a+\delta a|k)-Pr(all p< a|k)\]

Because \(\delta a \to 0\) so \[Pr(min p=a|k)=\frac{d Pr(all p<a|k)}{da} (a)=k(1-a)^{k-1}\]

Expected value is: \[ \int_{0}^{1} a Pr(min p=a|k) da = \int_{0}^{1} ka (1-a)^{k-1} da = k ( \int_{0}^{1} (1-a)^{k-1} da -\int_{0}^{1} (1-a)^{k} da ) = \Biggl[ k (\frac{1}{k} (1-a)^k-\frac{1}{k+1} (1-a)^{k+1}) \Biggl]_0^1= k(\frac{1}{k}-\frac{1}{k+1})=\frac{1}{k+1}\]

The i-th smallest out of k:

1,2,…i-th = a

\[\frac{k!}{i!(k-i)!} a^i(1-a)^{k-i}\] then

\[\frac{i}{k+1}\]

Correction of minimum p-values

  1. Sidak’s method \[ \begin{equation*} 1-(1-a)^k=p_c \end{equation*} \] \[ \begin{equation*} a=1-(1-p_c)^{\frac{1}{k}} \end{equation*} \]

  2. Bonferroni’s method \[ \begin{equation*} p_c=a\times k \end{equation*} \]

17.1.4 Repeating tests that are mutually dependent

Linkage disequilibrium

Structure population

Bonferroni and Sidak are too conservative

Permutation method to correct p might be good, but…

17.1.5 Multiple testing correction of p-values with Monte-Carlo permutation method

Monte-carlo permutation method with no. iterations = n, then corrected p can not be samller than 1/n.

?
Ns<-100;Nm<-10;Niter<-500 
Fx<-function(d){
 t.test(d~P[shuffle])$p.value
}
P<-rbinom(Ns,1,0.5) # binomial phenotype
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
meanPs<-matrix(0,Ns,Nm);Pminhistories<-matrix(0,Ns,Niter)

X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-0.3
for(i in 2:Nm){
 R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
 X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}
shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)

Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
 shuffle<-sample(1:Ns)
 Plist[i,]<-sort(apply(X,2,Fx))
 if(min(Plist[i,])<=obsMinP){
  counter<-counter+1
 }
 Phistory[i]<-counter/i
}
plot(Phistory, type = "b")

17.1.6 Minimum p-value when repeating tests that are mutually dependent

When tests are mutually dependent, p-values tend to be similar.

That means p-values get clustered…

17.2 When p-values are not in uniform distribution

17.2.1 When distribution of p-values shifts to the smaller side — Genomic control method

Structured population

set.seed(995599);Niter<-1000
#library(Rassoc)
st<-rep(0,Niter);p<-rep(0,Niter)
for(i in 1:Niter){
 af<-runif(1)*0.6+0.2 
 delta<-rnorm(1)
 af1<-af+af*0.05*delta; af2<-af-af*0.05*delta 
 case<-sample(c(0,1,2),1000,c(af1^2,2*af1*(1-af1),(1-af1)^2),replace=TRUE)
 cont<-sample(c(0,1,2),1000,c(af2^2,2*af2*(1-af2),(1-af2)^2),replace=TRUE)

 t<-matrix(c(length(case[case==0]),length(case[case==1]),length(case[case==2]),
             length(cont[cont==0]),length(cont[cont==1]),length(cont[cont==2])),nrow=2,byrow=TRUE)
 cattout<-CATT(t)
 st[i]<-(cattout$statistic)^2; p[i]<-cattout$p
}
plot(ppoints(length(p),a=0),sort(p)) 
ylim=c(0,1)
plot(ppoints(length(p),a=0),sort(p),ylim=ylim,type="l") 
for(i in 2:20){ 
par(new=T)
plot(ppoints(length(p),a=0),sort(pchisq(st/i,1,lower.tail=FALSE)),ylim=ylim,col="red",type="l")
}
lambda<-quantile(st,0.5)/qchisq(0.5,1)
stc<-st/lambda
pc<-pchisq(stc,1,lower.tail=FALSE)
par(new=T)
plot(ppoints(length(pc),a=0),sort(pc),type="l")
  • Only using observed p

  • correction with 1 parameter, \(/lambda\)

17.2.2 When alternative hypothesis is true — non-central chi-square distribution

Non-central \(chi^2\) distribution.

#library(Rassoc)
stB<-rep(0,Niter);pB<-rep(0,Niter) 
af<-runif(1)*0.6+0.2 ;delta<-0.05;af1<-af+af*delta;af2<-af-af*delta
exptable<-matrix(c(af1^2,2*af1*(1-af1),(1-af1)^2,af2^2,2*af2*(1-af2),(1-af2)^2),nrow=2,byrow=TRUE)*1000
cattout<-CATT(exptable)
expSt<-(cattout$statistic)^2 
for(i in 1:Niter){ ー
 case<-sample(c(0,1,2),1000,c(af1^2,2*af1*(1-af1),(1-af1)^2),replace=TRUE)
 cont<-sample(c(0,1,2),1000,c(af2^2,2*af2*(1-af2),(1-af2)^2),replace=TRUE)
 t<-matrix(c(length(case[case==0]),length(case[case==1]),length(case[case==2]),
             length(cont[cont==0]),length(cont[cont==1]),length(cont[cont==2])),nrow=2,byrow=TRUE)
 cattout<-CATT(t); stB[i]<-(cattout$statistic)^2; pB[i]<-cattout$p
}
lambdaB<-quantile(stB,0.5)/qchisq(0.5,1) ;stcB<-stB/lambdaB;pcB<-pchisq(stcB,1,lower.tail=FALSE)
plot(ppoints(length(pB),a=0),sort(pB),type="l",ylim=ylim)
par(new=T)
plot(ppoints(length(pcB),a=0),sort(pcB),type="l",col="red",ylim=ylim)
meanstB<-mean(stB);varstB<-var(stB)
ncp<-meanstB-1
ncpvalue<-qchisq(ppoints(length(pcB),a=0),1,ncp)
ylim<-c(0,max(stB,ncpvalue))
plot(ppoints(length(stB),a=0),sort(stB),ylim=ylim,col=gray(7/8))
par(new=T)
plot(ppoints(length(stB),a=0),ncpvalue,type="l",ylim=ylim)

17.2.3 Power of test

Type 1 error

Type 2 error

Power

x<-seq(from=0,to=30,by=0.01)
df<-1;lambda<-ncp
chi<-dchisq(x,df,0) 
ncChi<-dchisq(x,df,lambda)
ylim=c(0,0.2)
plot(x,chi,ylim=ylim,type="l")
par(new=T)
plot(x,ncChi,ylim=ylim,type="l",col="red")
abline(v=qchisq(0.05,1,lower.tail=FALSE))
pchisq(qchisq(0.05,1,lower.tail=FALSE),1,ncp,lower.tail=FALSE) 
  • The components of the hypothesis

  • Sample size

  • Rejection level

  • Power

help(power.prop.test)
help(power.t.test)
help(power.anova.test)
power.prop.test(p1 = 0.5, p2 = 0.4, sig.level = 0.01, power = 0.9)

The output of the R17-5.R is as follows. And it did not specify the number of samples n is calculated, is shown along with the value of a specified variable.

17.3 Utilize distribution of results

17.3.1 Correct using principal component analysis

Observing many markers, the differnece of case-control populatios are apparent.

Using the difference seen, correct the test results.

Nm<-1000 ;Npop<-4; 
Ns<-c(100,200,200,200) 
M<-NULL
for(j in 1:Npop){
 tmpM<-matrix(rep(0,Nm*Ns[j]),nrow=Nm)
 for(i in 1:Nm){
  af<-runif(1)*0.8+0.1;  f<-rnorm(1,sd=0.01);  if(abs(f)>1) f=0
  df<-c(af^2,2*af*(1-af),(1-af)^2)
  df[1]<-df[1]+f/2*df[2];  df[3]<-df[3]+f/2*df[2];  df[2]<-1-df[1]-df[3]
  tmpM[i,]<-sample(c(0,1,2),Ns[j],replace=TRUE,prob=df)
 }
 M<-cbind(M,tmpM)
}
Nm<-1000 ;Npop<-4; 
Ns<-c(100,200,200,200) 
M<-NULL 
for(j in 1:Npop){
 tmpM<-matrix(rep(0,Nm*Ns[j]),nrow=Nm)
 for(i in 1:Nm){
  af<-runif(1)*0.8+0.1;  f<-rnorm(1,sd=0.01);  if(abs(f)>1) f=0
  df<-c(af^2,2*af*(1-af),(1-af)^2)
  df[1]<-df[1]+f/2*df[2];  df[3]<-df[3]+f/2*df[2];  df[2]<-1-df[1]-df[3]
  tmpM[i,]<-sample(c(0,1,2),Ns[j],replace=TRUE,prob=df)
 }
 M<-cbind(M,tmpM)
}

Then from this data, by principal component analysis, take out the major axis. Please step-by-step process (R17-7.R).

##PCA##
mu<-apply(M,1,mean) 
M<-M-mu 
M<-M/sqrt(mu/2*(1-mu/2)) 
X<-1/Nm*t(M)%*%M 
eiout<-eigen(X) 
plot(eiout$values) 
eivect<-as.data.frame(eiout$vectors)
eilist<-1:(Npop+1)
plot(eivect[,eilist])

Then, samples the case and control at different rates from the structure Qa’a population (R17-9.R).

phenotype<-c(sample(c(0,1),sum(Ns)/2,replace=TRUE,prob=c(0.45,0.55)),sample(c(0,1),sum(Ns)/2,replace=TRUE,prob=c(0.55,0.45)))
Chisq<-rep(0,Nm);CorrChisq<-rep(0,Nm);Ps<-rep(0,Nm);CorrPs<-rep(0,Nm)
L<-3 
Emat<-eiout$vectors[,1:L] 
Esqs<-apply(Emat*Emat,2,sum) 
phenotype<-phenotype-mean(phenotype) 
Gamma<-apply(Emat*phenotype,2,sum)/Esqs 
corrphenotype<-phenotype-Emat%*%Gamma
for(i in 1:Nm){ 
 genotype<-M[i,]
 Gamma<-apply(Emat*genotype,2,sum)/Esqs
 corrgenotype<-genotype-Emat%*%Gamma
 Chisq[i]<-(sum(Ns))*cor(genotype,phenotype)^2
 CorrChisq[i]<-(sum(Ns)-(L+1))*cor(corrgenotype,corrphenotype)^2
}
Ps<-pchisq(Chisq,1,lower.tail=FALSE)
CorrPs<-pchisq(CorrChisq,1,lower.tail=FALSE)
ylim<-c(0,1)
plot(ppoints(length(Ps),a=0),sort(Ps),ylim=ylim)
par(new=T)
plot(ppoints(length(Ps),a=0),sort(CorrPs),ylim=ylim,col=gray(6/8))
plot(Ps, CorrPs)
chivalue<- qchisq(Ps,1,lower.tail=FALSE)
lambda<-quantile(chivalue,0.5)/qchisq(0.5,1)
chivalueGC<-chivalue/lambda
pGC<-pchisq(chivalueGC,1,lower.tail=FALSE)
plot(Ps,pGC)

17.3.2 When null hypotheses might not be always rejected

  1. False Discovery Rate (FDR)
N<-1000
peven<-p1<-ppoints(N,a=0) 
p1<-peven^5 
plot(peven, peven,xlim=c(0,1),ylim=c(0,1),type="l")
par(new=T)
plot(peven,p1,xlim=c(0,1),ylim=c(0,1),type="l")
abline(v=0.05) 
N<-1000
peven<-p1<-ppoints(N,a=0) 
peven<-ppoints(length(p),a=0)
p1<-peven^5 
pfdr<-p.adjust(p1,"fdr")
b<-0.05
plot(peven,sort(p1,decreasing=TRUE),xlim=c(0,1),ylim=c(0,1),type="l")
par(new=T)
plot(peven,b*((length(p1)-(1:length(p1))+1)/length(p1)),xlim=c(0,1),ylim=c(0,1),type="l",col="red")
par(new=T)
plot(peven,sort(pfdr,decreasing=TRUE),xlim=c(0,1),ylim=c(0,1),type="l",col="blue")
abline(h=b) 
abline(v=peven[length(which(pfdr>b))]) 
  1. Bayesian factor

17.4 Integrate multiple results — Meta-analysis

17.4.1 Integrate mutually independent multiple tests results

Independence of the test by 2 ?? 2 contingency table will think of when two there. The two tests are intended to examine one of the things some of the (relationship of factor A / a and factor B / b), and shall test to collect the sample to completely independent. In this way, when you examine using two sample sets in that there, it is one of the story to integrate doing what the results of the two sets. Consider that the two assayed retaken the sample is repeated many times. Chi-squared value of Pearson’s independence test under each of the null hypothesis, so you follow the chi-squared distribution with one degree of freedom, draw a scatter diagram of the p-value of the chi-square test of Pearson test of two test and, it should look like the one shown in Figure 17.13 (R17-13.R).

Nt<-100
df1<-1
df2<-1
chi1<-rchisq(Nt,df1)
chi2<-rchisq(Nt,df2)
plot(pchisq(chi1,df1,lower.tail=FALSE),pchisq(chi2,df2,lower.tail=FALSE))
chisum<-chi1+chi2?
plot(ppoints(Nt,a=0),sort(pchisq(chisum,df=df1+df2,lower.tail=FALSE)),type="l") 
t1<-t2<-matrix(c(10,10,10,10),nrow=2,byrow=TRUE)
m11<-m21<-apply(t1,1,sum);m12<-m22<-apply(t1,2,sum);M1<-M2<-sum(t1)
e1<-m11%*%t(m12)/M1;e2<-m21%*%t(m22)/M2
x11<-seq(from=-M1,to=M1,by=1);x12<--x11+m11[1];x21<--x11+m12[1];x22<--x12+m12[2]
xbind<-cbind(x11,x12,x21,x22);okx<-which(apply(xbind,1,min)>0)
x11<-x11[okx];x12<-x12[okx];x21<-x21[okx];x22<-x22[okx]
y11<-seq(from=-M2,to=M2,by=1);y12<--y11+m21[1];y21<--y11+m22[1];y22<--y12+m22[2];ybind<-cbind(y11,y12,y21,y22);oky<-which(apply(ybind,1,min)>0)
y11<-y11[oky];y12<-y12[oky];y21<-y21[oky];y22<-y22[oky]
chi1<-(x11-e1[1,1])^2/e1[1,1]+(x12-e1[1,2])^2/e1[1,2]+(x21-e1[2,1])^2/e1[2,1]+(x22-e1[2,2])^2/e1[2,2]
chi2<-(y11-e2[1,1])^2/e2[1,1]+(y12-e2[1,2])^2/e2[1,2]+(y21-e2[2,1])^2/e2[2,1]+(y22-e2[2,2])^2/e2[2,2]
z<-outer(chi1,chi2,FUN="+")
xlim<-ylim<-c(min(x11-e1[1,1],y11-e2[1,1]),max(x11-e1[1,1],y11-e2[1,1]))
contour(x11-e1[1,1],y11-e2[1,1],z,xlim=xlim,ylim=ylim)
abline(h=0);abline(v=0) 

17.4.2 Sum two tables into one table

sum11<-outer(x11,y11,FUN="+");sum12<-outer(x12,y12,FUN="+")
sum21<-outer(x21,y21,FUN="+");sum22<-outer(x22,y22,FUN="+")
sume11<-e1[1,1]+e2[1,1];sume12<-e1[1,2]+e2[1,2];sume21<-e1[2,1]+e2[2,1];sume22<-e1[2,2]+e2[2,2]
sumz<-(sum11-sume11)^2/sume11+(sum12-sume12)^2/sume12+(sum21-sume21)^2/sume21+(sum22-sume22)^2/sume22
xlim<-ylim<-c(min(x11-e1[1,1],y11-e2[1,1]),max(x11-e1[1,1],y11-e2[1,1]))
contour(x11-e1[1,1],y11-e2[1,1],sumz,xlim=xlim,ylim=ylim) 
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],z,xlim=xlim,ylim=ylim,col=gray(6/8)) 
abline(a=0,b=1)

17.4.3 Meta-analysis

library(rmeta)
n.case<-c(m11[1],m21[1])
n.ctrl<-c(m11[2],m21[2])
zMH<-z
zDSL<-z
for(i in 1:length(z[,1])){
 for(j in 1:length(z[1,])){
  mhout<-meta.MH(n.case,n.ctrl,c(x11[i],y11[j]),c(m12[1]-x11[i],m22[1]-y11[j]))
  zMH[i,j]<-mhout$MHtest[1]
  dslout<-meta.DSL(n.case,n.ctrl,c(x11[i],y11[j]),c(m12[1]-x11[i],m22[1]-y11[j]))
  zDSL[i,j]<-dslout$test[1]^2
 }
}
zlim<-c(0,max(sumz))
contour(x11-e1[1,1],y11-e2[1,1],sumz,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10,col=gray(6/8))
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],zMH,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10)

###

contour(x11-e1[1,1],y11-e2[1,1],z,xlim=xlim,ylim=ylim,zlim=zlim,,nlevels=10,col=gray(6/8))
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],zDSL,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10)

#######


library(rmeta)

t1<-matrix(c(10,20,30,40),nrow=2,byrow=TRUE)
t2<-matrix(c(200,300,50,40),nrow=2,byrow=TRUE)
m11<-apply(t1,1,sum)
m12<-apply(t1,2,sum)
M1<-sum(t1)

m21<-apply(t2,1,sum)
m22<-apply(t2,2,sum)
M2<-sum(t2)

m11<-apply(t1,1,sum)
m12<-apply(t1,2,sum)
M1<-sum(t1)

m21<-apply(t2,1,sum)
m22<-apply(t2,2,sum)
M2<-sum(t2)

e1<-m11%*%t(m12)/M1
e2<-m21%*%t(m22)/M2

x11<-seq(from=-M1,to=M1,by=1)

y11<-seq(from=-M2,to=M2,by=1)

x12<--x11+m11[1]
x21<--x11+m12[1]
x22<--x12+m12[2]

xbind<-cbind(x11,x12,x21,x22)
okx<-which(apply(xbind,1,min)>0)

x11<-x11[okx]
x12<-x12[okx]
x21<-x21[okx]
x22<-x22[okx]

y12<--y11+m21[1]
y21<--y11+m22[1]
y22<--y12+m22[2]

ybind<-cbind(y11,y12,y21,y22)
oky<-which(apply(ybind,1,min)>0)

y11<-y11[oky]
y12<-y12[oky]
y21<-y21[oky]
y22<-y22[oky]


sum11<-outer(x11,y11,FUN="+");sum12<-outer(x12,y12,FUN="+")
sum21<-outer(x21,y21,FUN="+");sum22<-outer(x22,y22,FUN="+")
sume11<-e1[1,1]+e2[1,1];sume12<-e1[1,2]+e2[1,2];sume21<-e1[2,1]+e2[2,1];sume22<-e1[2,2]+e2[2,2]
sumz<-(sum11-sume11)^2/sume11+(sum12-sume12)^2/sume12+(sum21-sume21)^2/sume21+(sum22-sume22)^2/sume22
xlim<-ylim<-c(min(x11-e1[1,1],y11-e2[1,1]),max(x11-e1[1,1],y11-e2[1,1]))

chi1<-(x11-e1[1,1])^2/e1[1,1]+(x12-e1[1,2])^2/e1[1,2]+(x21-e1[2,1])^2/e1[2,1]+(x22-e1[2,2])^2/e1[2,2]
chi2<-(y11-e2[1,1])^2/e2[1,1]+(y12-e2[1,2])^2/e2[1,2]+(y21-e2[2,1])^2/e2[2,1]+(y22-e2[2,2])^2/e2[2,2]
z<-outer(chi1,chi2,FUN="+")

n.case<-c(m11[1],m21[1])
n.ctrl<-c(m11[2],m21[2])
zMH<-zDSL<-z
for(i in 1:length(z[,1])){
 for(j in 1:length(z[1,])){
  mhout<-meta.MH(n.case,n.ctrl,c(x11[i],y11[j]),c(m12[1]-x11[i],m22[1]-y11[j]))
  zMH[i,j]<-mhout$MHtest[1]
  dslout<-meta.DSL(n.case,n.ctrl,c(x11[i],y11[j]),c(m12[1]-x11[i],m22[1]-y11[j]))
  zDSL[i,j]<-dslout$test[1]^2
 }
}
zlim<-c(0,max(zMH,zDSL))
contour(x11-e1[1,1],y11-e2[1,1],sumz,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10,col=gray(6/8))
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],zMH,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10)

contour(x11-e1[1,1],y11-e2[1,1],z,xlim=xlim,ylim=ylim,zlim=zlim,,nlevels=10,col=gray(6/8))
par(new=T)
contour(x11-e1[1,1],y11-e2[1,1],zDSL,xlim=xlim,ylim=ylim,zlim=zlim,nlevels=10)
a<-0.00001
k<-1:100000
FWER<-1-(1-a)^k
Bonferroni<-a*k
ylim<-c(0,1)
plot(k,FWER,type="l",ylim=ylim,col=gray(6/8))
par(new=T)
plot(k,Bonferroni,type="l",ylim=ylim)
FWER[1]
FWER[100000]

ylim<-c(log(min(FWER,Bonferroni),10),0)
plot(k,log(FWER,10),type="l",ylim=ylim,col=gray(6/8))
par(new=T)
plot(k,log(Bonferroni,10),type="l",ylim=ylim)

pc=0.01
As<-1-(1-pc)^(1/k)
Bfs<-pc/k
ylim<-c(log(min(As,Bfs),10),0)
plot(k,log(As,10),type="l",ylim=ylim,col=gray(6/8))
par(new=T)
plot(k,log(Bfs,10),type="l",ylim=ylim)
As[1]
As[100000]
Ns<-100
Nm<-10
Niter<-500
rs<-c(0,0.75,0.9,0.95,1)
Fx<-function(d){
 t.test(d~P[shuffle])$p.value
}
P<-rbinom(Ns,1,0.5)
#X<-matrix(sample(c(0,1),Ns*Nm,0.5),nrow=Ns)
X<-matrix(rnorm(Ns*Nm),nrow=Ns)

meanPs<-matrix(0,length(rs),Nm)
Pminhistories<-matrix(0,length(rs),Niter)
xx<-1
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
 R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
 X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}

#for(i in 1:Ns){
 #R<-sample(c(0,1),Nm,replace=TRUE,prob=c(r,1-r))
 #R<-runif(Nm)
 #if(i==1){
  #X[i,]<-X[i-1,]*R*r+X[i,]*(1-R*r)
 #}
 #X[i,]<-X[i-1,]*R*r+X[i,]*(1-R*r)

#}


shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)


Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
 shuffle<-sample(1:Ns)

 Plist[i,]<-sort(apply(X,2,Fx))

 if(min(Plist[i,])<=obsMinP){
  counter<-counter+1
 }
 Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP

image(cor(cbind(P,X)))

plot(Phistory,type="b")

xx<-2
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
 R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
 X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}

shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)


Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
 shuffle<-sample(1:Ns)


 Plist[i,]<-sort(apply(X,2,Fx))

 if(min(Plist[i,])<=obsMinP){
  counter<-counter+1
 }
 Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP


image(cor(cbind(P,X)))

xx<-3
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
 R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
 X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}

shuffle<-1:Ns

obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)


Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
 shuffle<-sample(1:Ns)

 Plist[i,]<-sort(apply(X,2,Fx))

 if(min(Plist[i,])<=obsMinP){
  counter<-counter+1
 }
 Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP


image(cor(cbind(P,X)))

xx<-4
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
 R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
 X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}


shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)


Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
 shuffle<-sample(1:Ns)


 Plist[i,]<-sort(apply(X,2,Fx))

 if(min(Plist[i,])<=obsMinP){
  counter<-counter+1
 }
 Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP


image(cor(cbind(P,X)))

xx<-5
X<-matrix(rnorm(Ns*Nm),nrow=Ns)
r<-rs[xx]
for(i in 2:Nm){
 R<-sample(c(0,1),Ns,replace=TRUE,prob=c(r,1-r))
 X[,i]<-X[,i-1]*(1-R)+X[,i]*R
}
shuffle<-1:Ns
obsPs<-apply(X,2,Fx)
obsMinP<-min(obsPs)


Plist<-matrix(0,Niter,Nm)
Phistory<-rep(0,Niter)
counter<-0
for(i in 1:Niter){
 shuffle<-sample(1:Ns)


 Plist[i,]<-sort(apply(X,2,Fx))

 if(min(Plist[i,])<=obsMinP){
  counter<-counter+1
 }
 Phistory[i]<-counter/i
}
meanP<-apply(Plist,2,mean)
#plot(ppoints(length(meanP),a=0),meanP,xlim=c(0,1),ylim=c(0,1))
Pminhistories[xx,]<-Phistory
meanPs[xx,]<-meanP

image(cor(cbind(P,X)))

xlim<-c(0,1)
ylim<-c(0,1)

plot(ppoints(length(meanPs[1,]),a=0),meanPs[1,],xlim=xlim,ylim=ylim,type="b")
for(i in 2:length(meanPs[,1])){
 par(new=T)
 plot(ppoints(length(meanPs[1,]),a=0),meanPs[i,],xlim=xlim,ylim=ylim,type="b")
}
Nt<-10000
Nk<-rpois(1,10)
dfs<-rpois(Nk,10)
chis<-matrix(0,Nt,Nk)
for(i in 1:Nk){
 chis[,i]<-rchisq(Nt,dfs[i])
}
dfs
chisum<-apply(chis,1,sum)
dfsum<-sum(dfs)
plot(ppoints(Nt,a=0),sort(pchisq(chisum,df=dfsum,lower.tail=FALSE)),type="l")