solarius
R packageThis vignette is a tutorial on the R package solarius
. The document contains a brief description of the main statistical models (polygenic, association and linkage) implemented in SOLAR
and accessible via solarius
, installation instructions for both SOLAR
and solarius
, reproducible examples on synthetic data sets available within the solarius
package. The examples are intended to cover all the necessary steps in the quantitative genetic analysis with SOLAR
: data loading, data exploration, running statistical models and interpreting the obtained results by printing, summarizing and plotting the models in R.
In human genetics research, the study of a complex disorder is targeted to discover its underlying genetic architecture. Whereas simple Mendelian diseases are results of a mutation in a single gene, for example Huntington’s disease, complex disorders or phenotypes are the product of the action of multiple genes and environmental factors as well as their interactions. For instance, obesity is a complex disease which genetic basis might be better dissected via measurable quantitative variation in phenotypes, such as body-mass index, closely related to the disease risk.
One of the study designs to identify the quantitative trait loci (QTLs) that affect complex phenotypes is based on examination of related individuals grouped by families (pedigrees). Pedigree-based designs allow for both QTL linkage and association mappings. In addition, QTL analysis with pedigrees possess more power to detect rare variants, permits explicit control for population stratification in association studies, and suit for examination of parent-of-origin and sex-specificity effects.
The analysis of the pedigree data is traditionally based on variance component (VC) models, also known as linear mixed models. SOLAR
is one of the well-known and widely used tools that implements the variance component models and other routines needed for the QTL analysis in extended pedigrees (L. Almasy and Blangero 1998) (http://solar.txbiomedgenetics.org/). The name SOLAR
stands for Sequential Oligogenic Linkage Analysis Routines. The analyses available in SOLAR
include customized quantitative genetic analysis, advanced linkage analysis, SNP association analysis (QTN, QTLD, and MGA). Other operations are implemented for calculation of marker-specific or multipoint identity-by-descent (IBD) matrices in pedigrees of arbitrary size and complexity. The linkage analysis allows for multiple quantitative traits and/or discrete traits which might involve multiple loci (oligogenic analysis), dominance effects, household effects, and interactions.
There are other tools similar to SOLAR
, such as Mendel
, MMM
, GEMMA
and FAST-LMM
, but the R package solarius
presented here has been especially developed to work with SOLAR
.
solarius
provides an interface from the program SOLAR
to the environment for statistical computing R
. On one side, main SOLAR
procedures can be called in R
by one or a few commands from the solarius
package, that makes the analysis more user-friendly with the same power of computations implemented in SOLAR
. On the other side, the solarius
user benefits from the infrastructure available with the R environment and its packages, such as interactive data manipulation, data visualization and parallel computation.
Consider that a particular trait, such as body-mass index, is observed in \(n\) individuals grouped by families or clusters of related individuals. A \(n \times 1\) vector \(y\) contains the phenotypic values measured for the trait. One assumes that these observations are described adequately by a linear model with a \(p \times 1\) vector of fixed effects (\(\beta\)) and a \(q \times 1\) vector of random effects (\(u\)) (Lynch and Walsh 1998, Chapter 26, pp. 746). In matrix form,
\[ y = X \beta + Z u + e \]
where \(X\) and \(Z\) are respectively \(n \times p\) and \(n \times q\) incidence matrices, and \(e\) is a \(n \times 1\) vector of residual deviations assumed to be distributed independently of the random effects. Denote the \(n \times n\) covariance matrix for the vector \(e\) by \(R\) and the the \(q \times q\) covariance matrix for the vector \(u\) by \(G\).
The first element of the vector \(\beta\) is typically the population mean, and other elements may be age, gender, diet and so on. The elements of the incidence matrices are usually equal to 0 or 1, that indicate the effect contribution to the individuals’s phenotype. The model is referred to as a mixed model, because this model jointly accounts for fixed and random effects.
Giving the two terms of the random effects \(u\) and \(e\), the first term typically accounts for the contribution from random genetic effects such as additive genetic values. The second term accounts for the residual variance, and one usually assumes that residual errors have constant variance and are uncorrelated, so that \(R\) is a diagonal matrix, with \(R = \sigma_e^2 I\).
Two more equations for the means and variances of the component vectors of the mixed model can be derived on the basis of its definition. Since \(E(u) = E(e) = 0\) by definition,
\[ E(y) = X \beta \]
Excluding the difference among individuals due to fixed effects and calling the assumption that \(u\) and \(e\) are uncorrelated, the covariance matrix for the vector of observations \(y\) is
\[ Var(y) = V = Z G Z^{T} + R \]
Now the mixed model can be rewritten in a compact form
\[ y = X \beta + Z u + e\]
\[ \mbox{ where } \mbox{ } \mbox{ } u \sim (0, G) \mbox{ and } e \sim (0, R) \\ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X \beta, V) = (X \beta, Z G Z^T + R) \]
For the mixed model, \(y\), \(X\) and \(Z\) are observed variables, while \(\beta\), \(u\), \(R\) and \(G\) are generally unknown variables. Thus, the analysis consists of two complementary estimation procedures: (1) estimation of the covariance matrices \(G\) and \(R\), and (2) estimation of the vectors of fixed and random effects, \(\beta\) and \(u\). The covariance matrices are usually functions of a few unknown parameters or variance components. For instance, the covariance matrix of the residuals errors typically has the form \(R = \sigma_e^2 I\), where \(\sigma_e^2\) is the parameter to be estimated.
The description of the two estimation procedures is out of scope of this document, and the reader is referred to (Lynch and Walsh 1998, Chapter 26-27), where the necessary theoretical material and accompanying examples are given. On the other hand, the implementation of the estimation procedures in SOLAR
is discussed elsewhere (L. Almasy and Blangero 1998).
The next three sections will present SOLAR
models, which follow the general definition of linear mixed model given above, but parametrization of the models and estimation procedures are implemented in a SOLAR
-specific way.
SOLAR
The polygenic model is a basis for any quantitative genetic analysis in SOLAR
. The model includes fixed effects (\(\beta\)) and three random effects (the genetic additive effect \(g\), the house-hold effect \(c\), and the residual effect \(e\)).
\[y = X\beta + g + c + e\]
\[\mbox{ where } \mbox{ } \mbox{ } g \sim (0, \sigma_g^2 A) \mbox{ and } \mbox{ } \mbox{ } c \sim (0, \sigma_c^2 H) \mbox{ and } \mbox{ } \mbox{ } e \sim (0, \sigma_e^2 I)\]
\[ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X\beta, V) = (X\beta, \sigma_g^2 A + \sigma_c^2 H + \sigma_e^2 I)\]
The first \(g\) term accounts for the additive genetic effect with the covariance matrix \(A = 2 K\), where \(K\) is the kinship matrix. SOLAR
estimates the empirical kinship matrix based on the identifier (ID) fields in the pedigree data, such as individual ID id
, gender sex
, family ID fam
, father ID fa
, mother ID mo
, and indicator of identical individuals, e.g. monozygotic twins, mztwin
(SOLAR Documentation, Appendix 1, file-pedigree command).
The second \(c\) term represents an annotated environmental effect. It is named as a house-hold effect, but it is generally related to the sharing of any non-genetic factor among individuals. The house-hold effect is included in the analysis if the pedigree data contain a house-hold ID field (the expected field name is hhid
). SOLAR
automatically creates Pedigree-Household Groups
based on the hhid
identifiers. That means that pedigrees and household groups are merged into larger groups, which contain every individual in the same pedigree as well as every individual that has the same hhid
as any individual in the pedigree (SOLAR Documentation, Section 9.3 Household Group Analysis). As a consequence of this grouping operation by SOLAR
, the covariance matrix \(H\) of the \(c\) random effect contains the information related to the pedigree-household groups rather than relationships given by the hhid
field.
The third \(e\) term is the residual error with the diagonal covariance matrix. It is important to note that the residual
command in SOLAR
(SOLAR Documentation, Appendix 1, residual command) returns the covariate-free value of a trait under study, expressed as \(y - X \beta\), and the returned value is not the residual vector \(e\) given in the equation of the polygenic model.
The polygenic term \(g\) represents the additive genetic effect by default. The incorporation of the dominance effect into SOLAR
models is also possible, but it is attributed to advanced modeling topics (SOLAR Documentation, Section 9.4 Dominance Analysis). The SOLAR
’s developers suggest that the dominance variance component is unlikely useful in quantitative trait mapping, particularly in human genetic analysis, as only bi-lineal relatives contribute to this component.
SOLAR
is able to work with discrete traits, which possible values are coded as two consecutive integers, for example, 0
and 1
(SOLAR Documentation, Appendix 1, discrete-notes). SOLAR
uses a liability threshold model to handle discrete traits (Duggirala et al. 1997), and more details about implementation and practical notes are available in (SOLAR Documentation, Section 9.1 Discrete Traits). SOLAR
authors recommend to work with quantitative traits whenever it is possible. The user is able to disable the transformation of discrete traits, if it is necessary.
The likelihood-based statistics, in particular Likelihood Ratio Tests (LRTs), is used for the inference of the linear mixed models implemented in SOLAR
. The following table shows the alternative models used to test the significance of the random effects \(g\) and \(c\).
Alternative model | Free parameters | Restricted parameters |
---|---|---|
No polygenic effect | \(\beta\), \(\sigma_c^2\), \(\sigma_e^2\) | \(\sigma_g^2\) = 0 |
No house-hold effect | \(\beta\), \(\sigma_g^2\), \(\sigma_e^2\) | \(\sigma_c^2\) = 0 |
Under the assumption of multivariate normality, the LRT statistic is asymptotically distributed as a \(\frac{1}{2}:\frac{1}{2}\) mixture of of a \(\chi_1^2\) variable and a point mass at the origin (J. Blangero, Wiliiams, and Almasy 2000). This statement is applied to any LRT performed for parameters like \(\sigma_g^2\), which is greater or equal to 0
by definition.
SOLAR
takes the decision of rejecting the house-hold effect by considering two factors, the test results and the value of \(\sigma_c^2\) parameter. If value of \(\sigma_c^2\) is non-zero, the house-hold effect is retained in the polygenic model, nevertheless the effect is not significant. The user has an option to force the rejection of the house-hold effect if it is necessary.
The tests applied to fixed effects (covariates) are also performed by means of LRTs. In the table given below the alternative model for the test of significance of age
covariate is shown.
Alternative model | Free parameters | Restricted parameters |
---|---|---|
No age effect | \(\sigma_g^2\), \(\sigma_c^2\), \(\sigma_e^2\) | \(\beta_{age}\) = 0 |
SOLAR
In principle, one can estimate the multivariate polygenic model by simply performing the univariate analysis in an appropriate way. Consider two traits \(y_1\) and \(y_2\) measured in each of \(n\) individuals, assuming that each trait follows the same univariate polygenic analysis. Then the \(2 \times n\) dimensional column vector of observations is denoted by the stack of univariate vectors, and the polygenic model has the matrix form:
\[ \begin{pmatrix} y_1 \\ y_2 \end{pmatrix}_{2n \times 1} = \begin{pmatrix} X_1 & 0 \\ 0 & X_2 \end{pmatrix}_{2n \times nk} \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix}_{2n \times 1} + \begin{pmatrix} g_1 \\ g_2 \end{pmatrix}_{2n \times 1} + \begin{pmatrix} e_1 \\ e_2 \end{pmatrix}_{2n \times 1} \]
\[ G_{2n \times 2n} = C_{2 \times 2} \otimes A_{n \times n} = \begin{pmatrix} c_{11} A & c_{12} A \\ c_{21} A & c_{21} A \end{pmatrix} = \begin{pmatrix} \sigma_{1}^2 A & \rho \sigma_{1} \sigma_{2} A \\ \rho \sigma_{1} \sigma_{2} A & \sigma_{2}^2 A \end{pmatrix}_{g} \]
\[ R_{2n \times 2n} = E_{2 \times 2} \otimes I_{n \times n} = \begin{pmatrix} e_{11} I & e_{12} I \\ e_{21} I & e_{21} I \end{pmatrix} = \begin{pmatrix} \sigma_{1}^2 I & \rho \sigma_{1} \sigma_{2} I \\ \rho \sigma_{1} \sigma_{2} I & \sigma_{2}^2 I \end{pmatrix}_{e} \]
The model can be generalized to \(k\) traits, and other random effects such as house-hold can be added in the similar way. The Kronecker product notation \(\otimes\) is convinient for denoting the covariance matrices. More details on the multivariate analysis are given in (Lynch and Walsh 1998, Chapter 26, Multiple Traits, pp. 774).
The multivariate analysis implemented in SOLAR
allows for any number of traits, but the tests for significance of the correlation coefficients are performed only for the bivariate models. The following table describes the main LRTs available in SOLAR
for the bivariate analysis.
Alternative model | Free parameters | Restricted parameters |
---|---|---|
No genetic correlation | \((\sigma_1^2, \sigma_2^2)_g\), \((\sigma_1^2, \sigma_2^2, \rho)_e\) | \(\rho_g = 0\) |
Complete pleiotropy | \((\sigma_1^2, \sigma_2^2)_g\), \((\sigma_1^2, \sigma_2^2, \rho)_e\) | \(\rho_g = 1\) |
No environmental correlation | \((\sigma_1^2, \sigma_2^2, \rho)_g\), \((\sigma_1^2, \sigma_2^2)_e\) | \(\rho_e = 0\) |
It is important to note that the transformation of discrete traits in the multivariate analysis is automatically performed by SOLAR
.
SOLAR
SOLAR
introduces sequential oligogenic linkage analysis, as stated in its name abbreviation. Consider the linkage model with a single QTL, which is described by another random effect \(q\).
\[y = X\beta + g + c + q + e\]
\[ \mbox{ where } \mbox{ } \mbox{ } g \sim (0, \sigma_g^2 A) \mbox{ and } \mbox{ } \mbox{ } c \sim (0, \sigma_c^2 H) \mbox{ and } \mbox{ } \mbox{ } q \sim (0, \sigma_q^2 M) \mbox{ and } \mbox{ } \mbox{ } e \sim (0, \sigma_e^2 I) \]
\[ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X\beta, V) = (X\beta, \sigma_g^2 A + \sigma_c^2 H + \sigma_q^2 M + \sigma_e^2)\]
The QTL under study \(q\) corresponds to a specific position in the genome and an identity-by-descent (IBD) matrix. The genome-wide linkage analysis assumes that the highly informative markers, such as micro-satellites, are available in the study, in order to estimate the two-point or multipoint IBD matrices by SOLAR
or other specific program like MERLIN
. Once the IBD matrices are calculated, the genome-wide scan is conducted by sequentially running the linkage model for each genome position.
The alternative model for testing the significance of a single QTL effect is the same as the polygenic model.
Alternative model | Free parameters | Restricted parameters |
---|---|---|
No linkage | \(\beta, \sigma_g^2, \sigma_c^2\), \(\sigma_e^2\) | \(\sigma_q^2\) = 0 |
Traditionally, the logarithm (base 10) of the odds (LOD) values are used to score the QTLs. If the LRT statistics for the linkage model is denoted as \(\Lambda\), then the correspondent LOD score is equal to \(\Lambda / (2 ln(10))\).
SOLAR
also allows for multi-pass linkage scan, that means that any further scan can be conditioned to QTLs found in the previous scans. For example, the second-pass scan conditioned to the previously found QTL \(q_1\) has the form:
\[y = X\beta + g + c + q_1 + q + e\]
The second-pass LRT test now looks like in the table below.
Alternative model | Free parameters | Restricted parameters |
---|---|---|
No linkage | \(\beta, \sigma_g^2, \sigma_c^2\), \(\sigma_{q_1}^2\), \(\sigma_e^2\) | \(\sigma_q^2\) = 0 |
The linkage model is extended to deal with many traits or discrete traits in the same way as for the polygenic model.
Additional information related to the linkage modeling in SOLAR
can be found in related publications (L. Almasy and Blangero 1998), (L. Almasy, Dyer, and Blangero 1997), (Williams et al. 1999), the SOLAR
tutorial (SOLAR Documentation, Chapter 3 Tutorial), and the SOLAR
manual, for example, (SOLAR Documentation, Apendix 1, multipoint command).
SOLAR
A basic association model in SOLAR
is extended from the polygenic model by simply adding a new covariate (fixed effect) snp
.
\[ y = X\beta + \beta_{snp} * snp + g + c + e\]
\[\mbox{ where } \mbox{ } \mbox{ } g \sim (0, \sigma_g^2 A) \mbox{ and } \mbox{ } \mbox{ } c \sim (0, \sigma_c^2 H) \mbox{ and } \mbox{ } \mbox{ } e \sim (0, \sigma_e^2 I)\]
\[ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X\beta, V) = (X\beta, \sigma_g^2 A + \sigma_c^2 H + \sigma_e^2 I)\]
In the genome-wide association scan SOLAR
passes through SNPs variants under the study one by one according to the equation given above. The p-value of association between trait \(y\) and SNP \(snp\) is again based on LRT test with the alternative model outlined in the table below. This alternative model is the same as the polygenic model.
Alternative model | Free parameters | Restricted parameters |
---|---|---|
No association | \(\beta, \sigma_g^2, \sigma_c^2\), \(\sigma_e^2\) | \(\beta_{snp} = 0\) |
More details on implementation of the association analysis in SOLAR
can be found in the SOLAR
manual, for example, (SOLAR Documentation, Apendix 1, mga and snp commands).
The idea of the proposed software (the R package solarius
) is to perform the calculus in SOLAR
at the back-end and to offer the user interface in R at the front-end. The table given below shows a basic description of the software via implementation of the main three models: polygenic, association and linkage.
Model | SOLAR command |
solarius function |
solarius S3 classes of returned object |
Parallel computation |
---|---|---|---|---|
Polygenic | polygenic |
solarPolygenic |
solarPolygenic |
None |
Association | mga |
solarAssoc |
solarAssoc , solarPolygenic |
Automatic or custom (by SNP file) |
Linkage | multipoint |
solarMultipoint |
solarMultipoint , solarPolygenic |
Custom (by chromosome) |
One can see how the software benefits from both SOLAR
and R sides. On the side of SOLAR
, the SOLAR
commands polygenic
, mga
and multipoint
do all the hard job on estimating the variance component models with control via a set of parameters and settings. On the side of R
, the solarius
functions solarPolygenic
, solarAssoc
and solarMultipoint
prepare the input data for the analysis, pass the arguments to the commands and set up other necessary settings, and finally read the output SOLAR
files and store the results into R objects of S3 classes in R.
The main advantage of running the SOLAR
models in solarius
instead of SOLAR
is that the analysis is conduced by only one function call, while the same analysis in SOLAR
could require typing more commands with appropriate options and settings (to be learned from the SOLAR
manual). The solarius
functions just automate the analysis flow and keep configuration options via the R function arguments, if such configuration is required. Another advantage of using the R package solarius
is that the results are stored in objects of especially designed S3 classes with associated methods like print, summary, plot and others.
The last column of the table given above contains the information related to parallel computation available for the models. The implementation of parallel calculations is straightforward in solarius
package, since the association and linkage analyses are embarrassingly parallel problems (easily divisible into different identical independent task) and the R environment offers a number of packages with parallel interfaces.
More information about the solarius
functions is available in the software manual of the package, and the practical guide of using the functions is presented in the next section dedicated to examples. Meantime, some points should to be stressed to describe some other features and drawbacks of the solarius
package.
solarius
functions, by default, do not create any directories visible to the user, though the analysis flow in SOLAR
is heavily based on storing the data in directories and files. However, the input phenotype and genotype data might be stored in CSV data format and passed to the functions. The user can specify a special directory (dir
argument for the solarPolygenic
, solarAssoc
and solarMultipoint
functions), where the SOLAR
output will be saved in addition to the output R object.tcl
scripts used by SOLAR
for automation are not visible to the user. The routine of executing either SOLAR
separate commands or a block of commands in a tcl
script is internally organized in the functions of solarius
.SOLAR
analysis is implemented in the functions in solarius
, and the user does not need to manually look for the analysis results.SOLAR
commands used to compute the analysis and some SOLAR
output files are saved in the returned objects. This feature might be necessary for those users who are familiar with SOLAR
. It is also helps to go back to SOLAR
and repeat or extend the analysis in the SOLAR
environment, if it is necessary.solarius
offers only three analyses via the most powerful SOLAR
commands polygenic
, mga
and multipoint
. If a more specific genetic analysis is required, the user needs to go back to SOLAR
.The installation process has three steps: install SOLAR
, register SOLAR
and install solarius
.
SOLAR
installationThe SOLAR
distribution is available on the official web page http://solar.txbiomedgenetics.org/ for the most popular operating systems (Linux, Mac OS, Windows).
In the case of Linux, the installation files are compressed in solar_linux.tar.gz
archive. Once the archive is downloaded to a local machine of the user and the installation files are unpacked, one can find the README
file and install_solar
script. In particular, the script copies the necessary library, binary and documentation files into an installation directory defined by the user. A special bash script called solar
is also created and copied to the system path defined by the user, e.g. /usr/local/bin/solar
, so that the program can be executed by simply typing solar
in the terminal.
An example of the installation command might be the following.
./install_solar /opt/appl/solar/7.6.6 /usr/local/bin solar
As a result of execution of the command, the library, binary and documentation files of SOLAR
version 7.6.6
will be copied to /opt/appl/solar/7.6.6
directory, and solar
execution script will be copied to /usr/local/bin
system directory.
After the installation process the new user is required to obtain a registration key, as it is explained by the message that appears when SOLAR
program starts. The registration key has to be stored in .solar_reg
file in the home directory of the user.
Once both installation and registration processes are finished, something similar to the following message will be displayed when SOLAR
is executed.
SOLAR Eclipse version 7.6.6 (Experimental), last updated on December 11, 2014
Copyright (c) 1995-2014 Texas Biomedical Research Institute
Enter help for help, exit to exit, doc to browse documentation.
solar>
SOLAR version for Windows. Currently, SOLAR does not have an appropriated version for Windows. As stated in the official web page of SOLAR http://solar.txbiomedgenetics.org/solarwindows.html:
Because SOLAR requires a Unix OS or equivalent, we have prepared a VMware Virtual Machine containing SOLAR plus the Xubuntu edition of linux which runs on Windows XP or later using the free VMware Player. Beware this requires a very large multi 1.4 Gb download, and will require more than 10 gigabytes free disc space during installation.
This is a clear limitation of solarius
package that comes from the dependency on SOLAR. Starting from version 3.*, solarius
is not supported for Windows. DESCRIPTION file has a line OS_type: unix
.
solarius
installationsolarius
package is available in the CRAN repository for R packages http://cran.r-project.org/web/packages/solarius/, so it can be install by calling the usual R function install.packages
.
install.packages(solarius)
The code repository of solarius
package is hosted in the public GitHub repository of the UGCD group https://github.com/ugcd/solarius. The R package devtools
allows to install the development version of the package hosted at GitHub. The installation commands are the following.
library(devtools)
install_github("ugcd/solarius")
solarius
has several R packages as dependencies (that can be seen in DESCRIPTION file). Installation of these packages should be straightforward using the standard R function install.packages
.
In the case of problems in the installation process, web pages of the packages are the main source of information.
Imported packages:
plyr
.ggplot2
.data.table
.
Installation of data.table R package. If the user experiences problems installing data.table
package, we recommend to upgrade the R environment to the latest stable version.
solarius
R package structuresolarius
is organized as a standard R package, and its structure can be seen in the code repository web site https://github.com/ugcd/solarius.
*.R
files with functions, classes and methods.*.RData
files.man
directory includes documentation organized in Rd
files (R documentation files).inst
directory will be copied to the installation directory of the package and, thus, will be available within the package via system.file
function.
*.R
scripts with code examples.SOLAR
data files for data sets of the package.*.R
scripts for testing units during development of the package (R package testthat
).Rmd
extension are the source files of the vignettes.The source files of documentation are stored in man
directory as Rd
files. R package staticdocs
converts this documentation into html pages. The documentation for solarius
is published on http://ugcd.github.io/solarius/doc/.
In the R environment the help for a function of the package, for example solarPolygenic
function, can be obtained by the standard ?solarPolygenic
R comand. The content of the documentation for solarPolygenic
function is the same as on http://ugcd.github.io/solarius/doc/solarPolygenic.html.
The vignetts are created by rmarkdown
R package and published as html pages on the web.
dat30
data set.SOLAR
, the code presented in the vignette is based on solarius
. The produced results are the same as those reported in the published articles.solarius
To start using solarius
package, the user needs to load the package with library
function.
library(solarius)
The version of solarius
loaded here is:
packageVersion("solarius")
## [1] '0.3.1'
Other R packages will be needed for the tutorial.
library(plyr)
The following code shows how the main functions (solarPolygenic
, solarMultipoint
and solarAssoc
) of the package work.
# load library
library(solarius)
# load data set
data(dat30)
# univariate polygenic model
mod1 <- solarPolygenic(trait1 ~ 1, dat30)
# bivariate polygenic model
mod2 <- solarPolygenic(trait1 + trait2 ~ 1, dat30,
polygenic.options = '-testrhoe -testrhog')
# specify directory with IBD matrices and run linkage model
mibddir <- system.file('extdata', 'solarOutput',
'solarMibdsCsv', package = 'solarius')
link <- solarMultipoint(trait1 ~ 1, dat30,
mibddir = mibddir, chr = 5)
# run association model in parallel
assoc <- solarAssoc(trait1 ~ 1, dat30, cores = 2,
snpcovdata = genocovdat30, snpmap = mapdat30)
The user can think of this block of code as a quick start in using solarius
. The material given further will provide more details and all necesseary information on the usage of solarius
.
The genome-wide QTL analysis typically involves large-scale data sets. Data sets distributed within solarius
package contain relatively small simulated data, which are used for reproducible demonstrations of how functions of the package work. There are three data sets available in solarius
:
multic
(Andrade et al. 2006) (polygenic and linkage models);
dat30
data set (polygenic, linkage and association models);dat50
from the R package FFBSKAT (polygenic model and association model, both with custom kinship matrix);SOLAR
(polygenic and linkage models).The simulated data set (in SOLAR
format) from multic
R package is placed in inst/extdata/solarOutput directory of solarius
package. 1200 individuals in 200 families are included in the study. The simulated phenotypes include two traits trait1
and trait2
and two covariates age
and sex
. The two traits possess a high genetic correlation. For the linkage analysis 6 IBD matrices for chromosome 5 6 IBD matrices for chromosome 2 are available. The six matrices of chromosome 5 are the original data from multic
package, while the six matrices of chromosome 2 are the exact copies of those of chromosome 5. Data for the two chromosomes are needed to test parallel calculation (separated by chromosome) of the linkage model.
A smaller subset of the simulated data from multic
package is provided in dat30
data set (dat30.RData
file in data directory). This smaller data set contains only 174 individuals in 29 families.
dat30
data frame) are the same as in the complete data described above.genocovdat30
matrix) were randomly generated to run the association analysis on these data (see the script inst/scripts/generate-genocovdat30.R).Documentation of dat30
data set is available on the page dat30.
The simulated data from the R package FFBSKAT
are stored in dat50
data set (dat50.RData
file in data directory). This data set is adapted from example.data
data set in the R package FFBSKAT
. The following variables are included.
phenodata
: A data.frame with the phenotypes trait
, sex
and age
given for 66 individuals (related and unrelated).genodata
: A matrix of 50 genetic variants (columns) given for 66 individuals (rows). The genotypes are coded in the format such as 1/1
, 1/2
and 2/2
.genocovdata
: A matrix of covariates derived from the genotype data (the additive model). The covariates are coded as integers 0, 1 and 2.snpdata
: A data.frame of annotation for genetic variants. The variables name of the variant name
, chromosome chrom
, gene name gene
and position in bp position
.kin
: A kinship matrix for 66 individuals.Documentation of dat50
data set is available on the page dat50.
The simulated data from the workshop GAW10 are described elsewhere (SOLAR Documentation, Chapter 3 Tutorial).
To load both data sets dat30
and dat50
, the user needs to use the standard R function data
A quick view to the data tables might be performed by means of str
function.
data(dat30)
str(dat30)
## 'data.frame': 174 obs. of 10 variables:
## $ famid : int 1 1 1 1 1 1 2 2 2 2 ...
## $ id : int 11 12 13 14 15 16 21 22 23 24 ...
## $ fa : int 0 0 11 11 11 11 0 0 21 21 ...
## $ mo : int 0 0 12 12 12 12 0 0 22 22 ...
## $ sex : int 1 2 1 2 1 1 1 2 2 1 ...
## $ affect: int 2 2 2 2 2 2 2 2 2 2 ...
## $ class : logi NA NA NA NA NA NA ...
## $ trait1: num 11.96 7.1 10.32 9.76 9.46 ...
## $ trait2: num 13.58 5.37 6.4 8.98 9.21 ...
## $ age : int 50 25 35 49 51 45 37 29 39 41 ...
str(genocovdat30)
## num [1:174, 1:100] 1.978 0.795 0.231 0.139 0.487 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:174] "11" "12" "13" "14" ...
## ..$ : chr [1:100] "snp_1" "snp_2" "snp_3" "snp_4" ...
str(mapdat30)
## 'data.frame': 100 obs. of 4 variables:
## $ SNP : Factor w/ 100 levels "snp_1","snp_10",..: 1 12 23 34 45 47 48 49 50 2 ...
## $ chr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pos : num 2105324 2105467 2106094 2108138 2109262 ...
## $ gene: Factor w/ 12 levels "gene1","gene2",..: 1 1 1 1 1 1 1 1 1 2 ...
data(dat50)
str(phenodata)
## 'data.frame': 66 obs. of 4 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : int 0 1 0 0 0 1 0 1 1 0 ...
## $ age : int 80 77 56 44 75 79 75 82 77 76 ...
## $ trait: num -1.763 -1.423 -0.805 0.268 -1.334 ...
str(genodata)
## chr [1:66, 1:50] "1/1" "1/1" "1/1" "1/2" "1/2" "1/1" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:66] "1" "2" "3" "4" ...
## ..$ : chr [1:50] "s1" "s2" "s3" "s4" ...
str(genocovdata)
## int [1:66, 1:50] 0 0 0 1 1 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:66] "1" "2" "3" "4" ...
## ..$ : chr [1:50] "s1" "s2" "s3" "s4" ...
loadMulticPhen
function in solarius
package loads the complete version of the data from the R package multic
(1200 individuals).
mdat <- loadMulticPhen()
str(mdat)
## 'data.frame': 1200 obs. of 10 variables:
## $ ID : chr "1001" "1002" "1003" "1004" ...
## $ affect: int 2 2 2 2 2 2 2 2 2 2 ...
## $ class : logi NA NA NA NA NA NA ...
## $ trait1: num 9.61 11.5 11.05 12.37 7.97 ...
## $ trait2: num 9.61 11.66 13.62 13.88 7.29 ...
## $ age : int 45 37 45 48 31 41 50 38 43 54 ...
## $ FAMID : chr "100" "100" "100" "100" ...
## $ FA : chr "0" "0" "1001" "1001" ...
## $ MO : chr "0" "0" "1002" "1002" ...
## $ SEX : chr "1" "2" "1" "2" ...
loadExamplesPhen
function returns the GAW10 data sets available in SOLAR
.
gawdat <- loadExamplesPhen()
str(gawdat)
## 'data.frame': 1497 obs. of 11 variables:
## $ ID : chr "1" "10" "100" "1000" ...
## $ AGE: int 56 37 80 51 50 39 36 68 69 57 ...
## $ EF : num NA 46.9 NA 29.5 24.7 ...
## $ Q1 : num NA 42.1 NA 25.9 18.4 ...
## $ Q2 : num NA 55.4 NA 32.4 29.6 ...
## $ Q3 : num NA 111.2 NA 65.8 39.3 ...
## $ Q4 : num NA 13.1 NA 11.5 10.1 ...
## $ Q5 : num NA 30.9 NA 27.7 24.6 ...
## $ FA : chr "0" "0" "0" "610" ...
## $ MO : chr "0" "0" "0" "612" ...
## $ SEX: chr "1" "2" "2" "1" ...
Both loadMulticPhen
and loadExamplesPhen
functions are based on a more general function readPhen
, which reads the phenotype data from text files of a table format. The data from the R package multic
can be also loaded by means of readPhen
function.
dat.dir <- system.file("extdata", "solarOutput", package = "solarius")
mdat <- readPhen(phen.file = file.path(dat.dir, "simulated.phen"), sep.phen = ",",
ped.file = file.path(dat.dir, "simulated.ped"), sep.ped = ",")
The two times kinship coefficients describe the relatedness among individuals in the family-based sample. Two functions solarKinship2
and plotKinship2
allows to compute and plot the kinship matirx of a given data set. solarKinship2
function runs SOLAR
for estimation of the kinship coefficients based on the ID fields in the data. The returned value of the function is two times kinship coefficients in a matrix form. plotKinship2
plots the two times kinship matrix by using the R package Matrix
.
For example, the kinship matrix for dat50
data sets has the following image.
plotKinship2(2*kin)
One can see that both unrelated and related individuals appear in dat50
data set.
The relationships among individuals in dat30
data set is examined in the following two lines of code.
kin2.dat30 <- solarKinship2(dat30)
plotKinship2(kin2.dat30)
The structure of patterns in the kinship matrix can be explored more closely if the first 30 individuals are plotted.
plotKinship2(kin2.dat30[1:30, 1:30])
It seems that the simulated data include only sib-pair relationships. Such an observation can be visually confirmed by histKinship2
function.
histKinship2(kin2.dat30)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
One see that all the related individuals have the twice kinship coefficient equal to 0.5
. The unique values of kin2.dat30
matrix can also be extracted by the following code.
unique(as.vector(kin2.dat30))
## [1] 1.0 0.0 0.5
The family trees are plotted by plotPed
, which reuses some routines from the R package kinship2
. The family with index 2
from dat30
data set is plotted in the following code.
plotPed(dat30, 2)
## Loading required package: kinship2
## Loading required package: Matrix
## Loading required package: quadprog
To show an example of trait transformation, a trait with an anomalous distribution exp_trait1
is created in dat30
data set.
dat30 <- mutate(dat30,
exp_trait1 = exp(trait1))
The list of available transforms in solarius
is returned by availableTransforms
function.
availableTransforms()
## [1] "none" "inormal" "log" "out"
The following code plots the distribution of the original values of exp_trait1
and its transformed values.
library(ggplot2)
## Find out what's changed in ggplot2 with
## news(Version == "1.0.1", package = "ggplot2")
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:solarius':
##
## annotate
library(gridExtra)
theme_set(theme_bw())
with(dat30, {
grid.arrange(
qplot(exp_trait1),
qplot(transformTrait(exp_trait1, "out")),
qplot(transformTrait(exp_trait1, "log")),
qplot(transformTrait(exp_trait1, "inormal"))
)
})
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
solarPolygenic
function runs the polygenic analysis in SOLAR
and returns back the results in R. The basic call looks like the standard call of lm
function with the formula interface.
M1 <- solarPolygenic(trait1 ~ age + sex, dat30)
M1
##
## Call: solarPolygenic(formula = trait1 ~ age + sex, data = dat30)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 Individuals: 174
##
## H2r is 0.8061621 p = 6.1167535e-10 (Significant)
## H2r Std. Error: 0.1100465
##
##
## Proportion of Variance Due to All Final Covariates Is
## 0.0330070
##
## Loglikelihoods and chi's are in trait1/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor, nocovar
##
## Residual Kurtosis is -0.3603, within normal range
The print method for the returned object of class solarPolygenic
shows the output produced by SOLAR
. The tables of the results are printed by summary
method.
summary(M1)
##
## Call: solarPolygenic(formula = trait1 ~ age + sex, data = dat30)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 Individuals: 174
##
## H2r is 0.8061621 p = 6.1167535e-10 (Significant)
## H2r Std. Error: 0.1100465
##
##
## Proportion of Variance Due to All Final Covariates Is
## 0.0330070
##
## Loglikelihoods and chi's are in trait1/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor, nocovar
##
## Residual Kurtosis is -0.3603, within normal range
##
## Loglikelihood Table:
## model loglik
## 1 sporadic -229.3348
## 2 polygenic -210.8689
##
## Covariates Table:
## covariate Estimate SE Chi pval
## 1 mean 7.7823346631 0.33112422 NA NA
## 2 age 0.0004591117 0.01483783 NA NA
## 3 sex -0.4559509166 0.31250447 NA NA
##
## Variance Components Table:
## varcomp Var SE pval
## 1 h2r 0.8061621 0.1100465 6.116753e-10
## 2 e2 0.1938379 0.1100465 NA
By default SOLAR
does not perform the LRT tests for the covariates. covtest
argument of solarPolygenic
function tells to SOLAR
to make this job.
M2 <- solarPolygenic(trait1 ~ age + sex, dat30, covtest = TRUE)
M2
##
## Call: solarPolygenic(formula = trait1 ~ age + sex, data = dat30, covtest = TRUE)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 Individuals: 174
##
## H2r is 0.8061621 p = 6.1167535e-10 (Significant)
## H2r Std. Error: 0.1100465
##
## age p = 0.9753082 (Not Sig., but fixed)
## sex p = 0.1455341 (Not Sig., but fixed)
##
## Proportion of Variance Due to All Final Covariates Is
## 0.0330070
##
## Loglikelihoods and chi's are in trait1/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor, nocovar
## Initial sporadic and polygenic models are s0 and p0
## Constrained covariate models are named no<covariate name>
##
## Residual Kurtosis is -0.3603, within normal range
The test statistics and p-values are stored in cf
slot of the returned object M2
.
M2$cf
## covariate Estimate SE Chi pval
## 1 mean 7.7823346631 0.33112424 NA NA
## 2 age 0.0004591117 0.01483782 0.0010 0.9753082
## 3 sex -0.4559509166 0.31250456 2.1184 0.1455341
The custom kinship matrix is passed by kinship
argument. The following code shows such an example of dat50
data set, which phenotype data are stored in phenodata
data.frame.
M3 <- solarPolygenic(trait ~ 1, phenodata, kinship = kin)
M3
##
## Call: solarPolygenic(formula = trait ~ 1, data = phenodata, kinship = kin)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait Individuals: 66
##
## H2r is 0.1000000 p = 0.5000000 (Not Significant)
##
##
## Loglikelihoods and chi's are in trait/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor
##
## Residual Kurtosis is -0.6742, within normal range
The polygenic and sporadic models have the same LRT statistics, that means SOLAR
was not able to make the polygenic analysis in a proper way.
M3$lf
## model loglik
## 1 sporadic -27.77137
## 2 polygenic -27.77137
This bug seems the case of unrelated individuals or the individuals withount such ID fields as fa
and mo
.
The bivariate polygenic analysis is run in the same way as the univariate one.
B1 <- solarPolygenic(trait1 + trait2 ~ 1, dat30)
B1
##
## Call: solarPolygenic(formula = trait1 + trait2 ~ 1, data = dat30)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 trait2 Individuals: 174
##
## H2r(trait1) is 0.8218823
## H2r(trait1) Std. Error: 0.1053258
##
## H2r(trait2) is 0.6270026
## H2r(trait2) Std. Error: 0.1158107
##
## RhoE is 0.4120487
## RhoE Std. Error: 0.1969379
##
## RhoG is 0.9728759
## RhoG Std. Error: 0.0374442
##
## Derived Estimate of RhoP is 0.8045957
##
##
## Loglikelihoods and chi's are in trait1.trait2/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor
The test of significance of environmental and genetic correlation coefficients is indicated by polygenic.options
argument of solarPolygenic
function.
B2 <- solarPolygenic(trait1 + trait2 ~ 1, dat30, polygenic.options = "-testrhoe -testrhog")
B2
##
## Call: solarPolygenic(formula = trait1 + trait2 ~ 1, data = dat30, polygenic.options = "-testrhoe -testrhog")
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 trait2 Individuals: 174
##
## H2r(trait1) is 0.8218823
## H2r(trait1) Std. Error: 0.1053258
##
## H2r(trait2) is 0.6270026
## H2r(trait2) Std. Error: 0.1158107
##
## RhoE is 0.4120487 p = 0.1424023
## RhoE Std. Error: 0.1969379
##
## RhoG is 0.9728759
## RhoG Std. Error: 0.0374442
##
##
## RhoG different from zero p = 2.2463156e-09
## RhoG different from 1.0 p = 0.2098041
## Derived Estimate of RhoP is 0.8045957
##
##
## Loglikelihoods and chi's are in trait1.trait2/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor
The results are stored in vcf
slot of the returned object.
B2$vcf
## varcomp Estimate SE zscore
## 1 h2r(trait1) 0.8218823 0.10532575 7.803242
## 2 e2(trait1) 0.1781177 0.10532575 1.691112
## 3 h2r(trait2) 0.6270026 0.11581068 5.414031
## 4 e2(trait2) 0.3729974 0.11581068 3.220751
## 5 rhog 0.9728759 0.03744417 25.982039
## 6 rhoe 0.4120487 0.19693793 2.092277
It might happen that traits in the multivariate polygenic analysis have different covariates. Trait-specific covariates are passed to solarPolygenic
function according the specification of SOLAR
.
B3 <- solarPolygenic(trait1 + trait2 ~ sex + age(trait2), dat30)
B3
##
## Call: solarPolygenic(formula = trait1 + trait2 ~ sex + age(trait2),
## data = dat30)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 trait2 Individuals: 174
##
## H2r(trait1) is 0.7956974
## H2r(trait1) Std. Error: 0.1129518
##
## H2r(trait2) is 0.6029008
## H2r(trait2) Std. Error: 0.1197632
##
## RhoE is 0.4655421
## RhoE Std. Error: 0.1736882
##
## RhoG is 0.9676250
## RhoG Std. Error: 0.0397582
##
## Derived Estimate of RhoP is 0.8027999
##
##
## Loglikelihoods and chi's are in trait1.trait2/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor
In the results of this analysis, sex
covariate is common for both traits, while age
covariate is applied only to trait2
. The given model specification can be ckecked by looking at the model files from SOLAR
.
tail(B3$solar$files$model$poly.mod, 3)
## [1] "# mu = \\{t1*(<Mean(trait1)>+<bsex(trait1)>*Female) + t2*(<Mean(trait2)>"
## [2] "# +<bsex(trait2)>*Female+<bage(trait2)>*(age-39.87931034))\\}"
## [3] "loglike set -361.499294"
solarAssoc
function runs the association analysis in SOLAR
, in particular via mga command. The genotype data are passed to solarAssoc
function in the form of SNP genotypes (snpdata
argument) or covariates (snpcovdata
or genocov.files
arguments). The later case assumes that the user has converted the genotypes to covariates by some method (outside solarius
package) before computing the association model.
The function returns an object of class solarAssoc
. This class has additional methods such as annotate
, plotQQ
and plotManh
for exploration of the association results. The main table of the results is snpf
slot of the returned object, which class is data.table
from the R package data.table
.
The association analysis is shown with dat50
data set with 66 individuals and 50 SNP variants. The kinship matrix kin
of this data set is also used.
snpdata
argumentThe option of using snpdata
is recommended in the case when the small number of SNPs is to be tested.
A1 <- solarAssoc(trait ~ 1, phenodata, snpdata = genodata, kinship = kin)
A1
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpdata = genodata)
##
## Input SNP data:
## * 50 SNP genotypes passed by `snpdata` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 50 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12 66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.999998 0.244311
## 3: 1.999998 0.244311
## 4: 27.000006 0.549856
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s): 00:00:03
summary
method reports the number of significant SNPs based on Bonferroni correction.
summary(A1)
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpdata = genodata)
##
## Association model
## * Number of SNPs: 50
## * Input format: snpdata
## * Number of significal SNPs: 0 (Bonferroni correction with alpha 0.05)
The significance threshold can be passed as alpha
parameter of summary
method.
summary(A1, alpha = 1)
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpdata = genodata)
##
## Association model
## * Number of SNPs: 50
## * Input format: snpdata
## * Number of significal SNPs: 1 (Bonferroni correction with alpha 1)
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf est_mac
## 1: s43 66 6.504508 0.01076 -1.186139 0.465081 0.093852 0.030303 3.999996
## dosage_sd
## 1: 0.238606
snpcovdata
argumentSNP covariates are passed to the association model by snpcovdata
parameter.
A2 <- solarAssoc(trait ~ 1, phenodata, snpcovdata = genocovdata, kinship = kin)
A2
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpcovdata = genocovdata)
##
## Input SNP data:
## * 50 SNP covariates passed by `snpcovdata` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 50 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12 66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.999998 0.244311
## 3: 1.999998 0.244311
## 4: 27.000006 0.549856
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s): 00:00:02
The values of the SNP covariates can take any numerical values. That means dosage format of imputed SNP data is allowed.
The previous example is repeated for the SNP covariate data, where values of 2
are replaced by 1.9
.
genocovdata2 <- genocovdata
genocovdata2[genocovdata2 == 2] <- 1.9
A2 <- solarAssoc(trait ~ 1, phenodata, snpcovdata = genocovdata2, kinship = kin)
A2
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpcovdata = genocovdata2)
##
## Input SNP data:
## * 50 SNP covariates passed by `snpcovdata` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 50 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.110595 0.489857 0.000772 0.014394
## 3: s11 66 0.050972 0.821380 -0.110595 0.489857 0.000772 0.014394
## 4: s12 66 0.927276 0.335572 -0.201592 0.209348 0.013951 0.203031
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.900008 0.232095
## 3: 1.900008 0.232095
## 4: 26.800026 0.541289
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s): 00:00:02
genocov.files
(single value)genocov.files
argument is supported to pass the SNP covariate data in files of the appropriate SOLAR
format. This option suits for large-scale calculations, when the real genome-wide scan is run. Both arguments genocov.files
and snplists.files
are required, and snpmap.files
argument is optional.
dir <- package.file("extdata", "solarAssoc", package = "solarius")
genocov.files <- file.path(dir, "snp.genocov")
snplists.files <- file.path(dir, c("snp.geno-list1", "snp.geno-list2"))
A3 <- solarAssoc(trait ~ 1, phenodata,
genocov.files = genocov.files, snplists.files = snplists.files)
A3
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, genocov.files = genocov.files,
## snplists.files = snplists.files)
##
## Input SNP data:
## * SNP covariates passed in 1 file(s) by `genocov.files` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 25 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12 66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.999998 0.244311
## 3: 1.999998 0.244311
## 4: 27.000006 0.549856
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s): 00:00:01
genocov.files
(many values)Many values can be passed to genocov.files
parameter, but this case is thought for parallel calculations.
dir <- package.file("extdata", "solarAssoc", package = "solarius")
genocov.files <- file.path(dir, c("snp.genocov1", "snp.genocov2"))
snplists.files <- file.path(dir, c("snp.geno-list1", "snp.geno-list2"))
A4 <- solarAssoc(trait ~ 1, phenodata,
genocov.files = genocov.files, snplists.files = snplists.files)
A4
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, genocov.files = genocov.files,
## snplists.files = snplists.files)
##
## Input SNP data:
## * SNP covariates passed in 2 file(s) by `genocov.files` argument and 2 files(s) by `snplists.files` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 25 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12 66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.999998 0.244311
## 3: 1.999998 0.244311
## 4: 27.000006 0.549856
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s), 2 batches: 00:00:01
QQ and Manhattan plots help to explore the association results visually.
A5 <- solarAssoc(trait ~ 1, phenodata, snpdata = genodata, snpmap = snpdata, kinship = kin)
plot(A5)
plot(A5, "qq")
Parallelization of the association analysis is an easy task with solarius
. The user only needs to specify how many cores are to be used for the computation.
A6 <- solarAssoc(trait ~ 1, phenodata, snpdata = genodata, kinship = kin, cores = 2)
The gain in CPU time is not obvious, as the number of SNPs (50) is too small.
A1$assoc$tprofile$cputime.sec
## [1] 3.268
A6$assoc$tprofile$cputime.sec
## [1] 2.458
dat30
data set contains 100 SNPs (genocovdat30
matrix). The same test on speed up by parallel computing looks as following.
A7 <- solarAssoc(trait1 ~ 1, dat30,
snpcovdata = genocovdat30, snpmap = mapdat30)
A8 <- solarAssoc(trait1 ~ 1, dat30,
snpcovdata = genocovdat30, snpmap = mapdat30,
cores = 2)
A7$assoc$tprofile$cputime.sec
## [1] 5.655
A8$assoc$tprofile$cputime.sec
## [1] 3.884
solarMultipoint
function performs the linkage analysis by calling SOLAR
command multipoint. The function looks for multipoint IBD matrices in the directory pointed out by mibddir
argument. The analysis can be customized by combination of options via multipoint.options
and multipoint.settings
arguments.
Phenotypes in dat30
data set and IBD matrices are placed in inst/extdata/solarOutput directory of the package.
The basic model requires only the path to a directory with IBD matrices (mibddir
argument). By default, all SOLAR
specific files, which represent IBD matrices, will be used in the analysis.
mibddir <- package.file("extdata", "solarOutput", "solarMibdsCsv", package = "solarius")
list.files(mibddir)
## [1] "mibd.2.0.gz" "mibd.2.1.gz" "mibd.2.2.gz" "mibd.2.3.gz" "mibd.2.4.gz"
## [6] "mibd.2.5.gz" "mibd.5.0.gz" "mibd.5.1.gz" "mibd.5.2.gz" "mibd.5.3.gz"
## [11] "mibd.5.4.gz" "mibd.5.5.gz"
L1 <- solarMultipoint(trait1 ~ 1, dat30, mibddir = mibddir)
L1
##
## Call: solarMultipoint(formula = trait1 ~ 1, data = dat30, mibddir = mibddir)
##
## Input IBD data:
## * directory /home/andrey/git/ugcd/solarius/inst/extdata/solarOutput/solarMibdsCsv
##
## Output results of association:
##
## * Table of association results (first 5 out of 12 rows):
## chr pos LOD Loglike H2r H2q1
## 1 2 0 1.4888 -208.503 0.448464 0.403780
## 2 2 1 2.4232 -206.352 0.244560 0.614310
## 3 2 2 3.2759 -204.388 0.123184 0.742126
## 4 2 3 3.5569 -203.741 0.097187 0.763608
## 5 2 4 3.1610 -204.653 0.179701 0.668166
##
## CPU time on 1 core(s): 00:00:03
Methods print
, summary
and plot
help to explore the results of analysis. The data table of results is stored in lodf
slot of the returned object.
summary(L1)
##
## Call: solarMultipoint(formula = trait1 ~ 1, data = dat30, mibddir = mibddir)
##
## Multipoint model
## * Number of used markers: 12
## * Number of passes: 1
## * Maximum LOD score: 3.56
## -- chr: 2
## -- position: 3 cM
plot(L1)
The results for chromosomes 2 and 5 are identical, since the IBD matrices for chromosome 2 are duplicates of the matrices for chromosome 5.
The following code shows an example of a custom linkage analysis. In particular, the linkage is run on chromosome 5 with a 2 cM interval, Fine mapping is disabled and the second pass is enables via multipoint.options
argument.
L2 <- solarMultipoint(trait1 ~ 1, dat30, mibddir = mibddir, chr = 5, interval = 2,
multipoint.settings = "finemap off", multipoint.options = "3")
summary(L2)
##
## Call: solarMultipoint(formula = trait1 ~ 1, data = dat30, mibddir = mibddir,
## chr = 5, interval = 2, multipoint.options = "3", multipoint.settings = "finemap off")
##
## Multipoint model
## * Number of used markers: 3
## * Number of passes: 2
## * Maximum LOD score: 3.28
## -- chr: 5
## -- position: 2 cM
The LOD scores for the second path (stored in lodf2
) are plotted with the code given below.
plot(L2, pass = 2)
The bivariate linkage analysis is performed in the same way as the bivariate polygenic analysis.
L3 <- solarMultipoint(trait1 + trait2 ~ 1, dat30, mibddir = mibddir, chr = 5,
interval = 2, multipoint.settings = "finemap off")
summary(L3)
##
## Call: solarMultipoint(formula = trait1 + trait2 ~ 1, data = dat30,
## mibddir = mibddir, chr = 5, interval = 2, multipoint.settings = "finemap off")
##
## Multipoint model
## * Number of used markers: 3
## * Number of passes: 1
## * Maximum LOD score: 2.47
## -- chr: 5
## -- position: 2 cM
L4 <- solarMultipoint(trait1 + trait2 ~ age + sex(trait1), dat30,
mibddir = mibddir, chr = 5, interval = 2, multipoint.settings = "finemap off")
summary(L4)
##
## Call: solarMultipoint(formula = trait1 + trait2 ~ age + sex(trait1),
## data = dat30, mibddir = mibddir, chr = 5, interval = 2, multipoint.settings = "finemap off")
##
## Multipoint model
## * Number of used markers: 3
## * Number of passes: 1
## * Maximum LOD score: 2.68
## -- chr: 5
## -- position: 2 cM
Parallelization of the linkage analysis is again parametrized with cores
argument, but the chromosome values are expected to be a vector of values. That means that each core does the job on a single chromosome.
L5 <- solarMultipoint(trait1 ~ 1, dat30, mibddir = mibddir,
chr = c(2, 5), cores = 2)
L1$multipoint$tprofile$cputime.sec
## [1] 3.229
L5$multipoint$tprofile$cputime.sec
## [1] 2.153
SOLAR
referencesA list of references given on the official web page of SOLAR
http://solar.txbiomedgenetics.org/:
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] solarius_0.3.1 plyr_1.8.3 gait_0.1 data.table_1.9.6
## [5] rmarkdown_0.8 knitr_1.11 devtools_1.9.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.2 compiler_3.2.2 formatR_1.2.1 bitops_1.0-6
## [5] iterators_1.0.7 tools_3.2.2 digest_0.6.8 jsonlite_0.9.19
## [9] evaluate_0.8 memoise_0.2.1 gtable_0.1.2 foreach_1.4.2
## [13] curl_0.9.4 yaml_2.1.13 parallel_3.2.2 qqman_0.1.2
## [17] proto_0.3-10 roxygen2_5.0.1 stringr_1.0.0 httr_1.0.0
## [21] grid_3.2.2 R6_2.1.1 XML_3.98-1.3 ggplot2_1.0.1
## [25] reshape2_1.4.1 magrittr_1.5 scales_0.3.0 codetools_0.2-14
## [29] htmltools_0.2.6 rsnps_0.1.6 MASS_7.3-44 colorspace_1.2-6
## [33] labeling_0.3 stringi_1.0-1 RCurl_1.95-4.7 doParallel_1.0.8
## [37] munsell_0.4.2 chron_2.3-47
This document is licensed under the Creative Commons Attribution 4.0 International Public License.
Almasy, L, and J Blangero. 1998. “Multipoint quantitative-trait linkage analysis in general pedigrees.” American Journal of Human Genetics 62 (5): 1198–1211. doi:10.1086/301844.
Almasy, L, T D Dyer, and J Blangero. 1997. “Bivariate quantitative trait linkage analysis: pleiotropy versus co-incident linkages.” Genetic Epidemiology 14 (6): 953–8. doi:10.1002/(SICI)1098-2272(1997)14:6<953::AID-GEPI65>3.0.CO;2-K.
Andrade, Mariza De, Elizabeth J Atkinson, Christopher I Amos, and Jianfang Chen. 2006. Estimating Genetic Components of Variance for Quantitative Traits in Family Studies using the MULTIC routines.
Blangero, J. 2009. “Statistical genetic approaches to human adaptability.” Human Biology 81 (5-6): 523–46. doi:10.3378/027.081.0603.
Blangero, J, and L Almasy. 1997. “Multipoint oligogenic linkage analysis of quantitative traits.” Genetic Epidemiology 14 (6): 959–64. doi:10.1002/(SICI)1098-2272(1997)14:6<959::AID-GEPI66>3.0.CO;2-K.
Blangero, John, Jeff T Wiliiams, and Laura Almasy. 2000. “Robust LOD Scores for Variance Linkage Analysis.” Genetic Epidemiology 19 (Suppl.
Blangero, John, Jeff T Williams, and Laura Almasy. 2001. “Variance component methods for detecting complex trait loci.” Advances in Genetics 42: 151–81.
Duggirala, R, J T Williams, S Williams-Blangero, and J Blangero. 1997. “A variance component approach to dichotomous trait linkage analysis using a threshold model.” Genetic Epidemiology 14 (6): 987–92. doi:10.1002/(SICI)1098-2272(1997)14:6<987::AID-GEPI71>3.0.CO;2-G.
Lynch, Michael, and Bruce Walsh. 1998. Genetics and Analysis of Quantitative Traits. Vol. 1. Sinauer Sunderland.
SOLAR Documentation. “SOLAR version 6.6.2.” http://helix.nih.gov/Documentation/solar-6.6.2-doc/index.html.
Towne, B, R M Siervogel, and J Blangero. 1997. “Effects of genotype-by-sex interaction on quantitative trait linkage analysis.” Genetic Epidemiology 14 (6): 1053–8. doi:10.1002/(SICI)1098-2272(1997)14:6<1053::AID-GEPI82>3.0.CO;2-G.
Williams, J T, and J Blangero. 2004. “Power of variance component linkage analysis-II. Discrete traits.” Annals of Human Genetics 68 (Pt 6): 620–32. doi:10.1046/j.1529-8817.2004.00128.x.
Williams, J T, P Van Eerdewegh, L Almasy, and J Blangero. 1999. “Joint multipoint linkage analysis of multivariate qualitative and quantitative traits. I. Likelihood formulation and simulation results.” American Journal of Human Genetics 65 (4): 1134–47. doi:10.1086/302570.
Williams, Jeff T, and John Blangero. 1999. “Comparison of Variance Components and Sibpair-Based Approaches to Quantitative Trait Linkage Analysis in Unselected Samples” 134 (February 1998): 113–34.