1 Introduction

This vignette is a tutorial on the R package solarius. The document contains a brief description of the main statistical models (polygenic, association and linkage) implemented in SOLAR and accessible via solarius, installation instructions for both SOLAR and solarius, reproducible examples on synthetic data sets available within the solarius package. The examples are intended to cover all the necessary steps in the quantitative genetic analysis with SOLAR: data loading, data exploration, running statistical models and interpreting the obtained results by printing, summarizing and plotting the models in R.

In human genetics research, the study of a complex disorder is targeted to discover its underlying genetic architecture. Whereas simple Mendelian diseases are results of a mutation in a single gene, for example Huntington’s disease, complex disorders or phenotypes are the product of the action of multiple genes and environmental factors as well as their interactions. For instance, obesity is a complex disease which genetic basis might be better dissected via measurable quantitative variation in phenotypes, such as body-mass index, closely related to the disease risk.

One of the study designs to identify the quantitative trait loci (QTLs) that affect complex phenotypes is based on examination of related individuals grouped by families (pedigrees). Pedigree-based designs allow for both QTL linkage and association mappings. In addition, QTL analysis with pedigrees possess more power to detect rare variants, permits explicit control for population stratification in association studies, and suit for examination of parent-of-origin and sex-specificity effects.

The analysis of the pedigree data is traditionally based on variance component (VC) models, also known as linear mixed models. SOLAR is one of the well-known and widely used tools that implements the variance component models and other routines needed for the QTL analysis in extended pedigrees (L. Almasy and Blangero 1998) (http://solar.txbiomedgenetics.org/). The name SOLAR stands for Sequential Oligogenic Linkage Analysis Routines. The analyses available in SOLAR include customized quantitative genetic analysis, advanced linkage analysis, SNP association analysis (QTN, QTLD, and MGA). Other operations are implemented for calculation of marker-specific or multipoint identity-by-descent (IBD) matrices in pedigrees of arbitrary size and complexity. The linkage analysis allows for multiple quantitative traits and/or discrete traits which might involve multiple loci (oligogenic analysis), dominance effects, household effects, and interactions.

There are other tools similar to SOLAR, such as Mendel, MMM, GEMMA and FAST-LMM, but the R package solarius presented here has been especially developed to work with SOLAR.

solarius provides an interface from the program SOLAR to the environment for statistical computing R. On one side, main SOLAR procedures can be called in R by one or a few commands from the solarius package, that makes the analysis more user-friendly with the same power of computations implemented in SOLAR. On the other side, the solarius user benefits from the infrastructure available with the R environment and its packages, such as interactive data manipulation, data visualization and parallel computation.

2 Methodology

2.1 Linear mixed model

Consider that a particular trait, such as body-mass index, is observed in \(n\) individuals grouped by families or clusters of related individuals. A \(n \times 1\) vector \(y\) contains the phenotypic values measured for the trait. One assumes that these observations are described adequately by a linear model with a \(p \times 1\) vector of fixed effects (\(\beta\)) and a \(q \times 1\) vector of random effects (\(u\)) (Lynch and Walsh 1998, Chapter 26, pp. 746). In matrix form,

\[ y = X \beta + Z u + e \]

where \(X\) and \(Z\) are respectively \(n \times p\) and \(n \times q\) incidence matrices, and \(e\) is a \(n \times 1\) vector of residual deviations assumed to be distributed independently of the random effects. Denote the \(n \times n\) covariance matrix for the vector \(e\) by \(R\) and the the \(q \times q\) covariance matrix for the vector \(u\) by \(G\).

The first element of the vector \(\beta\) is typically the population mean, and other elements may be age, gender, diet and so on. The elements of the incidence matrices are usually equal to 0 or 1, that indicate the effect contribution to the individuals’s phenotype. The model is referred to as a mixed model, because this model jointly accounts for fixed and random effects.

Giving the two terms of the random effects \(u\) and \(e\), the first term typically accounts for the contribution from random genetic effects such as additive genetic values. The second term accounts for the residual variance, and one usually assumes that residual errors have constant variance and are uncorrelated, so that \(R\) is a diagonal matrix, with \(R = \sigma_e^2 I\).

Two more equations for the means and variances of the component vectors of the mixed model can be derived on the basis of its definition. Since \(E(u) = E(e) = 0\) by definition,

\[ E(y) = X \beta \]

Excluding the difference among individuals due to fixed effects and calling the assumption that \(u\) and \(e\) are uncorrelated, the covariance matrix for the vector of observations \(y\) is

\[ Var(y) = V = Z G Z^{T} + R \]

Now the mixed model can be rewritten in a compact form

\[ y = X \beta + Z u + e\]

\[ \mbox{ where } \mbox{ } \mbox{ } u \sim (0, G) \mbox{ and } e \sim (0, R) \\ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X \beta, V) = (X \beta, Z G Z^T + R) \]

For the mixed model, \(y\), \(X\) and \(Z\) are observed variables, while \(\beta\), \(u\), \(R\) and \(G\) are generally unknown variables. Thus, the analysis consists of two complementary estimation procedures: (1) estimation of the covariance matrices \(G\) and \(R\), and (2) estimation of the vectors of fixed and random effects, \(\beta\) and \(u\). The covariance matrices are usually functions of a few unknown parameters or variance components. For instance, the covariance matrix of the residuals errors typically has the form \(R = \sigma_e^2 I\), where \(\sigma_e^2\) is the parameter to be estimated.

The description of the two estimation procedures is out of scope of this document, and the reader is referred to (Lynch and Walsh 1998, Chapter 26-27), where the necessary theoretical material and accompanying examples are given. On the other hand, the implementation of the estimation procedures in SOLAR is discussed elsewhere (L. Almasy and Blangero 1998).

The next three sections will present SOLAR models, which follow the general definition of linear mixed model given above, but parametrization of the models and estimation procedures are implemented in a SOLAR-specific way.

2.2 Polygenic model in SOLAR

The polygenic model is a basis for any quantitative genetic analysis in SOLAR. The model includes fixed effects (\(\beta\)) and three random effects (the genetic additive effect \(g\), the house-hold effect \(c\), and the residual effect \(e\)).

\[y = X\beta + g + c + e\]

\[\mbox{ where } \mbox{ } \mbox{ } g \sim (0, \sigma_g^2 A) \mbox{ and } \mbox{ } \mbox{ } c \sim (0, \sigma_c^2 H) \mbox{ and } \mbox{ } \mbox{ } e \sim (0, \sigma_e^2 I)\]

\[ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X\beta, V) = (X\beta, \sigma_g^2 A + \sigma_c^2 H + \sigma_e^2 I)\]

The first \(g\) term accounts for the additive genetic effect with the covariance matrix \(A = 2 K\), where \(K\) is the kinship matrix. SOLAR estimates the empirical kinship matrix based on the identifier (ID) fields in the pedigree data, such as individual ID id, gender sex, family ID fam, father ID fa, mother ID mo, and indicator of identical individuals, e.g. monozygotic twins, mztwin (SOLAR Documentation, Appendix 1, file-pedigree command).

The second \(c\) term represents an annotated environmental effect. It is named as a house-hold effect, but it is generally related to the sharing of any non-genetic factor among individuals. The house-hold effect is included in the analysis if the pedigree data contain a house-hold ID field (the expected field name is hhid). SOLAR automatically creates Pedigree-Household Groups based on the hhid identifiers. That means that pedigrees and household groups are merged into larger groups, which contain every individual in the same pedigree as well as every individual that has the same hhid as any individual in the pedigree (SOLAR Documentation, Section 9.3 Household Group Analysis). As a consequence of this grouping operation by SOLAR, the covariance matrix \(H\) of the \(c\) random effect contains the information related to the pedigree-household groups rather than relationships given by the hhid field.

The third \(e\) term is the residual error with the diagonal covariance matrix. It is important to note that the residual command in SOLAR (SOLAR Documentation, Appendix 1, residual command) returns the covariate-free value of a trait under study, expressed as \(y - X \beta\), and the returned value is not the residual vector \(e\) given in the equation of the polygenic model.

The polygenic term \(g\) represents the additive genetic effect by default. The incorporation of the dominance effect into SOLAR models is also possible, but it is attributed to advanced modeling topics (SOLAR Documentation, Section 9.4 Dominance Analysis). The SOLAR’s developers suggest that the dominance variance component is unlikely useful in quantitative trait mapping, particularly in human genetic analysis, as only bi-lineal relatives contribute to this component.

SOLAR is able to work with discrete traits, which possible values are coded as two consecutive integers, for example, 0 and 1 (SOLAR Documentation, Appendix 1, discrete-notes). SOLAR uses a liability threshold model to handle discrete traits (Duggirala et al. 1997), and more details about implementation and practical notes are available in (SOLAR Documentation, Section 9.1 Discrete Traits). SOLAR authors recommend to work with quantitative traits whenever it is possible. The user is able to disable the transformation of discrete traits, if it is necessary.

The likelihood-based statistics, in particular Likelihood Ratio Tests (LRTs), is used for the inference of the linear mixed models implemented in SOLAR. The following table shows the alternative models used to test the significance of the random effects \(g\) and \(c\).

Alternative model Free parameters Restricted parameters
No polygenic effect \(\beta\), \(\sigma_c^2\), \(\sigma_e^2\) \(\sigma_g^2\) = 0
No house-hold effect \(\beta\), \(\sigma_g^2\), \(\sigma_e^2\) \(\sigma_c^2\) = 0

Under the assumption of multivariate normality, the LRT statistic is asymptotically distributed as a \(\frac{1}{2}:\frac{1}{2}\) mixture of of a \(\chi_1^2\) variable and a point mass at the origin (J. Blangero, Wiliiams, and Almasy 2000). This statement is applied to any LRT performed for parameters like \(\sigma_g^2\), which is greater or equal to 0 by definition.

SOLAR takes the decision of rejecting the house-hold effect by considering two factors, the test results and the value of \(\sigma_c^2\) parameter. If value of \(\sigma_c^2\) is non-zero, the house-hold effect is retained in the polygenic model, nevertheless the effect is not significant. The user has an option to force the rejection of the house-hold effect if it is necessary.

The tests applied to fixed effects (covariates) are also performed by means of LRTs. In the table given below the alternative model for the test of significance of age covariate is shown.

Alternative model Free parameters Restricted parameters
No age effect \(\sigma_g^2\), \(\sigma_c^2\), \(\sigma_e^2\) \(\beta_{age}\) = 0

2.2.1 Multivariate polygenic model in SOLAR

In principle, one can estimate the multivariate polygenic model by simply performing the univariate analysis in an appropriate way. Consider two traits \(y_1\) and \(y_2\) measured in each of \(n\) individuals, assuming that each trait follows the same univariate polygenic analysis. Then the \(2 \times n\) dimensional column vector of observations is denoted by the stack of univariate vectors, and the polygenic model has the matrix form:

\[ \begin{pmatrix} y_1 \\ y_2 \end{pmatrix}_{2n \times 1} = \begin{pmatrix} X_1 & 0 \\ 0 & X_2 \end{pmatrix}_{2n \times nk} \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix}_{2n \times 1} + \begin{pmatrix} g_1 \\ g_2 \end{pmatrix}_{2n \times 1} + \begin{pmatrix} e_1 \\ e_2 \end{pmatrix}_{2n \times 1} \]

\[ G_{2n \times 2n} = C_{2 \times 2} \otimes A_{n \times n} = \begin{pmatrix} c_{11} A & c_{12} A \\ c_{21} A & c_{21} A \end{pmatrix} = \begin{pmatrix} \sigma_{1}^2 A & \rho \sigma_{1} \sigma_{2} A \\ \rho \sigma_{1} \sigma_{2} A & \sigma_{2}^2 A \end{pmatrix}_{g} \]

\[ R_{2n \times 2n} = E_{2 \times 2} \otimes I_{n \times n} = \begin{pmatrix} e_{11} I & e_{12} I \\ e_{21} I & e_{21} I \end{pmatrix} = \begin{pmatrix} \sigma_{1}^2 I & \rho \sigma_{1} \sigma_{2} I \\ \rho \sigma_{1} \sigma_{2} I & \sigma_{2}^2 I \end{pmatrix}_{e} \]

The model can be generalized to \(k\) traits, and other random effects such as house-hold can be added in the similar way. The Kronecker product notation \(\otimes\) is convinient for denoting the covariance matrices. More details on the multivariate analysis are given in (Lynch and Walsh 1998, Chapter 26, Multiple Traits, pp. 774).

The multivariate analysis implemented in SOLAR allows for any number of traits, but the tests for significance of the correlation coefficients are performed only for the bivariate models. The following table describes the main LRTs available in SOLAR for the bivariate analysis.

Alternative model Free parameters Restricted parameters
No genetic correlation \((\sigma_1^2, \sigma_2^2)_g\), \((\sigma_1^2, \sigma_2^2, \rho)_e\) \(\rho_g = 0\)
Complete pleiotropy \((\sigma_1^2, \sigma_2^2)_g\), \((\sigma_1^2, \sigma_2^2, \rho)_e\) \(\rho_g = 1\)
No environmental correlation \((\sigma_1^2, \sigma_2^2, \rho)_g\), \((\sigma_1^2, \sigma_2^2)_e\) \(\rho_e = 0\)

It is important to note that the transformation of discrete traits in the multivariate analysis is automatically performed by SOLAR.

2.3 Linkage model in SOLAR

SOLAR introduces sequential oligogenic linkage analysis, as stated in its name abbreviation. Consider the linkage model with a single QTL, which is described by another random effect \(q\).

\[y = X\beta + g + c + q + e\]

\[ \mbox{ where } \mbox{ } \mbox{ } g \sim (0, \sigma_g^2 A) \mbox{ and } \mbox{ } \mbox{ } c \sim (0, \sigma_c^2 H) \mbox{ and } \mbox{ } \mbox{ } q \sim (0, \sigma_q^2 M) \mbox{ and } \mbox{ } \mbox{ } e \sim (0, \sigma_e^2 I) \]

\[ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X\beta, V) = (X\beta, \sigma_g^2 A + \sigma_c^2 H + \sigma_q^2 M + \sigma_e^2)\]

The QTL under study \(q\) corresponds to a specific position in the genome and an identity-by-descent (IBD) matrix. The genome-wide linkage analysis assumes that the highly informative markers, such as micro-satellites, are available in the study, in order to estimate the two-point or multipoint IBD matrices by SOLAR or other specific program like MERLIN. Once the IBD matrices are calculated, the genome-wide scan is conducted by sequentially running the linkage model for each genome position.

The alternative model for testing the significance of a single QTL effect is the same as the polygenic model.

Alternative model Free parameters Restricted parameters
No linkage \(\beta, \sigma_g^2, \sigma_c^2\), \(\sigma_e^2\) \(\sigma_q^2\) = 0

Traditionally, the logarithm (base 10) of the odds (LOD) values are used to score the QTLs. If the LRT statistics for the linkage model is denoted as \(\Lambda\), then the correspondent LOD score is equal to \(\Lambda / (2 ln(10))\).

SOLAR also allows for multi-pass linkage scan, that means that any further scan can be conditioned to QTLs found in the previous scans. For example, the second-pass scan conditioned to the previously found QTL \(q_1\) has the form:

\[y = X\beta + g + c + q_1 + q + e\]

The second-pass LRT test now looks like in the table below.

Alternative model Free parameters Restricted parameters
No linkage \(\beta, \sigma_g^2, \sigma_c^2\), \(\sigma_{q_1}^2\), \(\sigma_e^2\) \(\sigma_q^2\) = 0

The linkage model is extended to deal with many traits or discrete traits in the same way as for the polygenic model.

Additional information related to the linkage modeling in SOLAR can be found in related publications (L. Almasy and Blangero 1998), (L. Almasy, Dyer, and Blangero 1997), (Williams et al. 1999), the SOLAR tutorial (SOLAR Documentation, Chapter 3 Tutorial), and the SOLAR manual, for example, (SOLAR Documentation, Apendix 1, multipoint command).

2.4 Association model in SOLAR

A basic association model in SOLAR is extended from the polygenic model by simply adding a new covariate (fixed effect) snp.

\[ y = X\beta + \beta_{snp} * snp + g + c + e\]

\[\mbox{ where } \mbox{ } \mbox{ } g \sim (0, \sigma_g^2 A) \mbox{ and } \mbox{ } \mbox{ } c \sim (0, \sigma_c^2 H) \mbox{ and } \mbox{ } \mbox{ } e \sim (0, \sigma_e^2 I)\]

\[ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X\beta, V) = (X\beta, \sigma_g^2 A + \sigma_c^2 H + \sigma_e^2 I)\]

In the genome-wide association scan SOLAR passes through SNPs variants under the study one by one according to the equation given above. The p-value of association between trait \(y\) and SNP \(snp\) is again based on LRT test with the alternative model outlined in the table below. This alternative model is the same as the polygenic model.

Alternative model Free parameters Restricted parameters
No association \(\beta, \sigma_g^2, \sigma_c^2\), \(\sigma_e^2\) \(\beta_{snp} = 0\)

More details on implementation of the association analysis in SOLAR can be found in the SOLAR manual, for example, (SOLAR Documentation, Apendix 1, mga and snp commands).

3 Software

The idea of the proposed software (the R package solarius) is to perform the calculus in SOLAR at the back-end and to offer the user interface in R at the front-end. The table given below shows a basic description of the software via implementation of the main three models: polygenic, association and linkage.

Model SOLAR command solarius function solarius S3 classes of returned object Parallel computation
Polygenic polygenic solarPolygenic solarPolygenic None
Association mga solarAssoc solarAssoc, solarPolygenic Automatic or custom (by SNP file)
Linkage multipoint solarMultipoint solarMultipoint, solarPolygenic Custom (by chromosome)

One can see how the software benefits from both SOLAR and R sides. On the side of SOLAR, the SOLAR commands polygenic, mga and multipoint do all the hard job on estimating the variance component models with control via a set of parameters and settings. On the side of R, the solarius functions solarPolygenic, solarAssoc and solarMultipoint prepare the input data for the analysis, pass the arguments to the commands and set up other necessary settings, and finally read the output SOLAR files and store the results into R objects of S3 classes in R.

The main advantage of running the SOLAR models in solarius instead of SOLAR is that the analysis is conduced by only one function call, while the same analysis in SOLAR could require typing more commands with appropriate options and settings (to be learned from the SOLAR manual). The solarius functions just automate the analysis flow and keep configuration options via the R function arguments, if such configuration is required. Another advantage of using the R package solarius is that the results are stored in objects of especially designed S3 classes with associated methods like print, summary, plot and others.

The last column of the table given above contains the information related to parallel computation available for the models. The implementation of parallel calculations is straightforward in solarius package, since the association and linkage analyses are embarrassingly parallel problems (easily divisible into different identical independent task) and the R environment offers a number of packages with parallel interfaces.

More information about the solarius functions is available in the software manual of the package, and the practical guide of using the functions is presented in the next section dedicated to examples. Meantime, some points should to be stressed to describe some other features and drawbacks of the solarius package.

3.1 Installation

The installation process has three steps: install SOLAR, register SOLAR and install solarius.

3.1.1 SOLAR installation

The SOLAR distribution is available on the official web page http://solar.txbiomedgenetics.org/ for the most popular operating systems (Linux, Mac OS, Windows).

In the case of Linux, the installation files are compressed in solar_linux.tar.gz archive. Once the archive is downloaded to a local machine of the user and the installation files are unpacked, one can find the README file and install_solar script. In particular, the script copies the necessary library, binary and documentation files into an installation directory defined by the user. A special bash script called solar is also created and copied to the system path defined by the user, e.g. /usr/local/bin/solar, so that the program can be executed by simply typing solar in the terminal.

An example of the installation command might be the following.

./install_solar /opt/appl/solar/7.6.6 /usr/local/bin solar

As a result of execution of the command, the library, binary and documentation files of SOLAR version 7.6.6 will be copied to /opt/appl/solar/7.6.6 directory, and solar execution script will be copied to /usr/local/bin system directory.

After the installation process the new user is required to obtain a registration key, as it is explained by the message that appears when SOLAR program starts. The registration key has to be stored in .solar_reg file in the home directory of the user.

Once both installation and registration processes are finished, something similar to the following message will be displayed when SOLAR is executed.

SOLAR Eclipse version 7.6.6 (Experimental), last updated on December 11, 2014
Copyright (c) 1995-2014 Texas Biomedical Research Institute
Enter help for help, exit to exit, doc to browse documentation.

solar>

3.1.2 Installation problems with SOLAR

SOLAR version for Windows. Currently, SOLAR does not have an appropriated version for Windows. As stated in the official web page of SOLAR http://solar.txbiomedgenetics.org/solarwindows.html:

Because SOLAR requires a Unix OS or equivalent, we have prepared a VMware Virtual Machine containing SOLAR plus the Xubuntu edition of linux which runs on Windows XP or later using the free VMware Player. Beware this requires a very large multi 1.4 Gb download, and will require more than 10 gigabytes free disc space during installation.

This is a clear limitation of solarius package that comes from the dependency on SOLAR. Starting from version 3.*, solarius is not supported for Windows. DESCRIPTION file has a line OS_type: unix.

3.1.3 solarius installation

solarius package is available in the CRAN repository for R packages http://cran.r-project.org/web/packages/solarius/, so it can be install by calling the usual R function install.packages.

install.packages(solarius)

The code repository of solarius package is hosted in the public GitHub repository of the UGCD group https://github.com/ugcd/solarius. The R package devtools allows to install the development version of the package hosted at GitHub. The installation commands are the following.

library(devtools)
install_github("ugcd/solarius")

3.1.4 Installation problems with solarius

solarius has several R packages as dependencies (that can be seen in DESCRIPTION file). Installation of these packages should be straightforward using the standard R function install.packages.

In the case of problems in the installation process, web pages of the packages are the main source of information.

Imported packages:

Installation of data.table R package. If the user experiences problems installing data.table package, we recommend to upgrade the R environment to the latest stable version.

3.2 solarius R package structure

solarius is organized as a standard R package, and its structure can be seen in the code repository web site https://github.com/ugcd/solarius.

  • R directory contains *.R files with functions, classes and methods.
  • data directory has data sets distributed with the package and stored in *.RData files.
  • man directory includes documentation organized in Rd files (R documentation files).
  • inst directory includes directories, where some extra data in any format can be stored for the R package. Sub-directories in inst directory will be copied to the installation directory of the package and, thus, will be available within the package via system.file function.
    • inst/examples has *.R scripts with code examples.
    • inst/extdata mostly contains SOLAR data files for data sets of the package.
    • inst/tests has *.R scripts for testing units during development of the package (R package testthat).
  • vignettes directory includes vignettes, which are known as a long-form documentation for an R package. Typically, vignettes has a form of a technical report or an academic paper, that describes the problem that your package is designed to solve. Files with Rmd extension are the source files of the vignettes.

3.3 Documentation

The source files of documentation are stored in man directory as Rd files. R package staticdocs converts this documentation into html pages. The documentation for solarius is published on http://ugcd.github.io/solarius/doc/.

In the R environment the help for a function of the package, for example solarPolygenic function, can be obtained by the standard ?solarPolygenic R comand. The content of the documentation for solarPolygenic function is the same as on http://ugcd.github.io/solarius/doc/solarPolygenic.html.

3.4 Vignettes

The vignetts are created by rmarkdown R package and published as html pages on the web.

3.5 Load solarius

To start using solarius package, the user needs to load the package with library function.

library(solarius)

The version of solarius loaded here is:

packageVersion("solarius")
## [1] '0.3.1'

Other R packages will be needed for the tutorial.

library(plyr)

3.6 Minimal example

The following code shows how the main functions (solarPolygenic, solarMultipoint and solarAssoc) of the package work.

# load library
library(solarius)

# load data set
data(dat30)

# univariate polygenic model
mod1 <- solarPolygenic(trait1 ~ 1, dat30)

# bivariate polygenic model
mod2 <- solarPolygenic(trait1 + trait2 ~ 1, dat30,
  polygenic.options = '-testrhoe -testrhog')

# specify directory with IBD matrices and run linkage model
mibddir <- system.file('extdata', 'solarOutput',
  'solarMibdsCsv', package = 'solarius') 
link <- solarMultipoint(trait1 ~ 1, dat30,
  mibddir = mibddir, chr = 5)

# run association model in parallel
assoc <- solarAssoc(trait1 ~ 1, dat30, cores = 2,
  snpcovdata = genocovdat30, snpmap = mapdat30)

The user can think of this block of code as a quick start in using solarius. The material given further will provide more details and all necesseary information on the usage of solarius.

3.7 Data sets

The genome-wide QTL analysis typically involves large-scale data sets. Data sets distributed within solarius package contain relatively small simulated data, which are used for reproducible demonstrations of how functions of the package work. There are three data sets available in solarius:

  • An adapted version of the simulated data from the R package multic (Andrade et al. 2006) (polygenic and linkage models);
    • A smaller subset of these data is given in dat30 data set (polygenic, linkage and association models);
  • An adapted version of the simulated data dat50 from the R package FFBSKAT (polygenic model and association model, both with custom kinship matrix);
  • GAW10 data distributed within SOLAR (polygenic and linkage models).

The simulated data set (in SOLAR format) from multic R package is placed in inst/extdata/solarOutput directory of solarius package. 1200 individuals in 200 families are included in the study. The simulated phenotypes include two traits trait1 and trait2 and two covariates age and sex. The two traits possess a high genetic correlation. For the linkage analysis 6 IBD matrices for chromosome 5 6 IBD matrices for chromosome 2 are available. The six matrices of chromosome 5 are the original data from multic package, while the six matrices of chromosome 2 are the exact copies of those of chromosome 5. Data for the two chromosomes are needed to test parallel calculation (separated by chromosome) of the linkage model.

A smaller subset of the simulated data from multic package is provided in dat30 data set (dat30.RData file in data directory). This smaller data set contains only 174 individuals in 29 families.

  • The phenotypes (dat30 data frame) are the same as in the complete data described above.
  • A hundred of synthetic SNPs (genocovdat30 matrix) were randomly generated to run the association analysis on these data (see the script inst/scripts/generate-genocovdat30.R).
  • Annotation information (mapdat30 data frame) also was generated, mainly in order to plot the association results with Manhattan plot.

Documentation of dat30 data set is available on the page dat30.

The simulated data from the R package FFBSKAT are stored in dat50 data set (dat50.RData file in data directory). This data set is adapted from example.data data set in the R package FFBSKAT. The following variables are included.

  • phenodata: A data.frame with the phenotypes trait, sex and age given for 66 individuals (related and unrelated).
  • genodata: A matrix of 50 genetic variants (columns) given for 66 individuals (rows). The genotypes are coded in the format such as 1/1, 1/2 and 2/2.
  • genocovdata: A matrix of covariates derived from the genotype data (the additive model). The covariates are coded as integers 0, 1 and 2.
  • snpdata: A data.frame of annotation for genetic variants. The variables name of the variant name, chromosome chrom, gene name gene and position in bp position.
  • kin: A kinship matrix for 66 individuals.

Documentation of dat50 data set is available on the page dat50.

The simulated data from the workshop GAW10 are described elsewhere (SOLAR Documentation, Chapter 3 Tutorial).

4 Examples

4.1 Loading data

To load both data sets dat30 and dat50, the user needs to use the standard R function data A quick view to the data tables might be performed by means of str function.

data(dat30)
str(dat30)
## 'data.frame':    174 obs. of  10 variables:
##  $ famid : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ id    : int  11 12 13 14 15 16 21 22 23 24 ...
##  $ fa    : int  0 0 11 11 11 11 0 0 21 21 ...
##  $ mo    : int  0 0 12 12 12 12 0 0 22 22 ...
##  $ sex   : int  1 2 1 2 1 1 1 2 2 1 ...
##  $ affect: int  2 2 2 2 2 2 2 2 2 2 ...
##  $ class : logi  NA NA NA NA NA NA ...
##  $ trait1: num  11.96 7.1 10.32 9.76 9.46 ...
##  $ trait2: num  13.58 5.37 6.4 8.98 9.21 ...
##  $ age   : int  50 25 35 49 51 45 37 29 39 41 ...
str(genocovdat30)
##  num [1:174, 1:100] 1.978 0.795 0.231 0.139 0.487 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:174] "11" "12" "13" "14" ...
##   ..$ : chr [1:100] "snp_1" "snp_2" "snp_3" "snp_4" ...
str(mapdat30)
## 'data.frame':    100 obs. of  4 variables:
##  $ SNP : Factor w/ 100 levels "snp_1","snp_10",..: 1 12 23 34 45 47 48 49 50 2 ...
##  $ chr : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pos : num  2105324 2105467 2106094 2108138 2109262 ...
##  $ gene: Factor w/ 12 levels "gene1","gene2",..: 1 1 1 1 1 1 1 1 1 2 ...
data(dat50)
str(phenodata)
## 'data.frame':    66 obs. of  4 variables:
##  $ id   : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex  : int  0 1 0 0 0 1 0 1 1 0 ...
##  $ age  : int  80 77 56 44 75 79 75 82 77 76 ...
##  $ trait: num  -1.763 -1.423 -0.805 0.268 -1.334 ...
str(genodata)
##  chr [1:66, 1:50] "1/1" "1/1" "1/1" "1/2" "1/2" "1/1" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:66] "1" "2" "3" "4" ...
##   ..$ : chr [1:50] "s1" "s2" "s3" "s4" ...
str(genocovdata)
##  int [1:66, 1:50] 0 0 0 1 1 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:66] "1" "2" "3" "4" ...
##   ..$ : chr [1:50] "s1" "s2" "s3" "s4" ...

loadMulticPhen function in solarius package loads the complete version of the data from the R package multic (1200 individuals).

mdat <- loadMulticPhen()
str(mdat)
## 'data.frame':    1200 obs. of  10 variables:
##  $ ID    : chr  "1001" "1002" "1003" "1004" ...
##  $ affect: int  2 2 2 2 2 2 2 2 2 2 ...
##  $ class : logi  NA NA NA NA NA NA ...
##  $ trait1: num  9.61 11.5 11.05 12.37 7.97 ...
##  $ trait2: num  9.61 11.66 13.62 13.88 7.29 ...
##  $ age   : int  45 37 45 48 31 41 50 38 43 54 ...
##  $ FAMID : chr  "100" "100" "100" "100" ...
##  $ FA    : chr  "0" "0" "1001" "1001" ...
##  $ MO    : chr  "0" "0" "1002" "1002" ...
##  $ SEX   : chr  "1" "2" "1" "2" ...

loadExamplesPhen function returns the GAW10 data sets available in SOLAR.

gawdat <- loadExamplesPhen()
str(gawdat)
## 'data.frame':    1497 obs. of  11 variables:
##  $ ID : chr  "1" "10" "100" "1000" ...
##  $ AGE: int  56 37 80 51 50 39 36 68 69 57 ...
##  $ EF : num  NA 46.9 NA 29.5 24.7 ...
##  $ Q1 : num  NA 42.1 NA 25.9 18.4 ...
##  $ Q2 : num  NA 55.4 NA 32.4 29.6 ...
##  $ Q3 : num  NA 111.2 NA 65.8 39.3 ...
##  $ Q4 : num  NA 13.1 NA 11.5 10.1 ...
##  $ Q5 : num  NA 30.9 NA 27.7 24.6 ...
##  $ FA : chr  "0" "0" "0" "610" ...
##  $ MO : chr  "0" "0" "0" "612" ...
##  $ SEX: chr  "1" "2" "2" "1" ...

Both loadMulticPhen and loadExamplesPhen functions are based on a more general function readPhen, which reads the phenotype data from text files of a table format. The data from the R package multic can be also loaded by means of readPhen function.

dat.dir <- system.file("extdata", "solarOutput", package = "solarius")

mdat <- readPhen(phen.file = file.path(dat.dir, "simulated.phen"), sep.phen = ",", 
  ped.file = file.path(dat.dir, "simulated.ped"), sep.ped = ",")

4.2 Data exploration

4.2.1 Plot kinship matrix

The two times kinship coefficients describe the relatedness among individuals in the family-based sample. Two functions solarKinship2 and plotKinship2 allows to compute and plot the kinship matirx of a given data set. solarKinship2 function runs SOLAR for estimation of the kinship coefficients based on the ID fields in the data. The returned value of the function is two times kinship coefficients in a matrix form. plotKinship2 plots the two times kinship matrix by using the R package Matrix.

For example, the kinship matrix for dat50 data sets has the following image.

plotKinship2(2*kin)

One can see that both unrelated and related individuals appear in dat50 data set.

The relationships among individuals in dat30 data set is examined in the following two lines of code.

kin2.dat30 <- solarKinship2(dat30)
plotKinship2(kin2.dat30)

The structure of patterns in the kinship matrix can be explored more closely if the first 30 individuals are plotted.

plotKinship2(kin2.dat30[1:30, 1:30])

It seems that the simulated data include only sib-pair relationships. Such an observation can be visually confirmed by histKinship2 function.

histKinship2(kin2.dat30)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

One see that all the related individuals have the twice kinship coefficient equal to 0.5. The unique values of kin2.dat30 matrix can also be extracted by the following code.

unique(as.vector(kin2.dat30))
## [1] 1.0 0.0 0.5

4.2.2 Plot pedigrees

The family trees are plotted by plotPed, which reuses some routines from the R package kinship2. The family with index 2 from dat30 data set is plotted in the following code.

plotPed(dat30, 2)
## Loading required package: kinship2
## Loading required package: Matrix
## Loading required package: quadprog

4.2.3 Transform traits

To show an example of trait transformation, a trait with an anomalous distribution exp_trait1 is created in dat30 data set.

dat30 <- mutate(dat30,
  exp_trait1 = exp(trait1))

The list of available transforms in solarius is returned by availableTransforms function.

availableTransforms()
## [1] "none"    "inormal" "log"     "out"

The following code plots the distribution of the original values of exp_trait1 and its transformed values.

library(ggplot2)
## Find out what's changed in ggplot2 with
## news(Version == "1.0.1", package = "ggplot2")
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:solarius':
## 
##     annotate
library(gridExtra)

theme_set(theme_bw())

with(dat30, {
  grid.arrange(
    qplot(exp_trait1),
    qplot(transformTrait(exp_trait1, "out")),
    qplot(transformTrait(exp_trait1, "log")),
    qplot(transformTrait(exp_trait1, "inormal"))
  )
})
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

4.3 Polygenic model (univariate)

solarPolygenic function runs the polygenic analysis in SOLAR and returns back the results in R. The basic call looks like the standard call of lm function with the formula interface.

M1 <- solarPolygenic(trait1 ~ age + sex, dat30)
M1
## 
## Call: solarPolygenic(formula = trait1 ~ age + sex, data = dat30)
## 
## File polygenic.out:
##  Pedigree:    dat.ped 
##  Phenotypes:  dat.phe 
##  Trait:       trait1                Individuals:  174 
##  
##           H2r is 0.8061621  p = 6.1167535e-10  (Significant) 
##         H2r Std. Error:  0.1100465 
##  
##  
##  Proportion of Variance Due to All Final Covariates Is 
##                0.0330070 
##  
##  Loglikelihoods and chi's are in trait1/polygenic.logs.out 
##  Best model is named poly and null0 
##  Final models are named poly, spor, nocovar 
##  
##  Residual Kurtosis is -0.3603, within normal range

The print method for the returned object of class solarPolygenic shows the output produced by SOLAR. The tables of the results are printed by summary method.

summary(M1)
## 
## Call: solarPolygenic(formula = trait1 ~ age + sex, data = dat30)
## 
## File polygenic.out:
##  Pedigree:    dat.ped 
##  Phenotypes:  dat.phe 
##  Trait:       trait1                Individuals:  174 
##  
##           H2r is 0.8061621  p = 6.1167535e-10  (Significant) 
##         H2r Std. Error:  0.1100465 
##  
##  
##  Proportion of Variance Due to All Final Covariates Is 
##                0.0330070 
##  
##  Loglikelihoods and chi's are in trait1/polygenic.logs.out 
##  Best model is named poly and null0 
##  Final models are named poly, spor, nocovar 
##  
##  Residual Kurtosis is -0.3603, within normal range 
## 
##  Loglikelihood Table:
##       model    loglik
## 1  sporadic -229.3348
## 2 polygenic -210.8689
## 
##  Covariates Table:
##   covariate      Estimate         SE Chi pval
## 1      mean  7.7823346631 0.33112422  NA   NA
## 2       age  0.0004591117 0.01483783  NA   NA
## 3       sex -0.4559509166 0.31250447  NA   NA
## 
##  Variance Components Table:
##   varcomp       Var        SE         pval
## 1     h2r 0.8061621 0.1100465 6.116753e-10
## 2      e2 0.1938379 0.1100465           NA

4.3.1 Testing covariates

By default SOLAR does not perform the LRT tests for the covariates. covtest argument of solarPolygenic function tells to SOLAR to make this job.

M2 <- solarPolygenic(trait1 ~ age + sex, dat30, covtest = TRUE)
M2
## 
## Call: solarPolygenic(formula = trait1 ~ age + sex, data = dat30, covtest = TRUE)
## 
## File polygenic.out:
##  Pedigree:    dat.ped 
##  Phenotypes:  dat.phe 
##  Trait:       trait1                Individuals:  174 
##  
##           H2r is 0.8061621  p = 6.1167535e-10  (Significant) 
##         H2r Std. Error:  0.1100465 
##  
##                                       age  p = 0.9753082  (Not Sig., but fixed) 
##                                       sex  p = 0.1455341  (Not Sig., but fixed) 
##  
##  Proportion of Variance Due to All Final Covariates Is 
##                0.0330070 
##  
##  Loglikelihoods and chi's are in trait1/polygenic.logs.out 
##  Best model is named poly and null0 
##  Final models are named poly, spor, nocovar 
##  Initial sporadic and polygenic models are s0 and p0 
##  Constrained covariate models are named no<covariate name> 
##  
##  Residual Kurtosis is -0.3603, within normal range

The test statistics and p-values are stored in cf slot of the returned object M2.

M2$cf
##   covariate      Estimate         SE    Chi      pval
## 1      mean  7.7823346631 0.33112424     NA        NA
## 2       age  0.0004591117 0.01483782 0.0010 0.9753082
## 3       sex -0.4559509166 0.31250456 2.1184 0.1455341

4.3.2 Custom kinship matrix

The custom kinship matrix is passed by kinship argument. The following code shows such an example of dat50 data set, which phenotype data are stored in phenodata data.frame.

M3 <- solarPolygenic(trait ~ 1, phenodata, kinship = kin)
M3
## 
## Call: solarPolygenic(formula = trait ~ 1, data = phenodata, kinship = kin)
## 
## File polygenic.out:
##  Pedigree:    dat.ped 
##  Phenotypes:  dat.phe 
##  Trait:       trait                 Individuals:  66 
##  
##           H2r is 0.1000000  p = 0.5000000  (Not Significant) 
##  
##  
##  Loglikelihoods and chi's are in trait/polygenic.logs.out 
##  Best model is named poly and null0 
##  Final models are named poly, spor 
##  
##  Residual Kurtosis is -0.6742, within normal range

The polygenic and sporadic models have the same LRT statistics, that means SOLAR was not able to make the polygenic analysis in a proper way.

M3$lf
##       model    loglik
## 1  sporadic -27.77137
## 2 polygenic -27.77137

This bug seems the case of unrelated individuals or the individuals withount such ID fields as fa and mo.

4.4 Polygenic model (bivariate)

The bivariate polygenic analysis is run in the same way as the univariate one.

B1 <- solarPolygenic(trait1 + trait2 ~ 1, dat30)
B1
## 
## Call: solarPolygenic(formula = trait1 + trait2 ~ 1, data = dat30)
## 
## File polygenic.out:
##  Pedigree:    dat.ped 
##  Phenotypes:  dat.phe 
##  Trait:       trait1 trait2         Individuals:  174 
##  
##           H2r(trait1) is 0.8218823   
##         H2r(trait1) Std. Error:  0.1053258 
##  
##           H2r(trait2) is 0.6270026   
##         H2r(trait2) Std. Error:  0.1158107 
##  
##           RhoE is 0.4120487   
##         RhoE Std. Error:  0.1969379 
##  
##           RhoG is 0.9728759 
##         RhoG Std. Error:  0.0374442 
##  
##         Derived Estimate of RhoP is 0.8045957 
##  
##  
##  Loglikelihoods and chi's are in trait1.trait2/polygenic.logs.out 
##  Best model is named poly and null0 
##  Final models are named poly, spor

4.4.1 Testing correlations

The test of significance of environmental and genetic correlation coefficients is indicated by polygenic.options argument of solarPolygenic function.

B2 <- solarPolygenic(trait1 + trait2 ~ 1, dat30, polygenic.options = "-testrhoe -testrhog")
B2
## 
## Call: solarPolygenic(formula = trait1 + trait2 ~ 1, data = dat30, polygenic.options = "-testrhoe -testrhog")
## 
## File polygenic.out:
##  Pedigree:    dat.ped 
##  Phenotypes:  dat.phe 
##  Trait:       trait1 trait2         Individuals:  174 
##  
##           H2r(trait1) is 0.8218823   
##         H2r(trait1) Std. Error:  0.1053258 
##  
##           H2r(trait2) is 0.6270026   
##         H2r(trait2) Std. Error:  0.1158107 
##  
##           RhoE is 0.4120487  p = 0.1424023 
##         RhoE Std. Error:  0.1969379 
##  
##           RhoG is 0.9728759 
##         RhoG Std. Error:  0.0374442 
##  
##  
##         RhoG different from zero  p = 2.2463156e-09 
##         RhoG different from 1.0   p = 0.2098041 
##         Derived Estimate of RhoP is 0.8045957 
##  
##  
##  Loglikelihoods and chi's are in trait1.trait2/polygenic.logs.out 
##  Best model is named poly and null0 
##  Final models are named poly, spor

The results are stored in vcf slot of the returned object.

B2$vcf
##       varcomp  Estimate         SE    zscore
## 1 h2r(trait1) 0.8218823 0.10532575  7.803242
## 2  e2(trait1) 0.1781177 0.10532575  1.691112
## 3 h2r(trait2) 0.6270026 0.11581068  5.414031
## 4  e2(trait2) 0.3729974 0.11581068  3.220751
## 5        rhog 0.9728759 0.03744417 25.982039
## 6        rhoe 0.4120487 0.19693793  2.092277

4.4.2 Trait-specific covariates

It might happen that traits in the multivariate polygenic analysis have different covariates. Trait-specific covariates are passed to solarPolygenic function according the specification of SOLAR.

B3 <- solarPolygenic(trait1 + trait2 ~ sex + age(trait2), dat30)
B3
## 
## Call: solarPolygenic(formula = trait1 + trait2 ~ sex + age(trait2), 
##     data = dat30)
## 
## File polygenic.out:
##  Pedigree:    dat.ped 
##  Phenotypes:  dat.phe 
##  Trait:       trait1 trait2         Individuals:  174 
##  
##           H2r(trait1) is 0.7956974   
##         H2r(trait1) Std. Error:  0.1129518 
##  
##           H2r(trait2) is 0.6029008   
##         H2r(trait2) Std. Error:  0.1197632 
##  
##           RhoE is 0.4655421   
##         RhoE Std. Error:  0.1736882 
##  
##           RhoG is 0.9676250 
##         RhoG Std. Error:  0.0397582 
##  
##         Derived Estimate of RhoP is 0.8027999 
##  
##  
##  Loglikelihoods and chi's are in trait1.trait2/polygenic.logs.out 
##  Best model is named poly and null0 
##  Final models are named poly, spor

In the results of this analysis, sex covariate is common for both traits, while age covariate is applied only to trait2. The given model specification can be ckecked by looking at the model files from SOLAR.

tail(B3$solar$files$model$poly.mod, 3)
## [1] "# mu = \\{t1*(<Mean(trait1)>+<bsex(trait1)>*Female) + t2*(<Mean(trait2)>"
## [2] "# +<bsex(trait2)>*Female+<bage(trait2)>*(age-39.87931034))\\}"           
## [3] "loglike set -361.499294"

4.5 Association model

solarAssoc function runs the association analysis in SOLAR, in particular via mga command. The genotype data are passed to solarAssoc function in the form of SNP genotypes (snpdata argument) or covariates (snpcovdata or genocov.files arguments). The later case assumes that the user has converted the genotypes to covariates by some method (outside solarius package) before computing the association model.

The function returns an object of class solarAssoc. This class has additional methods such as annotate, plotQQ and plotManh for exploration of the association results. The main table of the results is snpf slot of the returned object, which class is data.table from the R package data.table.

The association analysis is shown with dat50 data set with 66 individuals and 50 SNP variants. The kinship matrix kin of this data set is also used.

4.5.1 SNP data by snpdata argument

The option of using snpdata is recommended in the case when the small number of SNPs is to be tested.

A1 <- solarAssoc(trait ~ 1, phenodata, snpdata = genodata, kinship = kin)
A1
## 
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin, 
##     snpdata = genodata)
## 
##  Input SNP data:
##   *  50 SNP genotypes passed by `snpdata` argument
## 
##  Output results of association:
## 
##   *  Table of association results (first 5 out of 50 rows):
##    SNP NAv      chi     pSNP      bSNP   bSNPse   Varexp  est_maf
## 1:  s1  66 0.189994 0.662922  0.160860 0.369045 0.002875 0.053031
## 2: s10  66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11  66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12  66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13  66 0.182890 0.668901  0.118876 0.277971 0.002767 0.106061
##      est_mac dosage_sd
## 1:  7.000026  0.307915
## 2:  1.999998  0.244311
## 3:  1.999998  0.244311
## 4: 27.000006  0.549856
## 5: 13.999986  0.408810
## 
##  CPU time on 1 core(s): 00:00:03

summary method reports the number of significant SNPs based on Bonferroni correction.

summary(A1)
## 
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin, 
##     snpdata = genodata)
## 
## Association model
##  * Number of SNPs: 50 
##  * Input format: snpdata 
##  * Number of significal SNPs: 0 (Bonferroni correction with alpha 0.05)

The significance threshold can be passed as alpha parameter of summary method.

summary(A1, alpha = 1)
## 
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin, 
##     snpdata = genodata)
## 
## Association model
##  * Number of SNPs: 50 
##  * Input format: snpdata 
##  * Number of significal SNPs: 1 (Bonferroni correction with alpha 1)
##    SNP NAv      chi    pSNP      bSNP   bSNPse   Varexp  est_maf  est_mac
## 1: s43  66 6.504508 0.01076 -1.186139 0.465081 0.093852 0.030303 3.999996
##    dosage_sd
## 1:  0.238606

4.5.2 SNP data by snpcovdata argument

SNP covariates are passed to the association model by snpcovdata parameter.

A2 <- solarAssoc(trait ~ 1, phenodata, snpcovdata = genocovdata, kinship = kin)
A2
## 
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin, 
##     snpcovdata = genocovdata)
## 
##  Input SNP data:
##   *  50 SNP covariates passed by `snpcovdata` argument
## 
##  Output results of association:
## 
##   *  Table of association results (first 5 out of 50 rows):
##    SNP NAv      chi     pSNP      bSNP   bSNPse   Varexp  est_maf
## 1:  s1  66 0.189994 0.662922  0.160860 0.369045 0.002875 0.053031
## 2: s10  66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11  66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12  66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13  66 0.182890 0.668901  0.118876 0.277971 0.002767 0.106061
##      est_mac dosage_sd
## 1:  7.000026  0.307915
## 2:  1.999998  0.244311
## 3:  1.999998  0.244311
## 4: 27.000006  0.549856
## 5: 13.999986  0.408810
## 
##  CPU time on 1 core(s): 00:00:02

The values of the SNP covariates can take any numerical values. That means dosage format of imputed SNP data is allowed.

The previous example is repeated for the SNP covariate data, where values of 2 are replaced by 1.9.

genocovdata2 <- genocovdata
genocovdata2[genocovdata2 == 2] <- 1.9

A2 <- solarAssoc(trait ~ 1, phenodata, snpcovdata = genocovdata2, kinship = kin)
A2
## 
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin, 
##     snpcovdata = genocovdata2)
## 
##  Input SNP data:
##   *  50 SNP covariates passed by `snpcovdata` argument
## 
##  Output results of association:
## 
##   *  Table of association results (first 5 out of 50 rows):
##    SNP NAv      chi     pSNP      bSNP   bSNPse   Varexp  est_maf
## 1:  s1  66 0.189994 0.662922  0.160860 0.369045 0.002875 0.053031
## 2: s10  66 0.050972 0.821380 -0.110595 0.489857 0.000772 0.014394
## 3: s11  66 0.050972 0.821380 -0.110595 0.489857 0.000772 0.014394
## 4: s12  66 0.927276 0.335572 -0.201592 0.209348 0.013951 0.203031
## 5: s13  66 0.182890 0.668901  0.118876 0.277971 0.002767 0.106061
##      est_mac dosage_sd
## 1:  7.000026  0.307915
## 2:  1.900008  0.232095
## 3:  1.900008  0.232095
## 4: 26.800026  0.541289
## 5: 13.999986  0.408810
## 
##  CPU time on 1 core(s): 00:00:02

4.5.3 SNP data by genocov.files (single value)

genocov.files argument is supported to pass the SNP covariate data in files of the appropriate SOLAR format. This option suits for large-scale calculations, when the real genome-wide scan is run. Both arguments genocov.files and snplists.files are required, and snpmap.files argument is optional.

dir <- package.file("extdata", "solarAssoc", package = "solarius")
genocov.files <- file.path(dir, "snp.genocov")
snplists.files <- file.path(dir, c("snp.geno-list1", "snp.geno-list2"))

A3 <- solarAssoc(trait ~ 1, phenodata, 
  genocov.files = genocov.files, snplists.files = snplists.files)
A3
## 
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, genocov.files = genocov.files, 
##     snplists.files = snplists.files)
## 
##  Input SNP data:
##   *  SNP covariates passed in 1 file(s) by `genocov.files` argument
## 
##  Output results of association:
## 
##   *  Table of association results (first 5 out of 25 rows):
##    SNP NAv      chi     pSNP      bSNP   bSNPse   Varexp  est_maf
## 1:  s1  66 0.189994 0.662922  0.160860 0.369045 0.002875 0.053031
## 2: s10  66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11  66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12  66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13  66 0.182890 0.668901  0.118876 0.277971 0.002767 0.106061
##      est_mac dosage_sd
## 1:  7.000026  0.307915
## 2:  1.999998  0.244311
## 3:  1.999998  0.244311
## 4: 27.000006  0.549856
## 5: 13.999986  0.408810
## 
##  CPU time on 1 core(s): 00:00:01

4.5.4 SNP data by genocov.files (many values)

Many values can be passed to genocov.files parameter, but this case is thought for parallel calculations.

dir <- package.file("extdata", "solarAssoc", package = "solarius")
genocov.files <- file.path(dir, c("snp.genocov1", "snp.genocov2"))
snplists.files <- file.path(dir, c("snp.geno-list1", "snp.geno-list2"))

A4 <- solarAssoc(trait ~ 1, phenodata, 
  genocov.files = genocov.files, snplists.files = snplists.files)
A4
## 
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, genocov.files = genocov.files, 
##     snplists.files = snplists.files)
## 
##  Input SNP data:
##   *  SNP covariates passed in 2 file(s) by `genocov.files` argument and 2 files(s) by `snplists.files` argument
## 
##  Output results of association:
## 
##   *  Table of association results (first 5 out of 25 rows):
##    SNP NAv      chi     pSNP      bSNP   bSNPse   Varexp  est_maf
## 1:  s1  66 0.189994 0.662922  0.160860 0.369045 0.002875 0.053031
## 2: s10  66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11  66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12  66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13  66 0.182890 0.668901  0.118876 0.277971 0.002767 0.106061
##      est_mac dosage_sd
## 1:  7.000026  0.307915
## 2:  1.999998  0.244311
## 3:  1.999998  0.244311
## 4: 27.000006  0.549856
## 5: 13.999986  0.408810
## 
##  CPU time on 1 core(s), 2 batches: 00:00:01

4.5.5 Exploration of association results

QQ and Manhattan plots help to explore the association results visually.

A5 <- solarAssoc(trait ~ 1, phenodata, snpdata = genodata, snpmap = snpdata, kinship = kin)
plot(A5)

plot(A5, "qq")

4.5.6 Parallel computation

Parallelization of the association analysis is an easy task with solarius. The user only needs to specify how many cores are to be used for the computation.

A6 <- solarAssoc(trait ~ 1, phenodata, snpdata = genodata, kinship = kin, cores = 2)

The gain in CPU time is not obvious, as the number of SNPs (50) is too small.

A1$assoc$tprofile$cputime.sec
## [1] 3.268
A6$assoc$tprofile$cputime.sec
## [1] 2.458

dat30 data set contains 100 SNPs (genocovdat30 matrix). The same test on speed up by parallel computing looks as following.

A7 <- solarAssoc(trait1 ~ 1, dat30, 
  snpcovdata = genocovdat30, snpmap = mapdat30)
A8 <- solarAssoc(trait1 ~ 1, dat30, 
  snpcovdata = genocovdat30, snpmap = mapdat30,
  cores = 2)
A7$assoc$tprofile$cputime.sec
## [1] 5.655
A8$assoc$tprofile$cputime.sec
## [1] 3.884

4.6 Linkage model

solarMultipoint function performs the linkage analysis by calling SOLAR command multipoint. The function looks for multipoint IBD matrices in the directory pointed out by mibddir argument. The analysis can be customized by combination of options via multipoint.options and multipoint.settings arguments.

Phenotypes in dat30 data set and IBD matrices are placed in inst/extdata/solarOutput directory of the package.

4.6.1 Basic linkage model

The basic model requires only the path to a directory with IBD matrices (mibddir argument). By default, all SOLAR specific files, which represent IBD matrices, will be used in the analysis.

mibddir <- package.file("extdata", "solarOutput", "solarMibdsCsv", package = "solarius")  
list.files(mibddir)
##  [1] "mibd.2.0.gz" "mibd.2.1.gz" "mibd.2.2.gz" "mibd.2.3.gz" "mibd.2.4.gz"
##  [6] "mibd.2.5.gz" "mibd.5.0.gz" "mibd.5.1.gz" "mibd.5.2.gz" "mibd.5.3.gz"
## [11] "mibd.5.4.gz" "mibd.5.5.gz"
L1 <- solarMultipoint(trait1 ~ 1, dat30, mibddir = mibddir)
L1
## 
## Call: solarMultipoint(formula = trait1 ~ 1, data = dat30, mibddir = mibddir)
## 
##  Input IBD data:
##   *  directory /home/andrey/git/ugcd/solarius/inst/extdata/solarOutput/solarMibdsCsv
## 
##  Output results of association:
## 
##   * Table of association results (first 5 out of 12 rows):
##   chr pos    LOD  Loglike      H2r     H2q1
## 1   2   0 1.4888 -208.503 0.448464 0.403780
## 2   2   1 2.4232 -206.352 0.244560 0.614310
## 3   2   2 3.2759 -204.388 0.123184 0.742126
## 4   2   3 3.5569 -203.741 0.097187 0.763608
## 5   2   4 3.1610 -204.653 0.179701 0.668166
## 
##  CPU time on 1 core(s): 00:00:03

Methods print, summary and plot help to explore the results of analysis. The data table of results is stored in lodf slot of the returned object.

summary(L1)
## 
## Call: solarMultipoint(formula = trait1 ~ 1, data = dat30, mibddir = mibddir)
## 
## Multipoint model
##  * Number of used markers: 12 
##  * Number of passes: 1 
##  * Maximum LOD score: 3.56 
##   -- chr: 2 
##   -- position: 3 cM
plot(L1)

The results for chromosomes 2 and 5 are identical, since the IBD matrices for chromosome 2 are duplicates of the matrices for chromosome 5.

4.6.2 Linkage options

The following code shows an example of a custom linkage analysis. In particular, the linkage is run on chromosome 5 with a 2 cM interval, Fine mapping is disabled and the second pass is enables via multipoint.options argument.

L2 <- solarMultipoint(trait1 ~ 1, dat30, mibddir = mibddir, chr = 5, interval = 2, 
  multipoint.settings = "finemap off", multipoint.options = "3")
summary(L2)
## 
## Call: solarMultipoint(formula = trait1 ~ 1, data = dat30, mibddir = mibddir, 
##     chr = 5, interval = 2, multipoint.options = "3", multipoint.settings = "finemap off")
## 
## Multipoint model
##  * Number of used markers: 3 
##  * Number of passes: 2 
##  * Maximum LOD score: 3.28 
##   -- chr: 5 
##   -- position: 2 cM

The LOD scores for the second path (stored in lodf2) are plotted with the code given below.

plot(L2, pass = 2)

4.6.3 Bivariate linkage

The bivariate linkage analysis is performed in the same way as the bivariate polygenic analysis.

L3 <- solarMultipoint(trait1 + trait2 ~ 1, dat30, mibddir = mibddir, chr = 5, 
  interval = 2, multipoint.settings = "finemap off")
summary(L3)
## 
## Call: solarMultipoint(formula = trait1 + trait2 ~ 1, data = dat30, 
##     mibddir = mibddir, chr = 5, interval = 2, multipoint.settings = "finemap off")
## 
## Multipoint model
##  * Number of used markers: 3 
##  * Number of passes: 1 
##  * Maximum LOD score: 2.47 
##   -- chr: 5 
##   -- position: 2 cM
L4 <- solarMultipoint(trait1 + trait2 ~ age + sex(trait1), dat30, 
  mibddir = mibddir, chr = 5, interval = 2, multipoint.settings = "finemap off")
summary(L4)
## 
## Call: solarMultipoint(formula = trait1 + trait2 ~ age + sex(trait1), 
##     data = dat30, mibddir = mibddir, chr = 5, interval = 2, multipoint.settings = "finemap off")
## 
## Multipoint model
##  * Number of used markers: 3 
##  * Number of passes: 1 
##  * Maximum LOD score: 2.68 
##   -- chr: 5 
##   -- position: 2 cM

4.6.4 Parallel computation

Parallelization of the linkage analysis is again parametrized with cores argument, but the chromosome values are expected to be a vector of values. That means that each core does the job on a single chromosome.

L5 <- solarMultipoint(trait1 ~ 1, dat30, mibddir = mibddir,
  chr = c(2, 5), cores = 2)
L1$multipoint$tprofile$cputime.sec
## [1] 3.229
L5$multipoint$tprofile$cputime.sec
## [1] 2.153

5 Miscellaneous material

5.1 SOLAR references

A list of references given on the official web page of SOLAR http://solar.txbiomedgenetics.org/:

  • The main reference for SOLAR, including theoretical explanations of the variance component linkage method and the approximate multipoint IBD calculations in pedigrees, is (L. Almasy and Blangero 1998);
  • Bivariate Quantitative trait linkage is described in the following papers: (L. Almasy, Dyer, and Blangero 1997), (Williams et al. 1999);
  • The Liability Threshold model for discrete traits is described in the preceding paper as well as the following one: (Duggirala et al. 1997);
  • Gene By Environment Interaction is discussed in: (Towne, Siervogel, and Blangero 1997), (Blangero 2009);
  • An examination of LOD adjustment is given in: (J. Blangero, Wiliiams, and Almasy 2000), (J. Blangero, Williams, and Almasy 2001);
  • Additional references include: (Blangero and Almasy 1997), (Williams and Blangero 2004), (Williams and Blangero 1999).

6 R session info

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] solarius_0.3.1   plyr_1.8.3       gait_0.1         data.table_1.9.6
## [5] rmarkdown_0.8    knitr_1.11       devtools_1.9.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.2      compiler_3.2.2   formatR_1.2.1    bitops_1.0-6    
##  [5] iterators_1.0.7  tools_3.2.2      digest_0.6.8     jsonlite_0.9.19 
##  [9] evaluate_0.8     memoise_0.2.1    gtable_0.1.2     foreach_1.4.2   
## [13] curl_0.9.4       yaml_2.1.13      parallel_3.2.2   qqman_0.1.2     
## [17] proto_0.3-10     roxygen2_5.0.1   stringr_1.0.0    httr_1.0.0      
## [21] grid_3.2.2       R6_2.1.1         XML_3.98-1.3     ggplot2_1.0.1   
## [25] reshape2_1.4.1   magrittr_1.5     scales_0.3.0     codetools_0.2-14
## [29] htmltools_0.2.6  rsnps_0.1.6      MASS_7.3-44      colorspace_1.2-6
## [33] labeling_0.3     stringi_1.0-1    RCurl_1.95-4.7   doParallel_1.0.8
## [37] munsell_0.4.2    chron_2.3-47

7 License

This document is licensed under the Creative Commons Attribution 4.0 International Public License.

Creative Commons License

References

Almasy, L, and J Blangero. 1998. “Multipoint quantitative-trait linkage analysis in general pedigrees.” American Journal of Human Genetics 62 (5): 1198–1211. doi:10.1086/301844.

Almasy, L, T D Dyer, and J Blangero. 1997. “Bivariate quantitative trait linkage analysis: pleiotropy versus co-incident linkages.” Genetic Epidemiology 14 (6): 953–8. doi:10.1002/(SICI)1098-2272(1997)14:6<953::AID-GEPI65>3.0.CO;2-K.

Andrade, Mariza De, Elizabeth J Atkinson, Christopher I Amos, and Jianfang Chen. 2006. Estimating Genetic Components of Variance for Quantitative Traits in Family Studies using the MULTIC routines.

Blangero, J. 2009. “Statistical genetic approaches to human adaptability.” Human Biology 81 (5-6): 523–46. doi:10.3378/027.081.0603.

Blangero, J, and L Almasy. 1997. “Multipoint oligogenic linkage analysis of quantitative traits.” Genetic Epidemiology 14 (6): 959–64. doi:10.1002/(SICI)1098-2272(1997)14:6<959::AID-GEPI66>3.0.CO;2-K.

Blangero, John, Jeff T Wiliiams, and Laura Almasy. 2000. “Robust LOD Scores for Variance Linkage Analysis.” Genetic Epidemiology 19 (Suppl.

Blangero, John, Jeff T Williams, and Laura Almasy. 2001. “Variance component methods for detecting complex trait loci.” Advances in Genetics 42: 151–81.

Duggirala, R, J T Williams, S Williams-Blangero, and J Blangero. 1997. “A variance component approach to dichotomous trait linkage analysis using a threshold model.” Genetic Epidemiology 14 (6): 987–92. doi:10.1002/(SICI)1098-2272(1997)14:6<987::AID-GEPI71>3.0.CO;2-G.

Lynch, Michael, and Bruce Walsh. 1998. Genetics and Analysis of Quantitative Traits. Vol. 1. Sinauer Sunderland.

SOLAR Documentation. “SOLAR version 6.6.2.” http://helix.nih.gov/Documentation/solar-6.6.2-doc/index.html.

Towne, B, R M Siervogel, and J Blangero. 1997. “Effects of genotype-by-sex interaction on quantitative trait linkage analysis.” Genetic Epidemiology 14 (6): 1053–8. doi:10.1002/(SICI)1098-2272(1997)14:6<1053::AID-GEPI82>3.0.CO;2-G.

Williams, J T, and J Blangero. 2004. “Power of variance component linkage analysis-II. Discrete traits.” Annals of Human Genetics 68 (Pt 6): 620–32. doi:10.1046/j.1529-8817.2004.00128.x.

Williams, J T, P Van Eerdewegh, L Almasy, and J Blangero. 1999. “Joint multipoint linkage analysis of multivariate qualitative and quantitative traits. I. Likelihood formulation and simulation results.” American Journal of Human Genetics 65 (4): 1134–47. doi:10.1086/302570.

Williams, Jeff T, and John Blangero. 1999. “Comparison of Variance Components and Sibpair-Based Approaches to Quantitative Trait Linkage Analysis in Unselected Samples” 134 (February 1998): 113–34.