`solarius`

R packageThis vignette is a tutorial on the R package `solarius`

. The document contains a brief description of the main statistical models (polygenic, association and linkage) implemented in `SOLAR`

and accessible via `solarius`

, installation instructions for both `SOLAR`

and `solarius`

, reproducible examples on synthetic data sets available within the `solarius`

package. The examples are intended to cover all the necessary steps in the quantitative genetic analysis with `SOLAR`

: data loading, data exploration, running statistical models and interpreting the obtained results by printing, summarizing and plotting the models in R.

In human genetics research, the study of a complex disorder is targeted to discover its underlying genetic architecture. Whereas simple Mendelian diseases are results of a mutation in a single gene, for example Huntingtonâ€™s disease, complex disorders or phenotypes are the product of the action of multiple genes and environmental factors as well as their interactions. For instance, obesity is a complex disease which genetic basis might be better dissected via measurable quantitative variation in phenotypes, such as body-mass index, closely related to the disease risk.

One of the study designs to identify the quantitative trait loci (QTLs) that affect complex phenotypes is based on examination of related individuals grouped by families (pedigrees). Pedigree-based designs allow for both QTL linkage and association mappings. In addition, QTL analysis with pedigrees possess more power to detect rare variants, permits explicit control for population stratification in association studies, and suit for examination of parent-of-origin and sex-specificity effects.

The analysis of the pedigree data is traditionally based on variance component (VC) models, also known as linear mixed models. `SOLAR`

is one of the well-known and widely used tools that implements the variance component models and other routines needed for the QTL analysis in extended pedigrees (L. Almasy and Blangero 1998) (http://solar.txbiomedgenetics.org/). The name `SOLAR`

stands for Sequential Oligogenic Linkage Analysis Routines. The analyses available in `SOLAR`

include customized quantitative genetic analysis, advanced linkage analysis, SNP association analysis (QTN, QTLD, and MGA). Other operations are implemented for calculation of marker-specific or multipoint identity-by-descent (IBD) matrices in pedigrees of arbitrary size and complexity. The linkage analysis allows for multiple quantitative traits and/or discrete traits which might involve multiple loci (oligogenic analysis), dominance effects, household effects, and interactions.

There are other tools similar to `SOLAR`

, such as `Mendel`

, `MMM`

, `GEMMA`

and `FAST-LMM`

, but the R package `solarius`

presented here has been especially developed to work with `SOLAR`

.

`solarius`

provides an interface from the program `SOLAR`

to the environment for statistical computing `R`

. On one side, main `SOLAR`

procedures can be called in `R`

by one or a few commands from the `solarius`

package, that makes the analysis more user-friendly with the same power of computations implemented in `SOLAR`

. On the other side, the `solarius`

user benefits from the infrastructure available with the R environment and its packages, such as interactive data manipulation, data visualization and parallel computation.

Consider that a particular trait, such as body-mass index, is observed in \(n\) individuals grouped by families or clusters of related individuals. A \(n \times 1\) vector \(y\) contains the phenotypic values measured for the trait. One assumes that these observations are described adequately by a linear model with a \(p \times 1\) vector of fixed effects (\(\beta\)) and a \(q \times 1\) vector of random effects (\(u\)) (Lynch and Walsh 1998, Chapter 26, pp.Â 746). In matrix form,

\[ y = X \beta + Z u + e \]

where \(X\) and \(Z\) are respectively \(n \times p\) and \(n \times q\) incidence matrices, and \(e\) is a \(n \times 1\) vector of residual deviations assumed to be distributed independently of the random effects. Denote the \(n \times n\) covariance matrix for the vector \(e\) by \(R\) and the the \(q \times q\) covariance matrix for the vector \(u\) by \(G\).

The first element of the vector \(\beta\) is typically the population mean, and other elements may be age, gender, diet and so on. The elements of the incidence matrices are usually equal to 0 or 1, that indicate the effect contribution to the individualsâ€™s phenotype. The model is referred to as a *mixed model*, because this model jointly accounts for fixed and random effects.

Giving the two terms of the random effects \(u\) and \(e\), the first term typically accounts for the contribution from random genetic effects such as additive genetic values. The second term accounts for the residual variance, and one usually assumes that residual errors have constant variance and are uncorrelated, so that \(R\) is a diagonal matrix, with \(R = \sigma_e^2 I\).

Two more equations for the means and variances of the component vectors of the mixed model can be derived on the basis of its definition. Since \(E(u) = E(e) = 0\) by definition,

\[ E(y) = X \beta \]

Excluding the difference among individuals due to fixed effects and calling the assumption that \(u\) and \(e\) are uncorrelated, the covariance matrix for the vector of observations \(y\) is

\[ Var(y) = V = Z G Z^{T} + R \]

Now the mixed model can be rewritten in a compact form

\[ y = X \beta + Z u + e\]

\[ \mbox{ where } \mbox{ } \mbox{ } u \sim (0, G) \mbox{ and } e \sim (0, R) \\ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X \beta, V) = (X \beta, Z G Z^T + R) \]

For the mixed model, \(y\), \(X\) and \(Z\) are observed variables, while \(\beta\), \(u\), \(R\) and \(G\) are generally unknown variables. Thus, the analysis consists of two complementary estimation procedures: (1) estimation of the covariance matrices \(G\) and \(R\), and (2) estimation of the vectors of fixed and random effects, \(\beta\) and \(u\). The covariance matrices are usually functions of a few unknown parameters or variance components. For instance, the covariance matrix of the residuals errors typically has the form \(R = \sigma_e^2 I\), where \(\sigma_e^2\) is the parameter to be estimated.

The description of the two estimation procedures is out of scope of this document, and the reader is referred to (Lynch and Walsh 1998, Chapter 26-27), where the necessary theoretical material and accompanying examples are given. On the other hand, the implementation of the estimation procedures in `SOLAR`

is discussed elsewhere (L. Almasy and Blangero 1998).

The next three sections will present `SOLAR`

models, which follow the general definition of linear mixed model given above, but parametrization of the models and estimation procedures are implemented in a `SOLAR`

-specific way.

`SOLAR`

The polygenic model is a basis for any quantitative genetic analysis in `SOLAR`

. The model includes fixed effects (\(\beta\)) and three random effects (the genetic additive effect \(g\), the house-hold effect \(c\), and the residual effect \(e\)).

\[y = X\beta + g + c + e\]

\[\mbox{ where } \mbox{ } \mbox{ } g \sim (0, \sigma_g^2 A) \mbox{ and } \mbox{ } \mbox{ } c \sim (0, \sigma_c^2 H) \mbox{ and } \mbox{ } \mbox{ } e \sim (0, \sigma_e^2 I)\]

\[ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X\beta, V) = (X\beta, \sigma_g^2 A + \sigma_c^2 H + \sigma_e^2 I)\]

The first \(g\) term accounts for the additive genetic effect with the covariance matrix \(A = 2 K\), where \(K\) is the kinship matrix. `SOLAR`

estimates the *empirical* kinship matrix based on the identifier (ID) fields in the pedigree data, such as individual ID `id`

, gender `sex`

, family ID `fam`

, father ID `fa`

, mother ID `mo`

, and indicator of identical individuals, e.g.Â monozygotic twins, `mztwin`

(SOLAR Documentation, Appendix 1, file-pedigree command).

The second \(c\) term represents an annotated environmental effect. It is named as a house-hold effect, but it is generally related to the sharing of any non-genetic factor among individuals. The house-hold effect is included in the analysis if the pedigree data contain a house-hold ID field (the expected field name is `hhid`

). `SOLAR`

automatically creates `Pedigree-Household Groups`

based on the `hhid`

identifiers. That means that pedigrees and household groups are merged into larger groups, which contain every individual in the same pedigree as well as every individual that has the same `hhid`

as any individual in the pedigree (SOLAR Documentation, Section 9.3 Household Group Analysis). As a consequence of this grouping operation by `SOLAR`

, the covariance matrix \(H\) of the \(c\) random effect contains the information related to the pedigree-household groups rather than relationships given by the `hhid`

field.

The third \(e\) term is the residual error with the diagonal covariance matrix. It is important to note that the `residual`

command in `SOLAR`

(SOLAR Documentation, Appendix 1, residual command) returns the covariate-free value of a trait under study, expressed as \(y - X \beta\), and the returned value is not the residual vector \(e\) given in the equation of the polygenic model.

The polygenic term \(g\) represents the additive genetic effect by default. The incorporation of the dominance effect into `SOLAR`

models is also possible, but it is attributed to advanced modeling topics (SOLAR Documentation, Section 9.4 Dominance Analysis). The `SOLAR`

â€™s developers suggest that the dominance variance component is unlikely useful in quantitative trait mapping, particularly in human genetic analysis, as only bi-lineal relatives contribute to this component.

`SOLAR`

is able to work with discrete traits, which possible values are coded as two consecutive integers, for example, `0`

and `1`

(SOLAR Documentation, Appendix 1, discrete-notes). `SOLAR`

uses a *liability threshold model* to handle discrete traits (Duggirala et al. 1997), and more details about implementation and practical notes are available in (SOLAR Documentation, Section 9.1 Discrete Traits). `SOLAR`

authors recommend to work with quantitative traits whenever it is possible. The user is able to disable the transformation of discrete traits, if it is necessary.

The likelihood-based statistics, in particular Likelihood Ratio Tests (LRTs), is used for the inference of the linear mixed models implemented in `SOLAR`

. The following table shows the alternative models used to test the significance of the random effects \(g\) and \(c\).

Alternative model | Free parameters | Restricted parameters |
---|---|---|

No polygenic effect | \(\beta\), \(\sigma_c^2\), \(\sigma_e^2\) | \(\sigma_g^2\) = 0 |

No house-hold effect | \(\beta\), \(\sigma_g^2\), \(\sigma_e^2\) | \(\sigma_c^2\) = 0 |

Under the assumption of multivariate normality, the LRT statistic is asymptotically distributed as a \(\frac{1}{2}:\frac{1}{2}\) mixture of of a \(\chi_1^2\) variable and a point mass at the origin (J. Blangero, Wiliiams, and Almasy 2000). This statement is applied to any LRT performed for parameters like \(\sigma_g^2\), which is greater or equal to `0`

by definition.

`SOLAR`

takes the decision of rejecting the house-hold effect by considering two factors, the test results and the value of \(\sigma_c^2\) parameter. If value of \(\sigma_c^2\) is non-zero, the house-hold effect is retained in the polygenic model, nevertheless the effect is not significant. The user has an option to force the rejection of the house-hold effect if it is necessary.

The tests applied to fixed effects (covariates) are also performed by means of LRTs. In the table given below the alternative model for the test of significance of `age`

covariate is shown.

Alternative model | Free parameters | Restricted parameters |
---|---|---|

No age effect | \(\sigma_g^2\), \(\sigma_c^2\), \(\sigma_e^2\) | \(\beta_{age}\) = 0 |

`SOLAR`

In principle, one can estimate the multivariate polygenic model by simply performing the univariate analysis in an appropriate way. Consider two traits \(y_1\) and \(y_2\) measured in each of \(n\) individuals, assuming that each trait follows the same univariate polygenic analysis. Then the \(2 \times n\) dimensional column vector of observations is denoted by the stack of univariate vectors, and the polygenic model has the matrix form:

\[ \begin{pmatrix} y_1 \\ y_2 \end{pmatrix}_{2n \times 1} = \begin{pmatrix} X_1 & 0 \\ 0 & X_2 \end{pmatrix}_{2n \times nk} \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix}_{2n \times 1} + \begin{pmatrix} g_1 \\ g_2 \end{pmatrix}_{2n \times 1} + \begin{pmatrix} e_1 \\ e_2 \end{pmatrix}_{2n \times 1} \]

\[ G_{2n \times 2n} = C_{2 \times 2} \otimes A_{n \times n} = \begin{pmatrix} c_{11} A & c_{12} A \\ c_{21} A & c_{21} A \end{pmatrix} = \begin{pmatrix} \sigma_{1}^2 A & \rho \sigma_{1} \sigma_{2} A \\ \rho \sigma_{1} \sigma_{2} A & \sigma_{2}^2 A \end{pmatrix}_{g} \]

\[ R_{2n \times 2n} = E_{2 \times 2} \otimes I_{n \times n} = \begin{pmatrix} e_{11} I & e_{12} I \\ e_{21} I & e_{21} I \end{pmatrix} = \begin{pmatrix} \sigma_{1}^2 I & \rho \sigma_{1} \sigma_{2} I \\ \rho \sigma_{1} \sigma_{2} I & \sigma_{2}^2 I \end{pmatrix}_{e} \]

The model can be generalized to \(k\) traits, and other random effects such as house-hold can be added in the similar way. The *Kronecker product* notation \(\otimes\) is convinient for denoting the covariance matrices. More details on the multivariate analysis are given in (Lynch and Walsh 1998, Chapter 26, Multiple Traits, pp.Â 774).

The multivariate analysis implemented in `SOLAR`

allows for any number of traits, but the tests for significance of the correlation coefficients are performed only for the bivariate models. The following table describes the main LRTs available in `SOLAR`

for the bivariate analysis.

Alternative model | Free parameters | Restricted parameters |
---|---|---|

No genetic correlation | \((\sigma_1^2, \sigma_2^2)_g\), \((\sigma_1^2, \sigma_2^2, \rho)_e\) | \(\rho_g = 0\) |

Complete pleiotropy | \((\sigma_1^2, \sigma_2^2)_g\), \((\sigma_1^2, \sigma_2^2, \rho)_e\) | \(\rho_g = 1\) |

No environmental correlation | \((\sigma_1^2, \sigma_2^2, \rho)_g\), \((\sigma_1^2, \sigma_2^2)_e\) | \(\rho_e = 0\) |

It is important to note that the transformation of discrete traits in the multivariate analysis is automatically performed by `SOLAR`

.

`SOLAR`

`SOLAR`

introduces sequential oligogenic linkage analysis, as stated in its name abbreviation. Consider the linkage model with a single QTL, which is described by another random effect \(q\).

\[y = X\beta + g + c + q + e\]

\[ \mbox{ where } \mbox{ } \mbox{ } g \sim (0, \sigma_g^2 A) \mbox{ and } \mbox{ } \mbox{ } c \sim (0, \sigma_c^2 H) \mbox{ and } \mbox{ } \mbox{ } q \sim (0, \sigma_q^2 M) \mbox{ and } \mbox{ } \mbox{ } e \sim (0, \sigma_e^2 I) \]

\[ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X\beta, V) = (X\beta, \sigma_g^2 A + \sigma_c^2 H + \sigma_q^2 M + \sigma_e^2)\]

The QTL under study \(q\) corresponds to a specific position in the genome and an identity-by-descent (IBD) matrix. The genome-wide linkage analysis assumes that the highly informative markers, such as micro-satellites, are available in the study, in order to estimate the two-point or multipoint IBD matrices by `SOLAR`

or other specific program like `MERLIN`

. Once the IBD matrices are calculated, the genome-wide scan is conducted by sequentially running the linkage model for each genome position.

The alternative model for testing the significance of a single QTL effect is the same as the polygenic model.

Alternative model | Free parameters | Restricted parameters |
---|---|---|

No linkage | \(\beta, \sigma_g^2, \sigma_c^2\), \(\sigma_e^2\) | \(\sigma_q^2\) = 0 |

Traditionally, the logarithm (base 10) of the odds (LOD) values are used to score the QTLs. If the LRT statistics for the linkage model is denoted as \(\Lambda\), then the correspondent LOD score is equal to \(\Lambda / (2 ln(10))\).

`SOLAR`

also allows for multi-pass linkage scan, that means that any further scan can be conditioned to QTLs found in the previous scans. For example, the second-pass scan conditioned to the previously found QTL \(q_1\) has the form:

\[y = X\beta + g + c + q_1 + q + e\]

The second-pass LRT test now looks like in the table below.

Alternative model | Free parameters | Restricted parameters |
---|---|---|

No linkage | \(\beta, \sigma_g^2, \sigma_c^2\), \(\sigma_{q_1}^2\), \(\sigma_e^2\) | \(\sigma_q^2\) = 0 |

The linkage model is extended to deal with many traits or discrete traits in the same way as for the polygenic model.

Additional information related to the linkage modeling in `SOLAR`

can be found in related publications (L. Almasy and Blangero 1998), (L. Almasy, Dyer, and Blangero 1997), (Williams et al. 1999), the `SOLAR`

tutorial (SOLAR Documentation, Chapter 3 Tutorial), and the `SOLAR`

manual, for example, (SOLAR Documentation, Apendix 1, multipoint command).

`SOLAR`

A basic association model in `SOLAR`

is extended from the polygenic model by simply adding a new covariate (fixed effect) `snp`

.

\[ y = X\beta + \beta_{snp} * snp + g + c + e\]

\[\mbox{ where } \mbox{ } \mbox{ } g \sim (0, \sigma_g^2 A) \mbox{ and } \mbox{ } \mbox{ } c \sim (0, \sigma_c^2 H) \mbox{ and } \mbox{ } \mbox{ } e \sim (0, \sigma_e^2 I)\]

\[ \mbox{ implying } \mbox{ } \mbox{ } y \sim (X\beta, V) = (X\beta, \sigma_g^2 A + \sigma_c^2 H + \sigma_e^2 I)\]

In the genome-wide association scan `SOLAR`

passes through SNPs variants under the study one by one according to the equation given above. The p-value of association between trait \(y\) and SNP \(snp\) is again based on LRT test with the alternative model outlined in the table below. This alternative model is the same as the polygenic model.

Alternative model | Free parameters | Restricted parameters |
---|---|---|

No association | \(\beta, \sigma_g^2, \sigma_c^2\), \(\sigma_e^2\) | \(\beta_{snp} = 0\) |

More details on implementation of the association analysis in `SOLAR`

can be found in the `SOLAR`

manual, for example, (SOLAR Documentation, Apendix 1, mga and snp commands).

The idea of the proposed software (the R package `solarius`

) is to perform the calculus in `SOLAR`

at the back-end and to offer the user interface in R at the front-end. The table given below shows a basic description of the software via implementation of the main three models: polygenic, association and linkage.

Model | `SOLAR` command |
`solarius` function |
`solarius` S3 classes of returned object |
Parallel computation |
---|---|---|---|---|

Polygenic | `polygenic` |
`solarPolygenic` |
`solarPolygenic` |
None |

Association | `mga` |
`solarAssoc` |
`solarAssoc` , `solarPolygenic` |
Automatic or custom (by SNP file) |

Linkage | `multipoint` |
`solarMultipoint` |
`solarMultipoint` , `solarPolygenic` |
Custom (by chromosome) |

One can see how the software benefits from both `SOLAR`

and R sides. On the side of `SOLAR`

, the `SOLAR`

commands `polygenic`

, `mga`

and `multipoint`

do all the hard job on estimating the variance component models with control via a set of parameters and settings. On the side of `R`

, the `solarius`

functions `solarPolygenic`

, `solarAssoc`

and `solarMultipoint`

prepare the input data for the analysis, pass the arguments to the commands and set up other necessary settings, and finally read the output `SOLAR`

files and store the results into R objects of S3 classes in R.

The main advantage of running the `SOLAR`

models in `solarius`

instead of `SOLAR`

is that the analysis is conduced by only one function call, while the same analysis in `SOLAR`

could require typing more commands with appropriate options and settings (to be learned from the `SOLAR`

manual). The `solarius`

functions just automate the analysis flow and keep configuration options via the R function arguments, if such configuration is required. Another advantage of using the R package `solarius`

is that the results are stored in objects of especially designed S3 classes with associated methods like print, summary, plot and others.

The last column of the table given above contains the information related to parallel computation available for the models. The implementation of parallel calculations is straightforward in `solarius`

package, since the association and linkage analyses are *embarrassingly parallel* problems (easily divisible into different identical independent task) and the R environment offers a number of packages with parallel interfaces.

More information about the `solarius`

functions is available in the software manual of the package, and the practical guide of using the functions is presented in the next section dedicated to examples. Meantime, some points should to be stressed to describe some other features and drawbacks of the `solarius`

package.

- The
`solarius`

functions, by default, do not create any directories visible to the user, though the analysis flow in`SOLAR`

is heavily based on storing the data in directories and files. However, the input phenotype and genotype data might be stored in CSV data format and passed to the functions. The user can specify a special directory (`dir`

argument for the`solarPolygenic`

,`solarAssoc`

and`solarMultipoint`

functions), where the`SOLAR`

output will be saved in addition to the output R object. - The
`tcl`

scripts used by`SOLAR`

for automation are not visible to the user. The routine of executing either`SOLAR`

separate commands or a block of commands in a`tcl`

script is internally organized in the functions of`solarius`

. - The work on parsing the output of
`SOLAR`

analysis is implemented in the functions in`solarius`

, and the user does not need to manually look for the analysis results. - The
`SOLAR`

commands used to compute the analysis and some`SOLAR`

output files are saved in the returned objects. This feature might be necessary for those users who are familiar with`SOLAR`

. It is also helps to go back to`SOLAR`

and repeat or extend the analysis in the`SOLAR`

environment, if it is necessary. - The package
`solarius`

offers only three analyses via the most powerful`SOLAR`

commands`polygenic`

,`mga`

and`multipoint`

. If a more specific genetic analysis is required, the user needs to go back to`SOLAR`

.

The installation process has three steps: install `SOLAR`

, register `SOLAR`

and install `solarius`

.

`SOLAR`

installationThe `SOLAR`

distribution is available on the official web page http://solar.txbiomedgenetics.org/ for the most popular operating systems (Linux, Mac OS, Windows).

In the case of Linux, the installation files are compressed in `solar_linux.tar.gz`

archive. Once the archive is downloaded to a local machine of the user and the installation files are unpacked, one can find the `README`

file and `install_solar`

script. In particular, the script copies the necessary library, binary and documentation files into an installation directory defined by the user. A special bash script called `solar`

is also created and copied to the system path defined by the user, e.g. `/usr/local/bin/solar`

, so that the program can be executed by simply typing `solar`

in the terminal.

An example of the installation command might be the following.

`./install_solar /opt/appl/solar/7.6.6 /usr/local/bin solar`

As a result of execution of the command, the library, binary and documentation files of `SOLAR`

version `7.6.6`

will be copied to `/opt/appl/solar/7.6.6`

directory, and `solar`

execution script will be copied to `/usr/local/bin`

system directory.

After the installation process the new user is required to obtain a registration key, as it is explained by the message that appears when `SOLAR`

program starts. The registration key has to be stored in `.solar_reg`

file in the home directory of the user.

Once both installation and registration processes are finished, something similar to the following message will be displayed when `SOLAR`

is executed.

```
SOLAR Eclipse version 7.6.6 (Experimental), last updated on December 11, 2014
Copyright (c) 1995-2014 Texas Biomedical Research Institute
Enter help for help, exit to exit, doc to browse documentation.
solar>
```

**SOLAR version for Windows**. Currently, SOLAR does not have an appropriated version for Windows. As stated in the official web page of SOLAR http://solar.txbiomedgenetics.org/solarwindows.html:

Because SOLAR requires a Unix OS or equivalent, we have prepared a VMware Virtual Machine containing SOLAR plus the Xubuntu edition of linux which runs on Windows XP or later using the free VMware Player. Beware this requires a very large multi 1.4 Gb download, and will require more than 10 gigabytes free disc space during installation.

This is a clear limitation of `solarius`

package that comes from the dependency on SOLAR. Starting from version 3.*, `solarius`

is not supported for Windows. DESCRIPTION file has a line `OS_type: unix`

.

`solarius`

installation`solarius`

package is available in the CRAN repository for R packages http://cran.r-project.org/web/packages/solarius/, so it can be install by calling the usual R function `install.packages`

.

`install.packages(solarius)`

The code repository of `solarius`

package is hosted in the public GitHub repository of the UGCD group https://github.com/ugcd/solarius. The R package `devtools`

allows to install the development version of the package hosted at GitHub. The installation commands are the following.

```
library(devtools)
install_github("ugcd/solarius")
```

`solarius`

has several R packages as dependencies (that can be seen in DESCRIPTION file). Installation of these packages should be straightforward using the standard R function `install.packages`

.

In the case of problems in the installation process, web pages of the packages are the main source of information.

Imported packages:

- https://github.com/hadley/plyr for
`plyr`

. - https://github.com/hadley/ggplot2 for
`ggplot2`

. - https://github.com/Rdatatable/data.table for
`data.table`

.- See also CRAN_Release.cmd document.

**Installation of data.table R package**. If the user experiences problems installing `data.table`

package, we recommend to upgrade the R environment to the latest stable version.

`solarius`

R package structure`solarius`

is organized as a standard R package, and its structure can be seen in the code repository web site https://github.com/ugcd/solarius.

- R directory contains
`*.R`

files with functions, classes and methods. - data directory has data sets distributed with the package and stored in
`*.RData`

files. `man`

directory includes documentation organized in`Rd`

files (**R****d**ocumentation files).- inst directory includes directories, where some extra data in any format can be stored for the R package. Sub-directories in
`inst`

directory will be copied to the installation directory of the package and, thus, will be available within the package via`system.file`

function.- inst/examples has
`*.R`

scripts with code examples. - inst/extdata mostly contains
`SOLAR`

data files for data sets of the package. - inst/tests has
`*.R`

scripts for testing units during development of the package (R package`testthat`

).

- inst/examples has
- vignettes directory includes vignettes, which are known as a long-form documentation for an R package. Typically, vignettes has a form of a technical report or an academic paper, that describes the problem that your package is designed to solve. Files with
`Rmd`

extension are the source files of the vignettes.

The source files of documentation are stored in `man`

directory as `Rd`

files. R package `staticdocs`

converts this documentation into html pages. The documentation for `solarius`

is published on http://ugcd.github.io/solarius/doc/.

In the R environment the help for a function of the package, for example `solarPolygenic`

function, can be obtained by the standard `?solarPolygenic`

R comand. The content of the documentation for `solarPolygenic`

function is the same as on http://ugcd.github.io/solarius/doc/solarPolygenic.html.

The vignetts are created by `rmarkdown`

R package and published as html pages on the web.

- http://ugcd.github.io/solarius/vignettes/tutorial.html: The vignette presents a tutotial of the package.
- http://ugcd.github.io/solarius/vignettes/minimal.html: The vignette gives a minial example of the package. It includes three analyses, polygenic, association and linkage, applied to
`dat30`

data set. - http://ugcd.github.io/solarius/vignettes/modelsGAIT1.html: The vignette reports results on the real family-based data set GAIT1. The traits under study are the FXI levels in blood, the BMI and the Thrombosis affection. Although the original calculus were conducted with
`SOLAR`

, the code presented in the vignette is based on`solarius`

. The produced results are the same as those reported in the published articles.

`solarius`

To start using `solarius`

package, the user needs to load the package with `library`

function.

`library(solarius)`

The version of `solarius`

loaded here is:

`packageVersion("solarius")`

`## [1] '0.3.1'`

Other R packages will be needed for the tutorial.

`library(plyr)`

The following code shows how the main functions (`solarPolygenic`

, `solarMultipoint`

and `solarAssoc`

) of the package work.

```
# load library
library(solarius)
# load data set
data(dat30)
# univariate polygenic model
mod1 <- solarPolygenic(trait1 ~ 1, dat30)
# bivariate polygenic model
mod2 <- solarPolygenic(trait1 + trait2 ~ 1, dat30,
polygenic.options = '-testrhoe -testrhog')
# specify directory with IBD matrices and run linkage model
mibddir <- system.file('extdata', 'solarOutput',
'solarMibdsCsv', package = 'solarius')
link <- solarMultipoint(trait1 ~ 1, dat30,
mibddir = mibddir, chr = 5)
# run association model in parallel
assoc <- solarAssoc(trait1 ~ 1, dat30, cores = 2,
snpcovdata = genocovdat30, snpmap = mapdat30)
```

The user can think of this block of code as a quick start in using `solarius`

. The material given further will provide more details and all necesseary information on the usage of `solarius`

.

The genome-wide QTL analysis typically involves large-scale data sets. Data sets distributed within `solarius`

package contain relatively small simulated data, which are used for reproducible demonstrations of how functions of the package work. There are three data sets available in `solarius`

:

- An adapted version of the simulated data from the R package
`multic`

(Andrade et al. 2006) (polygenic and linkage models);- A smaller subset of these data is given in
`dat30`

data set (polygenic, linkage and association models);

- A smaller subset of these data is given in
- An adapted version of the simulated data
`dat50`

from the R package FFBSKAT (polygenic model and association model, both with custom kinship matrix); - GAW10 data distributed within
`SOLAR`

(polygenic and linkage models).

The simulated data set (in `SOLAR`

format) from `multic`

R package is placed in inst/extdata/solarOutput directory of `solarius`

package. 1200 individuals in 200 families are included in the study. The simulated phenotypes include two traits `trait1`

and `trait2`

and two covariates `age`

and `sex`

. The two traits possess a high genetic correlation. For the linkage analysis 6 IBD matrices for chromosome 5 6 IBD matrices for chromosome 2 are available. The six matrices of chromosome 5 are the original data from `multic`

package, while the six matrices of chromosome 2 are the exact copies of those of chromosome 5. Data for the two chromosomes are needed to test parallel calculation (separated by chromosome) of the linkage model.

A smaller subset of the simulated data from `multic`

package is provided in `dat30`

data set (`dat30.RData`

file in data directory). This smaller data set contains only 174 individuals in 29 families.

- The phenotypes (
`dat30`

data frame) are the same as in the complete data described above. - A hundred of synthetic SNPs (
`genocovdat30`

matrix) were randomly generated to run the association analysis on these data (see the script inst/scripts/generate-genocovdat30.R). - Annotation information (mapdat30 data frame) also was generated, mainly in order to plot the association results with Manhattan plot.

Documentation of `dat30`

data set is available on the page dat30.

The simulated data from the R package `FFBSKAT`

are stored in `dat50`

data set (`dat50.RData`

file in data directory). This data set is adapted from `example.data`

data set in the R package `FFBSKAT`

. The following variables are included.

`phenodata`

: A data.frame with the phenotypes`trait`

,`sex`

and`age`

given for 66 individuals (related and unrelated).`genodata`

: A matrix of 50 genetic variants (columns) given for 66 individuals (rows). The genotypes are coded in the format such as`1/1`

,`1/2`

and`2/2`

.`genocovdata`

: A matrix of covariates derived from the genotype data (the additive model). The covariates are coded as integers 0, 1 and 2.`snpdata`

: A data.frame of annotation for genetic variants. The variables name of the variant`name`

, chromosome`chrom`

, gene name`gene`

and position in bp`position`

.`kin`

: A kinship matrix for 66 individuals.

Documentation of `dat50`

data set is available on the page dat50.

The simulated data from the workshop GAW10 are described elsewhere (SOLAR Documentation, Chapter 3 Tutorial).

To load both data sets `dat30`

and `dat50`

, the user needs to use the standard R function `data`

A quick view to the data tables might be performed by means of `str`

function.

```
data(dat30)
str(dat30)
```

```
## 'data.frame': 174 obs. of 10 variables:
## $ famid : int 1 1 1 1 1 1 2 2 2 2 ...
## $ id : int 11 12 13 14 15 16 21 22 23 24 ...
## $ fa : int 0 0 11 11 11 11 0 0 21 21 ...
## $ mo : int 0 0 12 12 12 12 0 0 22 22 ...
## $ sex : int 1 2 1 2 1 1 1 2 2 1 ...
## $ affect: int 2 2 2 2 2 2 2 2 2 2 ...
## $ class : logi NA NA NA NA NA NA ...
## $ trait1: num 11.96 7.1 10.32 9.76 9.46 ...
## $ trait2: num 13.58 5.37 6.4 8.98 9.21 ...
## $ age : int 50 25 35 49 51 45 37 29 39 41 ...
```

`str(genocovdat30)`

```
## num [1:174, 1:100] 1.978 0.795 0.231 0.139 0.487 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:174] "11" "12" "13" "14" ...
## ..$ : chr [1:100] "snp_1" "snp_2" "snp_3" "snp_4" ...
```

`str(mapdat30)`

```
## 'data.frame': 100 obs. of 4 variables:
## $ SNP : Factor w/ 100 levels "snp_1","snp_10",..: 1 12 23 34 45 47 48 49 50 2 ...
## $ chr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pos : num 2105324 2105467 2106094 2108138 2109262 ...
## $ gene: Factor w/ 12 levels "gene1","gene2",..: 1 1 1 1 1 1 1 1 1 2 ...
```

```
data(dat50)
str(phenodata)
```

```
## 'data.frame': 66 obs. of 4 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ sex : int 0 1 0 0 0 1 0 1 1 0 ...
## $ age : int 80 77 56 44 75 79 75 82 77 76 ...
## $ trait: num -1.763 -1.423 -0.805 0.268 -1.334 ...
```

`str(genodata)`

```
## chr [1:66, 1:50] "1/1" "1/1" "1/1" "1/2" "1/2" "1/1" ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:66] "1" "2" "3" "4" ...
## ..$ : chr [1:50] "s1" "s2" "s3" "s4" ...
```

`str(genocovdata)`

```
## int [1:66, 1:50] 0 0 0 1 1 0 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:66] "1" "2" "3" "4" ...
## ..$ : chr [1:50] "s1" "s2" "s3" "s4" ...
```

`loadMulticPhen`

function in `solarius`

package loads the complete version of the data from the R package `multic`

(1200 individuals).

```
mdat <- loadMulticPhen()
str(mdat)
```

```
## 'data.frame': 1200 obs. of 10 variables:
## $ ID : chr "1001" "1002" "1003" "1004" ...
## $ affect: int 2 2 2 2 2 2 2 2 2 2 ...
## $ class : logi NA NA NA NA NA NA ...
## $ trait1: num 9.61 11.5 11.05 12.37 7.97 ...
## $ trait2: num 9.61 11.66 13.62 13.88 7.29 ...
## $ age : int 45 37 45 48 31 41 50 38 43 54 ...
## $ FAMID : chr "100" "100" "100" "100" ...
## $ FA : chr "0" "0" "1001" "1001" ...
## $ MO : chr "0" "0" "1002" "1002" ...
## $ SEX : chr "1" "2" "1" "2" ...
```

`loadExamplesPhen`

function returns the GAW10 data sets available in `SOLAR`

.

```
gawdat <- loadExamplesPhen()
str(gawdat)
```

```
## 'data.frame': 1497 obs. of 11 variables:
## $ ID : chr "1" "10" "100" "1000" ...
## $ AGE: int 56 37 80 51 50 39 36 68 69 57 ...
## $ EF : num NA 46.9 NA 29.5 24.7 ...
## $ Q1 : num NA 42.1 NA 25.9 18.4 ...
## $ Q2 : num NA 55.4 NA 32.4 29.6 ...
## $ Q3 : num NA 111.2 NA 65.8 39.3 ...
## $ Q4 : num NA 13.1 NA 11.5 10.1 ...
## $ Q5 : num NA 30.9 NA 27.7 24.6 ...
## $ FA : chr "0" "0" "0" "610" ...
## $ MO : chr "0" "0" "0" "612" ...
## $ SEX: chr "1" "2" "2" "1" ...
```

Both `loadMulticPhen`

and `loadExamplesPhen`

functions are based on a more general function `readPhen`

, which reads the phenotype data from text files of a table format. The data from the R package `multic`

can be also loaded by means of `readPhen`

function.

```
dat.dir <- system.file("extdata", "solarOutput", package = "solarius")
mdat <- readPhen(phen.file = file.path(dat.dir, "simulated.phen"), sep.phen = ",",
ped.file = file.path(dat.dir, "simulated.ped"), sep.ped = ",")
```

The two times kinship coefficients describe the relatedness among individuals in the family-based sample. Two functions `solarKinship2`

and `plotKinship2`

allows to compute and plot the kinship matirx of a given data set. `solarKinship2`

function runs `SOLAR`

for estimation of the kinship coefficients based on the ID fields in the data. The returned value of the function is two times kinship coefficients in a matrix form. `plotKinship2`

plots the two times kinship matrix by using the R package `Matrix`

.

For example, the kinship matrix for `dat50`

data sets has the following image.

`plotKinship2(2*kin)`

One can see that both unrelated and related individuals appear in `dat50`

data set.

The relationships among individuals in `dat30`

data set is examined in the following two lines of code.

```
kin2.dat30 <- solarKinship2(dat30)
plotKinship2(kin2.dat30)
```

The structure of patterns in the kinship matrix can be explored more closely if the first 30 individuals are plotted.

`plotKinship2(kin2.dat30[1:30, 1:30])`

It seems that the simulated data include only sib-pair relationships. Such an observation can be visually confirmed by `histKinship2`

function.

`histKinship2(kin2.dat30)`

`## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.`

One see that all the related individuals have the twice kinship coefficient equal to `0.5`

. The unique values of `kin2.dat30`

matrix can also be extracted by the following code.

`unique(as.vector(kin2.dat30))`

`## [1] 1.0 0.0 0.5`

The family trees are plotted by `plotPed`

, which reuses some routines from the R package `kinship2`

. The family with index `2`

from `dat30`

data set is plotted in the following code.

`plotPed(dat30, 2)`

```
## Loading required package: kinship2
## Loading required package: Matrix
## Loading required package: quadprog
```

To show an example of trait transformation, a trait with an anomalous distribution `exp_trait1`

is created in `dat30`

data set.

```
dat30 <- mutate(dat30,
exp_trait1 = exp(trait1))
```

The list of available transforms in `solarius`

is returned by `availableTransforms`

function.

`availableTransforms()`

`## [1] "none" "inormal" "log" "out"`

The following code plots the distribution of the original values of `exp_trait1`

and its transformed values.

`library(ggplot2)`

```
## Find out what's changed in ggplot2 with
## news(Version == "1.0.1", package = "ggplot2")
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:solarius':
##
## annotate
```

```
library(gridExtra)
theme_set(theme_bw())
with(dat30, {
grid.arrange(
qplot(exp_trait1),
qplot(transformTrait(exp_trait1, "out")),
qplot(transformTrait(exp_trait1, "log")),
qplot(transformTrait(exp_trait1, "inormal"))
)
})
```

```
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
```

`solarPolygenic`

function runs the polygenic analysis in `SOLAR`

and returns back the results in R. The basic call looks like the standard call of `lm`

function with the formula interface.

```
M1 <- solarPolygenic(trait1 ~ age + sex, dat30)
M1
```

```
##
## Call: solarPolygenic(formula = trait1 ~ age + sex, data = dat30)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 Individuals: 174
##
## H2r is 0.8061621 p = 6.1167535e-10 (Significant)
## H2r Std. Error: 0.1100465
##
##
## Proportion of Variance Due to All Final Covariates Is
## 0.0330070
##
## Loglikelihoods and chi's are in trait1/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor, nocovar
##
## Residual Kurtosis is -0.3603, within normal range
```

The print method for the returned object of class `solarPolygenic`

shows the output produced by `SOLAR`

. The tables of the results are printed by `summary`

method.

`summary(M1)`

```
##
## Call: solarPolygenic(formula = trait1 ~ age + sex, data = dat30)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 Individuals: 174
##
## H2r is 0.8061621 p = 6.1167535e-10 (Significant)
## H2r Std. Error: 0.1100465
##
##
## Proportion of Variance Due to All Final Covariates Is
## 0.0330070
##
## Loglikelihoods and chi's are in trait1/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor, nocovar
##
## Residual Kurtosis is -0.3603, within normal range
##
## Loglikelihood Table:
## model loglik
## 1 sporadic -229.3348
## 2 polygenic -210.8689
##
## Covariates Table:
## covariate Estimate SE Chi pval
## 1 mean 7.7823346631 0.33112422 NA NA
## 2 age 0.0004591117 0.01483783 NA NA
## 3 sex -0.4559509166 0.31250447 NA NA
##
## Variance Components Table:
## varcomp Var SE pval
## 1 h2r 0.8061621 0.1100465 6.116753e-10
## 2 e2 0.1938379 0.1100465 NA
```

By default `SOLAR`

does not perform the LRT tests for the covariates. `covtest`

argument of `solarPolygenic`

function tells to `SOLAR`

to make this job.

```
M2 <- solarPolygenic(trait1 ~ age + sex, dat30, covtest = TRUE)
M2
```

```
##
## Call: solarPolygenic(formula = trait1 ~ age + sex, data = dat30, covtest = TRUE)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 Individuals: 174
##
## H2r is 0.8061621 p = 6.1167535e-10 (Significant)
## H2r Std. Error: 0.1100465
##
## age p = 0.9753082 (Not Sig., but fixed)
## sex p = 0.1455341 (Not Sig., but fixed)
##
## Proportion of Variance Due to All Final Covariates Is
## 0.0330070
##
## Loglikelihoods and chi's are in trait1/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor, nocovar
## Initial sporadic and polygenic models are s0 and p0
## Constrained covariate models are named no<covariate name>
##
## Residual Kurtosis is -0.3603, within normal range
```

The test statistics and p-values are stored in `cf`

slot of the returned object `M2`

.

`M2$cf`

```
## covariate Estimate SE Chi pval
## 1 mean 7.7823346631 0.33112424 NA NA
## 2 age 0.0004591117 0.01483782 0.0010 0.9753082
## 3 sex -0.4559509166 0.31250456 2.1184 0.1455341
```

The custom kinship matrix is passed by `kinship`

argument. The following code shows such an example of `dat50`

data set, which phenotype data are stored in `phenodata`

data.frame.

```
M3 <- solarPolygenic(trait ~ 1, phenodata, kinship = kin)
M3
```

```
##
## Call: solarPolygenic(formula = trait ~ 1, data = phenodata, kinship = kin)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait Individuals: 66
##
## H2r is 0.1000000 p = 0.5000000 (Not Significant)
##
##
## Loglikelihoods and chi's are in trait/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor
##
## Residual Kurtosis is -0.6742, within normal range
```

The polygenic and sporadic models have the same LRT statistics, that means `SOLAR`

was not able to make the polygenic analysis in a proper way.

`M3$lf`

```
## model loglik
## 1 sporadic -27.77137
## 2 polygenic -27.77137
```

This bug seems the case of unrelated individuals or the individuals withount such ID fields as `fa`

and `mo`

.

The bivariate polygenic analysis is run in the same way as the univariate one.

```
B1 <- solarPolygenic(trait1 + trait2 ~ 1, dat30)
B1
```

```
##
## Call: solarPolygenic(formula = trait1 + trait2 ~ 1, data = dat30)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 trait2 Individuals: 174
##
## H2r(trait1) is 0.8218823
## H2r(trait1) Std. Error: 0.1053258
##
## H2r(trait2) is 0.6270026
## H2r(trait2) Std. Error: 0.1158107
##
## RhoE is 0.4120487
## RhoE Std. Error: 0.1969379
##
## RhoG is 0.9728759
## RhoG Std. Error: 0.0374442
##
## Derived Estimate of RhoP is 0.8045957
##
##
## Loglikelihoods and chi's are in trait1.trait2/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor
```

The test of significance of environmental and genetic correlation coefficients is indicated by `polygenic.options`

argument of `solarPolygenic`

function.

```
B2 <- solarPolygenic(trait1 + trait2 ~ 1, dat30, polygenic.options = "-testrhoe -testrhog")
B2
```

```
##
## Call: solarPolygenic(formula = trait1 + trait2 ~ 1, data = dat30, polygenic.options = "-testrhoe -testrhog")
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 trait2 Individuals: 174
##
## H2r(trait1) is 0.8218823
## H2r(trait1) Std. Error: 0.1053258
##
## H2r(trait2) is 0.6270026
## H2r(trait2) Std. Error: 0.1158107
##
## RhoE is 0.4120487 p = 0.1424023
## RhoE Std. Error: 0.1969379
##
## RhoG is 0.9728759
## RhoG Std. Error: 0.0374442
##
##
## RhoG different from zero p = 2.2463156e-09
## RhoG different from 1.0 p = 0.2098041
## Derived Estimate of RhoP is 0.8045957
##
##
## Loglikelihoods and chi's are in trait1.trait2/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor
```

The results are stored in `vcf`

slot of the returned object.

`B2$vcf`

```
## varcomp Estimate SE zscore
## 1 h2r(trait1) 0.8218823 0.10532575 7.803242
## 2 e2(trait1) 0.1781177 0.10532575 1.691112
## 3 h2r(trait2) 0.6270026 0.11581068 5.414031
## 4 e2(trait2) 0.3729974 0.11581068 3.220751
## 5 rhog 0.9728759 0.03744417 25.982039
## 6 rhoe 0.4120487 0.19693793 2.092277
```

It might happen that traits in the multivariate polygenic analysis have different covariates. Trait-specific covariates are passed to `solarPolygenic`

function according the specification of `SOLAR`

.

```
B3 <- solarPolygenic(trait1 + trait2 ~ sex + age(trait2), dat30)
B3
```

```
##
## Call: solarPolygenic(formula = trait1 + trait2 ~ sex + age(trait2),
## data = dat30)
##
## File polygenic.out:
## Pedigree: dat.ped
## Phenotypes: dat.phe
## Trait: trait1 trait2 Individuals: 174
##
## H2r(trait1) is 0.7956974
## H2r(trait1) Std. Error: 0.1129518
##
## H2r(trait2) is 0.6029008
## H2r(trait2) Std. Error: 0.1197632
##
## RhoE is 0.4655421
## RhoE Std. Error: 0.1736882
##
## RhoG is 0.9676250
## RhoG Std. Error: 0.0397582
##
## Derived Estimate of RhoP is 0.8027999
##
##
## Loglikelihoods and chi's are in trait1.trait2/polygenic.logs.out
## Best model is named poly and null0
## Final models are named poly, spor
```

In the results of this analysis, `sex`

covariate is common for both traits, while `age`

covariate is applied only to `trait2`

. The given model specification can be ckecked by looking at the model files from `SOLAR`

.

`tail(B3$solar$files$model$poly.mod, 3)`

```
## [1] "# mu = \\{t1*(<Mean(trait1)>+<bsex(trait1)>*Female) + t2*(<Mean(trait2)>"
## [2] "# +<bsex(trait2)>*Female+<bage(trait2)>*(age-39.87931034))\\}"
## [3] "loglike set -361.499294"
```

`solarAssoc`

function runs the association analysis in `SOLAR`

, in particular via mga command. The genotype data are passed to `solarAssoc`

function in the form of SNP genotypes (`snpdata`

argument) or covariates (`snpcovdata`

or `genocov.files`

arguments). The later case assumes that the user has converted the genotypes to covariates by some method (outside `solarius`

package) before computing the association model.

The function returns an object of class `solarAssoc`

. This class has additional methods such as `annotate`

, `plotQQ`

and `plotManh`

for exploration of the association results. The main table of the results is `snpf`

slot of the returned object, which class is `data.table`

from the R package `data.table`

.

The association analysis is shown with `dat50`

data set with 66 individuals and 50 SNP variants. The kinship matrix `kin`

of this data set is also used.

`snpdata`

argumentThe option of using `snpdata`

is recommended in the case when the small number of SNPs is to be tested.

```
A1 <- solarAssoc(trait ~ 1, phenodata, snpdata = genodata, kinship = kin)
A1
```

```
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpdata = genodata)
##
## Input SNP data:
## * 50 SNP genotypes passed by `snpdata` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 50 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12 66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.999998 0.244311
## 3: 1.999998 0.244311
## 4: 27.000006 0.549856
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s): 00:00:03
```

`summary`

method reports the number of significant SNPs based on Bonferroni correction.

`summary(A1)`

```
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpdata = genodata)
##
## Association model
## * Number of SNPs: 50
## * Input format: snpdata
## * Number of significal SNPs: 0 (Bonferroni correction with alpha 0.05)
```

The significance threshold can be passed as `alpha`

parameter of `summary`

method.

`summary(A1, alpha = 1)`

```
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpdata = genodata)
##
## Association model
## * Number of SNPs: 50
## * Input format: snpdata
## * Number of significal SNPs: 1 (Bonferroni correction with alpha 1)
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf est_mac
## 1: s43 66 6.504508 0.01076 -1.186139 0.465081 0.093852 0.030303 3.999996
## dosage_sd
## 1: 0.238606
```

`snpcovdata`

argumentSNP covariates are passed to the association model by `snpcovdata`

parameter.

```
A2 <- solarAssoc(trait ~ 1, phenodata, snpcovdata = genocovdata, kinship = kin)
A2
```

```
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpcovdata = genocovdata)
##
## Input SNP data:
## * 50 SNP covariates passed by `snpcovdata` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 50 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12 66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.999998 0.244311
## 3: 1.999998 0.244311
## 4: 27.000006 0.549856
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s): 00:00:02
```

The values of the SNP covariates can take any numerical values. That means dosage format of imputed SNP data is allowed.

The previous example is repeated for the SNP covariate data, where values of `2`

are replaced by `1.9`

.

```
genocovdata2 <- genocovdata
genocovdata2[genocovdata2 == 2] <- 1.9
A2 <- solarAssoc(trait ~ 1, phenodata, snpcovdata = genocovdata2, kinship = kin)
A2
```

```
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, kinship = kin,
## snpcovdata = genocovdata2)
##
## Input SNP data:
## * 50 SNP covariates passed by `snpcovdata` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 50 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.110595 0.489857 0.000772 0.014394
## 3: s11 66 0.050972 0.821380 -0.110595 0.489857 0.000772 0.014394
## 4: s12 66 0.927276 0.335572 -0.201592 0.209348 0.013951 0.203031
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.900008 0.232095
## 3: 1.900008 0.232095
## 4: 26.800026 0.541289
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s): 00:00:02
```

`genocov.files`

(single value)`genocov.files`

argument is supported to pass the SNP covariate data in files of the appropriate `SOLAR`

format. This option suits for large-scale calculations, when the real genome-wide scan is run. Both arguments `genocov.files`

and `snplists.files`

are required, and `snpmap.files`

argument is optional.

```
dir <- package.file("extdata", "solarAssoc", package = "solarius")
genocov.files <- file.path(dir, "snp.genocov")
snplists.files <- file.path(dir, c("snp.geno-list1", "snp.geno-list2"))
A3 <- solarAssoc(trait ~ 1, phenodata,
genocov.files = genocov.files, snplists.files = snplists.files)
A3
```

```
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, genocov.files = genocov.files,
## snplists.files = snplists.files)
##
## Input SNP data:
## * SNP covariates passed in 1 file(s) by `genocov.files` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 25 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12 66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.999998 0.244311
## 3: 1.999998 0.244311
## 4: 27.000006 0.549856
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s): 00:00:01
```

`genocov.files`

(many values)Many values can be passed to `genocov.files`

parameter, but this case is thought for parallel calculations.

```
dir <- package.file("extdata", "solarAssoc", package = "solarius")
genocov.files <- file.path(dir, c("snp.genocov1", "snp.genocov2"))
snplists.files <- file.path(dir, c("snp.geno-list1", "snp.geno-list2"))
A4 <- solarAssoc(trait ~ 1, phenodata,
genocov.files = genocov.files, snplists.files = snplists.files)
A4
```

```
##
## Call: solarAssoc(formula = trait ~ 1, data = phenodata, genocov.files = genocov.files,
## snplists.files = snplists.files)
##
## Input SNP data:
## * SNP covariates passed in 2 file(s) by `genocov.files` argument and 2 files(s) by `snplists.files` argument
##
## Output results of association:
##
## * Table of association results (first 5 out of 25 rows):
## SNP NAv chi pSNP bSNP bSNPse Varexp est_maf
## 1: s1 66 0.189994 0.662922 0.160860 0.369045 0.002875 0.053031
## 2: s10 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 3: s11 66 0.050972 0.821380 -0.105065 0.465364 0.000772 0.015152
## 4: s12 66 0.909920 0.340136 -0.196598 0.206100 0.013692 0.204545
## 5: s13 66 0.182890 0.668901 0.118876 0.277971 0.002767 0.106061
## est_mac dosage_sd
## 1: 7.000026 0.307915
## 2: 1.999998 0.244311
## 3: 1.999998 0.244311
## 4: 27.000006 0.549856
## 5: 13.999986 0.408810
##
## CPU time on 1 core(s), 2 batches: 00:00:01
```

QQ and Manhattan plots help to explore the association results visually.

`A5 <- solarAssoc(trait ~ 1, phenodata, snpdata = genodata, snpmap = snpdata, kinship = kin)`

`plot(A5)`