library(genomicMateSelectR)
#> Loading required package: tibble
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: tidyr
#> Loading required package: purrr

Genetic models implemented

Additive-only models

Previously, in the quick start vignette, I showed how to use the predictCrosses() function with a single, additive genetic effect, by setting modelType="A". We supplied example marker-effects as input via the snpeffs= argument.

snpeffsA %>% dplyr::select(Trait,allelesubsnpeff)
#> # A tibble: 2 × 2
#>   Trait  allelesubsnpeff
#>   <chr>  <list>         
#> 1 Trait1 <dbl [100 × 1]>
#> 2 Trait2 <dbl [100 × 1]>

These marker effects (snpeffsA) were derived from standard genome-wide marker regression (i.e. RR-BLUP, a.k.a. SNP-BLUP).

Consider a diploid population with n individuals genotyped at p biallelic genomic loci.

The standard genomic mixed-model is of the form:

\[\boldsymbol{y} =\boldsymbol{X\beta} +\boldsymbol{Z\alpha} +\boldsymbol{\epsilon}\]

In this linear model, the $n \times 1$ vector of phenotypic observations, $\boldsymbol{y}$ is modeled according to a combination of genetic and non-genetic effects. Fixed experimental design-related effects estimates are given by $\boldsymbol{\beta}$ and its corresponding incidence matrix $\boldsymbol{X}$ ($[ n \times N_{fixed}]$) where $N_{fixed}$ is the number of fixed factors. The elements of the $[n \times p]$ matrix $\boldsymbol{Z}$ contains column-centered marker genotypes:

\[\begin{matrix} z_{ij} = \begin{cases} 2-2p_{j} & A_1A_1\\ 1-2p_{j} & A_1A_2\\ 0-2p_{j} & A_2A_2 \end{cases}\end{matrix}\]

Here, $p_j$ and $q_j$ are the population frequencies for the $A_1$ and $A_2$ alleles, respectively for the jth SNP marker.

The marker-effects ($\boldsymbol{\alpha}$) are fitted as independent and identically distributed (i.i.d.) random-effects with mean zero and effects-variance $\sigma^2_{\alpha}$, i.e. $\alpha \sim N(0,\sigma^2_{\alpha})$.

As specified $\alpha$ represent allele-substitution effects.

Genomic estimated breeding values (GEBV), which are individual-level (rather than marker-level) selection criteria, can therefore be predicted as $\boldsymbol{\hat{g}}_{BV} =\boldsymbol{Z\hat{\alpha}}$ based on this model.

Models including dominance

We focus in this package, thus far, on dominance effects. Epistasis could be added, in principle.

It is worth reviewing some subtle but important differences between statistical models for partitioning the effect of a locus on a phenotype into “additive” and “dominance” components. genomicMateSelectR’s higher-level functions (predictCrosses() and runGenomicPredictions() are careful about this; I want to make the user aware.

To really understand the matter, there is a strong, recent literature on non-additive effects in genomic prediction, I recommend reviewing: Vitezica, Varona, and Legarra (2013); Varona et al. (2018); Xiang et al. (2016); Wolfe et al. (2021).

There are two partitions of additive and dominance effects used by genomicMateSelectR, one for modelType="AD", the other relevant for modelType="DirDom".

Classical (statistical)

Vitezica et al. (2013) showed the “classical” model specification, which partitions the total genotypic effect into allele substitution and dominance deviations.

\[\boldsymbol{y} =\boldsymbol{X\beta} +\boldsymbol{Z\alpha} +\boldsymbol{Wd} +\boldsymbol{\epsilon}\]

$\boldsymbol{Z\alpha}$ are as in modelType="A", column-centered dosages and allele substitution effects.

The new matrix of column-centered dominance codings, $\boldsymbol{W}$ contains:

\[\begin{matrix} w_{ij} = \begin{cases} -2q_j^2 & A_1A_1\\ 2p_jq_j & A_1A_2\\ -2p_j^2 & A_2A_2 \end{cases}\end{matrix}\]

The dominance deviation marker-effects ($\boldsymbol{d}$) are also fitted as i.i.d. random, $d \sim N(0,\sigma^2_{d})$. The corresponding individual-level, genome-wide prediction, the genomic estimated dominance deviation (GEDD) is therefore $\boldsymbol{\hat{g}}_{DD} =\boldsymbol{W\hat{d}}$ and we can predict the total merit or performance (rather than just the breeding value) of an individual as GETGV = GEBV + GEDD.

This model is implemented by genomicMateSelectR when the user chooses modelType="AD".

Genotypic (biological)

There is a subtly different model models biological dominance effects, by using a 0, 1, 0 coding of genotypes $A_1A_1$, $A_1A_2$ and $A_2A_2$, respectively.

\[\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{Za} + \boldsymbol{\Gamma} \boldsymbol{d}^* + \boldsymbol{\epsilon}\] The dominance coding in the matrix $\boldsymbol{\Gamma}$ is

\[\begin{matrix} \gamma_{ij} = \begin{cases} (0-2p_{j}q_{j}) & A_1A_1\\ (1-2p_{j}q_{j}) & A_1A_2\\ (0-2p_{j}q_{j}) & A_2A_2 \end{cases} \end{matrix}\]

The coding for the additive term ($\boldsymbol{Z}$) is the same as all previous models. However, the resulting set of genotypic additive-effects ($\boldsymbol{a}$) and the corresponding genotypic dominance-effects $\boldsymbol{d^*}$ are not allele substitution or dominance deviation effects. In genomicMateSelectR the genomic estimated values predicted by this model are referred to as: GETGV = GEadd + GEdom.

Allele substitution effects ($\alpha$) and subsequently GEBV can be recovered from this model via:

\[\boldsymbol{\alpha} =\boldsymbol{a} +\boldsymbol{d}(\boldsymbol{q}-\boldsymbol{p})\]

However, this is inefficient. In fact, currently, genomicMateSelectR does not both to implement the “genotypic” model as it’s own modelType=; but its on the ‘possibly to-do’ list.

Instead the “genotypic” model is used with small extension in an implementation of Xiang et al. (-Xiang et al. (2016) )’s model with directional dominance.

Directional dominance

genomicMateSelectR includes a third modelType="DirDom" .

Genomic mixed-model with a genome-wide directional dominance term:

\[\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{f}b + \boldsymbol{Za} + \boldsymbol{\Gamma} \boldsymbol{d}^* + \boldsymbol{\epsilon}\]

This is the “genotypic” model with an additive fixed-effect covariate, $\textbf{f}$, a vector of the individual-level genome-wide proportion of loci that are homozygous. The fixed-effect estimate $\boldsymbol{b}$ is the estimated linear effect of overall homozygosity, interpreted as inbreeding depression or heterosis depending on its direction relative to each trait.

genomicMateSelectR actually incorporates the directional-dominance effects into predictions of GEBV, GETGV as well as cross-means and variances by dividing $b$ by the number of effects ($p$) and subtracting that value from the vector of dominance effects, to get $\boldsymbol{d} = \boldsymbol{d}^*-\frac{b}{p}$. The notation here isn’t perfect, because $\boldsymbol{d}$ is not the classical dominance deviation effect, which I notated the same. Instead, here $\boldsymbol{d}$ is the vector of genotypic dominance effects with a potentially non-zero mean dominance value; the directional-dominance term is factored in.

Implementation using `runGenomicPredictions()` in `genomicMateSelectR`

There are many options for fitting the mixed-models described above, using software both in (and out) of R.

IMHO, the best R package, for the frequentist, remains to be the sommer R package Covarrubias-Pazaran (2016). sommer is available on CRAN and has a number of excellent vignettes to get new users started. sommer’s primary mixed-model solver mmer() is almost as flexible as asreml, but free and open-source. Downside is, it is slower and more memory intensive.

genomicMateSelectR’s genomic prediction function uses sommer::mmer() under the hood. The function runGenomicPredictions() will implement, across potentially multiple traits, univariate genomic predictions, with error-variances weighted according to user-specific weights.

Input: BLUPs or other phenotypes

runGenomicPredictions() requires phenotypes (response data) for genomic predictions via the argument blups=.

An example input blups is shown below:

blups
#> # A tibble: 2 × 2
#>   Trait  TrainingData      
#>   <chr>  <list>            
#> 1 Trait1 <tibble [300 × 4]>
#> 2 Trait2 <tibble [300 × 4]>

The tibble blups has one row per trait, one column Trait with chr-string of trait labels, second column is called TrainingData and is list-type with each element containing a tibble. For example:

blups$TrainingData[[1]] %>% str
#> tibble [300 × 4] (S3: tbl_df/tbl/data.frame)
#>  $ GID    : chr [1:300] "1" "2" "3" "4" ...
#>  $ drgBLUP: num [1:300] -0.1 -0.99 -0.283 -1.567 -0.949 ...
#>  $ BLUP   : num [1:300] -0.1 -0.99 -0.283 -1.567 -0.949 ...
#>  $ WT     : num [1:300] 1 1 1 1 1 1 1 1 1 1 ...

Four columns, all required in each tibble inside blups$TrainingData:

GID: genotype ID, chr-string
drgBLUP: the primary response variable. Stands for de-regressed BLUPs, but users can put any single-value-per-GID that they chose as phenotypic training data.
BLUP: used by the cross-validation functions (runCrossVal() and runParentWiseCrossVal() as a slot for validation data. Typically I use the not-deregressed BLUPs. Users can simply have the columns drgBLUP==BLUP if they wish.
WT: Weights for the error-variance. Used directly as input for sommer::mmer(weights = ) argument. Users can simply input a column of 1’s for an un-weighted prediction.

Typically, I implement de-regressed BLUPs as

\[drgBLUP = \frac{BLUP}{REL}\]

where $REL$ is the reliability of the BLUP, $REL=1-\frac{PEV}{\sigma^2_g}$. $\sigma^2_g$ as estimated in the mixed-model.

and the weights as

$WT = \frac{1-H^2}{(0.1+\frac{1-REL}{REL})*H^2}$, according to Garrick, Taylor, and Fernando (2009).

As indicated above, users are free to supply alternative weights or use an unweighted analysis.

In the future, I can make this more flexible. Users wanting a single-stage or any other flavor of approach for obtaining marker effects and genomic BLUPs can do so and bypass runGenomicPredictions() and format their marker-effects for predictCrosses().

Input: genomic relationship matrices

Each of the models discussed above is a genome-wide marker regression; a RR-BLUP, aka SNP-BLUP model. Each model has an exact equivalent genomic-BLUP (G-BLUP model) formed by constructing a genomic relationship matrix (GRM) from column-centered additive and dominance coding matrices described above (Vitezica, Varona, and Legarra 2013; Varona et al. 2018). In fact, runGenomicPredictions() fits GBLUP-style models using user-supplied GRMs via the argument grms=. By default, the argument getMarkEffs=FALSE is set, a GBLUP model is fit, and genomic BLUP values, e.g. GEBV and GEDD are returned to the user without marker-effects.

doseMat %>% str
#>  int [1:300, 1:100] 1 2 2 1 1 0 1 1 1 2 ...
#>  - attr(*, "dimnames")=List of 2
#>   ..$ : chr [1:300] "1" "2" "3" "4" ...
#>   ..$ : chr [1:100] "1_2" "1_3" "1_6" "1_7" ...

doseMat[1:3,1:8]
#>   1_2 1_3 1_6 1_7 1_8 1_11 1_14 1_17
#> 1   1   0   2   0   0    0    1    1
#> 2   2   0   1   1   1    0    0    1
#> 3   2   0   2   0   0    0    1    1

With the doseMat we can easily construct the GRMs we need using the kinship() function built-into genomicMateSelectR.

A<-kinship(doseMat,type = "add")
dim(A)
#> [1] 300 300

A[1:5,1:5]
#>           1           2           3           4           5
#> 1 1.6929383  0.18936973  0.39372344  0.26763285  0.27315593
#> 2 0.1893697  1.01254266  0.12403295  0.03319602 -0.06704188
#> 3 0.3937234  0.12403295  1.28023545 -0.07973320  0.06680451
#> 4 0.2676329  0.03319602 -0.07973320  1.13381525 -0.09453974
#> 5 0.2731559 -0.06704188  0.06680451 -0.09453974  1.21536872

summary(diag(A))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>  0.4602  0.7735  0.9456  0.9670  1.1120  1.6929

Fit an additive model, no marker effects

As is we have the needed input to runGenomicPredictions().

Kinship matrices should be formatted as a named-list of matrices. Depending on the model, either only an additive relationship matrix labeled “A,” or an additive and a dominance matrix (labeled “D”).

grms<-list(A=A)
grms %>% str
#> List of 1
#>  $ A: num [1:300, 1:300] 1.693 0.189 0.394 0.268 0.273 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:300] "1" "2" "3" "4" ...
#>   .. ..$ : chr [1:300] "1" "2" "3" "4" ...

SIwts<-c(0.75,0.25) %>% `names<-`(.,c("Trait1","Trait2"))
gpredsA<-runGenomicPredictions(modelType = "A", 
                               selInd = T, SIwts = SIwts,
                               blups = blups, grms = grms)
#> Loading required package: furrr
#> Loading required package: future
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -101.534   11:58:41      0           0
#>     2      -98.2273   11:58:41      0           0
#>     3      -97.4802   11:58:41      0           0
#>     4      -97.3541   11:58:41      0           0
#>     5      -97.3447   11:58:41      0           0
#>     6      -97.344   11:58:41      0           0
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -102.208   11:58:41      0           0
#>     2      -92.4952   11:58:41      0           0
#>     3      -90.2181   11:58:42      1           0
#>     4      -89.889   11:58:42      1           0
#>     5      -89.8724   11:58:42      1           0
#>     6      -89.8715   11:58:42      1           0

Here’s what runGenomicPredictions() returns by default.

gpredsA
#> # A tibble: 1 × 2
#>   gblups             genomicPredOut  
#>   <list>             <list>          
#> 1 <tibble [300 × 5]> <tibble [2 × 4]>

The first element gblups is a tibble with genomic BLUPs (GEBV in this case) for each trait, and since we supplied SIwts, the GEBV computed on the selection index, labeled SELIND.

gpredsA$gblups[[1]]
#> # A tibble: 300 × 5
#>    GID   predOf SELIND Trait1  Trait2
#>    <chr> <chr>   <dbl>  <dbl>   <dbl>
#>  1 1     GEBV   -0.994 -1.48   0.454 
#>  2 2     GEBV   -0.428 -1.07   1.49  
#>  3 3     GEBV   -0.474 -0.589 -0.130 
#>  4 4     GEBV   -0.735 -0.741 -0.719 
#>  5 5     GEBV   -0.432 -0.561 -0.0445
#>  6 6     GEBV   -0.731 -1.13   0.456 
#>  7 7     GEBV   -1.14  -1.57   0.163 
#>  8 8     GEBV    0.444  0.422  0.511 
#>  9 9     GEBV   -0.654 -1.42   1.64  
#> 10 10    GEBV   -0.656 -0.461 -1.24  
#> # … with 290 more rows

The other element of the output (genomicPredOut) contains data.frames of the fixed-effect estimates (fixeffs) and variance component estimates (varcomps) from the model. The gblups column here just has a more raw version of the GEBV.

gpredsA$genomicPredOut[[1]]
#> # A tibble: 2 × 4
#>   Trait  gblups             varcomps     fixeffs     
#>   <chr>  <list>             <list>       <list>      
#> 1 Trait1 <tibble [300 × 2]> <df [2 × 4]> <df [1 × 5]>
#> 2 Trait2 <tibble [300 × 2]> <df [2 × 4]> <df [1 × 5]>

Returning marker effects with `getMarkEffs=TRUE`

Setting getMarkEffs=TRUE in runGenomicPredictions() will generate RR-BLUP solutions for SNP-effects.

The helper function backsolveSNPeff() is used.

\[Z^T(ZZ^T)^{-1}g\]

Where $ are centered allele dosages or dominance genotype codings, and are the corresponding GBLUP solutions.

gpredsA<-runGenomicPredictions(modelType = "A", 
                               selInd = T, SIwts = SIwts,
                               getMarkEffs = TRUE,dosages = doseMat,
                               blups = blups, grms = grms)
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -101.534   11:58:42      0           0
#>     2      -98.2273   11:58:42      0           0
#>     3      -97.4802   11:58:43      1           0
#>     4      -97.3541   11:58:43      1           0
#>     5      -97.3447   11:58:43      1           0
#>     6      -97.344   11:58:43      1           0
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -102.208   11:58:43      0           0
#>     2      -92.4952   11:58:43      0           0
#>     3      -90.2181   11:58:43      0           0
#>     4      -89.889   11:58:43      0           0
#>     5      -89.8724   11:58:43      0           0
#>     6      -89.8715   11:58:43      0           0

gpredsA
#> # A tibble: 1 × 2
#>   gblups             genomicPredOut  
#>   <list>             <list>          
#> 1 <tibble [300 × 5]> <tibble [2 × 5]>

gpredsA$genomicPredOut[[1]]
#> # A tibble: 2 × 5
#>   Trait  gblups             varcomps     fixeffs      allelesubsnpeff
#>   <chr>  <list>             <list>       <list>       <list>         
#> 1 Trait1 <tibble [300 × 2]> <df [2 × 4]> <df [1 × 5]> <dbl [100 × 1]>
#> 2 Trait2 <tibble [300 × 2]> <df [2 × 4]> <df [1 × 5]> <dbl [100 × 1]>

Allele-substitution effects matrix for each trait is now in the list-column allelesubsnpeff.

Run non-additive genomic predictions

`modelType="AD"`

As mentioned above, modelType="AD" effects an additive-plus-dominance model using the “classical” partition. Predictions of GBLUPs (SNP-BLUPs / marker-effects) will correspond to GEBV (allele substitution effects).

We need first to add the correct dominance relationship matrix, constructed with kinship() to grms:

D<-kinship(doseMat,type = "domClassic")
grms[["D"]]<-D

gpredsAD<-runGenomicPredictions(modelType = "AD", 
                                selInd = T, SIwts = SIwts,
                                getMarkEffs = TRUE,dosages = doseMat,
                                blups = blups, grms = grms)
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -100.583   11:58:44      0           0
#>     2      -96.4141   11:58:44      0           0
#>     3      -95.5178   11:58:44      0           0
#>     4      -95.3651   11:58:44      0           0
#>     5      -95.353   11:58:44      0           0
#>     6      -95.3518   11:58:45      1           0
#>     7      -95.3517   11:58:45      1           0
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -98.116   11:58:45      0           0
#>     2      -84.3225   11:58:45      0           0
#>     3      -81.2501   11:58:45      0           0
#>     4      -80.7642   11:58:45      0           0
#>     5      -80.7346   11:58:45      0           0
#>     6      -80.7326   11:58:46      1           0
#>     7      -80.7325   11:58:46      1           0

gpredsAD$gblups[[1]] %>% head
#> # A tibble: 6 × 5
#>   GID   predOf SELIND Trait1 Trait2
#>   <chr> <chr>   <dbl>  <dbl>  <dbl>
#> 1 1     GEBV   -0.922 -1.48   0.753
#> 2 1     GETGV  -0.922 -1.48   0.753
#> 3 2     GEBV   -0.362 -1.09   1.82 
#> 4 2     GETGV  -0.362 -1.09   1.82 
#> 5 3     GEBV   -0.455 -0.558 -0.146
#> 6 3     GETGV  -0.455 -0.558 -0.146

Note that for each GID there are now two rows, distinguished by the predOf column, as either GEBV or GETGV (the GEBV + GEDD).

gpredsAD$genomicPredOut[[1]]
#> # A tibble: 2 × 6
#>   Trait  gblups             varcomps     fixeffs   allelesubsnpeff domdevsnpeff 
#>   <chr>  <list>             <list>       <list>    <list>          <list>       
#> 1 Trait1 <tibble [300 × 4]> <df [3 × 4]> <df [1 ×… <dbl [100 × 1]> <dbl [100 × …
#> 2 Trait2 <tibble [300 × 4]> <df [3 × 4]> <df [1 ×… <dbl [100 × 1]> <dbl [100 × …

Now two sets of marker effects are reported: allelesubsnpeff and domdevsnpeff.

`modelType="DirDom"`

The last model, implements an additive-plus-directional dominance model as detailed above. The “DirDom” model adds the potential for a genome-wide inbreeding depression (or heterosis effect).

To do this correctly, we need to switch to the “genotyped” parameterization of dominance (see above, and Vitezica, Varona, and Legarra (2013).

Simply change the type= argument to “domGenotypic” in the kinship() function to create a genotypic dominance relationship matrix

D<-kinship(doseMat,type = "domGenotypic")
# Replace the previous "D" matrix, but don't change the name
# grms list input must have names "A" or "D".
grms[["D"]]<-D

That’s the only different input that is needed.

gpredsDirDom<-runGenomicPredictions(modelType = "DirDom", 
                                    selInd = T, SIwts = SIwts,
                                    getMarkEffs = TRUE,dosages = doseMat,
                                    blups = blups, grms = grms)
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -99.0157   11:58:46      0           0
#>     2      -96.3744   11:58:46      0           0
#>     3      -95.732   11:58:47      1           0
#>     4      -95.6263   11:58:47      1           0
#>     5      -95.6193   11:58:47      1           0
#>     6      -95.6188   11:58:47      1           0
#> iteration    LogLik     wall    cpu(sec)   restrained
#>     1      -90.9376   11:58:47      0           0
#>     2      -81.2191   11:58:47      0           0
#>     3      -78.5672   11:58:47      0           0
#>     4      -78.0987   11:58:47      0           0
#>     5      -78.0659   11:58:48      1           0
#>     6      -78.0635   11:58:48      1           0
#>     7      -78.0633   11:58:48      1           0
#> Joining, by = "GID"
#> Joining, by = "GID"
#> Joining, by = "GID"
#> Joining, by = "GID"

gpredsDirDom$gblups[[1]] %>% head
#> # A tibble: 6 × 5
#>   GID   predOf   SELIND Trait1  Trait2
#>   <chr> <chr>     <dbl>  <dbl>   <dbl>
#> 1 1     GEBV     0.0433  32.9  -98.4  
#> 2 1     GETGV    2.88     3.54   0.878
#> 3 2     GEBV     4.51   -23.5   88.4  
#> 4 2     GETGV   -4.78    -7.38   3.00 
#> 5 3     GEBV   -11.8    -33.5   53.6  
#> 6 3     GETGV   -1.59    -1.99  -0.403

Note: no difference in structure of output gblups. For this model, rather than using the genomic BLUPs computed by the initial GBLUP model, runGenomicPredictions() first backsolves for the SNP-effects, then adds the genome-wide directional dominance effect. Because of this approach, we can also recover the allele substitution effects by doing $\alpha = a + d(q-p)$was added to the SNP-effects and then genomic BLUPs were calculated to get GEBV and GETGV that include this effect.

The estimated genome-wide directional dominance effect itself is available in the fixeffs column:

gpredsDirDom$genomicPredOut[[1]]$fixeffs[[1]]
#>     Trait      Effect    Estimate Std.Error     t.value
#> 1 drgBLUP (Intercept)  0.91088794  1.083581  0.84062719
#> 2 drgBLUP           f -0.08215501  1.521228 -0.05400573

The SNP effects themselves:

gpredsDirDom$genomicPredOut[[1]]
#> # A tibble: 2 × 8
#>   Trait  gblups    varcomps  fixeffs   allelesubsnpeff addsnpeff  domstar_snpeff
#>   <chr>  <list>    <list>    <list>    <list>          <list>     <list>        
#> 1 Trait1 <tibble … <df [3 ×… <df [2 ×… <dbl [100 × 1]> <dbl [100… <dbl [100 × 1…
#> 2 Trait2 <tibble … <df [3 ×… <df [2 ×… <dbl [100 × 1]> <dbl [100… <dbl [100 × 1…
#> # … with 1 more variable: domsnpeff <list>

Marker effects reported by the DirDom model:

allelesubsnpeff: allele substition effects with genome-wide directional dominance effect added
addsnpeff: genotypic additive effect
domstar_snpeff: genotypic dominance effect
domsnpeff: genotypic dominance effects with genome-wide directional dominance effect added

Cross predictions with `genomicMateSelectR`

Theory

See Wolfe et al. (2021) for full details.

Briefly:

Predict cross mean breeding values

\[\mu_{BV}=\frac{GEBV_{P1}+GEBV_{P2}}{2}\]

Predict cross mean total genetic value

(also called genomic prediction of cross performance or GPCP by Werner et al. (2020)) \[\mu_{TGV}=\sum\limits_{k=1}^p a_k(p_{ik} - q_{ik} - y_k) + d_k[2p_{ik}q_{ik} + y_k(p_{ik}-q_{ik})]\]

See (Toro and Varona 2010; Varona et al. 2018).

Here $p_{ik}$ and $q_{ik}$ are the allele frequencies of the counted (alternative) and the non-counted (reference-genome) allele, respectively, for one of the two parents (indexed by $i$). The difference in frequency between the parent one (indexed by $i$) and the parent two (indexed by $j$) is ,$y_k = p_{ik} - p_{jk}$ and the summation is over the $p$ markers indexed by $k$. Note that $a_k$ is the average effect and not the allele substitution effect, $\alpha$.

The genotypic parameterization for dominance should be to get marker effects to predict $\mu_{TGV}$ in this way. In fact, for this reason, predictCrosses() only returns $\mu_{TGV}$ predictions when modelType="DirDom".

Predict the genetic variance among offspring

\[\hat{\sigma}^{2}_{BV} = \boldsymbol{\alpha}^{T}\boldsymbol{D}\boldsymbol{\alpha}\] \[\hat{\sigma}^{2}_{DD} = \boldsymbol{d}^{T}\boldsymbol{D}^2\boldsymbol{d}\]

Where $\boldsymbol{D}^2 =\boldsymbol{D}\odot\boldsymbol{D}$, $\odot$ indicating element-wise (Hadamard) multiplication of $\boldsymbol{D}$, having the effect of squaring all elements.

\[\hat{\sigma}^{2}_{TGV} = \hat{\sigma}^{2}_{BV} +\hat{\sigma}^{2}_{DD}\]

The $p\times p$ variance-covariance matrix, $\boldsymbol{D}$, is the expected linkage disequilibrium among full-siblings by considering the expected pairwise recombination frequency and each parent’s haplotype phase.

\[\boldsymbol{D}_{P_1}^{gametes}=(1-2\boldsymbol{c})\odot\boldsymbol{D}_{P_1}^{haplos}\]

\[\boldsymbol{D}_{P_2}^{gametes}=(1-2\boldsymbol{c})\odot\boldsymbol{D}_{P_2}^{haplos}\]

\[\boldsymbol{D}_{P_1\times P_2}^{genotypes} =\boldsymbol{D}_{P_1}^{gametes} +\boldsymbol{D}_{P_2}^{gametes}\]

$\boldsymbol{D}_{P_1}^{haplos}$ and $\boldsymbol{D}_{P_2}^{haplos}$ are simply the $p\times p$ covariance matrices associated with each parent’s respective $2\times p$ haplotype matrix ($\boldsymbol{H}_{P_{1} or P_{2}}$), where elements are 1 if the counted allele is present, 0 otherwise.

\[\boldsymbol{D}^{haplos} =\frac{1}{2}\boldsymbol{H}^T\boldsymbol{H} -\boldsymbol{p}\boldsymbol{p}^T\] Where $\boldsymbol{p}$ is a vector of within-individual, per-SNP allele frequencies.

The $p\times p$ pairwise recombination frequencies matrix is $\boldsymbol{c}$ and can be derived from a genetic map. $\boldsymbol{D}_{P_1}^{gametes}$ and $\boldsymbol{D}_{P_2}^{gametes}$ are the covariance matrices for each parents pool of possible gametes, whose covariances sum to give the expected covariances genotypes in the cross, $\boldsymbol{D}_{P_1\times P_2}^{genotypes}$. The genetic variances $\hat{\sigma}^{2}_{BV}$ and $\hat{\sigma}^{2}_{DD}$ are thus predicted as above by using $\boldsymbol{D} = \boldsymbol{D}_{P_1\times P_2}^{genotypes}$.

See (Lehermeier, Teyssèdre, and Schön 2017).

Predicting the trait-trait co-variances among offspring

Consider two traits, $T1$ and $T2$

Variance for $T1$: \[\sigma^2_{T1}=\boldsymbol{\alpha}_{T1}^{T}\boldsymbol{D}\boldsymbol{\alpha}_{T1}\]

Variance for $T2$: \[\sigma^2_{T2}=\boldsymbol{\alpha}_{T2}^{T}\boldsymbol{D}\boldsymbol{\alpha}_{T2}\]

Covariance between $T1$ and $T2$: \[\sigma_{T1,T2}=\boldsymbol{\alpha}_{T1}^{T}\boldsymbol{D}\boldsymbol{\alpha}_{T2}\]

Apply to dominance by substituting $\boldsymbol{\alpha}$ with $\boldsymbol{d}$ and squaring elements of $\boldsymbol{D}$.

See (Neyhart, Lorenz, and Smith 2019).

Predict means and variances among offspring for the selection index

Selection index mean: \[\mu_{SI} =\boldsymbol{w}^T\hat{\boldsymbol{g}}_{BV}\]

Selection index variance: \[\sigma^2_{SI} =\boldsymbol{w}^T\boldsymbol{G}\boldsymbol{w}\]

The $n\times T$ matrix $\hat{\boldsymbol{g}}_{BV}$ contains the GEBV for each trait and the $T\times 1$ vector $\boldsymbol{w}$ are the index weights. The $T\times T$ matrix $\boldsymbol{G}$ is the additive (or total) genetic variance-covariance matrix for traits on the index.

\[\boldsymbol{G} = \begin{bmatrix} \sigma^2_{Trait_1,Trait_1} & \hdots & \sigma_{Trait_1,Trait_T}\\ \vdots & \ddots & \vdots \\ \sigma_{Trait_1,Trait_T} & \hdots & \sigma^2_{Trait_T,Trait_T}\end{bmatrix}\]

See (Bonk et al. 2016).

Predict the (selection-index) usefulness of the cross (the superior progeny mean)

\[\boldsymbol{UC}_{SI}=\boldsymbol{\mu}_{SI} +\boldsymbol{i}_{SI}\times\boldsymbol{\sigma}_{SI}\]

Distinguish two types of usefulness, concerning breeding value (allele substitution effects) versus total genetic value (total genetic effects, additive + dominance).

\[\hat{UC}_{parent} = \hat{\mu}_{BV}+i\times \hat{\sigma}_{BV}\] \[\hat{UC}_{variety} = \hat{\mu}_{TGV}+i\times \hat{\sigma}_{TGV}\] For simplicity, dropped the “SI” notation.

Example

In the previous vignette, I covered the formatting for inputs to predictCrosses() when the modelType="A". Below, I’ll highlight the features and implementation of the non-additive models.

# recombFreqMat
recombFreqMat<-genmap2recombfreq(genmap, nChr = 2)
recombFreqMat<-1-(2*recombFreqMat)


# Crosses to predict
set.seed(42);
parents<-sample(x = snpeffsA$gblups[[1]]$GID, size = 5, replace = F)
CrossesToPredict<-crosses2predict(parents)

`modelType=“AD”

data(doseMat)
data(haploMat)
predVarsAD<-predictCrosses(modelType = "AD",
                           selInd = TRUE, SIwts = SIwts,
                           CrossesToPredict=CrossesToPredict,
                           snpeffs=gpredsAD$genomicPredOut[[1]],
                           dosages=doseMat,
                           haploMat=haploMat,recombFreqMat=recombFreqMat)
#> [1] "Predicting cross variance parameters"
#> [1] "Done predicting fam vars. Took 0.12 mins for 15 crosses"
#> [1] "Predicting cross means"
#> [1] "Computing SELECTION INDEX means and variances."

predVarsAD$tidyPreds[[1]] %>% head
#> # A tibble: 6 × 9
#>   sireID damID Nsegsnps predOf Trait  predMean predVar predSD predUsefulness
#>   <chr>  <chr>    <int> <chr>  <chr>     <dbl>   <dbl>  <dbl>          <dbl>
#> 1 49     49          22 BV     SELIND    130.     334.   18.3          179. 
#> 2 49     49          22 BV     Trait1    203.     640.   25.3          271. 
#> 3 49     49          22 BV     Trait2    -89.2   1597.   40.0           17.5
#> 4 49     153         46 BV     SELIND    117.     479.   21.9          175. 
#> 5 49     153         46 BV     Trait1    180.    1064.   32.6          267. 
#> 6 49     153         46 BV     Trait2    -73.4   1815.   42.6           40.3

For modelType="AD" there should be two rows per cross per trait and in this case the column Trait also contains predictions on the selection index (SELIND).

predVarsAD$tidyPreds[[1]] %>% arrange(Trait,sireID,damID,predOf) %>% head
#> # A tibble: 6 × 9
#>   sireID damID Nsegsnps predOf Trait  predMean predVar predSD predUsefulness
#>   <chr>  <chr>    <int> <chr>  <chr>     <dbl>   <dbl>  <dbl>          <dbl>
#> 1 146    146         34 BV     SELIND     82.1    509.   22.6           142.
#> 2 146    146         34 TGV    SELIND     82.1    521.   22.8           143.
#> 3 153    146         48 BV     SELIND     92.5    567.   23.8           156.
#> 4 153    146         48 TGV    SELIND     92.5    579.   24.1           157.
#> 5 153    153         33 BV     SELIND    103.     624.   25.0           170.
#> 6 153    153         33 TGV    SELIND    103.     640.   25.3           170.

predVarsAD$rawPreds[[1]]$predMeans[[1]] %>% 
  arrange(Trait,sireID,damID,predOf) %>% head
#> # A tibble: 6 × 7
#>   Trait  sireID damID sireGEBV damGEBV predOf predMean
#>   <chr>  <chr>  <chr>    <dbl>   <dbl> <chr>     <dbl>
#> 1 Trait1 146    146       110.    110. MeanBV     110.
#> 2 Trait1 146    146       110.    110. TGV        110.
#> 3 Trait1 153    146       156.    110. MeanBV     133.
#> 4 Trait1 153    146       156.    110. TGV        133.
#> 5 Trait1 153    153       156.    156. MeanBV     156.
#> 6 Trait1 153    153       156.    156. TGV        156.

predVarsAD$rawPreds[[1]]$predVars[[1]] %>% 
  arrange(sireID,damID,predOf,Trait1,Trait2) %>% slice(1:8)
#> # A tibble: 8 × 8
#>   sireID damID Nsegsnps ComputeTime Trait1 Trait2 predOf predVar
#>   <chr>  <chr>    <int>       <dbl> <chr>  <chr>  <chr>    <dbl>
#> 1 146    146         34       0.483 Trait1 Trait1 VarBV  1232.  
#> 2 146    146         34       0.483 Trait1 Trait2 VarBV  -820.  
#> 3 146    146         34       0.483 Trait2 Trait2 VarBV  1976.  
#> 4 146    146         34       0.483 Trait1 Trait1 VarDD     9.97
#> 5 146    146         34       0.483 Trait1 Trait2 VarDD     9.04
#> 6 146    146         34       0.483 Trait2 Trait2 VarDD    51.3 
#> 7 153    146         48       0.468 Trait1 Trait1 VarBV  1360.  
#> 8 153    146         48       0.468 Trait1 Trait2 VarBV  -863.

`modelType=“DirDom”

For the DirDom model, not much change is needed.

One note: if you got marker effects using runGenomicPredictions(modelType="DirDom") then the SNP-effects generated, used below (gpredsDirDom$genomicPredOut[[1]]) are all you need. If you want to supply your own marker effects, they input must at a bare minimum contain the columns: Trait, allelesubsnpeff, addsnpeff and domsnpeff.

gpredsDirDom$genomicPredOut[[1]] %>% dplyr::select(Trait,allelesubsnpeff,addsnpeff,domsnpeff)
#> # A tibble: 2 × 4
#>   Trait  allelesubsnpeff addsnpeff       domsnpeff      
#>   <chr>  <list>          <list>          <list>         
#> 1 Trait1 <dbl [100 × 1]> <dbl [100 × 1]> <dbl [100 × 1]>
#> 2 Trait2 <dbl [100 × 1]> <dbl [100 × 1]> <dbl [100 × 1]>

predVarsDirDom<-predictCrosses(modelType = "DirDom",
                               selInd = TRUE, SIwts = SIwts,
                               CrossesToPredict=CrossesToPredict,
                               snpeffs=gpredsDirDom$genomicPredOut[[1]],
                               dosages=doseMat,
                               haploMat=haploMat,recombFreqMat=recombFreqMat)
#> [1] "Predicting cross variance parameters"
#> [1] "Done predicting fam vars. Took 0.12 mins for 15 crosses"
#> [1] "Done predicting fam vars. Took 0.12 mins for 15 crosses"
#> [1] "Predicting cross means"
#> [1] "Computing SELECTION INDEX means and variances."

Output in same format as modelType="AD", e.g.:

predVarsDirDom$tidyPreds[[1]] %>% str
#> tibble [90 × 9] (S3: tbl_df/tbl/data.frame)
#>  $ sireID        : chr [1:90] "49" "49" "49" "49" ...
#>  $ damID         : chr [1:90] "49" "49" "49" "153" ...
#>  $ Nsegsnps      : int [1:90] 22 22 22 46 46 46 48 48 48 53 ...
#>  $ predOf        : chr [1:90] "BV" "BV" "BV" "BV" ...
#>  $ Trait         : chr [1:90] "SELIND" "Trait1" "Trait2" "SELIND" ...
#>  $ predMean      : num [1:90] 113.2 178.4 -82.2 103.4 157.4 ...
#>  $ predVar       : num [1:90] 264 571 1296 400 954 ...
#>  $ predSD        : num [1:90] 16.3 23.9 36 20 30.9 ...
#>  $ predUsefulness: num [1:90] 157 242 14 157 240 ...

Additional `predictCrosses()` features

Parallel by family (ncores)
Multi-core matrix algrebra (OpenBLAS) usage (if your R compilation is linked to one) (nBLASthreads)
Toggle cross mean and cross variance predictions on-vs-off: predTheMeans and predTheVars.

Next vignettes

Cross-validation
Further under the hood: functions predCrossVars() and predCrossMeans() (used by predictCrosses())

References

Bonk, Sarah, Manuela Reichelt, Friedrich Teuscher, Dierck Segelke, and Norbert Reinsch. 2016. “Mendelian Sampling Covariability of Marker Effects and Genetic Values.” Genetics Selection Evolution 48 (1). https://doi.org/10.1186/s12711-016-0214-0.

Covarrubias-Pazaran, Giovanny. 2016. “Genome-Assisted Prediction of Quantitative Traits Using the R Package Sommer.” Edited by Aimin Zhang. PLOS ONE 11 (6): e0156744. https://doi.org/10.1371/journal.pone.0156744.

Garrick, Dorian J, Jeremy F Taylor, and Rohan L Fernando. 2009. “Deregressing Estimated Breeding Values and Weighting Information for Genomic Regression Analyses.” Genetics Selection Evolution 41 (1). https://doi.org/10.1186/1297-9686-41-55.

Lehermeier, Christina, Simon Teyssèdre, and Chris-Carolin Schön. 2017. “Genetic Gain Increases by Applying the Usefulness Criterion with Improved Variance Prediction in Selection of Crosses.” Genetics 207 (4): 1651–61. https://doi.org/10.1534/genetics.117.300403.

Neyhart, Jeffrey L, Aaron J Lorenz, and Kevin P Smith. 2019. “Multi-Trait Improvement by Predicting Genetic Correlations in Breeding Crosses.” G3 Genes|Genomes|Genetics 9 (10): 3153–65. https://doi.org/10.1534/g3.119.400406.

Toro, Miguel A, and Luis Varona. 2010. “A Note on Mate Allocation for Dominance Handling in Genomic Selection.” Genetics Selection Evolution 42 (1). https://doi.org/10.1186/1297-9686-42-33.

Varona, Luis, Andres Legarra, Miguel A. Toro, and Zulma G. Vitezica. 2018. “Non-Additive Effects in Genomic Selection.” Frontiers in Genetics 9 (March). https://doi.org/10.3389/fgene.2018.00078.

Vitezica, Zulma G, Luis Varona, and Andres Legarra. 2013. “On the Additive and Dominant Variance and Covariance of Individuals Within the Genomic Selection Scope.” Genetics 195 (4): 1223–30. https://doi.org/10.1534/genetics.113.155176.

Werner, Christian R., R. Chris Gaynor, Daniel J. Sargent, Alessandra Lillo, Gregor Gorjanc, and John M. Hickey. 2020. “Genomic Selection Strategies for Clonally Propagated Crops.” http://dx.doi.org/10.1101/2020.06.15.152017.

Wolfe, Marnin D, Ariel W Chan, Peter Kulakow, Ismail Rabbi, and Jean-Luc Jannink. 2021. “Genomic Mating in Outbred Species: Predicting Cross Usefulness with Additive and Total Genetic Covariance Matrices.” Genetics 219 (3). https://doi.org/10.1093/genetics/iyab122.

Xiang, Tao, Ole Fredslund Christensen, Zulma Gladis Vitezica, and Andres Legarra. 2016. “Genomic Evaluation by Including Dominance Effects and Inbreeding Depression for Purebred and Crossbred Performance with an Application in Pigs.” Genetics Selection Evolution 48 (1). https://doi.org/10.1186/s12711-016-0271-4.

Genomic predictions with non-additive effects