Run parent-wise cross-validation

Assess the accuracy of predicted previously unobserved crosses. Specifically the accurracy predicting the mean and variance among family members in breeding and total genetic values. The cross-validation procedure implemented is described in detail in the manuscript: https://www.biorxiv.org/content/10.1101/2021.01.05.425443v1, see "Details" below. User supplies a pedigree, haplotypes and other inputs.

runParentWiseCrossVal(
  nrepeats,
  nfolds,
  seed = NULL,
  modelType,
  ncores = 1,
  nBLASthreads = NULL,
  outName = NULL,
  ped = ped,
  gid = "GID",
  blups,
  dosages,
  grms,
  haploMat,
  recombFreqMat,
  selInd,
  SIwts = NULL,
  ...
)

Arguments

nrepeats: number of repeats
nfolds: number of folds
seed: integer, make the parent trait-test folds reproducible
modelType: string, "A", "AD", "DirDom". modelType="A": additive-only, predicts mean and variances for breeding values [BVs]). modelType="AD": the "classic" add-dom model, predicts family mean and variance for breeding value based on allele sub. effects. Predicts family variance-covariances for TGVs (BVs+DDs). Doesn't predict family mean TGV. modelType="DirDom": the "genotypic" add-dom model with prop. homozygous fit as a fixed-effect, to estimate a genome-wide inbreeding effect. obtains add-dom effects, computes allele sub effects (\(\alpha = a + d(q-p)\)) predicts family mean and covars for BVs using allele sub. effects predicts covars AND the means for TGVs using directly the add-dom effects. estimated linear effect of overall homozygosity (b), interpreted as inbreeding depression or heterosis depending on its direction relative to each trait (Xiang et al. 2016). The estimated genome-wide fixed-effect of homozygosity (b) can be incorporated into the predicted means and variances by first dividing by the number of effects (p) and subtracting that value from the vector of dominance effects (\(d\ast\)), to get \(d=d*-\frac{b}{p}\)
ncores: number of cores
nBLASthreads: number of cores for each worker to use for multi-thread BLAS
outName: default=NULL (optional), name and path to save outputs
ped: data.frame, 3 columns, "GID" (or gid), "sireID", "damID" for male and female parent, respectively.
gid: string variable name used for genotype ID's in e.g. blups (default="GID")
blups: nested data.frame with list-column "blups" containing
dosages: dosage matrix. required only for modelType=="DirDom". Assumes SNPs coded 0, 1, 2. Nind rows x Nsnp cols, numeric matrix, with rownames and colnames to indicate SNP/ind ID
grms: list of genomic relation matrices (GRMs, aka kinship matrices). Any genotypes in the GRMs get predicted with, or without phenotypes. Each element is named either A or D. Matrices supplied must match required by A, AD and DirDom models. e.g. grms=list(A=A,D=D).
haploMat: matrix of phased haplotypes, 2 rows per sample, cols = loci, 0,1, rownames assumed to contain GIDs with a suffix, separated by "_" to distinguish haplotypes
recombFreqMat: a square symmetric matrix with values = (1-2*c1), where c1=matrix of expected recomb. frequencies. The choice to do 1-2c1 outside the function was made for computation efficiency; every operation on a big matrix takes time.
selInd: logical, TRUE/FALSE, selection index accuracy estimates, requires input weights via SIwts
SIwts: required if selInd=FALSE, named vector of selection index weights, names match the "Trait" variable in blups
...

Value

tibble, one row, two list columns (basically a named two-element list of lists): meanPredAccuracy and varPredAccuracy both contain tibbles. Column "AccuracyEst" for family-size weighted prediction accuracy estimates. If selInd=TRUEthen corresponding accuracy labelled "SELIND" in "Trait" columns. The column "predVSobs" is a list of tibbles each containing the paired predicted and observed values for the given repeat-fold-trait.

Details

First, define a vector, \(\boldsymbol{P}\) of the parents listed in the pedigree. Define also a second vector \(\boldsymbol{C}\) listing the genotypes (clones) in the pedigree, including the parents (\(\boldsymbol{P}\subset\boldsymbol{C}\)).

Conducts nrepeats replications of the following procedure:

Define parent-wise cross-validation folds: randomly assign the parents in \(\boldsymbol{P}\) into \(\textit{k}\)-folds. \(\boldsymbol{P}_{TST}^k\), the list of "test" parents in the \(\textit{k}\)th-fold.
For each of the k-folds (set of "test" parents), divide the clones vector \(\boldsymbol{C}\)into two mutually exclusive sets: "training" (\(\boldsymbol{C}_{TRN}\)) and "validation" (\(\boldsymbol{C}_{VLD}\)). From the set \(\boldsymbol{C}_{TRN}\), we exclude all descendants (offspring, grandchildren, great grandchildren, etc.) of \(\boldsymbol{P}_{TST}^k\). We include the \(\boldsymbol{P}_{TST}^k\) themselves (phenotyping the parents before predicting their offspring) and any non-descendents. Define \(\boldsymbol{C}_{VLD}\) simply as the set difference between \(\boldsymbol{C}\) and \(\boldsymbol{C}_{TRN}\).
Estimate marker effects independently by fitting mixed-models (see section below for further details) to \(\boldsymbol{C}_{VLD}\) and \(\boldsymbol{C}_{TRN}\) corresponding to each \(\boldsymbol{P}_{TST}^k\).
For each \(\boldsymbol{P}_{TST}^k\), define the set of crosses to predict, \(\boldsymbol{X}_{toPred}^k\) to include any of the 462 actual families (sire-dam pairs) in the pedigree, in which the \(\boldsymbol{P}_{TST}^k\) were involved. By construction, the real family members that have been observed for each of the \(\boldsymbol{X}_{toPred}^k\) were excluded from the model used to get marker effects for \(\boldsymbol{C}_{TRN}\), and included in the model for \(\boldsymbol{C}_{VLD}\). Predict the means, variances and covariances for each focal trait in each cross, \(\boldsymbol{X}_{toPred}^k\) using the \(\boldsymbol{C}_{TRN}\) marker effects only.
For each family in \(\boldsymbol{X}_{toPred}^k\), using all existing family members, compute the sample means, variances and covariances for GEBV and GETGV as predicted by the \(\boldsymbol{C}_{VLD}\) marker effects.
Calculate the accuracy of prediction for each mean (\(\overset{\mu_{T}}{\textbf{cor}}_{BV}\), \(\overset{\mu_{T}}{\textbf{cor}}_{TGV}\)), variance (\(\overset{\sigma^2_{t=t}}{\textbf{cor}}_{BV}\), \(\overset{\sigma^2_{t=t}}{\textbf{cor}}_{TGV}\)) and covariance (\(\overset{\sigma_{t \neq t}}{\textbf{cor}}_{BV}\), \(\overset{\sigma_{t \neq t}}{\textbf{cor}}_{TGV}\)) in terms of both BV and TGV. For \(\overset{\mu_{T}}{\textbf{cor}}\) we used the Pearson correlation between predicted and sample mean GEBV/GETGV. For \(\overset{\sigma^2_{t=t}}{\textbf{cor}}\) and \(\overset{\sigma^2_{t\neq t}}{\textbf{cor}}\), only families with greater than two members were able to be included, and we weighted the correlation between the predicted and sample (co)variance of GEBV/GETGV according to the family size (R package::function psych::cor.wt). For sake of comparison, we also include accuracies in the supplement where predicted values are correlated to phenotypic (rather than genomic-predicted) BLUPs, e.g. \(\overset{\mu_{T}}{\textbf{cor}}_{BV,BLUP}\), \(\overset{\mu_{T}}{\textbf{cor}}_{TGV,BLUP}\), etc.

Arguments

Value

Details

See also