Assess the accuracy of predicted previously unobserved crosses. Specifically the accurracy predicting the mean and variance among family members in breeding and total genetic values. The cross-validation procedure implemented is described in detail in the manuscript: https://www.biorxiv.org/content/10.1101/2021.01.05.425443v1, see "Details" below. User supplies a pedigree, haplotypes and other inputs.

runParentWiseCrossVal(
  nrepeats,
  nfolds,
  seed = NULL,
  modelType,
  ncores = 1,
  nBLASthreads = NULL,
  outName = NULL,
  ped = ped,
  gid = "GID",
  blups,
  dosages,
  grms,
  haploMat,
  recombFreqMat,
  selInd,
  SIwts = NULL,
  ...
)

Arguments

nrepeats

number of repeats

nfolds

number of folds

seed

integer, make the parent trait-test folds reproducible

modelType

string, "A", "AD", "DirDom". modelType="A": additive-only, predicts mean and variances for breeding values [BVs]). modelType="AD": the "classic" add-dom model, predicts family mean and variance for breeding value based on allele sub. effects. Predicts family variance-covariances for TGVs (BVs+DDs). Doesn't predict family mean TGV. modelType="DirDom": the "genotypic" add-dom model with prop. homozygous fit as a fixed-effect, to estimate a genome-wide inbreeding effect. obtains add-dom effects, computes allele sub effects (\(\alpha = a + d(q-p)\)) predicts family mean and covars for BVs using allele sub. effects predicts covars AND the means for TGVs using directly the add-dom effects. estimated linear effect of overall homozygosity (b), interpreted as inbreeding depression or heterosis depending on its direction relative to each trait (Xiang et al. 2016). The estimated genome-wide fixed-effect of homozygosity (b) can be incorporated into the predicted means and variances by first dividing by the number of effects (p) and subtracting that value from the vector of dominance effects (\(d\ast\)), to get \(d=d*-\frac{b}{p}\)

ncores

number of cores

nBLASthreads

number of cores for each worker to use for multi-thread BLAS

outName

default=NULL (optional), name and path to save outputs

ped

data.frame, 3 columns, "GID" (or gid), "sireID", "damID" for male and female parent, respectively.

gid

string variable name used for genotype ID's in e.g. blups (default="GID")

blups

nested data.frame with list-column "blups" containing

dosages

dosage matrix. required only for modelType=="DirDom". Assumes SNPs coded 0, 1, 2. Nind rows x Nsnp cols, numeric matrix, with rownames and colnames to indicate SNP/ind ID

grms

list of genomic relation matrices (GRMs, aka kinship matrices). Any genotypes in the GRMs get predicted with, or without phenotypes. Each element is named either A or D. Matrices supplied must match required by A, AD and DirDom models. e.g. grms=list(A=A,D=D).

haploMat

matrix of phased haplotypes, 2 rows per sample, cols = loci, 0,1, rownames assumed to contain GIDs with a suffix, separated by "_" to distinguish haplotypes

recombFreqMat

a square symmetric matrix with values = (1-2*c1), where c1=matrix of expected recomb. frequencies. The choice to do 1-2c1 outside the function was made for computation efficiency; every operation on a big matrix takes time.

selInd

logical, TRUE/FALSE, selection index accuracy estimates, requires input weights via SIwts

SIwts

required if selInd=FALSE, named vector of selection index weights, names match the "Trait" variable in blups

...

Value

tibble, one row, two list columns (basically a named two-element list of lists): meanPredAccuracy and varPredAccuracy both contain tibbles. Column "AccuracyEst" for family-size weighted prediction accuracy estimates. If selInd=TRUEthen corresponding accuracy labelled "SELIND" in "Trait" columns. The column "predVSobs" is a list of tibbles each containing the paired predicted and observed values for the given repeat-fold-trait.

Details

First, define a vector, \(\boldsymbol{P}\) of the parents listed in the pedigree. Define also a second vector \(\boldsymbol{C}\) listing the genotypes (clones) in the pedigree, including the parents (\(\boldsymbol{P}\subset\boldsymbol{C}\)).

Conducts nrepeats replications of the following procedure:

  1. Define parent-wise cross-validation folds: randomly assign the parents in \(\boldsymbol{P}\) into \(\textit{k}\)-folds. \(\boldsymbol{P}_{TST}^k\), the list of "test" parents in the \(\textit{k}\)th-fold.

  2. For each of the k-folds (set of "test" parents), divide the clones vector \(\boldsymbol{C}\)into two mutually exclusive sets: "training" (\(\boldsymbol{C}_{TRN}\)) and "validation" (\(\boldsymbol{C}_{VLD}\)). From the set \(\boldsymbol{C}_{TRN}\), we exclude all descendants (offspring, grandchildren, great grandchildren, etc.) of \(\boldsymbol{P}_{TST}^k\). We include the \(\boldsymbol{P}_{TST}^k\) themselves (phenotyping the parents before predicting their offspring) and any non-descendents. Define \(\boldsymbol{C}_{VLD}\) simply as the set difference between \(\boldsymbol{C}\) and \(\boldsymbol{C}_{TRN}\).

  3. Estimate marker effects independently by fitting mixed-models (see section below for further details) to \(\boldsymbol{C}_{VLD}\) and \(\boldsymbol{C}_{TRN}\) corresponding to each \(\boldsymbol{P}_{TST}^k\).

  4. For each \(\boldsymbol{P}_{TST}^k\), define the set of crosses to predict, \(\boldsymbol{X}_{toPred}^k\) to include any of the 462 actual families (sire-dam pairs) in the pedigree, in which the \(\boldsymbol{P}_{TST}^k\) were involved. By construction, the real family members that have been observed for each of the \(\boldsymbol{X}_{toPred}^k\) were excluded from the model used to get marker effects for \(\boldsymbol{C}_{TRN}\), and included in the model for \(\boldsymbol{C}_{VLD}\). Predict the means, variances and covariances for each focal trait in each cross, \(\boldsymbol{X}_{toPred}^k\) using the \(\boldsymbol{C}_{TRN}\) marker effects only.

  5. For each family in \(\boldsymbol{X}_{toPred}^k\), using all existing family members, compute the sample means, variances and covariances for GEBV and GETGV as predicted by the \(\boldsymbol{C}_{VLD}\) marker effects.

  6. Calculate the accuracy of prediction for each mean (\(\overset{\mu_{T}}{\textbf{cor}}_{BV}\), \(\overset{\mu_{T}}{\textbf{cor}}_{TGV}\)), variance (\(\overset{\sigma^2_{t=t}}{\textbf{cor}}_{BV}\), \(\overset{\sigma^2_{t=t}}{\textbf{cor}}_{TGV}\)) and covariance (\(\overset{\sigma_{t \neq t}}{\textbf{cor}}_{BV}\), \(\overset{\sigma_{t \neq t}}{\textbf{cor}}_{TGV}\)) in terms of both BV and TGV. For \(\overset{\mu_{T}}{\textbf{cor}}\) we used the Pearson correlation between predicted and sample mean GEBV/GETGV. For \(\overset{\sigma^2_{t=t}}{\textbf{cor}}\) and \(\overset{\sigma^2_{t\neq t}}{\textbf{cor}}\), only families with greater than two members were able to be included, and we weighted the correlation between the predicted and sample (co)variance of GEBV/GETGV according to the family size (R package::function psych::cor.wt). For sake of comparison, we also include accuracies in the supplement where predicted values are correlated to phenotypic (rather than genomic-predicted) BLUPs, e.g. \(\overset{\mu_{T}}{\textbf{cor}}_{BV,BLUP}\), \(\overset{\mu_{T}}{\textbf{cor}}_{TGV,BLUP}\), etc.

See also

Other CrossVal: runCrossVal()