Appendix

Last updated: 2021-05-10

Checks: 7 0

Knit directory: PredictOutbredCrossVar/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191123)

The command set.seed(20191123) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 545ab6c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 545ab6c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  Icon
    Untracked:  PredictOutbredCrossVarMS_ResponseToReviews_R1.gdoc
    Untracked:  analysis/DavisResearchSeminar.Rmd
    Untracked:  analysis/accuracies.png
    Untracked:  analysis/docs/
    Untracked:  analysis/exploratorypreds.png
    Untracked:  analysis/images/
    Untracked:  manuscript/
    Untracked:  output/crossPredictions/
    Untracked:  output/gblups_DirectionalDom_parentwise_crossVal_folds.rds
    Untracked:  output/gblups_geneticgroups.rds
    Untracked:  output/gblups_parentwise_crossVal_folds.rds
    Untracked:  output/mtMarkerEffects/

Unstaged changes:
    Modified:   README.md
    Modified:   analysis/NGCleadersCall.Rmd
    Modified:   data/blups_forawcdata.rds
    Modified:   data/genmap_awc_May2020.rds
    Modified:   data/parentwise_crossVal_folds.rds
    Modified:   data/ped_awc.rds
    Modified:   data/selection_index_weights_4traits.rds
    Modified:   output/CrossesToPredict_top100stdSI_and_209originalParents.rds
    Modified:   output/accuraciesMeans.rds
    Modified:   output/accuraciesUC.rds
    Modified:   output/accuraciesVars.rds
    Modified:   output/crossRealizations/realizedCrossMeans.rds
    Modified:   output/crossRealizations/realizedCrossMeans_BLUPs.rds
    Modified:   output/crossRealizations/realizedCrossMetrics.rds
    Modified:   output/crossRealizations/realizedCrossVars.rds
    Modified:   output/crossRealizations/realizedCrossVars_BLUPs.rds
    Modified:   output/crossRealizations/realized_cross_means_and_covs_traits.rds
    Modified:   output/crossRealizations/realized_cross_means_and_vars_selindices.rds
    Modified:   output/ddEffects.rds
    Modified:   output/gebvs_ModelA_GroupAll_stdSI.rds
    Modified:   output/obsVSpredMeans.rds
    Modified:   output/obsVSpredUC.rds
    Modified:   output/obsVSpredVars.rds
    Modified:   output/pmv_DirectionalDom_varcomps_geneticgroups.rds
    Modified:   output/pmv_varcomps_geneticgroups.rds
    Modified:   output/pmv_varcomps_geneticgroups_tidy_includingSIvars.rds
    Modified:   output/propHomozygous.rds
    Modified:   output/top100stdSI.rds

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Appendix.Rmd) and HTML (docs/Appendix.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	3100a7d	wolfemd	2021-03-24	Build site.
Rmd	45e6b20	wolfemd	2021-03-24	Added the Appendix as a standalone HTML page. Include links on the Index page and Navbar (_site.yml).

Does the validation-data type (i.i.d. BLUPs vs. GBLUPs) make a difference? Most often, cross-validation done to test genomic prediction accuracy uses validation data (the stand-in for “truth”) consisting of adjusted values, (e.g. BLUPs or BLUEs) for total individual performance, not including genomic relatedness information. In our study we set-up cross-validation folds that enable us to predict the GEBV and GETGV (GBLUPs) of validation family-members, and to subsequently compute their sample means, variances and usefulness. This approach has the added advantage of expanding the available sample size of validation progeny with complete data across traits. Nevertheless, we made some comparison to results using BLUPs that do not incorporate genomic relatedness information; in other words, independent and identically distributed (i.i.d.) BLUPs.

Prediction accuracy for family means were nearly uniformly higher using GBLUPs compared to iidBLUPs (median 0.18 higher). The Spearman rank correlation between prediction accuracies based on iidBLUPs and GBLUPs was high (median 0.75, range 0.55-0.84). Similar to the means, accuracy using GBLUP-validation-data appeared mostly higher compared to iidBLUPs (median difference GBLUPs-iidBLUPs = 0.07, interquartile range -0.002-0.14). The Spearman rank correlations of iidBLUP and GBLUP-validation-based accuracies was positive for family (co)variances, but smaller compared to family means (mean correlation 0.5, range 0.04-0.89). Supplementary plots comparing validation-data accuracies for means and (co)variances were inspected (Figure S6-S7). Based on this, we conclude that we would reach similar though more muted conclusions about which trait variances and trait-trait covariances are best or worst predicted, if restricted to iidBLUPs for validation data.

What if we consider only families with greater than a threshold size? In our primary analysis, we computed (co)variance prediction accuracies with weighted correlations, considering any family with more than one member. We also considered a more conservative alternative approach of including only families with \(\geq\) 10 (n=112); we thought beyond that was too stringent as at \(\geq\) 20 only 22 families remain. The Spearman rank correlation between accuracy estimates when all vs. only families with more than 10 members was 0.89. There should therefore be good concordance with our primary conclusions, depending on the family size threshold we impose. The median difference in accuracy (“threshold size families” minus "all families’’) was 0.01. Considering only size 10 or greater families noticeably improved prediction accuracy for several trait variances and especially for two covariances (DM-TCHART and logFYLD-MCMDS) (Figure S8).

Comparing posterior mean variance (PMV) to variance of posterior mean (VPM) predictions: Variances and covariances were predicted with the computationally intensive PMV method. Population variance estimates based on PMV were consistently larger than VPM, but the correlation of those estimates is 0.98 (Figure S9). Using the predictions from the cross-validation results, we further observed that the PMV predictions were consistently larger and most notably that the correlation between PMV and VPM was very high (0.995). Some VPM prediction accuracies actually appear better than PMV predictions (Figure S10).

The critical point is that VPM and PMV predictions should have very similar rankings. In our primary analysis, we focus on the PMV results with the only exception being the exploratory predictions where we saved time/computation and used the VPM. If implementing mate selections via the usefulness criteria, choosing the VPM method would mostly have the consequence of shrinking the influence on selection decisions towards the mean.

Comparing the directional dominance to the “classic” model: Our focus in this article was not in finding the optimal or most accurate prediction model for obtaining marker effects. However, genome-wide estimates of directional dominance have not previously been made in cassava. For this reason, we make some brief comparison to the standard or “classic” additive-dominance prediction model, where dominance effects are centered on zero. Overall, the ranking of models and predictions between the two models were similar, as indicated by a rank correlation between model accuracy estimates of 0.98 for family means and 0.94 for variances and covariances. Three-quarters of family-mean and almost half of (co)variance accuracy estimates were higher using the directional dominance model. The most notably improved predictions were for the family-mean logFYLD TGV (Figure S11-S12). There was also an overall rank correlation of 0.98 between models in the prediction of untested crosses.

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        whisker_0.4       knitr_1.32        magrittr_2.0.1   
 [5] R6_2.5.0          rlang_0.4.10      fansi_0.4.2       stringr_1.4.0    
 [9] tools_4.0.3       xfun_0.22         utf8_1.2.1        git2r_0.28.0     
[13] jquerylib_0.1.3   htmltools_0.5.1.1 ellipsis_0.3.1    rprojroot_2.0.2  
[17] yaml_2.2.1        digest_0.6.27     tibble_3.1.1      lifecycle_1.0.0  
[21] crayon_1.4.1      later_1.1.0.1     sass_0.3.1        vctrs_0.3.7      
[25] promises_1.2.0.1  fs_1.5.0          glue_1.4.2        evaluate_0.14    
[29] rmarkdown_2.7     stringi_1.5.3     bslib_0.2.4       compiler_4.0.3   
[33] pillar_1.6.0      jsonlite_1.7.2    httpuv_1.5.5      pkgconfig_2.0.3

Appendix

2021-March-25