Last updated: 2021-03-24

Checks: 7 0

Knit directory: PredictOutbredCrossVar/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191123)

The command set.seed(20191123) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 45e6b20

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 45e6b20. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    output/.DS_Store

Untracked files:
    Untracked:  Icon
    Untracked:  PredictOutbredCrossVarMS_ResponseToReviews_R1.gdoc
    Untracked:  figure/
    Untracked:  manuscript/
    Untracked:  output/crossPredictions/
    Untracked:  output/gblups_DirectionalDom_parentwise_crossVal_folds.rds
    Untracked:  output/gblups_geneticgroups.rds
    Untracked:  output/gblups_parentwise_crossVal_folds.rds
    Untracked:  output/mtMarkerEffects/

Unstaged changes:
    Modified:   analysis/NGCleadersCall.Rmd
    Modified:   code/fitDirectionalDomMtBRR.R
    Modified:   code/fitmtBRR.R
    Modified:   code/getDirectionalDomGenomicBLUPs.R
    Modified:   code/getDirectionalDomMtCrossMeanPreds.R
    Modified:   code/getDirectionalDomMtCrossVarBVpreds.R
    Modified:   code/getDirectionalDomMtCrossVarTGVpreds.R
    Modified:   code/getDirectionalDomVarComps.R
    Modified:   code/getGenomicBLUPs.R
    Modified:   code/getMtCrossMeanPreds.R
    Modified:   code/getMtCrossVarPreds.R
    Modified:   code/getUntestedMtCrossVarPreds.R
    Modified:   code/getVarComps.R
    Modified:   data/blups_forawcdata.rds
    Modified:   data/genmap_awc_May2020.rds
    Modified:   data/parentwise_crossVal_folds.rds
    Modified:   data/ped_awc.rds
    Modified:   data/selection_index_weights_4traits.rds
    Modified:   output/CrossesToPredict_top100stdSI_and_209originalParents.rds
    Modified:   output/accuraciesMeans.rds
    Modified:   output/accuraciesUC.rds
    Modified:   output/accuraciesVars.rds
    Modified:   output/crossRealizations/realizedCrossMeans.rds
    Modified:   output/crossRealizations/realizedCrossMeans_BLUPs.rds
    Modified:   output/crossRealizations/realizedCrossMetrics.rds
    Modified:   output/crossRealizations/realizedCrossVars.rds
    Modified:   output/crossRealizations/realizedCrossVars_BLUPs.rds
    Modified:   output/crossRealizations/realized_cross_means_and_covs_traits.rds
    Modified:   output/crossRealizations/realized_cross_means_and_vars_selindices.rds
    Modified:   output/ddEffects.rds
    Modified:   output/gebvs_ModelA_GroupAll_stdSI.rds
    Modified:   output/obsVSpredMeans.rds
    Modified:   output/obsVSpredUC.rds
    Modified:   output/obsVSpredVars.rds
    Modified:   output/pmv_DirectionalDom_varcomps_geneticgroups.rds
    Modified:   output/pmv_varcomps_geneticgroups.rds
    Modified:   output/pmv_varcomps_geneticgroups_tidy_includingSIvars.rds
    Modified:   output/propHomozygous.rds
    Modified:   output/top100stdSI.rds

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/Results.Rmd) and HTML (docs/Results.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	2e6702a	wolfemd	2021-03-24	Build site.
Rmd	2ec2ff8	wolfemd	2021-03-24	Update graphics device so each figure saves to unique file name and displays properly in both HTML and PDF outputs.
html	be1e9fc	wolfemd	2021-03-24	Build site.
Rmd	f73b05f	wolfemd	2021-03-24	Update to match revised manuscript. Several comparisons moved to Appendix to streamline primary results, figures, etc.
html	4de1330	wolfemd	2021-02-01	Build site.
Rmd	883b1d4	wolfemd	2021-02-01	Update the syntax highlighting and code-block formatting throughout for
Rmd	6a10c30	wolfemd	2021-01-04	Submission and GitHub ready version.
html	6a10c30	wolfemd	2021-01-04	Submission and GitHub ready version.

library(tidyverse); library(magrittr); library(ragg);

Initial summaries

Summary of the pedigree and germplasm

ped<-readRDS(here::here("data","ped_awc.rds"))
ped %>% 
  count(sireID,damID) %$% summary(n)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   4.000   6.924  10.000  72.000

ped %>% 
  pivot_longer(cols=c(sireID,damID),names_to = "MaleOrFemale", values_to = "Parent") %>% 
  group_by(Parent) %>% 
  summarize(Ncontributions=n()) %$% summary(Ncontributions)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    4.00   16.00   30.61   36.00  256.00

There were 3199 comprising 462 families, derived from 209 parents in our pedigree. Parents were used an average of 31 (median 16, range 1-256) times as sire and/or dam in the pedigree. The mean family size was 7 (median 4, range 1-72).

propHom<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS14")
summary(propHom$PropSNP_homozygous)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7639  0.8254  0.8364  0.8363  0.8476  0.9273

The average proportion homozygous was 0.84 (range 0.76-0.93) across the 3199 pedigree members (computed over 33370 variable SNP; Table S14).

As expected for a population under recurrent selection, the homozygosity rate increases (though only fractionally) from the C0 (mean 0.826), C1 (0.835), C2 (0.838), C3 (0.839) (Figure S01).

propHom %>% 
  mutate(Group=ifelse(!grepl("TMS13|TMS14|TMS15", GID),"GG (C0)",NA),
         Group=ifelse(grepl("TMS13", GID),"TMS13 (C1)",Group),
         Group=ifelse(grepl("TMS14", GID),"TMS14 (C2)",Group),
         Group=ifelse(grepl("TMS15", GID),"TMS15 (C3)",Group)) %>% 
  group_by(Group) %>% 
  summarize(meanPropHom=round(mean(PropSNP_homozygous),3))

# A tibble: 4 x 2
  Group      meanPropHom
  <chr>            <dbl>
1 GG (C0)          0.826
2 TMS13 (C1)       0.835
3 TMS14 (C2)       0.838
4 TMS15 (C3)       0.839

propHom %>% 
  mutate(Group=ifelse(!grepl("TMS13|TMS14|TMS15", GID),"GG (C0)",NA),
         Group=ifelse(grepl("TMS13", GID),"TMS13 (C1)",Group),
         Group=ifelse(grepl("TMS14", GID),"TMS14 (C2)",Group),
         Group=ifelse(grepl("TMS15", GID),"TMS15 (C3)",Group)) %>% 
  ggplot(.,aes(x=Group,y=PropSNP_homozygous,fill=Group)) + geom_boxplot() + 
  theme_bw() + 
  scale_fill_viridis_d()

Version	Author	Date
be1e9fc	wolfemd	2021-03-24
6a10c30	wolfemd	2021-01-04

Summary of the cross-validation scheme

## Table S2: Summary of cross-validation scheme
parentfold_summary<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS02")
parentfold_summary %>% 
  summarize_if(is.numeric,~ceiling(mean(.)))

# A tibble: 1 x 4
  Ntestparents Ntrainset Ntestset NcrossesToPredict
         <dbl>     <dbl>    <dbl>             <dbl>
1           42      1833     1494               167

parentfold_summary %>% 
  summarize_if(is.numeric,~ceiling(min(.))) %>% mutate(Value="Min") %>% 
  bind_rows(parentfold_summary %>% 
              summarize_if(is.numeric,~ceiling(max(.))) %>% mutate(Value="Max"))

# A tibble: 2 x 5
  Ntestparents Ntrainset Ntestset NcrossesToPredict Value
         <dbl>     <dbl>    <dbl>             <dbl> <chr>
1           41      1245     1003               143 Min  
2           42      2323     2081               204 Max

Across the 5 replications of 5-fold cross-validation the average number of samples was 1833 (range 1245-2323) for training sets and 1494 (range 1003-2081) for testing sets. The 25 training-testing pairs set-up an average of 167 (range 143-204) crosses-to-predict (Table S02).

Summary of the BLUPs and sel. index

The correlation between phenotypic BLUPs for the two SI (stdSI and biofortSI; Table S01) was 0.43 (Figure S02). The correlation between DM and TCHART BLUPs, for which we had a priori expectations, was -0.29.

library(tidyverse); library(magrittr);
# Selection weights -----------
indices<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS01")

# BLUPs -----------
blups<-readRDS(here::here("data","blups_forawcdata.rds")) %>% 
  select(Trait,blups) %>% 
  unnest(blups) %>% 
  select(Trait,germplasmName,BLUP) %>% 
  spread(Trait,BLUP) %>% 
  select(germplasmName,all_of(c("DM","logFYLD","MCMDS","TCHART")))
blups %<>% 
  select(germplasmName,all_of(indices$Trait)) %>% 
  mutate(stdSI=blups %>% 
           select(all_of(indices$Trait)) %>% 
           as.data.frame(.) %>% 
           as.matrix(.)%*%indices$stdSI,
         biofortSI=blups %>% 
           select(all_of(indices$Trait)) %>% 
           as.data.frame(.) %>% 
           as.matrix(.)%*%indices$biofortSI)

Correlations among phenotypic BLUPs (including Selection Indices)

#```{r, fig.show="hold", out.width="50%"}
library(patchwork)
p1<-ggplot(blups,aes(x=stdSI,y=biofortSI)) + geom_point(size=1.25) + theme_bw()
corMat<-cor(blups[,-1],use = 'pairwise.complete.obs')
(p1 | ~corrplot::corrplot(corMat, type = 'lower', col = viridis::viridis(n = 10), diag = T,addCoef.col = "black")) + 
  plot_layout(nrow=1, widths = c(0.35,0.65)) +
  plot_annotation(tag_levels = 'A',
                  title = 'Correlations among phenotypic BLUPs (including Selection Indices)')

Version	Author	Date
be1e9fc	wolfemd	2021-03-24
6a10c30	wolfemd	2021-01-04

Predictions accuracies

library(tidyverse); library(magrittr);
# Table S10: Accuracies predicting the mean
accMeans<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS10")

Primary analysis

Means

accMeansMain<-accMeans %>% 
  filter(ValidationData=="GBLUPs", grepl("DirDom",Model))

accMeansMain %>% 
  select(-Model,-ValidationData) %>% 
  spread(predOf,Accuracy) %>% 
  mutate(diffAcc=MeanTGV-MeanBV) -> diffAcc
diffAcc %$% summary(diffAcc)

      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-0.0826119 -0.0312055 -0.0171228  0.0018702 -0.0004026  0.2795664

diffAcc %>% 
  summarise(quantile = scales::percent(c(0.05,0.25, 0.5, 0.75,0.95)),
            diffAcc = quantile(diffAcc, c(0.05,0.25, 0.5, 0.75,0.95))) %>% 
  spread(quantile,diffAcc)

# A tibble: 1 x 5
    `25%`    `5%`   `50%`     `75%` `95%`
    <dbl>   <dbl>   <dbl>     <dbl> <dbl>
1 -0.0312 -0.0656 -0.0171 -0.000403 0.176

diffAcc %>% 
  group_by(Trait) %>% 
  summarise(quantile = scales::percent(c(0.05,0.25, 0.5, 0.75,0.95)),
            diffAcc = quantile(diffAcc, c(0.05,0.25, 0.5, 0.75,0.95))) %>% 
  spread(quantile,diffAcc)

# A tibble: 6 x 6
# Groups:   Trait [6]
  Trait       `25%`     `5%`   `50%`    `75%`    `95%`
  <chr>       <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
1 biofortSI -0.0375 -0.0526  -0.0237 -0.00934  0.0123 
2 DM        -0.0277 -0.0444  -0.0183 -0.0116   0.00924
3 logFYLD    0.0616 -0.00674  0.130   0.180    0.246  
4 MCMDS     -0.0580 -0.0791  -0.0391 -0.0157  -0.00193
5 stdSI     -0.0323 -0.0708  -0.0173  0.00685  0.0271 
6 TCHART    -0.0231 -0.0293  -0.0161 -0.0123  -0.00706

(Co)variances

library(tidyverse); library(magrittr);
## Table S11: Accuracies predicting the variances
accVars<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS11")
accVarsMain<-accVars %>% 
  filter(ValidationData=="GBLUPs", grepl("DirDom",Model),VarMethod=="PMV")

Across traits. Proportion of accuracy estimates above zero? Percentiles: 10%, 50% (median), 90%

accVarsMain %>% 
  select(-Model,-ValidationData,-AccuracyCor) %>% 
  mutate(Component=ifelse(Trait1==Trait2,"Variance","Covariance"),
         TraitType=ifelse(grepl("SI",Trait1),"SI","ComponentTrait")) %>% 
  group_by(Component) %>% 
  summarize(propAboveZero=length(which(AccuracyWtCor>=0))/n(),
            quantile = scales::percent(c(0.1,0.5,0.9)),
            AccuracyWtCor = quantile(AccuracyWtCor, c(0.1,0.5,0.9))) %>% 
  pivot_wider(names_from = "quantile", values_from = "AccuracyWtCor")

# A tibble: 2 x 5
# Groups:   Component [2]
  Component  propAboveZero   `10%`  `50%` `90%`
  <chr>              <dbl>   <dbl>  <dbl> <dbl>
1 Covariance          0.7  -0.0812 0.0720 0.265
2 Variance            0.89 -0.0118 0.137  0.303

By trait variance / trait-trait covariances.

accVarsMain %>% 
  select(-Model,-ValidationData,-AccuracyCor) %>% 
  mutate(Component=ifelse(Trait1==Trait2,"Variance","Covariance"),
         TraitType=ifelse(grepl("SI",Trait1),"SI","ComponentTrait")) %>% 
  group_by(Component,TraitType,Trait1,Trait2) %>% 
  summarize(propAboveZero=length(which(AccuracyWtCor>=0))/n(),
            quantile = scales::percent(c(0.1,0.5,0.9)),
            AccuracyWtCor = quantile(AccuracyWtCor, c(0.1,0.5,0.9))) %>% 
  pivot_wider(names_from = "quantile", values_from = "AccuracyWtCor") %>% 
  arrange(desc(Component),TraitType,desc(`50%`))

# A tibble: 12 x 8
# Groups:   Component, TraitType, Trait1, Trait2 [12]
   Component  TraitType   Trait1  Trait2  propAboveZero    `10%`    `50%`  `90%`
   <chr>      <chr>       <chr>   <chr>           <dbl>    <dbl>    <dbl>  <dbl>
 1 Variance   ComponentT… MCMDS   MCMDS            0.92  0.0507   0.248   0.360 
 2 Variance   ComponentT… TCHART  TCHART           0.84 -0.0306   0.165   0.283 
 3 Variance   ComponentT… logFYLD logFYLD          1     0.0490   0.150   0.269 
 4 Variance   ComponentT… DM      DM               0.88 -0.0260   0.0651  0.164 
 5 Variance   SI          stdSI   stdSI            0.92  0.0623   0.172   0.283 
 6 Variance   SI          biofor… biofor…          0.78 -0.0501   0.0890  0.194 
 7 Covariance ComponentT… DM      TCHART           0.98  0.115    0.232   0.351 
 8 Covariance ComponentT… logFYLD MCMDS            0.9  -0.00151  0.157   0.297 
 9 Covariance ComponentT… MCMDS   TCHART           0.68 -0.0879   0.0565  0.156 
10 Covariance ComponentT… DM      MCMDS            0.76 -0.0988   0.0515  0.174 
11 Covariance ComponentT… logFYLD TCHART           0.5  -0.142   -0.00147 0.121 
12 Covariance ComponentT… DM      logFYLD          0.38 -0.100   -0.0265  0.0850

Difference in accuracy between VarTGV and VarBV?

accVarsMain %>% 
  select(-Model,-ValidationData,-AccuracyCor) %>% 
  mutate(Component=ifelse(Trait1==Trait2,"Variance","Covariance"),
         TraitType=ifelse(grepl("SI",Trait1),"SI","ComponentTrait")) %>% 
  spread(predOf,AccuracyWtCor) %>% 
  mutate(diffAcc=VarTGV-VarBV) -> diffAcc
diffAcc %$% round(summary(diffAcc),3)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -0.174  -0.033  -0.003  -0.002   0.028   0.196

diffAcc %>% 
  summarise(quantile = scales::percent(c(0.1,0.5,0.9)),
            diffAcc = quantile(diffAcc, c(0.1,0.5,0.9))) %>% 
  spread(quantile,diffAcc)

# A tibble: 1 x 3
    `10%`    `50%`  `90%`
    <dbl>    <dbl>  <dbl>
1 -0.0634 -0.00303 0.0578

diffAcc %>% 
  group_by(Component,TraitType,Trait1,Trait2) %>% 
  summarise(quantile = scales::percent(c(0.1,0.5,0.9)),
            diffAcc = quantile(diffAcc, c(0.1,0.5,0.9))) %>% 
  spread(quantile,diffAcc) %>% 
  arrange(desc(Component),TraitType,desc(`50%`))

# A tibble: 12 x 7
# Groups:   Component, TraitType, Trait1, Trait2 [12]
   Component  TraitType      Trait1    Trait2      `10%`    `50%`     `90%`
   <chr>      <chr>          <chr>     <chr>       <dbl>    <dbl>     <dbl>
 1 Variance   ComponentTrait MCMDS     MCMDS     -0.0111  0.0134   0.0368  
 2 Variance   ComponentTrait TCHART    TCHART    -0.0439 -0.0275  -0.0107  
 3 Variance   ComponentTrait DM        DM        -0.0591 -0.0301  -0.000190
 4 Variance   ComponentTrait logFYLD   logFYLD   -0.0929 -0.0319   0.0908  
 5 Variance   SI             stdSI     stdSI     -0.0284  0.0184   0.0436  
 6 Variance   SI             biofortSI biofortSI -0.0426 -0.00653  0.0262  
 7 Covariance ComponentTrait logFYLD   TCHART    -0.0532  0.0533   0.145   
 8 Covariance ComponentTrait DM        MCMDS     -0.0137  0.0229   0.0563  
 9 Covariance ComponentTrait logFYLD   MCMDS     -0.0372  0.0211   0.0847  
10 Covariance ComponentTrait MCMDS     TCHART    -0.0351  0.0116   0.0384  
11 Covariance ComponentTrait DM        logFYLD   -0.0715  0.00321  0.0626  
12 Covariance ComponentTrait DM        TCHART    -0.0985 -0.0605  -0.0334

Usefulness

The usefulness criteria i.e. $UC_{parent}$ and $UC_{variety}$ are predicted by:

\[UC_{parent}=\mu_{BV} + (i_{RS} \times \sigma_{BV})\]

\[UC_{variety}=\mu_{TGV} + (i_{VDP} \times \sigma_{TGV})\]

Selection intensity

In order to combined predicted means and variances into a UC, we first calculated the realized intensity of within-family selection ($i_{RS}$ and $i_{VDP}$). For $UC_{parent}$ we computed the $i_{RS}$ based on the proportion of progeny from each family, that themselves later appeared in the pedigree as parents. For $UC_{clone}$ we compute computed $i_{VDP}$ based on the proportion of family-members that had at least one plot at the penultimate stage of the VDP, the advanced yield trial (AYT). For completeness and as part of exploratory analysis, we computed the proportion selected at each of the VDP stages: clonal evaluation trial (CET), preliminary yield trial (PYT), advanced yield trial (AYT) and uniform yield trial (UYT).

The table below provides a quick summary of the number of families available with realized selection observed at each stage, plus the corresponding mean selection intensity and proportion selected across families.

## Table S13: Realized within-cross selection metrics
crossmetrics<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS13")
left_join(crossmetrics %>%
            select(sireID,damID,contains("realIntensity")) %>%   distinct %>% 
            pivot_longer(cols = contains("realIntensity"),
                         names_to = "Stage", values_to = "Intensity",names_prefix = "realIntensity") %>%
            group_by(Stage) %>%
            summarize(meanIntensity=mean(Intensity, na.rm = T),
                      Nfam=length(which(!is.na(Intensity)))),
          crossmetrics %>% 
            select(sireID,damID,contains("prop")) %>%   distinct %>% 
            rename(propParent=propUsedAsParent,
                   propCET=propPhenotyped,
                   propPYT=propPastCET,
                   propAYT=propPastPYT,
                   propUYT=propPastAYT) %>% 
            pivot_longer(cols = contains("prop"),
                         values_to = "PropPast",names_to = "Stage",names_prefix = "propPast|prop") %>% 
            group_by(Stage) %>% 
            summarize(meanPropPast=mean(PropPast, na.rm = T))) %>%
  mutate(Stage=factor(Stage,levels=c("Parent","CET","PYT","AYT","UYT"))) %>% 
  arrange(Stage) %>% 
  select(Stage,Nfam,meanIntensity,meanPropPast) %>% mutate_if(is.numeric,~round(.,2))

# A tibble: 5 x 4
  Stage   Nfam meanIntensity meanPropPast
  <fct>  <dbl>         <dbl>        <dbl>
1 Parent    48          1.59         0.02
2 CET      415          0.26         0.76
3 PYT      288          0.9          0.3 
4 AYT      104          1.46         0.05
5 UYT       39          1.4          0.03

There were 48 families with a mean intensity of 1.59 (mean 2% selected) that themselves had members who were parents in the pedigree. As expected, the number of available families and the proportion selected decreased (increasing selection intensity) from CET to UYT. We choose to focus on the AYT stage, which has 104 families, mean intensity 1.46 (mean 5% selected).

library(tidyverse); library(magrittr);
## Table S9: Predicted and observed UC
predVSobsUC<-read.csv(here::here("manuscript","SupplementaryTable09.csv"),stringsAsFactors = F)

uc_cv_summary<-predVSobsUC %>% 
  filter(VarMethod=="PMV") %>% 
  group_by(Model,predOf,Stage,Trait,Repeat,Fold) %>% 
  summarize(Nfam=n(),
            meanFamSize=round(mean(FamSize),1)) %>% 
  ungroup() %>% 
  select(-Trait,-Model) %>% 
  distinct
uc_cv_summary %>% 
  group_by(predOf,Stage) %>% 
  summarize(minNfam=min(Nfam),
            meanNfam=mean(Nfam),
            maxNfam=max(Nfam),
            minMeanFamSize=min(meanFamSize),
            meanMeanFamSize=mean(meanFamSize),
            maxMeanFamSize=max(meanFamSize))

# A tibble: 7 x 8
# Groups:   predOf [2]
  predOf Stage          minNfam meanNfam maxNfam minMeanFamSize meanMeanFamSize
  <chr>  <chr>            <int>    <dbl>   <int>          <dbl>           <dbl>
1 BV     ConstIntensity     143    166.      204            5.9            6.85
2 BV     Parent               9     17.3      24            8.8           13.8 
3 TGV    AYT                 25     37.3      50           10.9           12.9 
4 TGV    CET                129    149.      192            6.3            7.48
5 TGV    ConstIntensity     143    166.      204            5.9            6.85
6 TGV    PYT                 69    103.      142            7.3            8.73
7 TGV    UYT                  8     14.1      20            7.3           11.8 
# … with 1 more variable: maxMeanFamSize <dbl>

Accuracy comparisons

library(tidyverse); library(magrittr);
## Table S12: Accuracies predicting the usefulness criteria
accUC<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS12")
accUCmain<-accUC %>% 
  filter(VarMethod=="PMV",grepl("DirDom",Model),Stage %in% c("Parent","AYT")) %>%
  select(-Model,-VarMethod,-AccuracyCor,-Stage)
# accMeansMain %>% count(Trait,predOf)
# accVarsMain %>% count(Trait1,Trait2,predOf)
# accUCmain %>% count(Repeat,Fold)

Proportion of UC accuracy estimates greater than zero.

length(which(accUC$AccuracyWtCor>0))/nrow(accUC)

[1] 0.952619

Compare Mean, Variance and UC accuracy:

meanVsUCacc<-accUCmain %>% 
  rename(AccUC=AccuracyWtCor) %>% 
  left_join(accMeansMain %>% 
              select(-ValidationData,-Model) %>% 
              mutate(predOf=gsub("Mean","",predOf)) %>% 
              rename(AccMean=Accuracy))
meanVsUCacc %<>% 
  left_join(accVarsMain %>% 
              filter(Trait1==Trait2) %>% 
              select(-ValidationData,-Model,-AccuracyCor,-VarMethod,-Trait2) %>% 
              mutate(predOf=gsub("Var","",predOf)) %>% 
              rename(AccVar=AccuracyWtCor,
                     Trait=Trait1))

Correlation between mean, Var and UC accuracy, overall

meanVsUCacc %>% 
  summarize(corMeanUCacc=cor(AccUC,AccMean),
            meanDiffMeanUC=mean(AccUC-AccMean),
            corMeanVarAcc=cor(AccVar,AccMean),
            corVarUCacc=cor(AccUC,AccVar))

# A tibble: 1 x 4
  corMeanUCacc meanDiffMeanUC corMeanVarAcc corVarUCacc
         <dbl>          <dbl>         <dbl>       <dbl>
1        0.745        -0.0929        -0.222      -0.236

Correlation between mean, Var and UC accuracy, by predOf (BV vs. TGV)

meanVsUCacc %>% 
  group_by(predOf) %>% 
  summarize(corMeanUCacc=cor(AccUC,AccMean),
            meanDiffMeanUC=mean(AccUC-AccMean),
            corMeanVarAcc=cor(AccVar,AccMean))

# A tibble: 2 x 4
  predOf corMeanUCacc meanDiffMeanUC corMeanVarAcc
  <chr>         <dbl>          <dbl>         <dbl>
1 BV            0.704        -0.0748        -0.177
2 TGV           0.802        -0.111         -0.275

Correlation between mean, Var and UC accuracy, by Trait

meanVsUCacc %>% 
  group_by(Trait) %>% 
  summarize(corMeanUCacc=cor(AccUC,AccMean),
            meanDiffMeanUC=mean(AccUC-AccMean),
            corMeanVarAcc=cor(AccVar,AccMean)) %>% arrange(desc(corMeanUCacc))

# A tibble: 6 x 4
  Trait     corMeanUCacc meanDiffMeanUC corMeanVarAcc
  <chr>            <dbl>          <dbl>         <dbl>
1 TCHART          0.494        -0.00356      -0.0951 
2 logFYLD         0.410        -0.0762       -0.0571 
3 MCMDS           0.358        -0.280         0.475  
4 stdSI           0.315        -0.0736        0.129  
5 DM              0.110        -0.0938        0.227  
6 biofortSI       0.0189       -0.0297        0.00484

Make an overall comparison of trait accuracies for UC

accUCmain %>% 
  mutate(TraitType=ifelse(grepl("SI",Trait),"SI","ComponentTrait")) %>% 
  group_by(Trait) %>% 
  summarize(propAboveZero=length(which(AccuracyWtCor>0))/n(),
            quantile = scales::percent(c(0.1,0.5,0.9)),
            AccuracyWtCor = quantile(AccuracyWtCor, c(0.1,0.5,0.9))) %>% 
  pivot_wider(names_from = "quantile", values_from = "AccuracyWtCor") %>% 
  mutate(across(is.numeric,~round(.,2))) %>% 
  arrange(desc(`50%`))

# A tibble: 6 x 5
# Groups:   Trait [6]
  Trait     propAboveZero `10%` `50%` `90%`
  <chr>             <dbl> <dbl> <dbl> <dbl>
1 TCHART             1     0.66 0.83   0.94
2 DM                 1     0.36 0.65   0.81
3 biofortSI          1     0.39 0.580  0.74
4 stdSI              0.96  0.16 0.49   0.65
5 logFYLD            0.88 -0.04 0.24   0.49
6 MCMDS              0.7  -0.18 0.1    0.41

Compare UC accuracy of TGV vs. BV, overall

accUCmain %>% 
  spread(predOf,AccuracyWtCor) %>% 
  mutate(diffAcc=TGV-BV) %$% round(summary(diffAcc),2)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -0.75   -0.17   -0.06   -0.03    0.06    1.01

Proportion of accuracy estimates where BV>TGV.

accUCmain %>% 
  spread(predOf,AccuracyWtCor) %>% 
  mutate(diffAcc=TGV-BV) %$% (length(which(diffAcc<=0))/length(diffAcc))

[1] 0.62

Compare UC accuracy difference TGV minus BV, by Trait

accUCmain %>% 
  mutate(TraitType=ifelse(grepl("SI",Trait),"SI","ComponentTrait")) %>% 
  spread(predOf,AccuracyWtCor) %>% 
  mutate(diffAcc=TGV-BV) %>% 
  group_by(Trait,TraitType) %>% 
  summarize(quantile = scales::percent(c(0.1,0.5,0.9)),
            diffAcc = quantile(diffAcc, c(0.1,0.5,0.9))) %>% 
  pivot_wider(names_from = "quantile", values_from = "diffAcc") %>% 
  mutate(across(is.numeric,~round(.,2))) %>% 
  arrange(desc(`50%`))

# A tibble: 6 x 5
# Groups:   Trait, TraitType [6]
  Trait     TraitType      `10%` `50%` `90%`
  <chr>     <chr>          <dbl> <dbl> <dbl>
1 stdSI     SI             -0.15  0.1   0.44
2 logFYLD   ComponentTrait -0.26 -0.02  0.28
3 DM        ComponentTrait -0.24 -0.05  0.12
4 biofortSI SI             -0.19 -0.08  0.12
5 TCHART    ComponentTrait -0.17 -0.08  0.07
6 MCMDS     ComponentTrait -0.49 -0.17  0.12

Mean: TGV<BV (median -0.017)
Var: No consistent / small differences
UC: TGV<BV

Population estimates of the importance of dominance variance

In this study, our focus is mainly on distinguishing among crosses, and the accuracy of cross-metric predictions. Detailed analysis of the additive-dominance genetic variance-covariance structure in cassava (sub)-populations is an important topic, which we mostly leave for future study. However, we make a brief examination of the genetic variance-covariance estimates associated with the overall population and component genetic groups. We report all variance-covariance estimates in TableS15 and complete BGLR output in the FTP repository associated with this study.

library(tidyverse); library(magrittr);
## Table S15: Variance estimates for genetic groups
varcomps<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS15") %>% 
  filter(VarMethod=="PMV", Method=="M2",Model %in% c("DirDomAD")) %>% 
  select(-VarMethod,-Method)

varcomps %>% 
  filter(Trait1==Trait2) %>% 
  group_by(Model) %>% 
  summarize(minPropDom=min(propDom),
            meanPropDom=mean(propDom),
            maxPropDom=max(propDom))

# A tibble: 1 x 4
  Model    minPropDom meanPropDom maxPropDom
  <chr>         <dbl>       <dbl>      <dbl>
1 DirDomAD       0.06       0.236       0.53

Over all genetic groups analyzed, across trait and SI variances, dominance accounted for an average of 24% (range 6-53%).

varcomps %>% 
  filter(Trait1==Trait2) %>% 
  group_by(Trait1,Trait2) %>% 
  summarize(minPropDom=min(propDom),
            meanPropDom=mean(propDom),
            maxPropDom=max(propDom)) %>% arrange(desc(meanPropDom))

# A tibble: 6 x 5
# Groups:   Trait1 [6]
  Trait1    Trait2    minPropDom meanPropDom maxPropDom
  <chr>     <chr>          <dbl>       <dbl>      <dbl>
1 logFYLD   logFYLD         0.41       0.462       0.53
2 biofortSI biofortSI       0.18       0.264       0.5 
3 stdSI     stdSI           0.14       0.232       0.35
4 DM        DM              0.11       0.18        0.27
5 MCMDS     MCMDS           0.08       0.172       0.3 
6 TCHART    TCHART          0.06       0.106       0.21

Dominance was most important (mean 46% of genetic variance) for yield (logFYLD) and least important for TCHART (mean 11%) (Figure 4).

Population estimates of inbreeding effects

## Table S16: Directional dominance effects estimates
ddEffects<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS16")
ddEffects %>% 
  select(-Repeat,-Fold,-InbreedingEffectSD) %>% 
  mutate(Dataset=ifelse(Dataset!="GeneticGroups",Group,Dataset)) %>% 
  select(-Group) %>% 
  group_by(Dataset,Trait) %>% 
  summarize_all(~round(mean(.),3)) %>% 
  spread(Dataset,InbreedingEffect)

# A tibble: 4 x 3
  Trait   GeneticGroups ParentwiseCV
  <chr>           <dbl>        <dbl>
1 DM             -4.82        -7.85 
2 logFYLD        -2.75        -3.88 
3 MCMDS           0.32         1.27 
4 TCHART         -0.004       -0.015

We found mostly consistent and significant (diff. from zero) effects of inbreeding depression associated especially (Figure 5, Table S16), with logFYLD (mean effect -2.75 across genetic groups, -3.88 across cross-validation folds), but also DM (-4.82 genetic groups, -7.85 cross-validation) and MCMDS (0.32 genetic groups, 1.27 cross-validation). This corresponds to higher homozygosity being associated with lower DM, lower yield and worse disease.

Exploring Untested Crosses

We made 8 predictions (2 SIs x 2 prediction targets [BV, TGV] x 2 criteria [Mean, UC = Mean + 2*SD]) prediction for each of 47,083 possible crosses of 306 parents.

Correlations among predictions

First, quickly evaluate the multivariate decision space encompassed by predictions of mean, SD, UC for BV and TGV.

library(tidyverse); library(magrittr); 
predUntestedCrosses<-read.csv(here::here("manuscript","SupplementaryTable18.csv"),stringsAsFactors = F) %>% 
  filter(Model=="DirDom") %>% select(-Model)

Average correlations between BiofortSI and StdSI by prediction.

predUntestedCrosses %>% 
  spread(Trait,Pred) %>% 
  group_by(PredOf,Component) %>% 
  summarize(corSelIndices=cor(stdSI,biofortSI)) %>% 
  group_by(PredOf) %>% 
  summarize(meanCorSIs=mean(corSelIndices))

# A tibble: 3 x 2
  PredOf meanCorSIs
  <chr>       <dbl>
1 Mean        0.204
2 Sd          0.912
3 UC          0.136

TABLE: Correlations between predictions about each selection index ($\overset{StdSI,BiofortSI}{\textbf{cor}}$).

predUntestedCrosses %>% 
  spread(Trait,Pred) %>% 
  group_by(PredOf,Component) %>% 
  summarize(corSelIndices=cor(stdSI,biofortSI)) %>% 
  spread(Component,corSelIndices) %>% 
  arrange(PredOf) %>% 
  rmarkdown::paged_table()

TABLE: Correlations between predictions about each component, for each trait ($\overset{BV,TGV}{\textbf{cor}}$).

predUntestedCrosses %>% 
  spread(Component,Pred) %>% 
  group_by(Trait,PredOf) %>% 
  summarize(corBV_TGV=round(cor(BV,TGV),2)) %>% 
  spread(Trait,corBV_TGV)

# A tibble: 3 x 3
  PredOf biofortSI stdSI
  <chr>      <dbl> <dbl>
1 Mean        0.96  0.95
2 Sd          0.91  0.88
3 UC          0.95  0.93

predUntestedCrosses %>% 
  spread(PredOf,Pred) %>% 
  group_by(Trait,Component) %>% 
  summarize(corMeanSD=round(cor(Mean,Sd),2),
            corMeanUC=round(cor(Mean,UC),2),
            corSdUC=round(cor(Sd,UC),2)) %>% ungroup() %>% 
summarize(across(is.numeric,mean))#corComponents=round(cor(BV,TGV),2)) %$% summary(corComponents)

# A tibble: 1 x 3
  corMeanSD corMeanUC corSdUC
      <dbl>     <dbl>   <dbl>
1    -0.365     0.995  -0.258

predUntestedCrosses %>% 
  spread(PredOf,Pred) %>% 
  group_by(Trait,Component) %>% 
  summarize(corMeanSD=round(cor(Mean,Sd),2),
            corMeanUC=round(cor(Mean,UC),2),
            corSdUC=round(cor(Sd,UC),2)) %>% 
  rmarkdown::paged_table()

The mean and variance have a low, but negative correlation. At the standardized intensity of 2.67 (1% selected), leads to a small negative correlation between SD and UC. The crosses with highest mean will mostly be those with highest UC. The crosses with highest mean will also have a small tendency to have smaller variance.

Nevertheless, the biggest differences in decision space have to do with the difference between using the Mean vs. including the SD via the UC.

Decision space - top 50 crosses?

What we next want to know, is how different the selections of crosses-to-make would be if we use different criteria, particularly the mean vs. the UC.

For each of the 8 predictions of 47,083 crosses, select the top 50 ranked crosses.

top50crosses<-predUntestedCrosses %>% 
  filter(PredOf!="Sd") %>%
  group_by(Trait,PredOf,Component) %>% 
  slice_max(order_by = Pred,n=50) %>% ungroup()

top50crosses %>% distinct(sireID,damID) %>% nrow()

[1] 202

Number of distinct crosses selected per Trait

top50crosses %>% 
  distinct(Trait,sireID,damID) %>% 
  count(Trait)

# A tibble: 2 x 2
  Trait         n
  <chr>     <int>
1 biofortSI    90
2 stdSI       112

Number of Self vs. Outcross selected by Trait

top50crosses %>% 
  distinct(Trait,sireID,damID,IsSelf) %>% 
  count(Trait,IsSelf)

# A tibble: 4 x 3
  Trait     IsSelf     n
  <chr>     <lgl>  <int>
1 biofortSI FALSE     84
2 biofortSI TRUE       6
3 stdSI     FALSE    105
4 stdSI     TRUE       7

Only 202 unique crosses selected based on at least one of the 8 criteria. Of those 112 were selected for the StdSI (90 Biofort) including 7 (6) selfs on the StdSI (BiofortSI).

top50crosses %>% 
  distinct(Trait,sireID,damID) %>% 
  mutate(Selected="Yes") %>% 
  spread(Trait,Selected) %>% 
  na.omit(.)

# A tibble: 0 x 4
# … with 4 variables: sireID <chr>, damID <chr>, biofortSI <chr>, stdSI <chr>

There were 0 crosses selected for both SI.

top50crosses %>% 
  distinct(CrossPrevMade,sireID,damID) %>% 
  count(CrossPrevMade)

# A tibble: 1 x 2
  CrossPrevMade     n
  <chr>         <int>
1 No              202

None of the selected crosses have previously been tested.

Table: Summarize, by trait, the number of and relative contributions (number of matings) proposed for each parent selected in the group of top crosses.

top50crosses %>% 
  mutate(Family=paste0(sireID,"x",damID)) %>% 
  select(Trait,Family,sireID,damID) %>% 
  pivot_longer(cols = c(sireID,damID), names_to = "Parent", values_to = "germplasmName") %>% 
  count(Trait,germplasmName) %>% 
  group_by(Trait) %>% 
  summarize(Nparents=length(unique(germplasmName)),
            minProg=min(n),maxProg=max(n),medianProg=median(n))

# A tibble: 2 x 5
  Trait     Nparents minProg maxProg medianProg
  <chr>        <int>   <int>   <int>      <dbl>
1 biofortSI       33       1      81          4
2 stdSI           44       1      70          3

Next: For each SI, break down the criteria for which the “best” crosses are interesting.

Quantify the number of unique crosses selected by:

Component (BV vs. TGV)

Not many crosses are selected for both their BV and TGV?
Selfs get selected mostly by TGV???

Table: Number of crosses for each SI selected by BV, TGV or both, ignoring selection based Mean vs. UC.

top50crosses %>% 
  distinct(Trait,sireID,damID,Component) %>% 
  mutate(Selected="Yes") %>% 
  spread(Component,Selected) %>% 
  mutate(across(everything(.),replace_na,replace = "No")) %>% 
  count(Trait,BV,TGV) %>% spread(Trait,n) %>% 
  rmarkdown::paged_table()

Table: Number of crosses for each SI selected by BV, TGV or both, ignoring selection based Mean vs. UC. Broken down by whether

top50crosses %>% filter(IsSelf==TRUE) %>% 
  distinct(Trait,sireID,damID,Component) %>% 
  mutate(Selected="Yes") %>% 
  spread(Component,Selected) %>% 
  mutate(across(everything(.),replace_na,replace = "No")) %>% 
  count(Trait,BV) %>% 
  rmarkdown::paged_table()

Selfs were only selected based on BV.

Table: Compute the number of parents unique selected based on BV vs. TGV

top50crosses %>% 
  nest(families=c(-Trait,-Component)) %>% 
  spread(Component,families) %>% 
  mutate(NparentsBVunique=map2_dbl(BV,TGV,~length(union(.x$sireID,.x$damID) %>% .[!. %in% union(.y$sireID,.y$damID)])),
         NparentsTGVunique=map2_dbl(BV,TGV,~length(union(.y$sireID,.y$damID) %>% .[!. %in% union(.x$sireID,.x$damID)])),
         NparentsTot=map2_dbl(BV,TGV,~length(unique(c(.x$sireID,.x$damID,.y$sireID,.y$damID))))) %>% 
  select(-BV,-TGV) %>% arrange(Trait) %>% rmarkdown::paged_table()

Table: Compute the number of parents uniquely selected based on Mean, UC and both

top50crosses %>% 
  distinct(Trait,sireID,damID,PredOf) %>% 
  mutate(Selected="Yes") %>% 
  spread(PredOf,Selected) %>% 
  mutate(across(everything(.),replace_na,replace = "No")) %>% 
  count(Trait,Mean,UC) %>%spread(Trait,n) %>% 
  rmarkdown::paged_table()

So which are the “BEST” crosses?

Chosen most times, for most criteria?

best50crosses<-top50crosses %>% 
  count(sireID,damID,Trait) %>% 
  group_by(Trait) %>% 
  slice_max(order_by = `n`,n = 50, with_ties = TRUE) %>% 
  rename(NtimesChosen=n) 
best50crosses %>% count(Trait)

# A tibble: 2 x 2
# Groups:   Trait [2]
  Trait         n
  <chr>     <int>
1 biofortSI    66
2 stdSI        58

If you use the most times chosen as the criteria and you don’t break ties, there are 58 StdSI and 66 BiofortSI crosses to consider as the “best”.

### Are these crosses really "better"?
# bind_rows(predUntestedCrosses %>% 
#             filter(CrossPrevMade=="Yes") %>% 
#             mutate(Group=ifelse(grepl("TMS13",sireID) & grepl("TMS13",damID),"C1",
#                                 ifelse(grepl("TMS14",sireID) & grepl("TMS14",damID),"C2",
#                                        ifelse(grepl("TMS15",sireID) & grepl("TMS15",damID),"C3","C0")))),
#           predUntestedCrosses %>% 
#             semi_join(best50crosses) %>% 
#             mutate(Group="NewCrosses")) %>% 
#   group_by(Group,Trait,PredOf,Component) %>% 
#   summarize(meanPred=mean(Pred),
#             sePred=sd(Pred)/sqrt(n())) %>% 
#   ggplot(.,aes(x=Group,y=meanPred,fill=Component)) + 
#   facet_grid(PredOf~Trait, scales = 'free') + 
#   geom_bar(stat = 'identity', position='dodge')

Supplementary Questions

Does the validation type make a difference?

Most often, cross-validation done to test genomic prediction accuracy uses validation data (the stand-in for “truth”) consisting of adjusted values, (e.g. BLUPs or BLUEs) for total individual performance, not including genomic relatedness information. In our study we set-up cross-validation folds that enable use to predict the GEBV and GETGV (GBLUPs) of validation family-members, and to subsequently compute their sample means, variances and usefulness. This approach has the added advantage of expanding the available sample size of validation progeny with complete data across traits. Nevertheless, we made some comparison to results using BLUPs that do not incorporate genomic relatedness information; in other words, independent and identically distributed (i.i.d.) BLUPs.

library(tidyverse); library(magrittr);
# Table S10: Accuracies predicting the mean
accMeans<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS10")
## Table S11: Accuracies predicting the variances
accVars<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS11") %>% 
  mutate(Component=ifelse(Trait1==Trait2,"Variance","Covariance"),
         TraitType=ifelse(grepl("SI",Trait1),"SI","ComponentTrait"))

Summarize accuracy differences between GBLUPs and iidBLUPs as validation data.

accMeans %>% 
  filter(grepl("DirDom",Model)) %>% 
  spread(ValidationData,Accuracy) %>% 
  mutate(diffAcc=GBLUPs-iidBLUPs) %$% summary(diffAcc)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.02019  0.14109  0.17690  0.18111  0.22389  0.38530

Prediction accuracy for family means were nearly uniformly higher using GBLUPs compared to iidBLUPs (median 0.18 higher).

Compute Spearman rank correlations on a per Trait, per prediction type (BV vs TGV) basis.

accMeans %>% 
  filter(grepl("DirDom",Model)) %>% 
  spread(ValidationData,Accuracy) %>% 
  group_by(Trait,predOf) %>% 
  summarize(corAcc=cor(GBLUPs,iidBLUPs,method = 'spearman')) %$% summary(corAcc) # arrange(desc(corAcc))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.5469  0.6921  0.7512  0.7344  0.7852  0.8438

The Spearman rank correlation between prediction accuracies based on iidBLUPs and GBLUPs was high (median 0.75, range 0.55-0.84).

Summarize accuracy differences between GBLUPs and iidBLUPs as validation data.

accVars %>% 
  filter(grepl("DirDom",Model),VarMethod=="PMV") %>% 
  select(-AccuracyCor) %>% 
  spread(ValidationData,AccuracyWtCor) %>% 
  mutate(diffAcc=GBLUPs-iidBLUPs) %$% summary(diffAcc)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.38849 -0.00209  0.07280  0.06568  0.13568  0.43723

Similar to the means, accuracy using GBLUP-validation-data appeared mostly higher compared to iidBLUPs (median difference GBLUPs-iidBLUPs = 0.07, interquartile range -0.002-0.14).

Compute Spearman rank correlations on a per Trait, per prediction type (BV vs TGV) basis.

accVars %>% 
  filter(grepl("DirDom",Model),VarMethod=="PMV") %>% 
  select(-AccuracyCor) %>% 
  spread(ValidationData,AccuracyWtCor) %>% 
  group_by(Component,TraitType,Trait1,Trait2,predOf) %>% 
  summarize(corAcc=cor(GBLUPs,iidBLUPs,method = 'spearman')) %$% summary(corAcc) # arrange(desc(corAcc))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.04154 0.33269 0.52385 0.50785 0.66346 0.89154

The Spearman rank correlations of iidBLUP and GBLUP-validation-based accuracies was positive for family (co)variances, but smaller compared to family means (mean correlation 0.5, range 0.04-0.89).

Supplementary plots comparing validation-data accuracies for means and (co)variances were inspected (Figure S??-S?? Figure ???). Based on this, we conclude that we would reach similar though more muted conclusions about which trait variances and trait-trait covariances are best or worst predicted, if restricted to iidBLUPs for validation data.

Accuracy considering only families with 10+ members?

library(tidyverse); library(magrittr);
obsVSpredVars<-readRDS(here::here("output","obsVSpredVars.rds")) %>% 
  filter(grepl("DirDom",Model),ValidationData=="GBLUPs",VarMethod=="PMV") %>% 
  select(-Model,-ValidationData,-VarMethod)

obsVSpredVars %>% 
  distinct(sireID,damID,FamSize) %>% 
  filter(FamSize>=10) %>% nrow()

[1] 118

obsVSpredVars %>% 
  distinct(sireID,damID,FamSize) %>% 
  filter(FamSize>=20) %>% nrow()

[1] 22

In our primary analysis, we computed (co)variance prediction accuracies with weighted correlations, considering any family with more than one member. We also considered a more conservative alternative approach of including only families with $$10 (n=112) or $$20 (n=22).

Family Size >= 20

# Variances
compareAllFamsToBigFamsAccuracy<-obsVSpredVars %>% 
  drop_na(.) %>% 
  filter(FamSize>=20) %>% 
  select(-FamSize,-Nobs) %>% 
  nest(predVSobs=c(sireID,damID,predVar,obsVar,CorrWeight)) %>% 
  mutate(FamSize10plus=map_dbl(predVSobs,~psych::cor.wt(.[,3:4],w = .$CorrWeight) %$% r[1,2])) %>% 
  select(-predVSobs) %>% 
  left_join(readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS11") %>% 
              filter(VarMethod=="PMV",ValidationData=="GBLUPs", grepl("DirDom",Model)) %>% 
              select(-Model,-ValidationData,-VarMethod) %>% 
              select(-AccuracyCor) %>% 
              rename(AllFams=AccuracyWtCor)) %>% 
  mutate(accDiff=FamSize10plus-AllFams)
compareAllFamsToBigFamsAccuracy$accDiff %>% summary

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-1.312734 -0.226769 -0.014017  0.003398  0.230687  1.088088

Family Size >= 20

compareAllFamsToBigFamsAccuracy %>% 
  summarize(corAllVSBig=cor(FamSize10plus,AllFams,method = 'spearman',use='pairwise.complete.obs'))

# A tibble: 1 x 1
  corAllVSBig
        <dbl>
1       0.388

Family Size >= 10

# Variances
compareAllFamsToBigFamsAccuracy<-obsVSpredVars %>% 
  drop_na(.) %>% 
  filter(FamSize>=10) %>% 
  select(-FamSize,-Nobs) %>% 
  nest(predVSobs=c(sireID,damID,predVar,obsVar,CorrWeight)) %>% 
  mutate(FamSize10plus=map_dbl(predVSobs,~psych::cor.wt(.[,3:4],w = .$CorrWeight) %$% r[1,2])) %>% 
  select(-predVSobs) %>% 
  left_join(readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS11") %>% 
              filter(VarMethod=="PMV",ValidationData=="GBLUPs", grepl("DirDom",Model)) %>% 
              select(-Model,-ValidationData,-VarMethod) %>% 
              select(-AccuracyCor) %>% 
              rename(AllFams=AccuracyWtCor)) %>% 
  mutate(accDiff=FamSize10plus-AllFams)
compareAllFamsToBigFamsAccuracy$accDiff %>% summary

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.33655 -0.05449  0.01336  0.01144  0.07475  0.31545

Family Size >= 10

compareAllFamsToBigFamsAccuracy %>% 
  summarize(corAllVSBig=cor(FamSize10plus,AllFams,method = 'spearman',use='pairwise.complete.obs'))

# A tibble: 1 x 1
  corAllVSBig
        <dbl>
1       0.893

compareAllFamsToBigFamsAccuracy %>% 
  group_by(Trait1,Trait2) %>% 
  summarize(medianDiffAccSize10PlusVsAllFams=round(median(FamSize10plus-AllFams),3)) %>% arrange(desc(medianDiffAccSize10PlusVsAllFams))

# A tibble: 12 x 3
# Groups:   Trait1 [6]
   Trait1    Trait2    medianDiffAccSize10PlusVsAllFams
   <chr>     <chr>                                <dbl>
 1 DM        TCHART                               0.129
 2 logFYLD   MCMDS                                0.08 
 3 DM        logFYLD                              0.045
 4 logFYLD   logFYLD                              0.039
 5 DM        DM                                   0.019
 6 MCMDS     TCHART                               0.017
 7 TCHART    TCHART                               0.008
 8 MCMDS     MCMDS                                0.004
 9 biofortSI biofortSI                           -0.002
10 stdSI     stdSI                               -0.028
11 logFYLD   TCHART                              -0.06 
12 DM        MCMDS                               -0.062

The Spearman rank correlation between accuracy estimates when all vs. only families with more than 10 members was 0.89. There should therefore be good concordance with our primary conclusions, depending on the family size threshold we impose.

The median difference in accuracy (“threshold size families” minus “all families”) was 0.01. Considering only size 10 or greater families noticeably improved prediction accuracy for several trait variances and especially for two covariances (DM-TCHART and logFYLD-MCMDS) (Figures S???).

Comparing PMV and VPM predictions

Comparing posterior mean variance (PMV) to variance of posterior mean (VPM) predictions:

Variances and covariances estimates and cross-validation predictions were made with the PMV method, which is computationally more intensive than VPM.

Overall Spearman rank correlation between VPM and PMV variance component estimates.

library(tidyverse); library(magrittr);
## Table S15: Variance estimates for genetic groups
varcomps<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS15")
varcomps %>% 
  filter(Method=="M2",grepl("DirDomAD",Model)) %>% 
  select(-propDom,-Model,-Method) %>% 
  pivot_longer(cols = c(VarA,VarD), names_to = "VarComp", values_to = "VarEst") %>% 
  filter(!is.na(VarEst)) %>% 
  spread(VarMethod,VarEst) %>% 
  summarize(corPMV_VPM=cor(PMV,VPM, use = 'pairwise.complete.obs',method = 'spearman'))

# A tibble: 1 x 1
  corPMV_VPM
       <dbl>
1      0.983

We observed that population variance estimates based on PMV were consistently larger than VPM, but the correlation of those estimates is 0.98 (**Figure __**).

Compute the correlation between PMV and VPM predictions.

obsVSpredVars<-readRDS(here::here("output","obsVSpredVars.rds")) %>% 
  filter(grepl("DirDom",Model),ValidationData=="GBLUPs") %>% 
  select(-ValidationData,-Model)

obsVSpredVars %>% 
  select(-obsVar,-FamSize,-Nobs,-CorrWeight) %>% 
  spread(VarMethod,predVar) %>% 
  #group_by(predOf) %>% 
  summarize(corPMV_VPM=cor(PMV,VPM, use = 'pairwise.complete.obs',method = 'spearman'))

# A tibble: 1 x 1
  corPMV_VPM
       <dbl>
1      0.995

Using the predictions from the cross-validation results, we further observed that the PMV predictions were consistently larger and most notably that the correlation between PMV and VPM was very high (0.995). Some VPM prediction accuracies actually appear better than PMV predictions (Figure S???).

The critical point is that VPM and PMV predictions should have very similar rankings. In our primary analysis, we focus on the PMV results with the only exception being the exploratory predictions where we saved time/computation and used the VPM. If implementing mate selections via the usefulness criteria, choosing the VPM method would mostly have the consequence of shrinking the influence on selection decisions towards the mean.

Compare directional dominance to “classic” model

Our focus in this article was not in finding the optimal or most accurate prediction model for obtaining marker effects. However, genome-wide estimates of directional dominance have not previously been made in cassava. For this reason, we make some brief comparison to the standard or “classic” additive-dominance prediction model, where dominance effects are centered on zero.

library(tidyverse); library(magrittr);
# Table S10: Accuracies predicting the mean
accMeans<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS10") %>% 
  filter(ValidationData=="GBLUPs") %>% 
  select(-ValidationData) %>% 
  mutate(Model=ifelse(grepl("DirDom",Model),"DirDom","Classic"))
## Table S11: Accuracies predicting the variances
accVars<-readxl::read_xlsx(here::here("manuscript","SupplementaryTables.xlsx"),sheet = "TableS11") %>% 
  mutate(Component=ifelse(Trait1==Trait2,"Variance","Covariance"),
         TraitType=ifelse(grepl("SI",Trait1),"SI","ComponentTrait")) %>% 
  filter(ValidationData=="GBLUPs",VarMethod=="PMV") %>% 
  select(-ValidationData,-VarMethod,-AccuracyCor) %>% 
  mutate(Model=ifelse(grepl("DirDom",Model),"DirDom","Classic"))

Compute rank correlations overall (DirDom vs Classic): Means

accMeans %>% 
  spread(Model,Accuracy) %>%   
  summarize(corMeanAcc=cor(DirDom,Classic,method = 'spearman'))

# A tibble: 1 x 1
  corMeanAcc
       <dbl>
1      0.984

Compute rank correlations overall (DirDom vs Classic): Variances

accVars %>% 
  spread(Model,AccuracyWtCor) %>% 
  summarize(corVarAcc=cor(DirDom,Classic,method = 'spearman'))

# A tibble: 1 x 1
  corVarAcc
      <dbl>
1     0.938

Overall, the ranking of models and predictions between the two models were similar, as indicated by a rank correlation between model accuracy estimates of 0.98 for family means and 0.94 for variances and covariances.

Summarize accuracy differences between DirDom and Classic model: Means

accMeans %>% 
  spread(Model,Accuracy) %>% 
  mutate(diffAcc=DirDom-Classic) %$% summary(diffAcc)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.123522  0.002375  0.014843  0.024996  0.040209  0.233371

Summarize accuracy differences between DirDom and Classic model: Variances

accVars %>% 
  spread(Model,AccuracyWtCor) %>% 
  mutate(diffAcc=DirDom-Classic) %$% summary(diffAcc)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.337929 -0.024895 -0.003922 -0.008563  0.013797  0.148698

Three-quarters of family-mean and almost half of (co)variance accuracy estimates were higher using the directional dominance model.

# **Compute rank correlations of models - cor(DirDom,Classic) - by Trait, predOf - Means**
# accMeans %>% 
#   spread(Model,Accuracy) %>%   
#   group_by(Trait,predOf) %>% 
#   summarize(corAcc=cor(DirDom,Classic,method = 'spearman')) %>% spread(predOf,corAcc) # %$% summary(corAcc) # arrange(desc(corAcc))


# **Compute rank correlations of models - cor(DirDom,Classic) - by Trait, predOf - Variances**
#  accVars %>% 
#   spread(Model,AccuracyWtCor) %>% 
#   group_by(Component,TraitType,Trait1,Trait2,predOf) %>% 
#   summarize(corAcc=round(cor(DirDom,Classic,method = 'spearman'),3),
#             medianDiffAcc=round(median(DirDom-Classic),3)) %>% arrange(corAcc) %>% rmarkdown::paged_table()

The most notably improved predictions were for the family-mean logFYLD TGV (Figure S?? and S??).

Overall correlations between all Classic and Directional Dominance Predictions of Untested Crosses

library(tidyverse); library(magrittr); 
predUntestedCrosses<-read.csv(here::here("manuscript","SupplementaryTable18.csv"),stringsAsFactors = F) 
# predUntestedCrosses %>% 
#   spread(Model,Pred) %>% 
#   group_by(Trait,PredOf,Component) %>% 
#   summarize(corModels=round(cor(ClassicAD,DirDom,method = 'spearman'),2)) %>% 
#   spread(Component,corModels) %>% 
#   arrange(Trait,PredOf) %>% 
#   rmarkdown::paged_table()
predUntestedCrosses %>% 
  spread(Model,Pred) %>% 
  summarize(corModels=round(cor(ClassicAD,DirDom,method = 'spearman'),2))

  corModels
1      0.98

There was also an overall rank correlation of 0.98 between models in the prediction of untested crosses.

Convergence

Visual check of the posterior distribution and trace of error variance estimates, to check convergence.

library(tidyverse); library(magrittr); library(patchwork); library(ragg)
cv_brrIterHist<-list.files(here::here("output/mtMarkerEffects")) %>% 
  grep(".dat",.,value = T) %>% 
  grep("Repeat",.,value = T) %>% 
  tibble(Filename=.) %>% 
  separate(col = Filename,
           into = c("prefix","rep","fold","dataset","Model","extra"),
           sep = "_",remove = F) %>% 
  mutate(Dataset=paste(rep,fold,dataset,Model,sep = "_")) %>% 
  select(Dataset,Model,Filename)
gengroup_brrIterHist<-list.files(here::here("output/mtMarkerEffects")) %>% 
  grep(".dat",.,value = T) %>% 
  grep("Repeat",.,value = T, invert = T) %>% 
  tibble(Filename=.) %>% 
  separate(col = Filename,
           into = c("prefix","Dataset","Model","extra"),
           sep = "_",remove = F) %>% 
  mutate(Dataset=paste0(Dataset,"_",Model)) %>% 
  select(Dataset,Model,Filename)

brrIterHist<-bind_rows(gengroup_brrIterHist,cv_brrIterHist) %>% 
  mutate(iterHist=map(Filename,~read.table(paste0(here::here("output/mtMarkerEffects/"),.),
                                            header = F, stringsAsFactors = F) %>% 
                        mutate(Iteration=1:nrow(.)))) %>% 
  unnest() %>% 
  pivot_longer(cols=c(V1:V10),values_to = "varR",names_to = "varCovarParam")
# brrIterHist$iterHist[[1]] %>% slice()
# brrIterHist %>% slice(1) %>% unnest(iterHist)

colors<-viridis::viridis(4)[1:3]

brrIterHist %>% 
  ggplot(.,aes(x=Iteration,y=varR,color=Model,group=Dataset)) + 
  geom_line(alpha=0.7,size=0.8) + 
  theme_bw() +
  facet_grid(varCovarParam~., scales = 'free_y') + 
 # facet_wrap(~varCovarParam,scales = 'free_y',ncol=1) + 
  theme_bw() +  
  theme(plot.title = element_text(face='bold'),
        axis.title.x = element_blank(), #element_text(face='bold',color = 'black'),
        axis.title.y = element_text(face='bold',color = 'black'),
        strip.text.x = element_text(face='bold',color='black'),
        axis.text.x = element_text(color='black'),
        axis.text.y = element_text(color='black'),
        legend.title = element_text(face = 'bold',color='black'),
        legend.text = element_text(face='bold'),
        legend.position = 'bottom',
        strip.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  scale_color_manual(values=colors) + 
  labs(x="Thinned Iteration", y = "Error (Co)variance")

Version	Author	Date
2e6702a	wolfemd	2021-03-24

brrIterHist %>% 
  arrange(Dataset,Model,varCovarParam,Iteration) %>% 
  group_by(Dataset,Model,varCovarParam) %>% 
  slice(1001:6000) %>% 
  ungroup() %>% 
  arrange(Dataset,Model,varCovarParam,Iteration) -> x

colors<-viridis::viridis(10)
x %>%   
  ggplot(.,aes(x=varR,fill=Model)) + 
  geom_density(alpha=0.7,color='black') + 
  theme_bw() +
#  facet_grid(varCovarParam~Model, scales = 'free') + 
  facet_wrap(~varCovarParam,scales = 'free') + 
  theme_bw() +  
  theme(plot.title = element_text(face='bold'),
        axis.title.x = element_blank(), #element_text(face='bold',color = 'black'),
        axis.title.y = element_text(face='bold',color = 'black'),
        strip.text.x = element_text(face='bold',color='black'),
        axis.text.x = element_text(face = 'bold',color='black', angle=30, hjust=1),
        axis.text.y = element_text(face = 'bold',color='black'),
        legend.title = element_text(face = 'bold',color='black'),
        legend.text = element_text(face='bold'),
        legend.position = 'bottom',
        strip.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  scale_color_manual(values=colors) + 
  labs(title="Distribution of Post burn-in Error (co)variance estimates")

Version	Author	Date
2e6702a	wolfemd	2021-03-24

sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.1.1 ragg_1.1.2      magrittr_2.0.1  forcats_0.5.1  
 [5] stringr_1.4.0   dplyr_1.0.5     purrr_0.3.4     readr_1.4.0    
 [9] tidyr_1.1.3     tibble_3.1.0    ggplot2_3.3.3   tidyverse_1.3.0
[13] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] viridis_0.5.1      httr_1.4.2         sass_0.3.1         jsonlite_1.7.2    
 [5] viridisLite_0.3.0  tmvnsim_1.0-2      here_1.0.1         modelr_0.1.8      
 [9] bslib_0.2.4        assertthat_0.2.1   highr_0.8          cellranger_1.1.0  
[13] yaml_2.2.1         corrplot_0.84      lattice_0.20-41    pillar_1.5.1      
[17] backports_1.2.1    glue_1.4.2         digest_0.6.27      promises_1.2.0.1  
[21] rvest_1.0.0        colorspace_2.0-0   htmltools_0.5.1.1  httpuv_1.5.5      
[25] psych_2.0.12       pkgconfig_2.0.3    broom_0.7.5        haven_2.3.1       
[29] scales_1.1.1       whisker_0.4        later_1.1.0.1      git2r_0.28.0      
[33] generics_0.1.0     farver_2.1.0       ellipsis_0.3.1     withr_2.4.1       
[37] cli_2.3.1          mnormt_2.0.2       crayon_1.4.1       readxl_1.3.1      
[41] evaluate_0.14      fs_1.5.0           fansi_0.4.2        nlme_3.1-152      
[45] xml2_1.3.2         textshaping_0.3.3  tools_4.0.3        hms_1.0.0         
[49] lifecycle_1.0.0    munsell_0.5.0      reprex_1.0.0       compiler_4.0.3    
[53] jquerylib_0.1.3    systemfonts_1.0.1  gridGraphics_0.5-1 rlang_0.4.10      
[57] grid_4.0.3         rstudioapi_0.13    labeling_0.4.2     rmarkdown_2.7     
[61] gtable_0.3.0       DBI_1.1.1          R6_2.5.0           gridExtra_2.3     
[65] lubridate_1.7.10   knitr_1.31         utf8_1.2.1         rprojroot_2.0.2   
[69] stringi_1.5.3      parallel_4.0.3     Rcpp_1.0.6         vctrs_0.3.6       
[73] dbplyr_2.1.0       tidyselect_1.1.0   xfun_0.22

Results

2021-March-22