7 Prepare phenotype data

Context and Purpose: In this step, we do quality control, clean and format training data for further analysis.
Upstream: Section 6 - training data download
Downstream: pretty much everything
Inputs: “Raw” field trial data
Expected outputs: “Cleaned” field trial data

7.1 Process Map

7.2 Read DB data

Load the phenotype and metadata downloads into R.

I built a function readDBdata that simply wraps around read.csv, reads and merges the metadata to the plot-basis data. The metadataFile= argument can be left NULL.

dbdata<-readDBdata(phenotypeFile = here::here("data","phenotype.csv"),
                   metadataFile = here::here("data","metadata.csv"))
#> Joining, by = c("studyYear", "programDbId", "programName", "programDescription", "studyDbId", "studyName", "studyDescription", "studyDesign", "plotWidth", "plotLength", "fieldSize", "fieldTrialIsPlannedToBeGenotyped", "fieldTrialIsPlannedToCross", "plantingDate", "harvestDate", "locationDbId", "locationName")

HINT: At any point in the manual, if I reference or use a custom function in the genomicMateSelectR, I encourage you to check out the reference page for that function, e.g. readDBdata(). Or look at the code yourself by typing e.g. readDBdata at the R console or heading to the GitHub repo.

7.3 Check experimental designs

Checklist: Are the data plot-basis, plant-basis or a mixture? If plant-basis data are present, should they be converted to plot-basis for further analysis?

dbdata %>% count(observationLevel)
#>   observationLevel    n
#> 1             plot 2533
# table(dbdata$observationLevel)

Only plot-basis in this case.

Checklist: What experimental designs are present? How are they represented by the variables in the dataset? Are all designs consistent with your expectations, for example relative to the reported “trialType,” “studyName” and/or “studyDesign?”

In this step, in the past, I have not been certain of the experimental designs of the trials I had downloaded. I was also not certain how the designs were represented in the column-names. For this reason, I developed an ad hoc custom code to “detect” the designs. I built the genomicMateSelectR function detectExptDesigns(). See an example here.

RECOMMENDATION: Each analyst needs to use exploratory data anlaysis, making summary statistics and plots as necessary to determine how the data should be modelled downstream. If there are missing or incorrectly represented trial design variables, get it corrected on the database (contact breeding program data manager, as necessary).

Because I have a small example dataset, it is possible to look at 9 trials and evaluate.

Often, many more trials are part of a genomic prediction. This is why it is essential that trial designs be consistent, and clear to the analyst. You may need to derive a strategy similar to the detectExptDesigns() function to semi-automate the process.

library(gt)
dbdata %>% 
     count(studyName,trialType, studyDesign, numberBlocks,numberReps,entryType) %>% 
     spread(entryType,n) %>% 
     gt()  %>% 
     tab_options(table.font.size = pct(75))

studyName	trialType	studyDesign	numberBlocks	numberReps	check	test
19.GS.C1.C2.C3.AYT.42.UB	NA	Alpha	NA	NA	15	110
19.GS.C2.UYT.36.setA.IB	Uniform Yield Trial	Alpha	6	2	10	58
19.GS.C2.UYT.36.setA.UB	Uniform Yield Trial	Alpha	6	2	10	62
19.GS.C2.UYT.36.setB.IB	Uniform Yield Trial	RCBD	6	2	10	56
19.GS.C2.UYT.36.setB.UB	Uniform Yield Trial	Alpha	6	2	10	62
19.GS.C4B.PYT.135.UB	Preliminary Yield Trial	Alpha	30	2	12	258
19.GS.C4B.PYT.140.IB	Preliminary Yield Trial	Alpha	28	2	31	242
19.GS.C4C.CE.822.IB	Clonal Evaluation	RCBD	42	1	132	645
19geneticgainUB	genetic_gain_trial	Augmented	11	1	NA	810

Summary table above shows:

trialType and studyDesign cannot be 100% relied upon, at least not here.
The only trial actually listed as having studyDesign=="Augmented" does not have “check” vs. “test” distinguished in the “entryType.”
A trialType=="Clonal Evaluation" with studyDesign=="RCBD" but actually only 1 replication.

Next, I’ll check if the replicate and blockNumber columns reliably distinguish complete and incomplete blocks in the data.

dbdata %>% 
     group_by(studyName) %>% 
     summarize(N_replicate=length(unique(replicate)),
               N_blockNumber=length(unique(blockNumber))) %>% 
     gt() %>% tab_options(table.font.size = pct(75))

studyName	N_replicate	N_blockNumber
19.GS.C1.C2.C3.AYT.42.UB	3	3
19.GS.C2.UYT.36.setA.IB	2	6
19.GS.C2.UYT.36.setA.UB	2	6
19.GS.C2.UYT.36.setB.IB	2	6
19.GS.C2.UYT.36.setB.UB	2	6
19.GS.C4B.PYT.135.UB	2	30
19.GS.C4B.PYT.140.IB	2	28
19.GS.C4C.CE.822.IB	1	42
19geneticgainUB	1	11

Here, I notice that except 1 trial (19.GS.C1.C2.C3.AYT.42.UB) has the same number of reps and blocks.

The question is, are complete replications of the experiment indicated by replicate and incomplete sub-blocks represented by blockNumber

dbdata %>% 
     group_by(studyName) %>% 
     summarize(N_replicate=length(unique(replicate)),
               N_blockNumber=length(unique(blockNumber)),
               doRepsEqualBlocks=all(replicate==blockNumber)) %>% 
     gt() %>% tab_options(table.font.size = pct(75))

studyName	N_replicate	N_blockNumber	doRepsEqualBlocks
19.GS.C1.C2.C3.AYT.42.UB	3	3	TRUE
19.GS.C2.UYT.36.setA.IB	2	6	FALSE
19.GS.C2.UYT.36.setA.UB	2	6	FALSE
19.GS.C2.UYT.36.setB.IB	2	6	FALSE
19.GS.C2.UYT.36.setB.UB	2	6	FALSE
19.GS.C4B.PYT.135.UB	2	30	FALSE
19.GS.C4B.PYT.140.IB	2	28	FALSE
19.GS.C4C.CE.822.IB	1	42	FALSE
19geneticgainUB	1	11	FALSE

So for 1 trial, there are 3 complete blocks, no sub-blocks. For 6 trials, there are 2 complete replications and nested sub-blocks represented by the blockNumber variable. For 2 trials, there are only incomplete blocks.

Next, I decided to check that the replicate column definitely means complete blocks. The below might look a bit complicated, but I basically merge two summaries: (1) he overall number of accessions per trial, and (2) the average number of accessions per replicate per trial.

# the overall number of accessions per trial
dbdata %>% 
     group_by(studyName) %>% 
     summarize(N_accession=length(unique(germplasmName))) %>% 
     # the average number of accessions per replicate per trial
     left_join(dbdata %>% 
                    group_by(studyName,replicate) %>% 
                    summarize(N_accession=length(unique(germplasmName))) %>% 
                    group_by(studyName) %>% 
                    summarize(avgAccessionsPerReplicate=ceiling(mean(N_accession)))) %>% 
     gt() %>% tab_options(table.font.size = pct(75))
#> `summarise()` has grouped output by 'studyName'. You can override using the `.groups` argument.
#> Joining, by = "studyName"

studyName	N_accession	avgAccessionsPerReplicate
19.GS.C1.C2.C3.AYT.42.UB	42	42
19.GS.C2.UYT.36.setA.IB	35	34
19.GS.C2.UYT.36.setA.UB	36	36
19.GS.C2.UYT.36.setB.IB	36	33
19.GS.C2.UYT.36.setB.UB	36	36
19.GS.C4B.PYT.135.UB	135	135
19.GS.C4B.PYT.140.IB	129	127
19.GS.C4C.CE.822.IB	657	657
19geneticgainUB	782	782

The numbers are very similar for all trials, indicating complete blocks.

One more: look at the min, mean and max number of accessions per blockNumber.

# the overall number of accessions per trial
dbdata %>% 
     group_by(studyName) %>% 
     summarize(N_accession=length(unique(germplasmName))) %>% 
     left_join(dbdata %>% 
     group_by(studyName,replicate,blockNumber) %>% 
     summarize(N_accession=length(unique(germplasmName))) %>% ungroup() %>% 
     group_by(studyName) %>% 
     summarize(minAccessionsPerBlock=ceiling(min(N_accession)),
               avgAccessionsPerBlock=ceiling(mean(N_accession)),
               maxAccessionsPerBlock=ceiling(max(N_accession)))) %>% 
     gt() %>% tab_options(table.font.size = pct(60))
#> `summarise()` has grouped output by 'studyName', 'replicate'. You can override using the `.groups` argument.
#> Joining, by = "studyName"

studyName	N_accession	minAccessionsPerBlock	avgAccessionsPerBlock	maxAccessionsPerBlock
19.GS.C1.C2.C3.AYT.42.UB	42	41	42	42
19.GS.C2.UYT.36.setA.IB	35	11	12	12
19.GS.C2.UYT.36.setA.UB	36	12	12	12
19.GS.C2.UYT.36.setB.IB	36	9	11	12
19.GS.C2.UYT.36.setB.UB	36	12	12	12
19.GS.C4B.PYT.135.UB	135	9	9	9
19.GS.C4B.PYT.140.IB	129	8	10	10
19.GS.C4C.CE.822.IB	657	10	19	20
19geneticgainUB	782	39	73	80

From this, you can see that except for studyName=="19.GS.C1.C2.C3.AYT.42.UB" the sub-blocks represented by blockNumber have only subsets of the total number of accessions in the trial, as expected.

Further, except for studyName=="19geneticgainUB" all trials have pretty consistently sized sub-blocks.

Now I will ad hoc create two variables (CompleteBlocks and IncompleteBlocks), indicating (TRUE/FALSE) whether to model using the replicate and/or blockNumber variable.

I also like to create explicitly nested design variables (yearInLoc, trialInLocYr, repInTrial, blockInRep).

dbdata %<>% 
     group_by(studyName) %>% 
     summarize(N_replicate=length(unique(replicate)),
               N_blockNumber=length(unique(blockNumber)),
               doRepsEqualBlocks=all(replicate==blockNumber)) %>% 
     ungroup() %>% 
     mutate(CompleteBlocks=ifelse(N_replicate>1,TRUE,FALSE),
            IncompleteBlocks=ifelse(N_blockNumber>1 & !doRepsEqualBlocks,TRUE,FALSE)) %>% 
     left_join(dbdata) %>% 
     mutate(yearInLoc=paste0(programName,"_",locationName,"_",studyYear),
            trialInLocYr=paste0(yearInLoc,"_",studyName),
            repInTrial=paste0(trialInLocYr,"_",replicate),
            blockInRep=paste0(repInTrial,"_",blockNumber))
#> Joining, by = "studyName"

Just to check:

dbdata %>% 
     count(studyName,CompleteBlocks,IncompleteBlocks) %>% 
     left_join(dbdata %>% 
                    group_by(studyName) %>% 
                    summarize(nRepInTrial=length(unique(repInTrial)),
                              nBlockInRep=length(unique(blockInRep)))) %>% 
     gt() %>% tab_options(table.font.size = pct(67))
#> Joining, by = "studyName"

studyName	CompleteBlocks	IncompleteBlocks	n	nRepInTrial	nBlockInRep
19.GS.C1.C2.C3.AYT.42.UB	TRUE	FALSE	125	3	3
19.GS.C2.UYT.36.setA.IB	TRUE	TRUE	68	2	6
19.GS.C2.UYT.36.setA.UB	TRUE	TRUE	72	2	6
19.GS.C2.UYT.36.setB.IB	TRUE	TRUE	66	2	6
19.GS.C2.UYT.36.setB.UB	TRUE	TRUE	72	2	6
19.GS.C4B.PYT.135.UB	TRUE	TRUE	270	2	30
19.GS.C4B.PYT.140.IB	TRUE	TRUE	273	2	28
19.GS.C4C.CE.822.IB	FALSE	TRUE	777	1	42
19geneticgainUB	FALSE	TRUE	810	1	11

7.4 Traits and Trait Abbreviations

Cassavabase downloads use very long column-names corresponding to the full trait-ontology name. For convenience, I replace these names with abbreviations, documented here. For eventual upload of analysis results, names will need to be restored to ontology terms.

I also use this opportunity to subselect traits.

traitabbrevs<-tribble(~TraitAbbrev,~TraitName,
        "CMD1S","cassava.mosaic.disease.severity.1.month.evaluation.CO_334.0000191",
        "CMD3S","cassava.mosaic.disease.severity.3.month.evaluation.CO_334.0000192",
        "CMD6S","cassava.mosaic.disease.severity.6.month.evaluation.CO_334.0000194",
        "DM","dry.matter.content.percentage.CO_334.0000092",
        "RTWT","fresh.storage.root.weight.per.plot.CO_334.0000012",
        "NOHAV","plant.stands.harvested.counting.CO_334.0000010")
traitabbrevs %>% gt()#rmarkdown::paged_table()

TraitAbbrev	TraitName
CMD1S	cassava.mosaic.disease.severity.1.month.evaluation.CO_334.0000191
CMD3S	cassava.mosaic.disease.severity.3.month.evaluation.CO_334.0000192
CMD6S	cassava.mosaic.disease.severity.6.month.evaluation.CO_334.0000194
DM	dry.matter.content.percentage.CO_334.0000092
RTWT	fresh.storage.root.weight.per.plot.CO_334.0000012
NOHAV	plant.stands.harvested.counting.CO_334.0000010

Run function renameAndSelectCols() to rename columns and remove unselected traits.

dbdata<-renameAndSelectCols(traitabbrevs,
                            indata=dbdata,
                            customColsToKeep = c("observationUnitName",
                                                 "CompleteBlocks",
                                                 "IncompleteBlocks",
                                                 "yearInLoc",
                                                 "trialInLocYr",
                                                 "repInTrial","blockInRep"))
#> Joining, by = "TraitName"

7.5 QC Trait Values

At this point in the pipeline, we should check the all trait values are in allowable ranges. Different ways to approach this. Feel free to make some plots of your data!

The database also has mechanisms to ensure trait values are only within allowable ranges.

Nevertheless, as a habit, I have an simple ad hoc approach to this:

# comment out the traits not present in this dataset
dbdata<-dbdata %>% 
     dplyr::mutate(CMD1S=ifelse(CMD1S<1 | CMD1S>5,NA,CMD1S),
                   CMD3S=ifelse(CMD3S<1 | CMD3S>5,NA,CMD3S),
                   # CMD6S=ifelse(CMD6S<1 | CMD6S>5,NA,CMD6S), 
                   # CMD9S=ifelse(CMD9S<1 | CMD9S>5,NA,CMD9S),
                   # CGM=ifelse(CGM<1 | CGM>5,NA,CGM),
                   # CGMS1=ifelse(CGMS1<1 | CGMS1>5,NA,CGMS1),
                   # CGMS2=ifelse(CGMS2<1 | CGMS2>5,NA,CGMS2),
                   DM=ifelse(DM>100 | DM<=0,NA,DM),
                   RTWT=ifelse(RTWT==0 | NOHAV==0 | is.na(NOHAV),NA,RTWT),
                   # SHTWT=ifelse(SHTWT==0 | NOHAV==0 | is.na(NOHAV),NA,SHTWT),
                   # RTNO=ifelse(RTNO==0 | NOHAV==0 | is.na(NOHAV),NA,RTNO),
                   NOHAV=ifelse(NOHAV==0,NA,NOHAV),
                   NOHAV=ifelse(NOHAV>42,NA,NOHAV)
                   # RTNO=ifelse(!RTNO %in% 1:10000,NA,RTNO)
     )

7.6 Post-QC: composite traits

Now that component traits are QC’d, it’s time to compute any composite traits.

By composite traits, I mean traits computed from combinations of other traits.

Examples for cassava: season-wide mean disease severity, harvest index, and fresh root yield.

7.6.1 Season-wide mean disease severity

# [NEW AS OF APRIL 2021]
## VERSION with vs. without CBSD
## Impervious to particular timepoints between 1, 3, 6 and 9 scores

# Without CBSD (West Africa)
dbdata<-dbdata %>% 
  mutate(MCMDS=rowMeans(.[,colnames(.) %in% c("CMD1S","CMD3S","CMD6S","CMD9S")], na.rm = T)) %>% 
  select(-any_of(c("CMD1S","CMD3S","CMD6S","CMD9S")))

# With CBSD (East Africa)
# dbdata<-dbdata %>% 
#   mutate(MCMDS=rowMeans(.[,colnames(.) %in% c("CMD1S","CMD3S","CMD6S","CMD9S")], na.rm = T),
#          MCBSDS=rowMeans(.[,colnames(.) %in% c("CBSD1S","CBSD3S","CBSD6S","CBSD9S")], na.rm = T)) %>% 
#   select(-any_of(c("CMD1S","CMD3S","CMD6S","CMD9S","CBSD1S","CBSD3S","CBSD6S","CBSD9S")))

7.6.2 Fresh root yield (FYLD)

RTWT (fresh root weight per plot in kg) –> FYLD (fresh root yield in tons per hectare)

\[FYLD = \frac{RTWT_{kg / plot}}{MaxHarvestedPlantsPerPlot \times PlantSpacing}\times10\] NOTE: MaxHarvestedPlantsPerPlot in formula above is to distinguish from the plantsPerPlot meta-data field, in case that a net-plot harvest is used. In other words, the value should be the total number of plants intended for harvest in a plot, assuming there were no missing plants in the plot.

PlantSpacing is the area in \(m^2\) per plant.

dbdata %>% 
     count(studyYear,studyName,studyDesign,plotWidth,plotLength,plantsPerPlot) %>% 
     mutate(plotArea=plotWidth*plotLength) %>% 
     gt() %>% tab_options(table.font.size = pct(67))

studyYear	studyName	studyDesign	plotWidth	plotLength	plantsPerPlot	n	plotArea
2019	19.GS.C1.C2.C3.AYT.42.UB	Alpha	4	4.0	NA	125	16.0
2019	19.GS.C2.UYT.36.setA.IB	Alpha	4	4.0	NA	68	16.0
2019	19.GS.C2.UYT.36.setA.UB	Alpha	4	4.0	NA	72	16.0
2019	19.GS.C2.UYT.36.setB.IB	RCBD	4	4.0	NA	66	16.0
2019	19.GS.C2.UYT.36.setB.UB	Alpha	4	4.0	NA	72	16.0
2019	19.GS.C4B.PYT.135.UB	Alpha	2	4.0	NA	270	8.0
2019	19.GS.C4B.PYT.140.IB	Alpha	3	2.5	NA	273	7.5
2019	19.GS.C4C.CE.822.IB	RCBD	1	2.5	NA	777	2.5
2019	19geneticgainUB	Augmented	1	2.5	NA	810	2.5

In the example trial data, the plantsPerPlot meta-data field is empty. To my knowledge, no meta-data field is available in BreedBase to represent a net-plot harvest.

RECOMMEND INPUTING plantsPerPlot meta-data to cassavabase for your breeding program!

Luckily, since there are only 9 trials and this is a tutorial, we will decisions manually.

Firstly noting that the trial 19geneticgainUB actually does not have phenotypes (for any trait). It will be excluded downstream. (~~I might find a substitute genetic gain trial, from an earlier year, for the sake of this example~~)

To decide what the real MaxHarvestedPlantsPerPlot and plantsPerPlot were likely to have been, I make two plots below and also compute the maximum NOHAV for each trial.

dbdata %>% 
     ggplot(.,aes(x=NOHAV, fill=studyName)) + geom_density(alpha=0.75)
#> Warning: Removed 921 rows containing non-finite values
#> (stat_density).

Maybe clearer to make a boxplot?

dbdata %>% 
     # plot area in meters squared
     mutate(plotArea=plotWidth*plotLength) %>% 
     ggplot(.,aes(x=plotArea,y=NOHAV, fill=studyName)) + 
     geom_boxplot() + theme(axis.text.x = element_blank())
#> Warning: Removed 921 rows containing non-finite values
#> (stat_boxplot).

plantsPerPlot_choices<-dbdata %>% 
     distinct(studyYear,studyName,plotWidth,plotLength,plantsPerPlot) %>% 
     left_join(dbdata %>% 
                    group_by(studyName) %>% 
                    summarize(MaxNOHAV=max(NOHAV, na.rm=T))) %>% 
          # plot area in meters squared
     mutate(plotArea=plotWidth*plotLength,
            # Number of plants per plot
            plantsPerPlot=MaxNOHAV,
            plantsPerPlot=ifelse(studyName=="19.GS.C2.UYT.36.setA.UB",20,plantsPerPlot)) %>% 
     # exclude the empty genetic gain trial
     filter(studyName!="19geneticgainUB") %>% 
     select(studyName,plotArea,MaxNOHAV,plantsPerPlot)
#> Warning in max(NOHAV, na.rm = T): no non-missing arguments
#> to max; returning -Inf
#> Joining, by = "studyName"
plantsPerPlot_choices %>% gt() #%>% tab_options(table.font.size = pct(67))

studyName	plotArea	MaxNOHAV	plantsPerPlot
19.GS.C1.C2.C3.AYT.42.UB	16.0	10	10
19.GS.C2.UYT.36.setA.IB	16.0	20	20
19.GS.C2.UYT.36.setA.UB	16.0	18	20
19.GS.C2.UYT.36.setB.IB	16.0	20	20
19.GS.C2.UYT.36.setB.UB	16.0	20	20
19.GS.C4B.PYT.135.UB	8.0	10	10
19.GS.C4B.PYT.140.IB	7.5	9	9
19.GS.C4C.CE.822.IB	2.5	7	7

For the sake of this example, it is ‘ok’ to make choices on the basis I have just done.

As a data generator, in-house at each breeding program, no reason not to get the correct answer and repair the metadata on the database!

dbdata %<>%
     # remove the empty genetic gain trial
     filter(studyName!="19geneticgainUB") %>% 
     select(-plantsPerPlot) %>% 
     # join plantsPerPlot_choices to the trial data
     left_join(plantsPerPlot_choices) %>% 
     # compute fresh root yield (FYLD) in tons per hectare
     mutate(PlantSpacing=plotArea/plantsPerPlot,
            FYLD=RTWT/(plantsPerPlot*PlantSpacing)*10)
#> Joining, by = "studyName"
dbdata %>% ggplot(.,aes(x=FYLD,fill=studyName)) + geom_density(alpha=0.75)
#> Warning: Removed 133 rows containing non-finite values
#> (stat_density).

Additional things to compute:

log-transform yield traits: this is a habit based on experience. Linear mixed-models should have normally distributed homoskedastic residuals, if they don’t log-transform the response variable often helps. For FYLD and related traits, I always log-transform.

# I log transform yield traits 
# to satisfy homoskedastic residuals assumption 
# of linear mixed models
dbdata %<>% 
     mutate(DYLD=FYLD*(DM/100),
            logFYLD=log(FYLD),
            logDYLD=log(DYLD),
            PropNOHAV=NOHAV/plantsPerPlot) 
# remove non transformed / per-plot (instead of per area) traits
dbdata %<>% select(-RTWT,-FYLD,-DYLD)
dbdata %>% ggplot(.,aes(x=logFYLD,fill=studyName)) + geom_density(alpha=0.75)
#> Warning: Removed 133 rows containing non-finite values
#> (stat_density).

Debatable whether this is better. Let’s not dwell on it. Onward!

SUGGESTION: For individuals working this manual, consider making different, or no transformations as you see fit, with your own data. Even better, set-up a direct comparison of results with- vs. without-transformation.*

7.7 Save “cleaned” phenotypes

saveRDS(dbdata,file=here::here("output","phenotypes_cleaned.rds"))

6 Download training data

8 Prepare genotypic data