3 Using this manual
3.1 Suggested workflow
Create a
workflowR
repository for your genomic prediction analysis, following instructions here.-
Follow along the following documents as templates and examples:
Genomic Selection Manual (THIS BOOK)
-
GS Process Maps (see process map section below, also placed strategically throughout the manual). There are four total:
Overview Process Map
Data Download and Preparation Process Map
Preliminary Analysis and Cross-validation Process Map
Genomic Mate Selection Process Map
GS Checklist (~TO BE LINKED HERE~)
Use some variant of these documents and code examples to complete a genomic prediction analysis and develop a report on the results.
-
Advice and best practices:
Choose your own data, traits, from cassavabase
-
Work through what the example code actually does for yourself.
- Follow-up functions you don’t know by going to the manual.
- I will strive to provide references to other tutorials, papers, etc. to give context and help you learn more detail where desired..
Inevitably, we will want divergences, alterations, bells-and-whistles on top of the process documented. SUGGEST altering and developing your own process maps and checklists as you go.
Use a combination of Rmarkdown (.Rmd) and Rscripts (.R) to document your analysis, as demonstrated.
Take the time to write commentary throughout. In full sentences, what do you intend to do? How do you interpret the results? What is the next step? Etc.
Take the time to think through the naming of datasets, files, folders, R objects, etc.
Use Git version control, made easy with Rstudio.
Publish your code to GitHub and a report on your results as a webpage using GitHub Pages. I will demonstrate using the package
workflowR
to manage these aspects.
3.2 Prerequisites
- You need to install R, Rstudio and relevant R packages in advance. Instructions are in the next section.
- You need to know at least some R syntax. Links to learning resources are also provided in a section below. If you’ve never used R before, you are going to have trouble following the coding aspects in this manual.
3.3 Install software and packages
3.3.1 R, Rstudio, R packages
-
Install R and Rstudio
-
Install packages
-
tidyverse
(includesdplyr
,tidyr
,ggplot2
,magrittr
and other really useful ones) genomicMateSelectR
workflowr
-
install.packages(c("tidyverse","workflowr", "sommer", "lme4"))
devtools::install_github("wolfemd/genomicMateSelectR", ref = 'master')
3.3.2 Create a GitHub account
We would like to teach you a reproducible, open-access approach to data science and genomic selection.
To start, please go to https://github.com/ and create a free account, if you don’t already have one. We will show you how to create web-based reports, like this, from your analyses!
3.3.3 Command-line Programs
Using Cassavabase-derived data mostly, but not entirely removes the need for command-line informatics tools. However, I was not able to totally avoid it in the example. Furthermore, this will be a valuable skill / experience to learn.
For the section on “preparing genotype data” and further downstream when we do some steps to check the validity of the pedigree we end up needing some bioinformatics tools.
3.3.3.1 Windows
The three possible Linux-emulator applications that colleagues have recommended to me for Windows users:
1. Windows Subsystem for Linux: https://docs.microsoft.com/en-us/windows/wsl/install
2. Git BASH for Windows: https://gitforwindows.org/
3. Cygwin: https://www.cygwin.com/
I can’t give you much advice beyond those links. Get googling for solutions!
I found this open-access google doc: http://bit.ly/2FSSjH6 which might provide guidance for Windows users.
BACK-UP PLAN: We might explore setting up a Cornell-based BioHPC node and allowing everyone to log-in to it remotely. The BioHPC has all the programs we could possibly want and gives access to more memory/compute cores than any single laptop.
3.3.3.2 Mac
You’ll already have access to most of the commands I’ll demonstrate, e.g. grep
, cut
. I recommend installing “Homebrew” which will enable you to easily install e.g. vcftools
and bcftools
by doing e.g. brew install vcftools
in the terminal.
3.4 Learning R and more
R for Data Science (https://r4ds.had.co.nz/):
As the Rstudio / Tidyverse team recommends here (https://www.tidyverse.org/learn/), this book is “the best place to start learning the tidyverse.” Just about anyone should be able to (1) read the brief introduction and (2) look over the table of contents and quickly find a starting point that meets their level / interest.-
Data Challenge Lab (https://datalab.stanford.edu/challenge-lab)
Where students develop their data skills by solving a progression of increasingly difficult challenges. I just discovered this. There is much here that I think is useful. Especially amongst the “Open Content” (https://dcl-docs.stanford.edu/home/). For example:Data Wrangling
https://dcl-wrangle.stanford.edu/manip-basics.htmlFunctional Programming
https://dcl-prog.stanford.edu/
https://dcl-prog.stanford.edu/purrr-mutate.html (I’ve been told my loops using purrr package functions are confusing… check this out).
Also:https://dcl-prog.stanford.edu/purrr-parallel.html
Learn the tidyverse
https://www.tidyverse.org/learn/
https://www.tidyverse.org/packages/APS 135: Introduction to Exploratory Data Analysis with R
(https://dzchilds.github.io/eda-for-bio/)
Has some useful intro sections on the very basics of R and Rstudio, seems to have come out of a Plant Science department, so there is that.Stat545: Data wrangling, exploration, and analysis with R
https://stat545.com/
Compendium of Learning Resources gDoc: This google doc contains all the links/references above. It contains even more, and I hope to maintain it as a growing, dynamic, more comprehensive annotated list of resources for learning R, Rstudio, data science and more.
3.5 R sessions, packages to load
I will use the tidyverse
and also genomicMateSelectR
packages throughout the pipeline.
There are others that may appear.
I recommend, for each pipeline segment, starting with a new R session. Begin each segment, with a step to load these R packages:
3.6 High performance and remote computing
The example in this manual is designed to work on a laptop… at least a new one. I’ve got 16-cores and 64GB RAM on the machine I developed it on.
In practice, with the large number of plots, clones and SNPs that we actually work with, we will not use a laptop for these computations.
At some point, perhaps at the end of the pipeline run-through we will want to cover the (remote) use of high performance computing machines to facilitate.