3 Using this manual

3.1 Suggested workflow

  • Create a workflowR repository for your genomic prediction analysis, following instructions here.

  • Follow along the following documents as templates and examples:

    • Genomic Selection Manual (THIS BOOK)

    • GS Process Maps (see process map section below, also placed strategically throughout the manual). There are four total:

      1. Overview Process Map

      2. Data Download and Preparation Process Map

      3. Preliminary Analysis and Cross-validation Process Map

      4. Genomic Mate Selection Process Map

    • GS Checklist (~TO BE LINKED HERE~)

  • Use some variant of these documents and code examples to complete a genomic prediction analysis and develop a report on the results.

  • Advice and best practices:

    • Choose your own data, traits, from cassavabase

    • Work through what the example code actually does for yourself.

      • Follow-up functions you don’t know by going to the manual.
      • I will strive to provide references to other tutorials, papers, etc. to give context and help you learn more detail where desired..
    • Inevitably, we will want divergences, alterations, bells-and-whistles on top of the process documented. SUGGEST altering and developing your own process maps and checklists as you go.

    • Use a combination of Rmarkdown (.Rmd) and Rscripts (.R) to document your analysis, as demonstrated.

    • Take the time to write commentary throughout. In full sentences, what do you intend to do? How do you interpret the results? What is the next step? Etc.

    • Take the time to think through the naming of datasets, files, folders, R objects, etc.

    • Use Git version control, made easy with Rstudio.

    • Publish your code to GitHub and a report on your results as a webpage using GitHub Pages. I will demonstrate using the package workflowR to manage these aspects.

3.2 Prerequisites

  1. You need to install R, Rstudio and relevant R packages in advance. Instructions are in the next section.
  2. You need to know at least some R syntax. Links to learning resources are also provided in a section below. If you’ve never used R before, you are going to have trouble following the coding aspects in this manual.

3.3 Install software and packages

3.3.1 R, Rstudio, R packages

install.packages(c("tidyverse","workflowr", "sommer", "lme4"))
devtools::install_github("wolfemd/genomicMateSelectR", ref = 'master') 

3.3.2 Create a GitHub account

We would like to teach you a reproducible, open-access approach to data science and genomic selection.

To start, please go to https://github.com/ and create a free account, if you don’t already have one. We will show you how to create web-based reports, like this, from your analyses!

3.3.3 Command-line Programs

Using Cassavabase-derived data mostly, but not entirely removes the need for command-line informatics tools. However, I was not able to totally avoid it in the example. Furthermore, this will be a valuable skill / experience to learn.

For the section on “preparing genotype data” and further downstream when we do some steps to check the validity of the pedigree we end up needing some bioinformatics tools.

3.3.3.1 Windows

The three possible Linux-emulator applications that colleagues have recommended to me for Windows users:

1. Windows Subsystem for Linux: https://docs.microsoft.com/en-us/windows/wsl/install

2. Git BASH for Windows: https://gitforwindows.org/

3. Cygwin: https://www.cygwin.com/

I can’t give you much advice beyond those links. Get googling for solutions!

I found this open-access google doc: http://bit.ly/2FSSjH6 which might provide guidance for Windows users.

BACK-UP PLAN: We might explore setting up a Cornell-based BioHPC node and allowing everyone to log-in to it remotely. The BioHPC has all the programs we could possibly want and gives access to more memory/compute cores than any single laptop.

3.3.3.2 Mac

You’ll already have access to most of the commands I’ll demonstrate, e.g. grep, cut. I recommend installing “Homebrew” which will enable you to easily install e.g. vcftools and bcftools by doing e.g. brew install vcftools in the terminal.

3.3.3.3 Programs we might use

  • Bioinformatics command-line software tools:

    • vcftools

    • bcftools

    • plink1.9: When it comes up late in the pipeline, I actually describe the process of getting it working on a (my) Mac laptop. In a pinch, you can download and unzip the pre-complied plink program on your machine and then use ./plink to run the program.

  • There are some other commands we might encounter, that should come pre-available, at least in the Mac and Linux command lines.

3.4 Learning R and more

Compendium of Learning Resources gDoc: This google doc contains all the links/references above. It contains even more, and I hope to maintain it as a growing, dynamic, more comprehensive annotated list of resources for learning R, Rstudio, data science and more.

3.4.3 Hotkeys

Pretty critical to learn a few of these, especially these:

  • CMD+Option+I = create chunk
  • Shift+CMD+M = %>% pipe operator
  • CMD+Enter = submit (run) lines of code in your Rmd or R script to the console.

3.5 R sessions, packages to load

I will use the tidyverse and also genomicMateSelectR packages throughout the pipeline.

There are others that may appear.

I recommend, for each pipeline segment, starting with a new R session. Begin each segment, with a step to load these R packages:

library(tidyverse)
library(genomicMateSelectR)
library(gt) # just for the nice looking tables

3.6 High performance and remote computing

The example in this manual is designed to work on a laptop… at least a new one. I’ve got 16-cores and 64GB RAM on the machine I developed it on.

In practice, with the large number of plots, clones and SNPs that we actually work with, we will not use a laptop for these computations.

At some point, perhaps at the end of the pipeline run-through we will want to cover the (remote) use of high performance computing machines to facilitate.