# Introduction

In recent years, advances in genomic sequencing through next generation sequencing have enabled the development of millions of new markers, which have been consistently used in studies of important agronomic traits (Edwards and Batley, 2010). Hence, breeders have focused on studies that allow the association of markers with the phenotypic of interest, make predictions of performance or in research involving population studies and diversity analysis.
After generating a large amount of genomic data, the genomic data must pass through a quality control and imputation of the missing genomic data. Some GS models that take advantage of dimensionality reduction like GBLUP and RKHS needs to construct relationship matrices. Moreover, the understanding of population genetics parameters is important as well.
Therefore, there is the need of prepare genomic datasets, in such a way that it can be easily applied in a great range of studies. Hence, we propose proposed snpReady package based on needs of setting datasets ready to run genomic studies in leading genomic applications. Thus, we include in this package the three primary critical needs faced before running genomic analyses: preparation and quality control of datasets, estimation of relationship matrices and estimation of basic population genetics parameters.

# Quality control

The function was created with the purpose of recoding and reshape the matrices obtained from different SNP genotyping platforms and let it ready to be used in genomic analyses. Thus, it reshapes, recodes, makes quality control and imputation of missing data in the dataset. It also cleans the map based on the same threshold used in the raw data set.
Marker matrix used as input can be organized in two ways. In the long format, for each subject, there is an observation of each SNP and its alleles. Thus, if there are $$n$$ individuals and $$p$$ markers, the matrix is the order of is $$(n \times p) \times 4$$ where columns represent samples and SNP identification and one for each allele, in this particular order. In order to illustrate the process, we use a maize data set with 64 lines and 539 SNPs.

library(snpReady)
## Loading required package: Matrix
## Loading required package: matrixcalc
## Loading required package: stringr
## Loading required package: rgl
## Loading required package: impute
geno <- read.table("http://italo-granato.github.io/geno.txt", header = TRUE, na.strings = "NA")
head(geno)
##   sample     marker allele.1 allele.2
## 1    A01 PHM4468.13        G        G
## 2    A01 PHM2770.19        G        G
## 3    A01 PZA00485.2        A        A
## 4    A01 PZA00522.7        A        A
## 5    A01 PZA00627.1        G        G
## 6    A01 PZA00473.5        G        G
dim(geno)
## [1] 34496     4

Another format that can be used as input is wide, where samples are in the row and SNPs in columns. In this case, the matrix is the order of $$n \times p$$.

x[1:10,1:5]
##     PHM10225.15 PHM10321.11 PHM10404.8 PHM10525.11 PHM10525.9
## A01 "CC"        "GG"        NA         "CC"        "TT"
## A02 "CG"        "CC"        "GG"       "TT"        "GG"
## A03 "CC"        "GG"        "GG"       "CC"        "TT"
## A04 "CC"        "GG"        "GG"       "TT"        "GG"
## A05 "CC"        "GG"        "CC"       NA          "GG"
## A06 "CC"        "GG"        "GG"       "TT"        "GG"
## A07 "CC"        "CC"        "CC"       "CC"        "GG"
## A08 "CC"        "GG"        "GG"       "CC"        "TT"
## A09 "CC"        "CC"        "CC"       "TT"        "GG"
## A10 "CC"        "GG"        "GG"       "TT"        "GG"

The raw data can be coded as the nitrogenous base (A, C, G, and T) or the standard A and B. However, if the data was already recoded this can be set by base argument and only quality control is made. Thus, if the base is FALSE dataset must be coded as 0, 1 and 2. Missing data should be set as NA.
Quality control (QC) for genomic data is based on removing individuals and markers with poor information. In general, for individuals, it can be associated with the amount of missing markers. Hence, samples which do not meet some threshold of missing data can be removed through sweep.sample. For markers, QC is based on allele frequency and the amount of missing data. Markers with a low frequency of one of its alleles usually are non-informative and in some situations are considered monomorphic. Therefore, they can be removed. The same trend is applied for missing data. Thus, the QC process is made through MAF and call.rate.
Along with removing non-informative markers, imputation of missing data becomes necessary. We present two types of map-independent imputation. One is based on the Wright equilibrium. For a missing position, we assume that the probability of taking up value is dependent on both allele frequency of the SNP and the level of homozygosity of an individual. Thus: $P(x_{ij})=\left\{ \begin{array}{ll} P(x = 0) = (1 - p_j)^2+ p_j (1 - p_j ) F_i\\ P(x = 1) = 2p_j (1 - p_j )-2p_j (1 - p_j)\\ P(x=2)= p_j^2+p_j (1 - p_j ) F_i \end{array} \right.$ Where $$p_i$$ is the frequency of the major allele for an SNP $$i$$, and $$F_j$$ is the level of homozygosity of an individual $$j$$ estimate as a proportion of the amount of homozygous loci relative to the total of loci. Another method of imputation currently implemented is based on the mean of each SNP. Each missing position of a SNP $$j$$ is replaced by its mean. Thus: $\bar{p}= 2p_j$      In this section we are going to present a basic usage for raw.data in a maize data set. It is composed of 64 inbred lines genotyped with 539 SNPs. First, let’s run a basic quality control on this data set.

geno.ready <- raw.data(data = as.matrix(geno), frame = "long", base = TRUE, sweep.sample = 0.5, call.rate = 0.95, maf = 0.10, imput = FALSE)
M <- geno.ready$M.clean M[1:10,1:5] ## PHM10225.15 PHM10321.11 PHM10525.9 PHM11000.21 PHM11114.7 ## A01 2 0 0 0 0 ## A02 1 2 2 0 0 ## A03 2 0 0 0 0 ## A04 2 0 2 0 0 ## A05 2 0 2 2 2 ## A06 2 0 2 0 0 ## A07 2 2 2 0 0 ## A08 2 0 0 0 0 ## A09 2 2 2 0 2 ## A10 2 0 2 0 0 Above, we recode and clean the dataset according to some quality control parameters. Using a call rate of 0.95 means that it can accept markers with only a maximum of 5% of missing data. MAF of 0.1 means that markers with a minor frequency allele less than 0.1 were removed. For individuals,the threshold used for sweep.sample was 0.5, which means that samples with more than 50% of missing data were removed. Besides the cleaned matrix, a report is also outputted as a summary on how many and what markers were removed by the steps in the quality control. geno.ready$report
## $maf ##$maf$r ## [1] "67 Markers removed by MAF = 0.1" ## ##$maf$whichID ## [1] "PHM10750.26" "PHM12693.8" "PHM12904.7" "PHM1506.23" "PHM15871.11" ## [6] "PHM16605.19" "PHM175.25" "PHM18705.23" "PHM2159.8" "PHM2177.85" ## [11] "PHM2770.19" "PHM2773.30" "PHM3612.19" "PHM3637.15" "PHM3688.14" ## [16] "PHM3690.23" "PHM3691.15" "PHM3691.18" "PHM3931.17" "PHM424.13" ## [21] "PHM424.16" "PHM4313.17" "PHM4339.79" "PHM4349.6" "PHM4469.13" ## [26] "PHM4552.6" "PHM4662.153" "PHM4757.14" "PHM4951.8" "PHM5529.4" ## [31] "PHM5535.8" "PHM5727.5" "PHM574.14" "PHM5740.9" "PHM7616.35" ## [36] "PHM8074.6" "PHM835.25" "PHM9672.9" "PZA00103.20" "PZA00192.6" ## [41] "PZA00213.19" "PZA00216.2" "PZA00276.18" "PZA00344.10" "PZA00381.3" ## [46] "PZA00425.9" "PZA00525.17" "PZA00573.3" "PZA00615.3" "PZA00658.19" ## [51] "PZA00730.2" "PZA00804.1" "PZA00878.2" "PZA00881.1" "PZA01790.1" ## [56] "PZA02151.3" "PZA02167.2" "PZA02820.17" "PZA02850.18" "PZA02921.9" ## [61] "PZA02923.7" "PZA02949.22" "PZA02952.10" "PZA03011.6" "PZA03035.5" ## [66] "PZA03063.18" "PZA03083.7" ## ## ##$cr
## $cr$r
## [1] "151 Markers removed by Call Rate = 0.95"
##
## $cr$whichID
##   [1] "PHM10404.8"   "PHM10525.11"  "PHM112.8"     "PHM1155.14"
##   [5] "PHM11985.27"  "PHM12992.5"   "PHM1307.11"   "PHM13094.8"
##   [9] "PHM13639.13"  "PHM13648.11"  "PHM13675.18"  "PHM13823.7"
##  [13] "PHM1438.34"   "PHM14618.11"  "PHM14671.9"   "PHM1506.23"
##  [17] "PHM15331.16"  "PHM1534.45"   "PHM15449.10"  "PHM15501.9"
##  [21] "PHM15961.13"  "PHM1684.20"   "PHM1745.16"   "PHM175.25"
##  [25] "PHM17698.8"   "PHM18513.156" "PHM1870.20"   "PHM18887.12"
##  [29] "PHM1899.157"  "PHM1932.51"   "PHM2006.57"   "PHM2177.85"
##  [33] "PHM229.15"    "PHM2343.25"   "PHM2350.14"   "PHM2487.6"
##  [37] "PHM2749.10"   "PHM3094.23"   "PHM3147.18"   "PHM3155.14"
##  [41] "PHM3171.5"    "PHM3309.8"    "PHM3463.18"   "PHM3512.186"
##  [45] "PHM3627.11"   "PHM3631.47"   "PHM3668.12"   "PHM3691.18"
##  [49] "PHM3844.14"   "PHM3852.15"   "PHM3963.33"   "PHM4196.27"
##  [53] "PHM4348.16"   "PHM4662.153"  "PHM4818.15"   "PHM4880.179"
##  [57] "PHM4905.6"    "PHM5296.6"    "PHM5480.17"   "PHM5572.19"
##  [61] "PHM5622.21"   "PHM563.9"     "PHM5637.15"   "PHM5798.39"
##  [65] "PHM5805.19"   "PHM5822.15"   "PHM597.18"    "PHM662.27"
##  [69] "PHM7417.21"   "PHM7898.10"   "PHM8074.6"    "PHM9162.135"
##  [73] "PHM9241.13"   "PHM9635.30"   "PHM9676.10"   "PZA00004.2"
##  [77] "PZA00049.12"  "PZA00058.5"   "PZA00061.1"   "PZA00084.2"
##  [81] "PZA00099.6"   "PZA00153.3"   "PZA00182.4"   "PZA00192.6"
##  [85] "PZA00216.2"   "PZA00219.7"   "PZA00220.11"  "PZA00289.11"
##  [89] "PZA00311.4"   "PZA00323.3"   "PZA00334.2"   "PZA00345.15"
##  [93] "PZA00395.2"   "PZA00399.10"  "PZA00405.7"   "PZA00423.16"
##  [97] "PZA00425.9"   "PZA00439.6"   "PZA00447.6"   "PZA00453.2"
## [101] "PZA00463.3"   "PZA00492.26"  "PZA00516.3"   "PZA00524.2"
## [105] "PZA00525.17"  "PZA00562.4"   "PZA00588.2"   "PZA00636.6"
## [109] "PZA00653.5"   "PZA00672.6"   "PZA00684.12"  "PZA00714.1"
## [113] "PZA00725.4"   "PZA00730.2"   "PZA00881.1"   "PZA00934.2"
## [117] "PZA00941.2"   "PZA01073.1"   "PZA01280.2"   "PZA01327.1"
## [121] "PZA01342.2"   "PZA01359.1"   "PZA01451.1"   "PZA01462.1"
## [125] "PZA01623.3"   "PZA01713.4"   "PZA01715.1"   "PZA01765.1"
## [129] "PZA01810.2"   "PZA01887.1"   "PZA02049.1"   "PZA02247.1"
## [133] "PZA02249.4"   "PZA02462.1"   "PZA02478.7"   "PZA02688.2"
## [137] "PZA02727.1"   "PZA02731.1"   "PZA02746.2"   "PZA02820.17"
## [141] "PZA02872.1"   "PZA02890.4"   "PZA02949.22"  "PZA02957.5"
## [145] "PZA02958.17"  "PZA03027.23"  "PZA03028.5"   "PZA03036.6"
## [149] "PZA03076.10"  "PZA03090.31"  "PZA03659.1"
##
##
## $sweep ##$sweep$r ## [1] "2 Samples removed by sweep.sample = 0.5" ## ##$sweep$whichID ## [1] "E08" "E09" ## ## ##$imput
## [1] "No data point was imputed"

For instance, 67 and 151 markers were removed by MAF and call rate, respectively. Moreover, two individuals were removed. Furthermore, the report shows the SNP and sample IDs removed by each procedure. It is important highlight that some SNPs may fail in both controls applied. Thus, their identification will appear in both sections. Here, 204 markers were removed, which 14 did not attend either QC applied.

Now, the quality control can be combined with imputation. When using the imputation, it is needed to choose between the supported methods.

geno.ready2 <- raw.data(data = as.matrix(geno), frame = "long", base = TRUE, sweep.sample = 0.5, call.rate = 0.95, maf = 0.10, imput = TRUE, imput.type = "wright", outfile = "012")
Mwrth <- geno.ready2$M.clean Mwrth[1:10,1:5] ## PHM10225.15 PHM10321.11 PHM10525.9 PHM11000.21 PHM11114.7 ## A01 2 0 0 0 0 ## A02 1 2 2 0 0 ## A03 2 0 0 0 0 ## A04 2 0 2 0 0 ## A05 2 0 2 2 2 ## A06 2 0 2 0 0 ## A07 2 2 2 0 0 ## A08 2 0 0 0 0 ## A09 2 2 2 0 2 ## A10 2 0 2 0 0 Regarding the Wright’s method, its accuracy is about 62%. Imputation based on mean also have intermediate accuracy. These methods are less accurate than those map-dependent methods, such as the one developed in BEAGLE (Browning and Browning, 2016). However, as described by Rutkoski et al. (2013) with lower missing data rates allowed a mere imputation is enough. Concerning the output formats, the outfile argument is used to export the cleaned matrix in the suitable format. Default is the count of reference allele such that $$AA = 2$$, $$Aa = 1$$, $$aa = 0$$. Another format is coded as -1, 0, 1, case considering $$p_j$$ as 0.5. Finally, a special format is suitable for use in the STRUCTURE software (Pritchard et al., 2000). To generate this output, is necessary the raw data be coded as nitrogenous bases. geno.readySTR <- raw.data(data = as.matrix(geno), frame = "long", base = TRUE, sweep.sample = 0.5, call.rate = 0.95, maf = 0.10, imput = FALSE, outfile = "structure") Mstr <- geno.readySTR$M.clean
Mstr[1:10,1:5]
##     PHM10225.15 PHM10321.11 PHM10525.9 PHM11000.21 PHM11114.7
## A01           2           3          4           3          4
## A01           2           3          4           3          4
## A02           2           2          3           3          4
## A02           3           2          3           3          4
## A03           2           3          4           3          4
## A03           2           3          4           3          4
## A04           2           3          3           3          4
## A04           2           3          3           3          4
## A05           2           3          3           1          2
## A05           2           3          3           1          2

In this output, each sample is split into two rows, one for each allele. Nitrogenous bases are then recoded to a specific number such that A is 1, C is 2, G is 3 and T is 4 and missing data are assigned as -9. Also, given that STRUCTURE can handle missing data, arguments related to imputation are ignored when this output is selected.

# Genomic relationship matrix

Genomic relationship matrices (GRM) are created through the G.matrix function. Genomic prediction models use these matrices, especially in the G-BLUP. Different kinship parametrizations were proposed with the aim of increasing the accuracy of prediction of genomic selection. G.matrix function estimates four types of additive and one dominant genomic relationship matrix.
The matrix used in the data entry should be coded as allele count, where $$AA = 2$$, $$Aa = 1$$, $$aa = 0$$. However, it accepts continuous values for some SNPs. The second argument is the method to be used to generate the GRM. There are four methods to generate it, three forms of GRM and the Gaussian kernel (GK). Methods currently implemented are the one proposed by VanRaden (2008), two methods proposed by Yang et al. (2010), the UAR (Unified Additive Relationship) and adjusted UAR, and the Gaussian kernel (Pérez-Elizalde et al., 2015).

G <- G.matrix(M = Mwrth, method = "VanRaden", format = "wide")
Ga <- G$Ga Ga[1:5,1:5] ## A01 A02 A03 A04 A05 ## A01 0.94814007 -0.17589223 -0.03158082 -0.07619916 -0.07775787 ## A02 -0.17589223 0.95502441 0.02823502 -0.07678368 -0.02599541 ## A03 -0.03158082 0.02823502 0.92553864 -0.05327300 -0.02664488 ## A04 -0.07619916 -0.07678368 -0.05327300 0.92488918 0.06967097 ## A05 -0.07775787 -0.02599541 -0.02664488 0.06967097 1.04659916 Gd <- G$Gd
Gd[1:5,1:5]
##           A01       A02       A03       A04       A05
## A01 0.9029512 0.3886324 0.4360494 0.4156504 0.5037025
## A02 0.3886324 0.9362374 0.3838944 0.4086660 0.4779729
## A03 0.4360494 0.3838944 0.8507627 0.4016978 0.4876244
## A04 0.4156504 0.4086660 0.4016978 0.8491409 0.5039872
## A05 0.5037025 0.4779729 0.4876244 0.5039872 1.0440042

Just for the vanRaden method, as shown above, two matrices additive and dominance are generated. Otherwise, only the additive matrix is outputted.

G <- G.matrix(M = Mwrth, method = "UAR", format = "wide")
G[1:5,1:5]
##             A01         A02         A03         A04         A05
## A01  1.85403910 -0.28825604 -0.05735312 -0.15262638 -0.11931476
## A02 -0.28825604  1.82934078  0.01367647 -0.11610267 -0.07391783
## A03 -0.05735312  0.01367647  1.83686752 -0.09745287 -0.02504138
## A04 -0.15262638 -0.11610267 -0.09745287  1.84411847  0.13385173
## A05 -0.11931476 -0.07391783 -0.02504138  0.13385173  1.93787735

Two forms of output are generated, one as a matrix of order $$n \times n$$.

dim(G)
## [1] 62 62

Another output is the long format, such that the inverse is used to create a table in a suitable format for ASREML-R, where three columns are representing the row, columns and respective value of the lower diagonal matrix.

G <- G.matrix(M = Mwrth, method = "UAR", format = "long") 
## Warning in posdefmat(Ga): The matrix was adjusted for the nearest positive
## definite matrix
head(G)
##     row column    value
## 1     1      1 199967.8
## 63    2      1 208953.1
## 64    2      2 218371.2
## 125   3      1 211921.0
## 126   3      2 221559.7
## 127   3      3 225070.5

These two forms are suitable to use in many aplications like BGLR (Pérez and De Los Campos, 2014), rrBLUP (Endelman, 2011) and ASREML-R (Butler et al., 2009).

# Population genetics summary

The popgen function aims to produce a summary about population genetics parameters. Thus, for any marker locus $$j$$ with alleles $$A_1$$ and $$A_2$$, the function estimates:

• Allele frequencies

\begin{aligned} f(A_1) = p_j = \frac{nA_1}{2N} \\ \\ f(A_2) = q_j = 1 - p_j \end{aligned}

• Minor allele frequency

$$maf = min(p_j, q_j)$$

• Observed heterozygosity

\begin{aligned} H_o=\frac{nH_j}{N} \end{aligned}

• Expected heterozygosity

$$H_e = 2p_jq_j$$

• Nei’s genetic diversity

$$DG = 1 - p_j^2 - q_j^2$$

• Polymorphic information content

$$PIC = 1-(p_j^2 + q_j^2) - (2p_j^2q_j^2)$$

• Missing rate

\begin{aligned} H_o=\frac{nNA_j}{N} \end{aligned}

• Hardy-Weinberg equilibrium statistic

\begin{aligned} \chi^2=\frac{1}{d}\sum_{k=1}^{3} \frac{(O_k - E_k)^2}{E_k} \end{aligned}

where $$nA_1$$ is the number of copies of $$A_1$$ allele in the population, $$N$$ is the number of individuals, $$nH_j$$ is the number of heterozygous genotypes (of type $$A_1A_2$$ or $$A_2A_1$$) in the locus $$j$$, $$O_k$$ is the observed values for the genotypes $$0$$, $$1$$ and $$2$$, $$E_k$$ is the expected values for $$0 = N * (1 - p_j)^2$$, $$1 = N * 2p_j(1-p_j)$$ and $$2=N * p_j^2$$.

Moreover, for any individual $$i$$, the function provides estimates of:

• Observed heterozygosity

\begin{aligned} H_{o_i} = \frac{nH_i}{m} \end{aligned}

• Inbreeding coefficient

\begin{aligned} F_i=\frac{O(H_i)-E(H)}{m-E(H)} \end{aligned}

where $$nH_i$$ is the number of heterozigous genotypes (of type $$A_1A_2$$ or $$A_2A_1$$) in the individual $$i$$, $$m$$ is the number of markers, $$O(H_i)$$ is the observed homozygosity for individual $$i$$, $$E(H) = \sum_{j} 1-2p_j(1-p_j)$$ is the expected homozygosity across all snps.

Then, based on the estimates described above, populational parameters, means and bounders of DG, PIC, MAF, Ho, F and S are provided. Besides, measures of variability are estimated:

• Effective populational size
\begin{aligned} Ne =\left(\frac{1}{2\bar{F_i}} \right) N \end{aligned}

• Additive variance due to the allele frequencies

\begin{aligned} Va = \sum_{j=1}^{m} 2p_jq_j \end{aligned}

• Dominance variance due to the allele frequencies

\begin{aligned} Vd = \sum_{j=1}^{m} (2 p_{j} q_{j})^2 \end{aligned}

where $$\bar{F_i} = \frac{\sum_{i=1}^{N} F_i}{N}$$ is the mean of $$F_i$$

In general, data set needs to be coded as allele count and missing data can be accepted. Estimates for the whole population is presented in a list. Thus, there are estimates for Genotype, Markers, Population and Variability.

pop.gen <- popgen(M = Mwrth)
head(pop.gen$whole$Markers)
##                p    q  MAF   He   Ho   GD  PIC Miss  chiSq         pval
## PHM10225.15 0.90 0.10 0.10 0.19 0.02 0.19 0.17    0 51.802 6.138145e-13
## PHM10321.11 0.32 0.68 0.32 0.44 0.03 0.44 0.34    0 53.185 3.035085e-13
## PHM10525.9  0.73 0.27 0.27 0.40 0.00 0.40 0.32    0 62.000 3.434573e-15
## PHM11000.21 0.15 0.85 0.15 0.25 0.03 0.25 0.22    0 46.930 7.356560e-12
## PHM11114.7  0.44 0.56 0.44 0.49 0.03 0.49 0.37    0 54.131 1.875179e-13
## PHM11226.13 0.44 0.56 0.44 0.49 0.02 0.49 0.37    0 58.015 2.601867e-14
head(pop.gen$whole$Genotypes)
##       Ho    Fi
## A01 0.02 0.938
## A02 0.06 0.846
## A03 0.01 0.962
## A04 0.01 0.962
## A05 0.03 0.923
## A06 0.03 0.915
head(pop.gen$whole$Population)
##     mean lower upper
## GD  0.39  0.17  0.50
## PIC 0.31  0.16  0.38
## MAF 0.29  0.10  0.50
## Ho  0.04  0.00  0.14
## F   0.91  0.63  1.00
head(pop.gen$whole$Variability)
##                     estimate
## Ne                     34.10
## Va                    130.09
## Vd                     53.81
## number.of.genotypes    62.00
## number.of.markers     335.00

The popgen function also allows the assignment of subpopulations to individuals. It estimates the same population genetic parameters for each subpopulation, such as effective population size, components of additive and dominance variance, and endogamy. In our example, let’s split the whole population into two subpopulations according to the nitrogen use efficiency (NE), the high and low one.

subgroups <- as.matrix(c(rep("HNE", 10), rep("LNE", 52)))
pop.gen <- popgen(M = Mwrth, subgroups = subgroups)

It produces for each group the same estimates described above. For example, in the HNE group:

head(pop.gen$bygroup$HNE$Markers) ## p q MAF He Ho GD PIC Miss chiSq pval ## PHM10225.15 0.95 0.05 0.05 0.10 0.1 0.10 0.09 0 0.028 0.867814108 ## PHM10321.11 0.30 0.70 0.30 0.42 0.0 0.42 0.33 0 10.000 0.001565402 ## PHM10525.9 0.70 0.30 0.30 0.42 0.0 0.42 0.33 0 10.000 0.001565402 ## PHM11000.21 0.10 0.90 0.10 0.18 0.0 0.18 0.16 0 10.000 0.001565402 ## PHM11114.7 0.20 0.80 0.20 0.32 0.0 0.32 0.27 0 10.000 0.001565402 ## PHM11226.13 0.55 0.45 0.45 0.50 0.1 0.50 0.37 0 6.368 0.011621498 head(pop.gen$bygroup$HNE$Genotypes)
##       Ho    Fi
## A01 0.02 0.930
## A02 0.06 0.826
## A03 0.01 0.957
## A04 0.01 0.957
## A05 0.03 0.913
## A06 0.03 0.904

and for LNE group:

head(pop.gen$bygroup$LNE$Markers) ## p q MAF He Ho GD PIC Miss chiSq pval ## PHM10225.15 0.88 0.12 0.12 0.20 0.00 0.20 0.18 0 52.000 5.550063e-13 ## PHM10321.11 0.33 0.67 0.33 0.44 0.04 0.44 0.34 0 43.308 4.676450e-11 ## PHM10525.9 0.73 0.27 0.27 0.39 0.00 0.39 0.32 0 52.000 5.550063e-13 ## PHM11000.21 0.15 0.85 0.15 0.26 0.04 0.26 0.23 0 37.771 7.954843e-10 ## PHM11114.7 0.48 0.52 0.48 0.50 0.04 0.50 0.37 0 44.297 2.821839e-11 ## PHM11226.13 0.42 0.58 0.42 0.49 0.00 0.49 0.37 0 52.000 5.550063e-13 head(pop.gen$bygroup$LNE$Genotypes)
##       Ho    Fi
## A11 0.01 0.962
## A12 0.02 0.954
## B01 0.02 0.954
## B02 0.07 0.815
## B03 0.09 0.762
## B04 0.04 0.892

Moreover, the separation by groups allows the identification of exclusive/absent and fixed alleles in the assigned sub-populations. Which are in the HNE:

head(pop.gen$bygroup$HNE$exclusive) ## [1] "There are no exclusive alleles for this group" head(pop.gen$bygroup$HNE$fixed)
## [1] "PHM15278.6" "PHM1576.25" "PHM1962.33" "PHM2518.28" "PHM2672.19"
## [6] "PHM2885.31"

and in the LNE:

head(pop.gen$bygroup$LNE$exclusive) ## [1] "PHM15278.6" "PHM1576.25" "PHM1962.33" "PHM2518.28" "PHM2672.19" ## [6] "PHM2885.31" head(pop.gen$bygroup$LNE$fixed)
## [1] "There are no fixed alleles for this group"

As can be noted, in the HNE group there are many fixed alleles and no particular alleles. On the other hand, in LNE, the opposite was observed. Hence, for two populations, considering that quality control was made and monomorphic markers were removed, if one allele is fixed in one population, the other will be exclusive in another subpopulation.

In the presence of subgroups, popgen also produces the Wright’s F statistics in order to measure genetic divergence between populations. $$F_{IS}$$ measures the deficiency of average heterozygotes in each population, $$F_{ST}$$ is related to the degree of gene differentiation among populations and $$F_{IT}$$ is the deficiency of average heterozygotes in subpopulations. The parameters are estimated as follows:

\begin{aligned} F_{IS} = 1 - \dfrac{H_I}{H_S} \end{aligned}

\begin{aligned} F_{ST} = 1 - \dfrac{H_S}{H_T} \end{aligned}

\begin{aligned} F_{IT} = 1 - \dfrac{H_I}{H_T} \end{aligned}

where $$H_I$$ is the weighted average of observed heterozygosity in the subpopulations, $$H_S$$ is the average expected heterozygosity estimated from each subpopulation and $$H_T$$ is the expected heterozygosity in the total population. Besides the estimates of F parameters for each marker locus, estimates considering means of $$H_I$$, $$H_T$$ and $$H_S$$ for each population is produced. Thus, There is a comparison considering all population and in a pairwise format which each pair of populations are compared against each other.