Genomic prediction models for grain yield of spring bread wheat in diverse agro-ecological zones


Genomic and pedigree predictions for grain yield and agronomic traits were carried out using high density molecular data on a set of 803 spring wheat lines that were evaluated in 5 sites characterized by several environmental co-variables. Seven statistical models were tested using two random cross-validations schemes. Two other prediction problems were studied, namely predicting the lines’ performance at one site with another (pairwise-site) and at untested sites (leave-one-site-out). Grain yield ranged from 3.7 to 9.0 t ha−1 across sites. The best predictability was observed when genotypic and pedigree data were included in the models and their interaction with sites and the environmental co-variables. The leave-one-site-out increased average prediction accuracy over pairwise-site for all the traits, specifically from 0.27 to 0.36 for grain yield. Days to anthesis, maturity, and plant height predictions had high heritability and gave the highest accuracy for prediction models. Genomic and pedigree models coupled with environmental co-variables gave high prediction accuracy due to high genetic correlation between sites. This study provides an example of model prediction considering climate data along-with genomic and pedigree information. Such comprehensive models can be used to achieve rapid enhancement of wheat yield enhancement in current and future climate change scenario.

Global wheat production is currently close to 700 million tons1, and the demand for wheat in developing countries is projected to increase 60% by 20502. Wheat grain yield is a complex trait that depends on multiple genes interacting with each other and the environment3,4. Although the effects of major genes regulating plant phenology and morphology and their influence on grain yield have been previously described5, quantitative trait loci (QTLs) for grain yield have had limited practical applications in breeding programs due to the small genetic variance accounted for by individual QTLs, the variation across environments4, and the influence of the genetic backgrounds.

Recent advances in sequencing technologies have enabled the generation of high throughput, fast, and relatively inexpensive genotypic information; thereby facilitating the implementation of genomic prediction and genomic selection in plant and animal breeding6. Incorporation of genomic information through prediction models provides an alternative approach to indirect selection in breeding for crop varieties. Given that plant breeding programs started to incorporate genomic information, parametric linear regression and non-parametric models have emerged as preferred methods7,8. However, the genetic instruction from genes translates into the full set of phenotypic traits and ultimately into grain yield components is affected by numerous interactions among pathways and the environment. Genotype by environment interactions (G × E) can reduce trait heritability and the ability to statistically predict superior genotypes under contrasting environments9,10. For this reason, collecting phenotypic data from different environments continues to be a powerful predictor of important biological outcomes such as grain yield11. Although different genomic technologies are being utilized to breed suitable varieties, genomic selection provides the option of considering multiple variables simultaneously for predicting genetic yield potential10.

Pedigree information accounts for the proportion of predictive ability related to differences in families and increases prediction accuracy when used together with marker information (that accounts for Mendelian sampling) in genomic selection models12. Burgueño et al.9 demonstrated the superiority of pedigree plus genomic models over pedigree or genomic-based predictions alone when incorporating G × E in the genomic regression model. Jarquin et al.13 proposed a model that can use not only genomic information but also pedigree and environmental information for the prediction of unobserved genotypes. Data from multi-environment trials can also be used for predicting climate change scenarios and selecting suitable sites for evaluating promising germplasm. Including environmental covariables in genomic selection prediction models is expected to result in less biased estimation of effects, higher prediction accuracy, better precision and power, and increased heritability to explain grain yield variation14. This information facilitate selection of promising germplasm for use in crop breeding aimed at both population improvement and cultivar release.

Cross-validation schemes are used in genomic prediction studies to estimate the accuracy with which predictions can be made for different traits and environments9,15,16,17,18,19,20,21,22. There are two basic cross-validation schemes used in genome-enabled prediction: (1) predicting the performance of certain proportion of lines that have not been evaluated in any of the observed environments (CV1), and (2) predicting the performance of a proportion of lines that have been evaluated in some environments, but not in others, also called sparse testing (CV2). Another prediction problem that does not involve random cross-validation is predicting one environment using another environment (pairwise environment). The fourth prediction problem consists of predicting one environment (i.e., site-year combination) that was not included in the usual set of testing environments in the evaluation system (leave-one-environment-out); the only available information on this untested environment could be certain characteristics that would have been previously collected such as soil type, altitude, longitude, maximum and minimum temperature, precipitation during other cropping cycles, etc. It is expected that predicting the performance of untested lines can be conducted with sufficient accuracy when there is knowledge about their relationships (pedigree relationship or genomic relationship). Similarly, the performance of lines in unobserved environments could be predicted if there is information about the environmental conditions17. The accuracy of predicting performance in unobserved environments would however be related to our ability to select the most appropriate environmental variables for inclusion in the prediction model. To date, this would be the first study assessing the prediction problems when leaving-one-environment-out with real environmental data.

In light of the facts mentioned above, the following objectives of the present study were framed: 1) to investigate the stability performance of wheat lines across a set of 5 Mexican environments; 2) to evaluate genomic prediction with high density genotype-by-sequencing (DArTseq) markers for agronomic traits and grain yield using different combinations for the effects of lines (L), sites (E), genomic data (G), pedigree data (A), and environmental covariables (W) and their interactions; and 3) to test a new problem that arises when predicting the performance of wheat lines in environments that have not been previously used (untested environments) where the only available information from them is their climate data.

Figure 1Full size image
Map was constructed using ESRI’s ArcGIS Desktop ArcMap 10.2.2 software (26).
Figure 2Full size image
Sites refer to Celaya; Delicias; Tepatitlán; Ciudad Obregón and Zaragoza.
Figure 3Full size image
The red codes represent each of the sites. The points represent each of the lines. The lines closest to the end of the site vector are the best performing lines for that specific site.
Figure 4Full size image
The heritability for each site within the pair is represented by a circle. For each pair-wise the prediction is direct and reciprocal. Abbreviations: CE: Celaya; DE: Delicias; TE: Tepatitlán; OB: Cd. Obregón; ZA: Zaragoza.