Comparison of variable selection procedures to model weather-pathogen relation in crops

Main Article Content

Franco Marcelo Suarez
Cecilia Bruno
María de la Paz Giménez Pecci
Mónica Balzarini

Abstract

Nowadays it is possible to easily access large volumes of georeferenced climatic data. These data can be used to model the relationship between climatic conditions and disease from multiple meteorological variables, usually correlated and redundant. The selection of variables allows the identification of a subset of relevant regressors to build predictive models. Stepwise, Boruta, and LASSO are variable selection procedures of different nature, so their relative performance has been scarcely explored. The objective of this work was the comparison of these methods simultaneously applied in the construction of regression models to predict disease risk from climatic data. Three georeferenced databases were used with presence/absence values of different pathogens in maize crops in Argentina. For each scenario, climatic variables from the period prior to sowing until harvest were obtained. The three variable selection methods obtained models with accuracy close to 70 %. However, LASSO produced the best predictive model, selecting an intermediate number of variables with respect to Stepwise (lower number) and Boruta (higher number). The results could be extended to other pathosystems and inspire the construction of alarm systems based on climatic variables.

Downloads

Download data is not yet available.

Article Details

How to Cite
Suarez, F. M., Bruno, C., Giménez Pecci, M. de la P., & Balzarini, M. (2024). Comparison of variable selection procedures to model weather-pathogen relation in crops. AgriScientia, 40(2), 37–48. https://doi.org/10.31047/1668.298x.v40.n2.40871
Section
Short comunications

References

Amat Rodrigo, J. (2016). Introducción a la Regresión Lineal Múltiple. Ciencia de Datos [blog]. https://www.cienciadedatos.net/documentos/25_regresion_lineal_multiple

Balzarini, M. G., González, L., Tablada, M., Casanoves, F., Di Rienzo, J. A. y Robledo, C. W. (2008). Infostat. Manual del Usuario, Editorial Brujas.

Barontini, J. M., Malavera, A. P., Ferrer, M., Torrico, A. K., Maurino, M. F., y Giménez Pecci, M. P. (2022). Infection with Spiroplasma kunkelii on temperate and tropical x temperate maize in Argentina and development of a tool to evaluate germplasm. European Journal of Plant Pathology, 162(2), 455-463. https://doi.org/10.1007/s10658-021-02415-4

Bolsa de Cereales de Buenos Aires (2021). Informe cierre de campaña. Maíz 2021-2022. https://www.bolsadecereales.com/estimaciones-informes

Chen, M., Ois Brun, F., Raynal, M. y Makowski, D. (2020). Forecasting severe grape downy mildew attacks using machine learning. PLOS ONE 15(3), e0230254. https://doi.org/10.1371/journal.pone.0230254

Fonti, V. (2017). Research paper in business analytics: feature selection with LASSO. VU Amsterdam research paper in business analutics, 30, 1-25.

García-Dominguez, A., Galván-Tejada, C. E., Zanella-Calzada, L. A., Gamboa-Rosales, H., Galván-Tejada, J. I., Celaya-Padilla, J. M., Luna-García, H. y Magallanes-Quintanar, R. (2020). Feature Selection Using Genetic Algorithms for the Generation of a Recognition and Classification of Children Activities Model Using Environmental Sound. Mobile Information Systems, Volume 2020, 8617430. https://doi.org/10.1155/2020/8617430

Gholami, H., Mohammadifar, A., Golzari, S., Kaskaoutis, D. G. y Collins, A. L. (2021). Using the Boruta algorithm and deep learning models for mapping land susceptibility to atmospheric dust emissions in Iran. Aeolian Research, 50, 100682. https://doi.org/10.1016/j.aeolia.2021.100682

Hastie, T., Tibshirani, R. y Tibshirani, R. (2020). Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons. Statistical Science, 35(4), 579-592. https://doi.org/10.1214/19-STS733

Heinze, G., Wallisch, C. y Dunkler, D. (2018). Variable selection – A review and recommendations for the practicing statistician. Biometrical Journal, 60(3),431-449. https://doi.org/10.1002/bimj.201700067

Hoerl, A. E. y Kennard, R. W. (1970). Ridge regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55-67. https://doi.org/10.1080/00401706.1970.10488634

Horton, N. J. y Kleinman, K. (2015). Using R and RStudio for Data Management, Statistical Analysis, and Graphics. CRC Press.

Hosmer, D. W. y Lemeshow, S. (2000). Applied Logistic Regression. John Wiley & Sons.

Jović, A., Brkić, K. y Bogunović, N. (2015). A review of feature selection methods with applications. 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), 1200-1205. https://doi.org/10.1109/MIPRO.2015.7160458

Kursa, M. B. y Rudnicki, W. R. (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), 1-13. https://doi.org/10.18637/jss.v036.i11

Lasso, E., Corrales, D. C., Avelino, J., de Melo Virginio Filho, E. y Corrales, J. C. (2020). Discovering weather periods and crop properties favorable for coffee rust incidence from feature selection approaches. Computers and Electronics in Agriculture, 176, 105640. https://doi.org/https://doi.org/10.1016/j.compag.2020.105640

Li, H., Li, C. J., Wu, X. J. y Sun, J. (2014). Statistics-based wrapper for feature selection: An implementation on financial distress identification with support vector machine. Applied Soft Computing, 19, 57-67. https://doi.org/10.1016/j.asoc.2014.01.018

Li, J., Veeranampalayam-Sivakumar, A. N., Bhatta, M., Garst, N. D., Stoll, H., Stephen Baenziger, P., Belamkar, V., Howard, R., Ge, Y. y Shi, Y. (2019). Principal variable selection to explain grain yield variation in winter wheat from features extracted from UAV imagery. Plant Methods, 15(1), 123. https://doi.org/10.1186/s13007-019-0508-7

López-Ramírez, V., Ruíz, M., Rossi, E., Zuber, N., Lagares, A., Balzarini, M., Bonamico, N. y Fischer, S. (2022). Curtobacterium, a Foliar Pathogen Isolated from Maize in Central Argentina. Current Microbiology, 79, 261. https://doi.org/10.1007/s00284-022-02953-y

Maldonado, S., Flores, Á., Verbraken, T., Baesens, B. y Weber, R. (2015). Profit-based feature selection using support vector machines – General framework and an application for customer retention. Applied Soft Computing, 35, 740–748. https://doi.org/10.1016/J.ASOC.2015.05.058

March, G. J., Balzarini, M., Ornaghi, J. A., Beviacqua, J. E. y Marinelli, A. (1995). Predictive model for “Mal de Río Cuarto” disease intensity. Plant Disease, 79(10).

Kuhn, M. (2021). Package “caret” Title Classification and Regression Training. Consultado el 15 marzo de 2023. https://CRAN.R-project.org/package=caret

Kuhn, M. y Silge, J. (2022). Tidy modeling with R. O’Reilly Media, Inc.

McEligot, A. J., Poynor, V., Sharma, R. y Panangadan, A. (2020). Logistic LASSO Regression for Dietary Intakes and Breast Cancer. Nutrients, 12(9), 2652. https://doi.org/10.3390/NU12092652

Nilsson, R., Peña, J. M., Björkegren, J. y Tegnér, J. (2007). Consistent Feature Selection for Pattern Recognition in Polynomial Time. The Journal of Machine Learning Research, 8, 589-612.

Paccioretti, P., Giannini-Kurina, F., Suarez, F. y Scavuzzo, M., Alemandri, V. M., Gómez Montenegro, B. y Balzarini, M. (2023). Protocolo para automatizar la descarga de datos climáticos desde la nube y generar indicadores biometeorológicos para el monitoreo epidemiológico de cultivos. AgriScientia, 40(1), 93-100. https://doi.org/10.31047/1668.298x.v1.n40.39619

Peres, F. A. P. y Fogliatto, F. S. (2018). Variable selection methods in multivariate statistical process control: A systematic literature review. Computers & Industrial Engineering, 115, 603-619. https://doi.org/https://doi.org/10.1016/j.cie.2017.12.006

R Core Team (2022). R: A language and environment for statistical computing. In R Foundation for Statistical Computing. https://www.r-project.org/

Reyna, P., Suarez, F., Balzarini, M. y Pardina, P. R. (2023). Influence of Climatic Variables on Incidence of Whitefly-Transmitted Begomovirus in Soybean and Bean Crops in North-Western Argentina. Viruses, 15(2), 462. https://doi.org/10.3390/V15020462

Rossi, E. A., Ruiz, M., Rueda Calderón, M. A., Bruno, C. I., Bonamico, N. C. y Balzarini, M. G. (2019). Meta-Analysis of QTL Studies for Resistance to Fungi and Viruses in Maize. Crop Science, 59(1), 125-139. https://doi.org/10.2135/CROPSCI2018.05.0330

Rostami, M., Berahmand, K., Nasiri, E. y Forouzandeh, S. (2021). Review of swarm intelligence-based feature selection methods. Engineering Applications of Artificial Intelligence, 100, 104210. https://doi.org/https://doi.org/10.1016/j.engappai.2021.104210

Ruiz, M., Rossi, E. A., Bonamico, N. C. y Balzarini, M. G. (2021). Modelos multivariados en la búsqueda de regiones genómicas para resistencia a mal de Río Cuarto y bacteriosis en maíz. BAG. Journal of Basic and Applied Genetics, 32(1), 25-33. https://doi.org/10.35407/BAG.2020.32.01.03

Rusyana, A., Notodiputro, K. A. y Sartono, B. (2021). The lasso binary logistic regression method for selecting variables that affect the recovery of Covid-19 patients in China. Journal of Physics: Conference Series, 1882(1), 012035. https://doi.org/10.1088/1742-6596/1882/1/012035

Shafiee, S., Lied, L. M., Burud, I., Dieseth, J. A., Alsheikh, M. y Lillemo, M. (2021). Sequential forward selection and support vector regression in comparison to LASSO regression for spring wheat yield prediction based on UAV imagery. Computers and Electronics in Agriculture, 183, 106036. https://doi.org/10.1016/J.COMPAG.2021.106036

Shi, L., Westerhuis, J. A., Rosén, J., Landberg, R. y Brunius, C. (2019). Variable selection and validation in multivariate modelling. Bioinformatics, 35(6), 972-980. https://doi.org/10.1093/bioinformatics/bty710

Singh, K. (2021). Comparing Variable Selection Algorithms On Logistic Regression – A Simulation [Tesis de Licenciatura, Uppsala University]. DiVA, Uppsala University Library.

Suarez, F. M., Bruno, C. I., Giannini Kurina, F., Giménez Pecci, M. de la P., Rodríguez Pardina, P. y Balzarini, M. (2023). Selecting Climatic Variables to Model Plant Disease Risk. SSRN Electronic Journal, 4314562. https://doi.org/10.2139/SSRN.4314562

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.

Tikhonov, A. N. (1963). On the solution of ill-posed problems and the method of regularization. Doklady Akademii Nauk, 151(3), 501-504.

Vu, D. H., Muttaqi, K. M. y Agalgaonkar, A. P. (2015). A variance inflation factor and backward elimination based robust regression model for forecasting monthly electricity demand using climatic variables. Applied Energy, 140, 385-394. https://doi.org/10.1016/j.apenergy.2014.12.011

Whittingham, M. J., Stephens, P. A., Bradbury, R. B. y Freckleton, R. P. (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75(5), 1182-1189. https://doi.org/10.1111/j.1365-2656.2006.01141.x

Wilches Ortiz, W. A., Vargas Diaz, R. E. y Espitia Malagón, E. M. (2022). Efectos del clima y su relación con el tizón tardío (Phytophthora infestans (Mont.) de Bary) en cultivo de papa (Solanum tuberosum L.). Siembra, 9(2), e4008. https://doi.org/10.29166/SIEMBRA.V9I2.4008

Witten, I. H., Frank, E., Hall, M. A. y Pal, C. J. (2016). Data Mining: Practical Machine Learning Tools and Techniques. The Morgan Kaufmann Series in Data Management Systems.

Yin, J., Mutiso, F. y Tian, L. (2021). Joint hypothesis testing of the area under the receiver operating characteristic curve and the Youden index. Pharmaceutical Statistics, 20(3), 657-674. https://doi.org/https://doi.org/10.1002/pst.2099

Żogała-Siudem, B. y Jaroszewicz, S. (2021). Fast stepwise regression based on multidimensional indexes. Information Sciences, 549, 288-309. https://doi.org/https://doi.org/10.1016/j.ins.2020.11.031

Zou, H. y Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x