Crop yield estimation using satellite images : comparison of linear and non-linear models

Development of models for crop yield prediction using remote sensing allows accurate, reliable and timely estimations over large areas. Particularly, this information is necessary to ensure the adequacy of a nation’s food supply as well as to aid policy makers and farmers. In Argentina, soybean (Glycine max (L.) Merr.) and corn (Zea mays L.) are the most important crops. The goal of this research was to develop and evaluate linear and non-linear models to estimate crop yield from satellite data. Particularly, we proposed and applied those models to obtain soybean and corn yield in the central region of Córdoba (Argentina) using Landsat and SPOT images. The models were designed taking into account all or some bands included in the images from one or both satellites. Results showed that models provided a good fit when all images are used, being superior the accuracy obtained by neural networks (NN). For soybean, the best estimation presented a coefficient of determination equal to 0.90 with NN and 0.82 with multiple linear regression models, and for corn 0.92 and 0.88, respectively. This study concludes that Landsat and SPOT images can be effectively used to predict, in early to mid-season crop growth stages, corn and soybean yield.


InTrODuCTIOn
National crop area estimates are typically obtained using ground data.Accurate and timely assessment of crop yield is an essential process to ensure the adequacy of food supply.It provides policy makers, governmental agencies and farmers the necessary information to better manage harvest, storage, import/export, transportation and marketing activities (Craig and Atkinson, 2013;Gusso, Ducati, Veronez, Arvor and Gonzaga da Silveira Jr, 2013) In the past, estimates of crop yield were done, in general, from the expertise of farmers or, as claimed by Geipel, Link and Claupein, (2014) "better estimations can be drawn from destructive sampling procedures in representative areas".Later, approaches using simple mathematical or statistical relationships were built with agronomic and meteorological data (Dadhwal and Ray, 2000), crop growth models (Thorp, De Jonge, Kaleita, Batchelor and Paz, 2008); some of those models were also applied in Argentina (Grassini et al. 2013;Milera and Crotti 2014).
In general, the processes and properties that regulate the crop yield vary in space and time.Therefore, the use of remote sensing techniques is an effective tool for assessing and monitoring crop yield because satellite data provide a spatial and periodic, comprehensive view of the real crop state (Geipel et al., 2014).The information provided by remote sensing allows obtaining field data at a lower cost compared to other methods, the coverage of large areas and the possibility of periodic repetitions (Bocco, Sayago and Willington, 2014).Several types of images can be used to monitor the agricultural surface.In cultivated areas where plots are small, due to its spatial resolution, data from Landsat and SPOT are very suitable; however, the temporal resolution of these satellites and the possibility of clouds during their passes are sometimes a constraint (Kuenzer, Dech and Wagner, 2015).In particular, Gusso et al. (2013) evaluated the Coupled Model performance, which is entirely based on MODIS images, to estimate soybean production prior to the crop harvest in Rio Grande do Sul (Brasil).For Argentina, Holzman, Rivas and Piccolo (2014) evaluated the capacity in estimating regional yield of soybean with linear and quadratic functions using MODIS.
Neural network (NN) methodology is an alternative modeling and simulation tool, which is specially designed for nonlinear systems and they do not require prior assumption about the statistical behavior or any specific relationship between variables (Bocco et al., 2014).Kaul, Hill and Walthall (2005) investigated the NN models performances in predicting corn (Zea mays L.) and soybean (Glycine max (L.) Merr.) yields for typical climatic conditions and Dai, Huo and Wang (2011) simulated sunflower (Helianthus annuus L.) yield using NN and multi-linear regression models.Green, Salas, Martinez and Erskine et al. (2007) presented a Spatial Analysis Neural Network (SANN) and Multiple Linear Regression (MLR) algorithms for analyzing grain yield for wheat (Triticum aestivum L.) in Colorado (USA).Alvarez (2009) proposed a regional analysis for wheat yield in the Argentine Pampas, using surface regression and NN methodologies.
Extensive agriculture is the main crop cultivation system used in productive land in the central area of Argentina.Soybean is the most important crop, taking into account the economic yield obtained by farmers and the sown area (20,480,000 ha in 2015-2016), followed by corn, with more than 6,904,000 ha in the same period.In particular, the province of Córdoba is the second largest producer of soybean and corn in Argentina, with approximately 27% and 28% of the sowed hectares, respectively (Sayago, Ovando, and Bocco, 2017;Ministerio de Agroindustria, 2017).
The goal of this work is to build linear and non-linear models to estimate crop yield, using Landsat 8 and SPOT 5 images and validate them with soybean and corn yield data from the central region of Córdoba (Argentina).

maTerIaL anD meTHODs study area
The study area is located in the central plains of Córdoba (Argentina), in the sub-region known as "Pampa Alta" (31º to 32º S; 63º to 64º W, approximately).As Bocco et al. (2014) state, its productive characteristics are very representative of the central region of Argentina (Figure 1).It presents a slightly undulating relief of hills developed on loessic material of silt loam texture with a slight slope to the east; soils in this area are classified as Entic and Typic Haplustoll characterized by Rollán and Bachmeier (2014).The climate in the study area is classified as dry sub-humid and the average annual rainfall is approximately 800 mm, concentrated in summer (Sayago et al., 2017).
Soybean is sown by direct seeding with maturity groups 3 and 4 of transgenic varieties resistant to glyphosate, with spacing between rows of 0.52 m, without fertilizer application.On the other hand, corn is sown by direct seeding also, with a distance between rows of 0.53 m and fertilized with urea (Bocco et al., 2014;Ferreyra, 2016) (Figure 1).Both crops were sown between October 15 and November 15 and harvested between April 20 and May 30.

Ground data
Yield prediction modeling was carried out using data collected from 26 plots with soybean (14) and corn (12), whose area varied between 15 and 40 ha.They were visited throughout the growing season (October -June).Immediately before harvesting (Figure 2), four samples were collected from each plot (within a circular area of 0.25 m 2 ).The samples were dried to constant weight in an oven, and the threshing was performed manually.To calculate the average yield of each plot the weight of the samples obtained was adjusted to 13% moisture.In order to validate the yields obtained from the field samples, all values were also compared with average data reported by the farm owners for each plot; both yields were equivalent.
Yield values obtained were in a range between 2,800 and 4,900 kg/ha for soybean plots and varied between 8,000 to 10,500 kg/ha for corn plots.As for the latter, there were, two plots with low yield (4,500 and 6,700 kg/ha, respectively), which were included because they allowed validating the robustness of the models, in presence of a significant variability.

satellite data
This study examined four images from Landsat 8 and SPOT 5.
Landsat 8: two images corresponding to December 28 (I1) and January 13 (I2), with seven bands (b1 to b7).These bands correspond to the Operational Land Imager (OLI) which together with the Thermal Infrared Sensor (TIRS) are the two sensors of Landsat 8.Both images have the same spatial (30 m), temporal (16 days) and radiometric resolutions.It was not possible to obtain images in February because the days were cloudy when images of the area were recorded by the satellite.
SPOT 5: two images corresponding to March 12 (I3) and March 22 (I4).These images have three All of them were obtained in clear sky days.Their specifications are described in Table 1.
In order to construct the models, digital numbers were converted to surface level reflectance with the method of dark object subtraction (Chavez, 1998).This correction and the rest of the image processing: cut of the study area, layer stacking, overlay regions of interest (ROI), were carried out using the software ENVI 4.6.1.ROIs covered the pixels with valid information without considering the borders.This was done because differences in the phenological stage of crops can occur and/or disturbed pixels from neighboring coverage could be included (Figure 3).
A subset of all the images was created to cover the study area, in order to make Landsat 8 and SPOT 5 spatial resolutions comparative.For this, the pixel size from Landsat 8 was adjust ed to SPOT 5 (each Landsat 8 pixel was divided into nine parts with the same attribute value).
The total number of data (pixels that conformed the ROIs) were 48,783 (soybean: 19,821 and corn: 28,962) and were randomized before being input into the models.Half of the reflectance data were used for the learning process, this is to obtain the models coefficients and weights, and the other half was used for the validation process, with Multiple   Linear Regression and NN methodologies.
In this work, we did not consider the algebraic relations between image bands (vegetation indices) as input for the models.On the one hand, as Ovando et al. (2016) and long before Doraiswamy et al. (2004) stated, the relationship between yield and Normalized Difference Vegetation Index (NDVI) may not be adequate in extreme meteorological conditions.This is because the difference between very good or average crop conditions can be masked by NDVI saturation.On the other hand, Li, Liang, Wang and Qin ( 2007) have already worked with these methodologies including this vegetation index; however, these authors used MODIS images with 1 km resolution.b-Neural networks were used to estimate crop yield: they are a structure of neurons joined by nodes that transmit information from one neuron to another, which produces a result by means of mathematical functions (Hilera González and Martínez Hernando, 2000).
In this work, multilayer perceptron neural networks (Figure 4) were designed with an input layer whose number of neurons was equal to the number of bands considered in each model; a hidden layer was designed with the same number of neurons than the respective input layer.The output layer was built with only one neuron that indicates the calculated crop yield.
The general steps that describe the training algorithm were executed according to Bocco et al. (2014).This process was repeated 3,000 times or until the mean square error was lower than a desired value (in these models equal to 0.01).
Table 2 summarizes all variables used, for each Multiple Linear Regression (MLRi) and Neural Network (NNi) models.They were designed taking into account all or some bands included in the images from one or both satellites.

Validation models and statistical analysis
A set of independent data (50% of total pixels, with known yields) was used for the validations of models.The errors were estimated using standard statistics: coefficient of determination (R 2 ) and root mean square error (RMSE); also scatter plots between measured and estimated values were considered.The equation for RMSE is given by: where n is the number of data, p is the number of parameters to be estimated, SSE is the sum of squared error.

resuLTs anD DIsCussIOn
All regression and neural network models developed to estimate yield provided a good fit with measured yield.Table 3 presents the R 2 and RMSE values, which show highly significant relationships between predicted and observed yield, for soybean.
Neural networks present the best fits when they are compared with MLR models, for soybean crop.Using all images and their bands, NN1 and MLR1 are the models that best estimate yield (Table 3).Models NN2 and MLR2, which did not include the SWIR (SPOT 5) and b6 -b7 bands (Landsat 8), also presented high values of coefficient of determination.It should be noted that if only early dates in the crop calendar are considered, i.e.only Landsat 8 images are used, with NN models, yield can be predicted with good accuracy (R 2 = 0.88).Estimates with multiple linear regression models present a lower coefficient of determination and RMSE increases about 50% (MLR4 and MLR5).Models that use the two bands included in the NDVI definition, and only dates corresponding to late phenological crop stages (SPOT images), present good fit with neural network (NN7).This result confirms what Bauer (1975) stated, that the infrared reflectance decreased and the red one increased with maturity after the crop had reached their maximum vegetative growth.
The results obtained are comparable to Gusso et al. (2013), who evaluated the Enhanced Vegetation Index, which is part of MOD13Q1-V005 product, to estimate soybean production in Rio Grande do Sul (Brasil).Results obtained using linear least squares regression analysis estimated production, at municipality and state level, with R 2 = 0.91 and R 2 = 0.82 respectively.Prasad, Chai, Singh and Kafatos (2006) improved yield estimation considering NDVI (from AVHRR-NOAA), soil moisture, surface temperature and rainfall.With these four inputs, at Iowa state level, they obtained a linear regression model with R 2 =0.86.Doraiswamy, Akhmedov, Beard, Stern and Mueller (2007) developed a 5-year multi-regression algorithm and showed, at state level, coefficients of determination of 0.88 (Iowa) and 0.44 (Illinois) for soybean.
Among the models that estimate corn yield, the best fit is also reached with NN methodology, using images from one or two satellites (Table 4).Using NN and exclusively images from one satellite, models that estimate corn yield with high precision included as input the bands that define the NDVI (NN12 and NN14).Using NN models that estimate corn yield, Li et al. (2007) informed for MLR models, R 2 values between 0.34 and 0.86 with RMSE values ranged from 1,290 to 2,500 kg/ ha and for NN models the R 2 values ranged from 0.53 to 0.94 with RMSE between 770 and 1,900 kg/ ha.The RMSE values obtained by these authors significantly exceed those obtained in this work.
All NN models presented estimations with better fit values than MLR models, when estimates were analysed at plot scale.It can be observed, in Figure 5, that for low or high yields, in both crops, the models present the greater over or underestimations respectively.accuracy as good as NN1 and NN8 models (Figure 6).This figure also shows that MLR methodology underestimates the high yields and overestimates yields lesser than 3,500 kg/ha significantly.

COnCLusIOns
This study shows that Landsat 8 and SPOT 5 images can be used to predict corn and soybean yield in early to mid season crop growth stages.Multiple linear regression and neural network models were generated to estimate crop yield using, all or some, spectral bands of one or both satellites.
Both methodologies were efficient in capturing the relationship between crop yield and spectral values, although neural network models have higher precision than the multiple linear regression Table 5 shows all coefficients for the best MLR models (two for soybean and two for corn, respectively).As previously stated the results of the work confirm the relevance of this methodology, although the coefficients are valid only for the study area.
Mixed models, which included all image data and both crops, showed very good results, their statistic values are presented in Table 6.
The mixed models, which can be used without previously discriminating the crop, achieved an   ones.However, the latter technique is a possible alternative due to its easy implementation.
If the crop type present in the plots is not known, the mixed models that include the presence of soybean and corn can accurately describe the crop yield.
The possibility of combining satellite images with climatologic or soil data to improve the performance of yield estimation, is the next step to be explored.

Figure 1 .
Figure 1.Composition of study area and main crops in Córdoba (Argentina): a) Soybean crop, b) SPOT image, and c) Corn plant

Figure 3 .
Figure 3. Photograph of a soybean plot with phenological differences between edge and inside.
variables are v i = surface reflectance of i band of SPOT/Landsat satellite and model constants are a i and b (this last includes the error term).

Figure 4 .
Figure 4. Scheme of the neural network of multilayer perceptron type.

*Figure 5 .
Figure 5. Soybean and corn yield, measured and estimated values for the best models, at plot level.

Table 1 .
Satellite and wavelength of each band, for images corresponding to study area

Table 2 .
Regression and Neural Network models: variables and images used

Table 4 .
Statistic values for corn yield estimation: Neural Network and Multiple Linear Regression models

Table 3 .
Statistic values for soybean yield estimation: Neural Network and Multiple Linear Regression models

Table 5 .
Regression coefficients for the best MLR models *

Table 6 .
Statistic values for mixed models that estimate yield