Ensemble Machine Learning Algorithms for Satellite Remote Sensing of Water Quality in Coastal Waters

Summary

I live near the Chesapeake Bay, the largest estuary in the United States.  In 7th grade, I went on a field trip to the Bay and was shocked to learn the magnitude of its water quality problems. Since coastal oceans often feature a rich variety of processes at different spatial and temporal scales, it is impractical and costly to rely on in-situ water sampling for monitoring water quality. Satellite remote sensing provides a synoptic view of the ocean surface at daily intervals and could offer a cost-effective alternative solution. The key to successful remote sensing is the development of algorithms that can infer inherent water properties like chlorophyll or Total Suspended Sediment (TSS) concentration from reflectance measurements. Since the existing NASA ocean color algorithms do not work well in optically complex coastal waters, I decided to apply advanced machine learning algorithms. I hypothesized that these algorithms can discover complex nonlinear relationship between satellite reflectance and chlorophyll or TSS concentration. I developed three algorithms: support vector regression, relevance vector machines, and artificial neural networks. They demonstrated skill in retrieving TSS and chlorophyll from satellite remote sensing. With further testing and validation, these algorithms can be applied to coastal oceans and estuaries worldwide. This may lead to a new monitoring tool for water quality and assist in the development of sound management strategies for coastal resources. Future work includes developing machine learning algorithms for detecting harmful algal blooms and for remote sensing by drones.

Question / Proposal

Coastal developments and agricultural use of fertilizers have led to dramatic water quality declines in coastal systems worldwide. Excessive sediment loading smothers seagrass, oysters and other benthic organisms. Nutrient enrichment leads to excessive algae growth, hypoxia, and harmful algal blooms. This year large algal blooms decimated coastal Florida, creating a major human health crisis that attracted national media attention. The Florida Red Tides caused widespread fish kills and respiratory problems. Satellite remote sensing can provide a means to monitor water quality problems such as poor water clarity and toxic algal blooms. The Moderate Resolution Imaging Spectroradiometer (MODIS) onboard the Aqua satellite images the entire Earth in 1-2 days and measures water-leaving radiances in the visible to near-infrared range with a spatial resolution of 1 km. However, the current NASA algorithms designed to retrieve chlorophyll in the open ocean perform poorly in coastal waters. I sought to develop new algorithms that can overcome this limitation.  I hypothesized that advanced machine learning algorithms can discover the complex nonlinear relationships between satellite reflectance and water properties such as chlorophyll and suspended sediment concentration.  I expected that the algorithms can uncover the complex relationships which would be hard or impossible to capture in regression models. I thought that the artificial neural network could fit the training data but may have a generalization problem. On the other hand, kernel methods such as support vector machine were designed to minimize structural risks and would produce more robust algorithms.    

Research

To use satellites or drones to monitor water quality in the ocean, algorithms must be developed to infer inherent water optical properties from reflectance measurements.  In the open ocean (Case 1 water), various algorithms have been developed to infer phytoplankton chlorophyll from satellite remote sensing. They generally fall into two types: semi-analytical and empirical. Semi-analytical algorithms use analytic reflectance models that can be inverted to derive chlorophyll and absorption coefficients of other water constituents (Morel and Prieus 1977; Gordon et al. 1988). However, empirical formulae are needed in the parameterization of several terms in these models (Gordon and Morel 1983). On the other hand, empirical algorithms are based on a statistical regression of reflectance measurements versus in-situ chlorophyll measurements (Clark 1997). They use different formulations, including multiple regression, second-order and third-order polynomials (O’Reilly et al. 1998).

Open-ocean remote sensing algorithms do not perform well in estuaries and coastal oceans (Case 2 waters) where the optical properties of seawater are affected by suspended sediment, phytoplankton, and colored dissolved organic matter (Mobley et al. 2004; Tzortziou et al. 2007). Machine learning algorithms such as the artificial neural network (ANN) have been proposed for Case 2 waters (Doerffer and Schiller 2006). Typically, ANN is based on the feedforward multilayer perceptron architecture (Mas and Flores 2008). Although ANN models are superior to regression models (Dransfeld et al. 2004), they have drawbacks. The design and training of ANN is a complex and costly task; minimization of training errors often leads to poor generalization performance.

Support vector machines, one of the supervised machine learning algorithms, provide an attractive alternative to ANN (Vapnik 1979; Cortes and Vapnik 1995), and have been used to classify remote sensing images (Mountrakis et al. 2011). Instead of minimizing errors, the support vector regression (SVR) models minimize structural risk, where an upper bound of generalization performance is maximized (Smola and Scholkopf 2004). SVR has been used to retrieve oceanic chlorophyll concentrations (Zhan et al. 2003; Camps-Valls et al. 2006a, b). The relevance vector machine (RVM) is a Bayesian approach (Tipping 2000). It has an identical functional form to SVR but tends to lead to sparser solutions (less structural risk) with fewer basis functions. It has also been used in remote sensing studies of open ocean (Camps-Valls et al. 2006a). However, neither SVR nor RVM have been applied to Case 2 coastal waters.

Currently robust and accurate algorithms for remote sensing of coastal waters are still lacking. In this study, I developed and tested three machine learning algorithms (ANN, SVR and RVM) for inferring chlorophyll and total suspended sediment. We selected the study site to be the Chesapeake Bay which has a long history of water quality problems. I explored a novel ensemble approach to improve the performance of the machine learning algorithms. These algorithms advance the science of remote sensing in Case 2 waters and may provide a powerful tool for monitoring water quality in estuaries and coastal oceans worldwide.

Method / Testing and Redesign

My project involves data assemblage, algorithm training, and performance assessments. 

Moderate Resolution Imaging Spectradiometer (MODIS) reflectance data collected onboard the Aqua Satellite are used as explanatory variables. They have a resolution of 1000 m and include ten wavelengths: 412,443, 469, 488, 531, 547, 555, 645, 667, and 678 nm. The reflectance data were downloaded from NASA’s Ocean Color website (oceancolor.gsfc.nasa.gov). The in-situ chlorophyll or Total Suspended Sediment (TSS) data are used as the dependent variable. They were downloaded from the Chesapeake Bay Program (Figure 1, http://www.chesapeakebay.net/data). MODIS and in-situ measurements were paired to the same day and geographic location. 1172 and 1268 data points were assembled for chlorophyll and TSS datasets, respectively. They were randomly divided into a training dataset (80%) and an independent testing dataset (20%), and transformed to a zero mean and unit variance.

Figure 1. Map of Chesapeake Bay. The red dots indicate the in-situ sampling locations. 

I trained three machine learning algorithms for TSS and chlorophyll using R programming language. The programs that I wrote can be found here: ./Publication/ in Repository and Google Drive.  (1) The Artificial Neural Network (ANN) is based on the feedforward multilayer perceptron architecture, consisting of an input layer, one or more sets of hidden layers, and one output layer (Dransfeld et al. 2004; Mas & Flores 2008). I used three hidden layers with 30, 20, and 10 nodes, respectively. The backpropagation algorithm with weight backtracking was used, and the logistic function was chosen as the activation function. (2) The Support Vector Regression (SVR) seeks a function that has at most 𝜺-deviations from the observed targets and at the same time is as flat as possible (Smola & Scholkopf 2004). I adopted the soft-margin loss formulation (Cortes and Vapnik 1995). I used the radial basis kernel function to map the data to higher dimensions where the data exhibit linear patterns. SVR has two hyperparameters: 𝜺-band defines a margin of tolerance; C determines the trade-off between flatness and maximum deviations from ±𝜺 band. I used a grid search method to find the optimal values of 𝜺 and C that minimize the mean-square error between the predicted and observed chlorophyll/TSS concentration. (3) The Relevance Vector Machine (RVM) has an identical functional form to the SVR but uses Bayesian inference (Tipping 2000; Camps-Valls et al. 2006). RVM maximizes the logarithm of the likelihood of the weights.

Figure 2. Flow chart illustrating the training process for chlorophyll algorithms.

I used the ensemble approach to improve the performance of both TSS and chlorophyll algorithms (Figure 2). I generated hundreds of new training data sets by ignoring one data point in the original training data set at a time. I trained ANN, SVM and RVM on each training dataset and produced a prediction on the testing dataset. I then repeated this process for all the training sets to create an ensemble mean prediction.

The research was conducted at the University of Maryland Center for Environmental Science and at home. The only equipment used was a desktop computer.

 

Results

The performance of the algorithms was examined using the independent testing dataset. Figure 3 shows scatter plots between the observed and predicted chlorophyll. Good agreements were found in RVM and SVR. ANN performs well on the training data but poorly on the testing data, indicating that ANN is over-fitted and has poor generalization.  The NASA chlorophyll algorithm OCM3 has the worst performance, with many scattered points.

Figure 3. Observed versus predicted chlorophyll  (log scale) for (a) NASA OCM3, (b) SVR, (c) RVM,  and (d) ANN. The training/test data points are black/red.

Diagnostics on the regression between the observed and predicted chlorophyll for RVM are shown in Figure 4. Residual errors plotted against the fitted values are small and scatter around zero, indicating that the variance of residual is the same for all fitted values (homoscedasticity). The normal Q-Q plot fits well to a diagonal line and suggests that the errors are normally distributed. The scale-location plot shows no significant patterns and confirms the homoscedasticity in the residuals. The leverage plot reveals no outsized influences of particular data points on the regression.

Figure 4. Regression diagnostics on RVM for chlorophyll: (a) residuals versus fitted values; (b) normal Q-Q plot; (c) scale location plot; (d) residuals versus leverage plot.

Four metrics were used to assess the prediction error: root-mean-square error (rmse) and mean-absolute-error (mae) to measure accuracy, mean-error (me) to measure bias, and the correlation coefficient (r-square) to measure the fit between the measured and predicted chlorophyll concentrations. The subscript 10 indicates that the metrics are calculated using the log-transformed data. For chlorophyll (Table 1), RVM performs best, with r-square of 0.51, rmse of 0.19 and bias of -0.01. SVR is the second best. NASA OCM3 performs worst, having r-square of 0.09 and rmse of 0.37. Another important metrics is th­­­­e sparsity which measures the robustness of regression models. RVM uses a small number of relevant vectors relative to the training samples and had a low sparsity. SVR uses a large number of support vectors and has a sparsity of 0.49. ANN has a high value for the Akaike Information Criterion (AIC) and is not robust. Table 2 summarizes the performance metrics for the machine algorithms for TSS. Both RVM and SVR significantly outperform the previous models:  the diffuse attenuation coefficient at 490 nm (Wang et al. 2009) and a third-degree polynomial (Ondrusek et al. 2012). ANN performed the worst with r-square of 0.05.

Table 1. Performance comparison for chlorophyll algorithms. The metrics are r-square, rmse, mae, mean-absolute-percentage-error, me, and sparsity.

Table 2. Performance comparison of TSS algorithms.

The monthly climatology of chlorophyll distribution retrieved using RVM shows a seasonal progression of phytoplankton in excellent agreement with previous observations (Figure 5). During the cold winter months, chlorophyll concentration was low. As water warms and more nutrients are delivered from the rivers during the spring, an algal bloom in the estuary. Chlorophyll remained relatively high during the hot summer months but decreased in the fall. 

Figure 5. Monthly chlorophyll climatology obtained using RVM.

 

Conclusion

Three machine learning algorithms were developed for retrieving chlorophyll and TSS from satellite remote sensing, and their predictive skills were compared against the NASA operational ocean color OCM3 algorithm and previous TSS algorithms. The OCM3 algorithm has a near-zero correlation coefficient, large root-mean-square error and large bias when used to retrieving chlorophyll in the Chesapeake Bay, thus performing poorly in Case 2 coastal waters. ANN has a low correlation coefficient, large root-mean-square error and large bias in predicting TSS. Although ANN predicts chlorophyll mostly accurately on the training data, its performance on the testing data is greatly degraded (Figure 3). This indicates that ANN is over-fitted to the training data and has poor generalization performance. In comparison, RVM and SVR display very good skills in retrieving both chlorophyll and TSS in coastal waters. Their numerical metrics such as the correlation coefficient, the root-mean-square and biases are of similar magnitudes, indicating similar predictive skills. However, the number of relevant vectors used in RVM is 10-20% of the number of the support vectors used in SVR, indicating that RVM is a much more robust and computationally efficient algorithm than SVR.

My numerical analysis addressed the original hypothesis that machine learning algorithms can discover the complex nonlinear relationships between satellite reflectance and inherent optical water properties such as chlorophyll and suspended sediment concentration. In particular, advanced kernel methods such as support vector machine and relevance vector machine demonstrated superior skills in retrieving water properties in optically complex coastal waters. The artificial neural network model used multiple hidden layers to fit into the training data and performed poorly on the independent testing data, thereby suffering poor generalization performance. In contrast, SVR and RVM were designed to minimize structural errors and the model complexity was controlled through a regularization term, thereby leading to accurate and robust algorithms that performed well on the independent testing data.

Overall, the relevance vector machine and support vector regression produce two promising algorithms for satellite remote sensing of water quality in coastal waters. One limitation was that my algorithm developments were based on in-situ data collected in one coastal region (the Chesapeake Bay). In order to apply these machine learning algorithms worldwide, I would need to test them against in-situ data collected in other geographic regions.

If the algorithms are validated in other coastal regions, they would represent a major progress in improving satellite remote sensing of coastal waters. With the expected further improvements in satellite sensors, such as finer spatial resolutions and hyperspectral cameras, we would gain incredible capability in monitoring coastal oceans at unprecedented resolutions. As drones carrying hyperspectral imaging cameras become widely available, we could deploy a new remote sensing tool at a moment’s notice to survey coastal sites suspected of an emerging problem such as toxic algal blooms. For the future work, I would like to develop algorithms for hyperspectral remote sensing. I would also like to train algorithms for detecting harmful algal blooms and develop a warning system that can save lives.    

About me

I became interested in math at an early age. I began to take math classes from Art of Problem Solving (AoPS) since fifth grade. This past summer I participated in the Ross Mathematics Program at Ohio State University and loved to “think deeply of simple things”. In the middle school, I conducted a statistical analysis of cancer reoccurrence and published a paper (https://www.emerginginvestigators.org/articles/a-retrospective-statistical-analysis-of-second-primary-cancers-in-the-delmarva-peninsula-u-s-a). During a field trip to the Chesapeake Bay, I became concerned about the health of coastal oceans. I would like to help and work on interdisciplinary projects bridging environmental science and computer science.

I admire Richard Feynman and Rachel Carson. I enjoyed reading Richard Feynman’s book Surely You’re Joking, Mr. Feynman. Not only was he a world-renowned scientist, he was also charismatic. Rachel Carson’s book Silent Spring exposed the harmful effects of pesticide. She initiated an environmental movement and led to creation of the Environmental Protection Agency.

I plan to major in computer science and minor in environmental science in college. My long term goal is to conduct machine learning research at Google or become a professor at a research university. I was fortunate to have won national science awards. I was a Broadcom Masters semifinalist (2017), and won 4th place Grand Award at ISEF (2018).  I was named as one of the twenty Davidson Fellows in the nation (2018). Google conducts cutting edge research in machine learning and genuinely cares about environment. I would feel very honored to win a prize from the Google Science Fair.

Health & Safety

I worked at Horn Point Labratory, the University of Maryland Center for Environmental Science, and at home. However, my work was computational, so I did not need to follow any health/safety guidelines. My adviser was Dr. Greg Silsbe. Here is his contact information:

Email: gsilsbe@umces.edu
Phone: 410-221-8247

Bibliography, references, and acknowledgements

I would like to acknowledge my mentor Dr. Greg Silsbe. He gave me the data I used. I independently developed the machine learning methodology and implemented the models on my own. Dr. Silsbe provided helpful tips and suggestions and taught me to write more efficient programs in the R programming language.

I worked at the Horn Point Lab, the University of Maryland Center for Environmental Science. No special equipment was used.

 

Camps-Valls, G., et al. (2006a) Retrieval of oceanic chlorophyll concentration with relevance vector machines. Remote Sensing of Environment, 105, 23-33.

Camps-Valls, G., Bruzzone, L., Rojo-Alvarex. J., & Melgani, F. (2006b) Robust support vector regression for biophysical variable estimation from remotely sensed images. IEEE Geoscience and Remote Sensing Letters, 3(3), 93-97.

Clark, C. (1997) Reconstructing the evolutionary dynamics of former ice sheets using multi-temporal evidence, remote sensing and GIS. Quaternary Science Reviews, 16(9), 1067-1092.

Cortes, C., & Vapnik, V. (1995) Support-vector networks. Machine Learning, 20(3), 273-297.

Doerffer, R., & Schiller, H. (2006) The MERIS Case 2 water algorithm. International Journal of Remote Sensing, 28(3-4), 517-535.

Dransfeld, S., Tatnall, A., Robinson, I., & Mobley, C. (2004). A comparison of multi-layer perceptron and multilinear regression algorithms for the inversion of synthetic ocean colour spectra. International Journal of Remote Sensing, 25(21), 4829-4834.

Gordon, H., & Morel, A. (1983) Remote assessment of ocean color for interpretation of satellite visible imagery. Berlin: Springer-Verlag.

Gordon, G., et al. (1988) A semianalytic radiance model of ocean color. Journal of Geophysical Research, 93(D9), 10909-10924.

Mas, J., & Flores J. (2008) The application of neural networks to the analysis of remotely sensed data. International Journal of Remote Sensing, 29(3), 617-663.

Morel, A., & Prieur, L. (1977) Analysis of variations in ocean color. Limnology and Oceanography, 22(4), 709-722.

Mountrakis, G., Im, J., & Ogole, C. (2011) Support vector machines in remote sensing: A review. ISPRS Journal of Photogrammetry and Remote Sensing, 66(3), 247-259.

O’Reilly, J. et al. (1998) Ocean color chlorophyll algorithms for SeaWiFS. Journal of Geophysical Research, 103(C11), 24937-24953.

Ondrusek, M., Stengel, E., Kinkade, C., Vogel, R., Keegstra, P., Hunter, C., & Kim, C. (2012). The development of a new optical total suspended matter algorithm for the Chesapeake Bay. Remote Sensing of Environment, 119, 243-254.

Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199-222.

Tzortziou, M. et al. (2007). Remote sensing reflectance and inherent optical properties in the mid Chesapeake Bay. Estuarine, Coastal and Shelf Science, 72(1-2), 348-362.

Vapnik, V. (1979) Estimation of dependences based on empirical data. Moscow: Nauka.

Wang, M., Son, S., & Harding, L. (2009) Retrieval of diffuse attenuation coefficient in the Chesapeake Bay and turbid ocean regions for satellite ocean color applications. Journal of Geophysical Research, 114(C10).

 

The R programming language and R libraries I used are cited below:

Bivand, R., Keitt, T., and Rowlingson, B. (2018). rgdal: Bindings for the 'Geospatial' Data Abstraction Library. R package version 1.3-6. https://CRAN.R-project.org/package=rgdal

Bivand, R., Pebesma, E., Gomez-Rubio, V. (2013). Applied spatial data analysis with R, Second edition. Springer, NY. http://www.asdar-book.org/

Bivand, R., and Rundel, C. (2018). rgeos: Interface to Geometry Engine - Open Source ('GEOS'). R package version 0.4-2. https://CRAN.R-project.org/package=rgeos

Fritsch, S., and Guenther, F. (2016). neuralnet: Training of Neural Networks. R package version 1.33. https://CRAN.R-project.org/package=neuralnet

Garnier, S. (2018). viridis: Default Color Maps from 'matplotlib'. R package version 0.5.1. https://CRAN.R-project.org/package=viridis

Hijmans, R. (2017). raster: Geographic Data Analysis and Modeling. R package version 2.6-7. https://CRAN.R-project.org/package=raster

Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A. (2004). Kernlab - An S4 Package for Kernel Methods in R. Journal of Statistical Software 11(9), 1-20. URL http://www.jstatsoft.org/v11/i09/

Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2018). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-0. https://CRAN.R-project.org/package=e1071

Microsoft Corporation and Weston, S. (2017). doParallel: Foreach Parallel Adaptor for the 'parallel' Package. R package version 1.0.11. https://CRAN.R-project.org/package=doParallel

Pebesma, E., Bivand, R., (2005). Classes and methods for spatial data in R. RNews 5 (2), https://cran.r-project.org/doc/Rnews/.

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Schnute, J., Boers, M., and Haigh, R. (2017). PBSmapping: Mapping Fisheries Data and Spatial Analysis Tools. R package version 2.70.4. https://CRAN.R-project.org/package=PBSmapping