Deepwound: Automated Postoperative Wound Assessment and Surgical Site Surveillance through Convolutional Neural Networks

Summary

Postoperative wound complications are a significant cause of expense for hospitals, doctors, and patients. Hence, an effective method to diagnose the onset of wound complications is strongly desired. Algorithmically classifying wound images is a difficult task due to the variability in the appearance of wound sites.

Convolutional neural networks (CNNs), a subgroup of artificial neural networks that have shown great promise in analyzing visual imagery, can be leveraged to categorize surgical wounds. I present a multi-label CNN ensemble, Deepwound, trained to classify wound images using only image pixels and corresponding labels as inputs.

My final computational model can accurately identify the presence of nine labels: the presence of a wound, drainage, fibrinous exudate, granulation tissue, surgical site infection, open wound, staples, steri strips, and sutures. My model achieves receiver operating curve (ROC) area under curve (AUC) scores, sensitivity, specificity, and F1 scores superior to prior work in this area.

Smartphones provide a means to deliver accessible wound care due to their increasing ubiquity. Paired with deep neural networks, they offer the capability to provide clinical insight to assist surgeons during postoperative care. I also present a mobile application frontend to Deepwound, named Theia, that assists patients in tracking their wound and surgical recovery from the comfort of their home.

The next step will be to embed small neural networks within smartphones, enabling physicians in areas without wireless access to monitor wounds as well. Discovering malignancies quickly will also help expedite treatment in these areas.

Question / Proposal

Currently, most wound findings are documented via visual assessment by surgeons. Patients revisit their surgeon a few days after the operation for this checkup. This takes up valuable time that a surgeon could use to help out other patients. Infections can also set in earlier and the delay until the checkup can exacerbate the issue. Moreover, there is a lack of quantification of surgical wounds. An automated analysis of a wound image can provide a complementary opinion and draw the attention of a surgeon to particular issues detected in a wound. Thus, a rapid and portable computer-aided diagnosis (CAD) tool for wound assessment will greatly assist surgeons in determining the status of a wound in a timely manner.

My project goal is to develop a multilabel convolutional neural network (CNN) for automated postoperative wound classification to predict the onset of negative afflictions, such as surgical site infections (SSIs), granulation tissue, or fibrinous exudate. For easy enablement, I build a mobile application for the mobile software ecosystem that presents a user implementation of our CAD system. It includes clinically relevant features, such as the daily documentation of patient health and generation of wound assessments. The app enables patients to generate wound analysis reports and send them to the surgeon regularly from a remote location, such as their home.

I will use four metrics to evaluate the performance of my machine learning models on each wound label: sensitivity, specificity, F1-score, the area under receiver operating characteristic (ROC) curves, and saliency maps.

Research

Introduction

A critical issue in the healthcare industry is the effective management of postoperative wounds. The World Health Organization estimates 266.2 to 359.5 million surgical operations were performed in 2012, displaying an increase of 38% over the preceding eight years [1]. Surgeries expose patients to an array of possible afflictions in the surgical site. Surgical site infection (SSI) is an expensive healthcare-associated infection. The difference between the mean unadjusted costs for patients with and without SSI is approximately $21,000 [2]. Thus, individual SSIs have a significant financial impact on healthcare providers, patients, and insurers. SSIs occur in 2-5 percent of patients undergoing inpatient surgery in the U.S., resulting in approximately 160,000 to 300,000 SSIs each year in the United States alone, as summarized by [3].

Advances in software and hardware, in the form of powerful algorithms and computing units, have allowed for deep learning algorithms to solve a wide variety of tasks which were previously deemed difficult for computers to tackle. Challenging problems such as playing strategic games like Go [4] and poker [5], and visual recognition [6] are now possible using modern computing. Convolutional neural networks (CNN) have demonstrated accurate image classification after being trained on a large dataset of samples [7]. In the past decade, research efforts have led to impressive results on medical tasks, such as skin lesion inspection [8] and X-ray based pneumonia identification [9].

Prior Work

While the applications of machine learning in healthcare are numerous, few have attempted to solve the problem of postoperative wound analysis and surgical site monitoring. I would like to summarize two key pieces of research that sought to build models similar to the one presented in this paper.

Wang et al. showcased a comprehensive pipeline for wound analysis, from wound segmentation to infection scoring and healing prediction [10]. For binary infection classification, they obtained an F1 score of 0.348 and accuracy of 95.7% with a Kernel Support Vector Machine (SVM) trained on CNN generated features. Their data set consisted of 2,700 images with 150 cases positive for SSI.

Sanger et al. used classical machine learning to predict the onset of SSI in a wound. It is trained on baseline risk factors (BRF), such as pre-operative labs (e.g. blood tests) and type of operation [11].

While the infection scoring model presented by Wang et al. does achieve an accuracy of 95.7%, I believe that this metric is insufficient due to the class imbalance in their dataset. Sensitivity, specificity, F1 score, and ROC curves are better metrics which address this issue. This work improves upon these metrics compared to the presentation by Wang et al. While Sanger et al. have built a predictive methodology based on BRF, my approach leverages pixel data from wound images. Thus, my research complements any analysis using BRF.

According to my literature search, no prior work in dressing identification and other ailments apart from SSIs have been modeled using computational techniques. Thus, I believe I have built the most robust and comprehensive wound classification algorithm up-to-date.

Method / Testing and Redesign

Prior to this research, a dataset of 1,335 smartphone wound images was collected primarily from patients and surgeons at the Palo Alto VA Hospital and the Washington University Medical Center in St. Louis. The dataset also includes images from searching the internet to counteract class imbalance. All images were anonymized and cropped into identical squares.

As can be seen, images are very diverse and contain high variability. Images ranged from open wounds with infections to closed wounds with sutures. Table I shows the breakdown of the entire dataset.

Many tools went into the development of this research. My models were engineered using the Keras [12], OpenCV [13], and Scikit-learn [14] frameworks in the Python 3.7 programming language.

The figure below summarizes the four steps in the development of my model.

Data Preprocessing

Once data is loaded into memory, images are resized to 224 by 224 pixels to fit the input of my CNN architecture. The input layer is 224 by 224 by 3 pixels, the final dimension accounting for the three color channels. 80% of the data is used as the training data for my model and 20% is left for model evaluation and testing.

A critical component of the preprocessing stage is to compensate for vast differences in lighting and position found in smartphone images. To accommodate this, I apply contrast limited adaptive histogram equalization to each image [15].

Model Generation

I take the preprocessed images and generate three slightly different CNNs using the WoundNet architecture. I make adjustments to a current state of the art CNN, VGG-16 [16], to better suit my specific problem. The resulting configured model, known as WoundNet, is illustrated below.

WoundNet has a reduced computational complexity in comparison to VGG-16 and is capable of multi-label classification.  Rather than creating nine individual binary classifiers, I train each neural network to label images with all nine classes. This enables my model to find inter-label correlations through shared knowledge in the deep learning model.

Model Training via Transfer Learning

Transfer Learning

In practice, it is very difficult to train a CNN from end-to-end starting with randomly initialized weights. Furthermore, huge datasets with upwards of a million images are necessary to successfully train an accurate neural network from scratch. Too little data, such as my case, would cause a model to overfit.

We employ transfer learning [17], also known as fine tuning, to leverage pre-learned features from the ImageNet database. The original VGG-16 model was trained from end to end using approximately 1.3 million images (1000 object classes) from the ImageNet Large Scale Visual Recognition Challenge. We use the weights and layers from the original VGG-16 model as a starting point for training my models.

Data Augmentation

In order to make the most out of my training set, we utilize aggressive data augmentation prior to feeding images into the CNN. Data augmentation improves the generalization and performance of a deep neural network by copying images in the training set and performing a variety of random perturbations on them.

Results

I now present results for my computational model. I evaluate my CNN ensemble by calculating a variety of classification metrics (e.g accuracy, sensitivity, specificity, and F1 score), analyzing receiver operating characteristic curves, and generating saliency maps.

Numerical Metrics

I use a few different metrics to evaluate the performance of the ensemble as a whole. I use accuracy, sensitivity, specificity, and F1 Score (see Equations 1-4).

The latter three are more reliable metrics than accuracy for this paper as they take into account the real effectiveness of the model at discerning the presence and absence of a particular ailment. Table II displays all of my scores.

The area under the curve (AUC) of a receiver operating characteristic (ROC) curve is a useful metric in determining the performance of a binary classifier. ROC curves graphically represent the trade-off at every possible cutoff between sensitivity and specificity. Better classifiers have higher AUC values for their ROC curves while worse classifiers have lower AUC values. The red, dashed line along the center of the chart represents 50% probability or random chance. Since the AUC values are significantly higher than 0.5, it is clear that CNNs are capturing key information within in the dataset. With a larger dataset, I am confident the models can attain surgeon-level accuracy on this problem.

Visual Metrics

When analyzing digital images using machine learning, it is important to understand why a certain classifier works. Saliency maps have been shown in the past as a way to visualize the inner workings of CNNs in the form of a heat map which highlights the features within the image that the classifier is focused on [20]. I generate saliency maps from one of CNNs on a couple of images in the validation set to ensure that my classifiers are identifying the regions of interest for a particular label in an image. I can confirm that the attention of the model is drawn to the correct regions in the images.

Mobile Application: Theia

With a predicted 6.8 billion smartphones in the world by 2022 [19] mobile health monitoring platforms can be leveraged to provide the right care at the right time. A mobile application is a way to deliver my Deepwound model to patients and providers. Theia is a proof-of-concept of how Deepwound can assist physicians and patients in postoperative wound surveillance. The first component of the app is the ”Quick Test.” Physicians or patients can quickly photograph a wound and generate a wound assessment. A wound assessment provides positive or negative values for each label affiliated with a wound.

With permission from the patient, this app can also be used to collect wound images to add to my dataset. The enlarged dataset can be used to further improve my deep learning algorithms. As more patients and surgeons use the app, more image data can be collected. This newly accumulated data can be used to train my CNNs even further, leading to a virtuous cycle of improving accuracy.

Conclusion

In summary, my work describes a new machine learning based approach using CNNs to analyze an image of a wound and document its wellness. My research extends prior work by introducing models that are capable of identifying multiple characteristics present within a wound. My implementation achieves scores that improve upon the image classification pipeline introduced by Wang et al. and is much easier to use than the baseline risk factor (BRF) approach suggested by Sanger et al.

I acknowledge that my data set size is small and has some imbalance. This is a common problem in medical research as the data needs to be gathered over a sustained period of time with health compliant processes. I overcome these hurdles through the use of aggressive data augmentation, transfer learning, and an ensemble of three CNNs.

My approach for analysis and delivery with a smartphone is a unique contribution. It enables several key benefits: tracking a patient remotely, ease of communication with the medical team and an ability to detect the early onset of infection. Widespread use of such means can also enable automated data collection and classification at a lower cost, which in turn can improve the machine learning algorithm through re-training with a larger dataset. My mobile app can also generate comprehensive wound reports that can be used for the purpose of billing insurers, thus saving surgeons time.

Future Work

There are many ways to improve my algorithm. On a larger scale, it is necessary to gather more images for both training and testing. Creating a robust corpus of images will enable us to improve the performance of my method. More labeled images always lead to higher performances in the field of deep learning. This can also enable me to try different CNN models, such as ResNet and DenseNet, or even experiment with custom architectures.

I would like to consider blur detection prior to analyzing my image. If the image is too blurry, I can notify the user and request a clearer picture. There are many well-known techniques to accurately measure blur within an image.

I would also like to look into embedding my model into mobile devices directly without the need for a server. This will drastically increase speed for users and enable them to use the app in locations without access to the internet. Finally, I would like to extend my wound assessment framework by developing a computational model to track the healing of a wound using a time-series of images which can be collected using the current version of the mobile app.

About me

Hi! I'm Varun Shenoy, a senior at Cupertino High School in Cupertino, CA. I've always been a maker with a deep desire to build things to empower others, often by combining product development skills with cutting-edge research in the realms of artificial intelligence and healthcare. 

I was introduced to research during my sophomore year when I met Dr. Oliver Aalami, a surgeon at the Palo Alto Veterans Affairs Hospital and clinical professor at Stanford, at a local hackathon. After one successful project, inspired by the veterans in the hospital, we collaborated on developing an automated tool to assess postoperative wounds over the next summer. Winning the Google Science Fair would validate this research.

Since then, I've worked alongside Prof. Kurt Keutzer's group in the UC Berkeley Artificial Intelligence Research Lab on developing computational algorithms to automatically segment cancerous regions within a brain MRI. Using my new interest in medical imaging technology, I consulted with Dr. Bao Do's computational radiology group at Stanford. I created a web interface for bone radiograph annotation, showcased in this video.

I've built multiple mobile apps cumulatively amassing over 20,000 downloads worldwide. Research and app development has fulfilled many of my childhood dreams, from attending the Consumer Electronics Show to the Worldwide Developers Conference.

My future plans include attaining a comprehensive education in computer science with a focus on artificial intelligence and entrepreneurship. I cite Alan Turing, Randy Pausch, David Pogue, my family, and teachers as key influences on my development as a scientist and engineer. 

Health & Safety

Dr. Oliver Aalami, my research mentor, worked closely with medical students to collect HIPPA compliant data by onboarding patients at the Palo Alto Veterans Affairs onto this project. Data collection occurred after obtaining IRB approval from Stanford University. 

I worked with de-identified data provided by Dr. Aalami.

 

Mentor Contact Details

Dr. Oliver Aalami

aalami@stanford.edu

(650) 315-3236

Clinical Associate Professor of Surgery at Stanford University and Education Site Director at the Veterans Affairs Palo Alto Health Care System

Bibliography, references, and acknowledgements

Acknowledgments

This research would not have been possible without the advice and in-person/email mentorship from the following individuals.

Dr. Oliver Aalami from Stanford University and Palo Alto Veterans Affairs Hospital guided me on all medical aspects of the project. He led the effort for the collection of data and its curation. As the sole programmer on our team, I had the opportunity to work on the entire computational area of the project, presented in this submission.

Dr. Mohammed Zayed from the University of Washington at St. Louis assisted in collecting images at the hospital and merging it with the rest of the dataset.

Dr. Andre Esteva from Stanford University graciously spent some time out of his busy schedule to answer any questions about artificial neural networks, convolutional neural networks, data augmentation, and transfer learning.

Thank you to the patients and students who assisted Dr. Aalami in collecting and labeling images for the dataset used in this research

Thank you to my phenomenal teachers and Cupertino High School for fostering my passion for learning, be it in STEM or the humanities.

Finally, I would like to thank my family for supporting me throughout my research.

 

References

[1] T. G. Weiser, A. B. Haynes, G. Molina, S. R. Lipsitz, M. M. Esquivel, T. Uribe-Leitz, R. Fu, T. Azad, T. E. Chao, W. R. Berry and A. A. Gawande, ”Size and distribution of the global volume of surgery in 2012,” Bulletin of the World Health Organization, 2016.

[2] M. L. Schweizer, J. J. Cullen, E. N. Perencevich and M. S. Vaughan Sarrazin, ”Costs Associated With Surgical Site Infections in Veterans Affairs Hospitals,” JAMA Surg, vol. 146, no. 9, pp. 575-581, 2014.

[3] D. J. Anderson, K. Podgorny, S. I. Berrios-Torres, D. W. Bratzler, E. P. Dellinger, L. Greene, A. C. Nyquist, L. Saiman, D. S. Yokoe, L. L. Maragakis and K. S. Kaye, ”Strategies to Prevent Surgical Site Infections in Acute Care Hospitals: 2014 Update,” Infect Control Hosp Epidemiol, vol. 35, no. 6, pp. 605- 627, 2014.

[4] D. Silver, J. Schrittweiser, K. Simoyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driesseche, T. Graepel and D. Hassabis, ”Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.

[5] N. Brown and T. Sandholm, ”Superhuman AI for heads-up no-limit poker: Libratus beats top professionals,” Science, 2017.

[6] J. Deng, W. Dong, R. Socher, L. J. Li, L. Kai and L. Fei-Fei, ”ImageNet: A large-scale hierarchical image database,” IEEE Conference on Computer Vision and Pattern Recognition, 2009.

[7] Y. LeCun, Y. Bengio and G. Hinton, ”Deep learning,” Nature, vol. 521, pp. 436-444, 2015.

[8] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau and S. Thrun, ”Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, pp. 115-118, 2017.

[9] P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya, M. P. Lungren and A. Y. Ng, ”CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning,” arXiv preprint arXiv:1711.05225, 2017.

[10] C. Wang, X. Yan, M. Smith, K. Kochhar, M. Rubin, S. M. Warren, J. Wrobel and H. Lee, ”A unified framework for automatic wound segmentation and analysis with deep convolutional neural networks,” 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, 2015.

[11] P. C. Sanger, G. H. van Ramshorst, E. Mercan, S. Huang, A. L. Hartzler, C. A. Armstrong, R. J. Lordon, W. B. Lober and H. L. Evans, ”A Prognostic Model of Surgical Site Infection Using Daily Clinical Wound Assessment,” J Am Coll Surg, vol. 223, no. 2, pp. 259-270, 2016.

[12] F. Chollet, ”Keras,” 2015. [Online]. Available: https://github.com/kerasteam/keras.

[13] G. Bradski, ”The OpenCV Library,” Dr. Dobb’s Journal: Software Tools for the Professional Programmer, vol. 25, no. 11, pp. 120-123, 2000.

[14] F. Pedregosa, G. Varoquax, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, ”Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825-2830, 2011.

[15] S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz, T. Greer, B. ter Haar Romeny, J. B. Zimmerman and K. Zuiderveld, ”Adaptive Histogram Equalization and its Variations,” Computer Vision, Graphics and Image Processing, vol. 39, pp. 355-368, 1987.

[16] K. Simoyan and A. Zisserman, ”Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[17] A. Karpathy, ”Transfer Learning,” Stanford University. [Online]. Available: http://cs231n.github.io/transfer-learning/.

[18] D. P. Kingma and J. Ba, ”Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv:1412.6980, 2017.

[19] Ericsson, Inc., ”Ericsson Mobility Report: June 2017”, June 2017. [Online]. Available: https://www.ericsson.com/assets/local/mobilityreport/documents/2017/ericsson-mobility-report-june-2017.pdf.

[20] K. Simoyan, A. Vedaldi and A. Zisserman, ”Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps,” arXiv preprint arXiv:1312.6034, 2014.