Applications of Deep Learning: Convolutional Neural Network Models In the Healthcare Industry: Part 2

Rishav Agarwal
6 min readMar 31, 2022

--

Deep Learning in Hand Segmentation and Detection

Deep Learning Applications in Healthcare

Introduction

We will be focusing on two papers:

  1. Zhang, M., Cheng, X., Copeland, D., Desai, A., Guan, M. Y., Brat, G. A., & Yeung, S. (2020). Using computer vision to automate hand detection and tracking of surgeon movements in videos of open surgery. In AMIA Annual Symposium Proceedings (Vol. 2020, p. 1373). American Medical Informatics Association.

The work of M. Zhang et. al. has utilized computer vision to acquaint a mechanized methodology with video investigation in order to increase the efficiency of automation in hand detection and tracking movements in open surgery videos. A cutting edge convolutional neural network architecture — RetinaNet — for object discovery was utilized to identify working hands in open medical procedure recordings.

2. Haque, A., Guo, M., Alahi, A., Yeung, S., Luo, Z., Rege, A., … & Fei-Fei, L.
(2017, November).
Towards vision-based smart hospitals: a system for tracking and monitoring hand hygiene compliance. In Machine Learning for Healthcare Conference (pp. 75–87). PMLR.

According to the paper by A. Haque et al. a vision based non- intrusive technique is the approach towards a smart hospital. It aims at reducing hospital acquired infections by eliminating close proximity detection of compliances. They work with depth images in order to differentiate between a 6 person performing the hand hygiene act and a person just there in the vicinity.

A Deep-Dive into Paper 1

Dataset

  • Videos of Open Surgery from YouTube curated by using search terms compiled by Southern Cross Health Society.
  • A global set : 188 videos with 70 in the breast section, 88 in gastrointestinal section and 30 in the head-and-neck surgery sections.
  • Using the global sets and the annotated labels, a dataset of 1880 labeled images was created, where 940 were the allocated for training, 380 for validation and 560 for testing.
Creating Bounding Boxes

Methodology: RetinaNet Detection Model | Transfer Learning | SORT Algorithm

  • The RetinaNet model is a 3-part single neural network Detection model. If we look at the image below, we can see that it starts with a ResNet architecture which is known to have the ability to extract visual features.
  • That supports a base known as the Feature Pyramid Net allowing for feature processing and object detection.
RetinaNet Architecture
  • Now, the pyramid has multiple levels which feeds into two (Class + Box) Full-Convolutional subnets. The purpose of this is to classify the objects in order to gain the spatial boundaries.
  • The way RetinaNet model outputs the prediction is by first dividing the inputs into reference box called — “anchors” which is basically a sliding window position as seen in standard CNN models.
  • Each image has an bounding box, which now intersects with these anchors. If the intersection-over-union (IoU) is > 0.5, then it becomes foreground else it becomes background.
  • The paper proposed pre-training (Transfer Learning) their model on the following existing datasets: EgoHands and Oxford (Hands)
Pre-trained RetinaNet detection model performance
  • The Simple Online and Realtime Tracking (SORT) algorithm was used post-training and prediction through the RetinaNet model. It enables multiple object tracking from unidentified bounding box inputs.
  • The novel approach taken by the paper was to modify this SORT algorithm to not delete the objects when it moves out of the frame, as the surgeon hand may have a lot of movements with a lot of fluctuation between the frames. Rather, they update the existing frame with the current value by taking a first-out, first-in approach.

Results

Hand tracking during surgery videos
  • The model identifies hands consistently through time even in frames depicting multiple left and right hands (5c, 5d)
  • Also actions requiring a steady hand such as excising tissue with an electrocautery rendered very little overall tracking hands (cyan in 5b, yellow in 5c, 5d)
  • For techniques that involved larger lateral movements such as tying a knot with suture, trajectories were smooth and efficient (top hand, cyan in 5a)
  • Finally, the mapped trajectories also highlight instances of highly controlled dexterity, such as in Figure 5d where the bottom left hand relies on minor finger adjustments as opposed to larger motions to apply counter-traction (blue), while the right hand uses electrocautery to divide tissue (yellow).
Surgery detection performance across annotated hand datasets.

A Deep-Dive into Paper 2

Dataset

Dataset of Depth Images
  • A depth image is an image where the pixels do not represent a colour (RBG) rather it represents how far the pixels denoted in real world meters.
  • The Data Sources used are images from an acute care pediatric unit and an adult intensive care unit. Sensors were installed in 2 participating hospitals with hand sanitizer dispensers.

Methodology

  • The work aims at promoting hand-hygiene compliance across hospitals. It suggests a 3-way technique to achieve the same.
  • First, detect the healthcare staff, then track them in the physical world and then classify their hand hygiene behavior
  • To detect the healthcare staff, a pedestrian detection technique is proposed. This involves locating the 3D position of the staff within the field-of-view of the sensors. Thus, the equation (left) as suggested by them to solve is given below. To locate them in the real world, the paper formulates a linear integer program by finding the flow that minimizes cost (right).
(Left): Locating 3D position | (Right): Cost Function
  • Lastly, the most important part is to create the hand hygiene activity classifier using CNN models:
  1. AlexNet architecture consists of 5 convolutional layers, 3 max-pooling layers, 2 normalization layers, 2 fully connected layers, and 1 softmax layer. Each convolutional layer consists of convolutional filters and a nonlinear activation function ReLU. Input size is fixed due to the presence of fully connected layers. AlexNet overall has 60 million parameters.
AlexNet Architecture

2. The VGG16 CNN model is one of the most prominent vision frameworks which has been employed in countless technology. It has a convolution layers of a 3x3 channel in the first step, while utilizing a cushioning max pooling technique of 2x2 channel in the second step.

VGG16 Architecture

3. A Residual Neural Network (ResNet) is one of the standard architectures of artificial neural networks which expands its similarity to the pyramidal cells in the cerebral cortex of human body. Residual neural networks do this by using skip associations, or easy direction to directly jump towards a higher level layer. The initial stages of ResNet models contained a three layer architecture which included the nonlinearities (ReLU) on the sides and keep all the standardization in between.

ResNet Architecture

Results

Figure (Left): Top-down view of tracks. Blue rectangles are doors, orange squares are dispensers, and black lines are walls. Different track colors denote different people | Figure (Right): Examples before and after the spatial transformation. (Top) Input images with green bounding boxes from the grid generator. (Bottom) Transformed inputs. The model stretches, skews, and interpolates the input differently, depending on the scene contents.
Classifier ablation study: Effect of different inputs and architectures. D, F, and P denote depth, foreground, and pose inputs, respectively. STN denotes the spatial transformer network. The training and test sets are balanced with a 50/50 class-split.

As is seen by the table in the previous slide, ResNet gives the best accuracy with a value of 95.5% followed AlexNet and VGG16. The one important thing to note is that this paper facilitates a non-intrusive way to capture data, thus, maintaining privacy by using depth image rather than colored ones. It also compared the traditional methods of hand-hygiene compliance using RFID by the mathematical model presented by them.

--

--

Rishav Agarwal
Rishav Agarwal

Written by Rishav Agarwal

MS Computer Science @ Columbia University | Data Science | Make your own future, Make your own past

No responses yet