This article is part of our coverage.

A new machine learning technique developed by researchers at Edge Impulse makes it possible to run real-time object detection on devices with very small computation and memory capacity. New computer vision applications can be unlocked by the new deep learning architecture.

Most object-detection deep learning models have memory and computation requirements that are beyond the capacity of small processors. It requires several hundred kilobytes of memory, which makes it a great technique for TinyML, a subfield of machine learning focused on running models on devices with limited or no internet access.

Image classification vs object detection

TinyML has made great progress in image classification, where the machine learning model only predicts the presence of a certain type of object in an image. The model needs to identify more than one object as well as the bounding box of each instance.

elephants-object-detection

Object detection models require more memory and are more complex than image classification networks.

We added computer vision support to Edge Impulse back in 2020, and we have seen a tremendous pickup of applications.

Many applications use image classification. A security camera can use TinyML image classification to determine if a person is in the picture or not. More can be done.

You're limited to these very basic classification tasks and it was a big nuisance. There is a lot of value in seeing three people and a label in the top left corner.

The earlier object detection models had to process the input image several times to locate the objects, which made them slow and expensive. YOLO uses single-shot detection to provide near real-time object detection. Their memory requirements are still large. Even models designed for edge applications are hard to run on small devices.

YOLOv5 and MobileNet are large networks that will never fit on a computer and are not compatible with the class of devices.

These models need a lot of data and they are bad at detecting small objects. YOLOv5 recommends more than 10,000 training instances per object class.

Not all object-detection applications need the high-precision output that deep learning models provide. You can shrink your deep learning models to very small sizes if you find the right tradeoff between accuracy, speed, and memory.

FOMO predicts the object's center instead of detecting bounding boxes. Many object detection applications only look at the location of objects in the frame and not their sizes. It requires less data to detect centroids than it does to predict bounding box.

sheep-object-detection-bounding-box-vs-centroid

Redefining object detection deep learning architectures

There is a structural change to traditional deep learning architectures.

A single-shot object detector is made up of a set of layers that are connected and have features. The visual features are pulled from the layers in a way. Simple things such as lines and edges are detected by the first layer. The pooling layer reduces the size of the layer and keeps the most prominent features in each area.

The pooling layer's output is fed to the next layer, which extracts higher-level features, such as corners, arcs, and circles. As more layers are added, the feature maps can detect more complicated things.

neural-networks-layers-visualization
Each layer of the neural network encodes specific features from the input image.

The fully connected layers try to predict the class and box of objects by flattening the output of the final layer.

The last few layers are removed. The output of the neural network is turned into a smaller version of the image with each output value representing a small patch of the input image. Each output unit is trained on a special loss function to predict class probabilities for the patch in the input image. The output becomes a heat map.

fomo-heatmap
FOMO’s output layer produces a heatmap of class probabilities for each corresponding area in the input image.

There are several benefits to this approach. FOMO is compatible with existing architectures. MobileNetV2 is a popular deep learning model for image classification on edge devices.

FOMO lowers the memory and compute requirements of object detection models by reducing the size of the neural network. It is 30 times faster than MobileNet and can run on devices with less than 200 kilobytes of RAM.

The following video shows a neural network using a little over 200 kilobytes of memory to detect objects at 30 frames per second. FOMO can detect objects at 60 frames per second compared to the 2 frames per second performance of MobileNet.

Mat Kelcey, the Principal Engineer at Edge Impulse, did work on neural network architecture for counting bees.

He designed a custom architecture that works for this type of problem, because traditional object detection methods are bad at this type of problem.

The output can be configured based on the application and can detect many instances of objects in a single image.

FOMO-bee-detection
FOMO can detect many small object instances in an image

Limits of FOMO

There are tradeoffs to the benefits of FOMO. It works best when objects are the same size. It is like a grid of squares with each square detecting one object. If there are many small objects in the background and one large object in the foreground, it will not work well.

When objects are too close to each other, they will occupy the same grid square, which reduces the accuracy of the object detector. Reducing the cell size or increasing the image resolution can help you overcome the limit.

When the camera is in a fixed location, it's especially useful to use FOMO.

The Edge Impulse team plans to make the model 888-282-0465 888-282-0465 888-282-0465 888-282-0465 888-282-0465 888-282-0465 in the future.

The article was originally written by Ben Dickson and was published on TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. We discuss the evil side of technology, the darker implications of new tech, and what we need to look out for. The original article can be read here.