How to Implement Object Detection in TensorFlow
Object detection goes beyond image classification by not only identifying what objects are present in an image, but also where they are located. While a classifier might tell you 'this image contains...
Key Insights
- Object detection combines classification and localization to identify multiple objects in images, making it more complex than simple image classification but essential for real-world computer vision applications.
- TensorFlow’s Object Detection API provides pre-trained models that can be fine-tuned on custom datasets in hours rather than days, dramatically reducing development time.
- Converting your model to TensorFlow Lite can reduce inference time by 3-5x and model size by up to 75%, making deployment on edge devices practical.
Introduction to Object Detection
Object detection goes beyond image classification by not only identifying what objects are present in an image, but also where they are located. While a classifier might tell you “this image contains a dog,” an object detector tells you “there’s a dog at coordinates (120, 45) with width 200 and height 180, and a cat at (350, 120).”
This capability powers autonomous vehicles that need to locate pedestrians and other cars, security systems that track individuals across camera feeds, and retail analytics platforms that count customers and monitor shelf inventory. The technology has matured significantly, with architectures like YOLO (You Only Look Once) prioritizing speed, SSD (Single Shot Detector) balancing speed and accuracy, and Faster R-CNN optimizing for maximum detection precision.
For most production applications, you’ll want to start with a pre-trained model and fine-tune it on your specific dataset. This approach leverages transfer learning to achieve strong results with limited data and compute resources.
Setting Up Your TensorFlow Environment
Start by installing TensorFlow and the Object Detection API. The API isn’t available via pip, so you’ll need to clone the repository and install dependencies manually.
# Install TensorFlow
pip install tensorflow==2.13.0
# Clone the TensorFlow Models repository
git clone https://github.com/tensorflow/models.git
cd models/research
# Install required packages
pip install protobuf==3.20.3
pip install pillow lxml matplotlib
# Compile protocol buffers
protoc object_detection/protos/*.proto --python_out=.
# Install the Object Detection API
cp object_detection/packages/tf2/setup.py .
python -m pip install .
Verify your installation with this test script:
import tensorflow as tf
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as viz_utils
print(f"TensorFlow version: {tf.__version__}")
print("Object Detection API installed successfully")
Preparing Your Dataset
Your dataset needs two components: images and annotations. Annotations define bounding boxes around objects using formats like Pascal VOC (XML files) or COCO (JSON). For this guide, we’ll use Pascal VOC format.
Each XML annotation file should look like this:
<annotation>
<filename>image001.jpg</filename>
<size>
<width>800</width>
<height>600</height>
</size>
<object>
<name>cat</name>
<bndbox>
<xmin>100</xmin>
<ymin>150</ymin>
<xmax>300</xmax>
<ymax>400</ymax>
</bndbox>
</object>
</annotation>
TensorFlow requires data in TFRecord format. Here’s a conversion script:
import tensorflow as tf
import xml.etree.ElementTree as ET
from object_detection.utils import dataset_util
import os
def create_tf_example(image_path, xml_path):
with tf.io.gfile.GFile(image_path, 'rb') as fid:
encoded_image = fid.read()
tree = ET.parse(xml_path)
root = tree.getroot()
width = int(root.find('size/width').text)
height = int(root.find('size/height').text)
filename = root.find('filename').text.encode('utf8')
xmins, xmaxs, ymins, ymaxs = [], [], [], []
classes_text, classes = [], []
for obj in root.findall('object'):
xmins.append(float(obj.find('bndbox/xmin').text) / width)
xmaxs.append(float(obj.find('bndbox/xmax').text) / width)
ymins.append(float(obj.find('bndbox/ymin').text) / height)
ymaxs.append(float(obj.find('bndbox/ymax').text) / height)
class_name = obj.find('name').text
classes_text.append(class_name.encode('utf8'))
classes.append(1) # Map to your label map
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height': dataset_util.int64_feature(height),
'image/width': dataset_util.int64_feature(width),
'image/filename': dataset_util.bytes_feature(filename),
'image/encoded': dataset_util.bytes_feature(encoded_image),
'image/format': dataset_util.bytes_feature(b'jpg'),
'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
'image/object/class/label': dataset_util.int64_list_feature(classes),
}))
return tf_example
# Create TFRecord file
writer = tf.io.TFRecordWriter('train.tfrecord')
for image_file in os.listdir('images/train'):
xml_file = image_file.replace('.jpg', '.xml')
tf_example = create_tf_example(
f'images/train/{image_file}',
f'annotations/train/{xml_file}'
)
writer.write(tf_example.SerializeToString())
writer.close()
Split your data 80/20 for training and validation, creating separate TFRecord files for each.
Configuring and Training a Pre-trained Model
Download a pre-trained model from the TensorFlow Model Zoo. SSD MobileNet V2 offers a good balance of speed and accuracy:
wget http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_320x320_coco17_tpu-8.tar.gz
tar -xvf ssd_mobilenet_v2_320x320_coco17_tpu-8.tar.gz
Modify the pipeline configuration file (pipeline.config):
# Key sections to update in pipeline.config:
# 1. Number of classes
num_classes: 2 # Change to your number of classes
# 2. Batch size
batch_size: 16 # Adjust based on GPU memory
# 3. Fine-tune checkpoint
fine_tune_checkpoint: "ssd_mobilenet_v2_320x320_coco17_tpu-8/checkpoint/ckpt-0"
fine_tune_checkpoint_type: "detection"
# 4. Training data paths
train_input_reader: {
label_map_path: "label_map.pbtxt"
tf_record_input_reader {
input_path: "train.tfrecord"
}
}
# 5. Validation data paths
eval_input_reader: {
label_map_path: "label_map.pbtxt"
tf_record_input_reader {
input_path: "val.tfrecord"
}
}
Create a label map file (label_map.pbtxt):
item {
id: 1
name: 'cat'
}
item {
id: 2
name: 'dog'
}
Start training:
python models/research/object_detection/model_main_tf2.py \
--pipeline_config_path=pipeline.config \
--model_dir=training/ \
--alsologtostderr
Training typically requires 10,000-50,000 steps depending on dataset size. Monitor progress with TensorBoard:
tensorboard --logdir=training/
Running Inference and Visualizing Results
Once trained, load your model and run inference:
import tensorflow as tf
import numpy as np
from PIL import Image
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as viz_utils
# Load the model
detect_fn = tf.saved_model.load('training/saved_model')
# Load label map
category_index = label_map_util.create_category_index_from_labelmap(
'label_map.pbtxt'
)
def detect_objects(image_path):
# Load and preprocess image
image_np = np.array(Image.open(image_path))
input_tensor = tf.convert_to_tensor(image_np)
input_tensor = input_tensor[tf.newaxis, ...]
# Run detection
detections = detect_fn(input_tensor)
# Extract detection results
num_detections = int(detections.pop('num_detections'))
detections = {key: value[0, :num_detections].numpy()
for key, value in detections.items()}
detections['num_detections'] = num_detections
detections['detection_classes'] = detections['detection_classes'].astype(np.int64)
# Visualize results
image_np_with_detections = image_np.copy()
viz_utils.visualize_boxes_and_labels_on_image_array(
image_np_with_detections,
detections['detection_boxes'],
detections['detection_classes'],
detections['detection_scores'],
category_index,
use_normalized_coordinates=True,
max_boxes_to_draw=20,
min_score_thresh=0.5,
agnostic_mode=False
)
return image_np_with_detections
# Run detection
result = detect_objects('test_image.jpg')
Image.fromarray(result).save('output.jpg')
Performance Optimization and Deployment
For production deployment, convert your model to TensorFlow Lite:
import tensorflow as tf
# Load saved model
converter = tf.lite.TFLiteConverter.from_saved_model('training/saved_model')
# Apply optimizations
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
# Convert model
tflite_model = converter.convert()
# Save TFLite model
with open('model.tflite', 'wb') as f:
f.write(tflite_model)
For even faster inference, apply quantization:
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
This can reduce model size from 20MB to 5MB and improve inference speed by 3-5x on mobile devices.
Conclusion and Next Steps
You now have a complete pipeline for implementing object detection in TensorFlow: from environment setup and data preparation through training and deployment. The pre-trained model approach gets you to production quickly, while the TFLite conversion ensures your model runs efficiently on edge devices.
For further optimization, explore custom training schedules, data augmentation strategies, and architecture modifications. Consider the TensorFlow Model Optimization Toolkit for advanced quantization techniques, and investigate TensorFlow Serving for scalable cloud deployment. The TensorFlow Object Detection API documentation provides excellent resources for training custom architectures and fine-tuning hyperparameters for specific use cases.