Discover our new platform: Learn more

Turing Segment: A high-performance Cellpose Algorithm for Cell Segmentation

BioTuring AI & Algorithm team
BioTuring AI & Algorithm team
December 9, 2024
1

TLDR: We optimized Cellpose to be 300x faster and 23x more memory-efficient in 2D without compromising accuracy. We also developed an efficient algorithm to stitch series of 2D segmentations into 3D segmentations, and obtained 400x speedup in time, and 300x less memory consumption.

We named the new tool as TuringSegment

What is Cellpose?

Cellpose is a popular deep learning algorithm for cell segmentation across a wide range of different tissue types, cultures, and imaging techniques. However, Cellpose faces a computational hurdle: for a standard size of pathology slide, it takes hours to process and consumes hundreds of GB RAM

Turing Segment: An Optimized Cellpose Algorithm

To address these issues, we have developed an optimized version of Cellpose that aims to improve its running time and memory performance while maintaining Cellpose’s accuracy. We obtain this by revising a new dynamic programming algorithm for post-processing, tile processing, and parallelization.

For 3D segmentation, we developed a new stitching algorithm that utilized the Sort-Tile-Recursive Tree (STRTree) data structure to reduce the number of IOU calculations needed between segmentation pairs.

Dynamic Programming Algorithm for Post-processing

The Cellpose algorithm consists of two phases: AI model inference and post-processing. Post-processing is the primary bottleneck of the algorithm, typically requiring up to 200 times longer than the model inference phase.

During the post-processing phase, the most time-consuming task is the dynamic process of grouping pixels into individual cell instances and it represents a significant opportunity for performance optimization.

The dynamic process has a key property: the subsequent positions of a pixel depend solely on its current position. This characteristic allows us to reformulate the problem into a series of optimal subproblems, which can be efficiently addressed using Dynamic Programming techniques.

Tiled Processing

When working with high-resolution or whole-slide images, processing the entire image at once can often lead to memory constraints. To address this challenge, Turing Segment employs a tiled processing approach. This method divides the image into equally sized tiles, allowing each tile to be processed independently to obtain segmentations while significantly reducing memory consumption.

One issue that arises with tiled processing is that segmentations at the boundaries of tiles may be incomplete. To mitigate this, TuringSegment introduces a margin around each tile, ensuring that overlapping areas are included in the segmentation process. After processing all tiles, duplicate segmentations are removed.

Parallelization

Turing Segment is designed with extensive parallelization in mind, leveraging both CPU and GPU resources to achieve accelerated processing speeds.

Model inference and post-processing operations run in parallel, ensuring that both GPU and CPU resources are utilized effectively without idling. This simultaneous processing keeps the workflow efficient. Post-processing is further enhanced through parallelization. A pool of workers is utilized to process inference results for tiles as they become available.

The deduplication algorithm also benefits from parallel processing, as segmentation pairs to check are evenly distributed across multiple processes. This approach significantly speeds up the duplicate removal process.

Benchmark

All benchmarks are generated with the following specifications:

  • GPU: NVIDIA RTX 4090
  • CPU: AMD Ryzen Threadripper PRO 7985WX 64-Cores
  • Libraries: PyTorch 2.4, CUDA Runtime 11.8, cuDNN 9.1.0.70

To test the effectiveness of the mentioned improvements, we compare the TuringSegment tool to the original Cellpose algorithm.

We benchmarked the processing time and memory between TuringSegment and the original Cellpose using 16 processes. The plots below show the processing time for the original Cellpose and TuringSegment. All images are squares (dxd) with dimension d is represented in the horizontal axis).

TuringSegment outperforms the original Cellpose with different image sizes. The difference is more significant as the image size increases. For the image with size 40,000 x 40,000 TuringSegment is 294 times faster than the original Cellpose, reducing the processing time from a few hours to less than 1 minute.

The plots below shows the memory consumption between the two tools:

Thanks to our tiled processing, Turing Segment’s memory consumption is reduced rapidly. For the image with size 40000×40000, our tool consumes 23 times less memory than the original Cellpose. This improvement reduces the hardware requirements and allows us to process even larger images (discussed in the next session).

While achieving impressive speedup, Turing Segment does not sacrifice accuracy. The plot below shows the accuracy of the TuringSegment and the original Cellpose.

Stitching segmentations into 3D cells

Cellpose can combine cell segmentations from adjacent z-slices to reconstruct 3D cells. This reconstruction process requires calculating the intersection over union (IOU) between all possible cell pairs in consecutive slices, making it computationally expensive in terms of both processing time and memory usage.

To improve computational efficiency, we developed a new stitching algorithm that utilized the Sort-Tile-Recursive Tree (STRTree) data structure to reduce the number of IOU calculations needed between segmentation pairs. The algorithm’s performance is further enhanced through parallel processing of IOU computations across multiple processes.

We conducted benchmarks comparing our implementation in Turing Segment (running with 16 parallel processes) against Cellpose’s original stitching algorithm, measuring both processing speed and memory usage. Our benchmark results demonstrate that Cellpose’s original algorithm is limited to processing images no larger than 40000×40000 pixels due to memory constraints.

Our optimized algorithm demonstrates significant memory and performance improvements over the original Cellpose implementation. When processing a 40,000 x 40,000 pixel image, our algorithm uses approximately 300 times less memory while executing 400 times faster than Cellpose.

The algorithm also shows good parallel scaling characteristics. We observed nearly linear performance improvements as we increased the number of parallel processes up to 16, indicating efficient parallelization of the workload.

Learn More

Check out the installation and usage instruction for Turing Segment: https://github.com/bioturing-org/turing_segment

BioTuring AI & Algorithm Team

1 comment