How can I set up and run docling-serve on a MacBook Pro using Docker, and what performance and stability considerations should I be aware of?

To set up and run docling-serve on a MacBook Pro using Docker, follow these steps:

Install Docker: Use the Apple Silicon (ARM64) version if you have an M1/M2 MacBook Pro.
Download Required Models Locally:
```
pip install docling-tools
docling-tools models download --all -o models
```
This creates a models directory with all necessary models.
Increase Docker Memory Allocation: In Docker Desktop, go to Preferences > Resources > Memory and allocate at least 8GB for stability.

Run the Docker Container:

docker run -p 5001:5001 \
  -v $(pwd)/models:/opt/app-root/src/models \
  -e DOCLING_SERVE_ARTIFACTS_PATH="/opt/app-root/src/models" \
  -e DOCLING_SERVE_ENABLE_UI=true \
  -e UVICORN_WORKERS=1 \
  quay.io/docling-project/docling-serve

UVICORN_WORKERS=1 is important for stability on macOS.
To use more CPU threads for a single document, add -e DOCLING_NUM_THREADS=4 (or up to your CPU core count).

Access the UI: Open http://localhost:5001 in your browser.

Performance:

On a modern MacBook Pro (M1/M2, 12 cores, 32GB RAM), expect to process about 5 PDFs (127 pages, 6.2MB total) in roughly 10 minutes (CPU-only). Increasing CPU cores, RAM, or worker count does not significantly improve throughput due to Python's concurrency limitations.

Stability and Troubleshooting:

Use ARM64 images and ensure all models are present to avoid runtime errors.
Memory management has been improved through the integration of mimalloc (a high-performance memory allocator), which significantly reduces memory growth during document processing. While earlier versions experienced memory leaks, this optimization addresses those concerns. Still, monitor memory usage for long-running sessions and restart the container if you observe unusual resource consumption.

Memory Debugging Endpoints:

Docling Serve provides built-in memory debugging endpoints to help diagnose memory issues in long-running sessions or containerized environments:
- /v1/memory/stats - Returns memory statistics including RSS, anonymous memory, file-backed memory, slab memory, and cgroup total memory (in MB)
- /v1/memory/counts - Returns garbage collection statistics, object counts, asyncio task counts, and the top 20 most common object types
These endpoints are controlled by the DOCLING_SERVE_ENABLE_MANAGEMENT_ENDPOINTS environment variable (defaults to false). When disabled, these endpoints return a 403 Forbidden error.

To enable the memory debugging endpoints, add the environment variable when running Docker:
```
docker run -p 5001:5001 \
  -v $(pwd)/models:/opt/app-root/src/models \
  -e DOCLING_SERVE_ARTIFACTS_PATH="/opt/app-root/src/models" \
  -e DOCLING_SERVE_ENABLE_UI=true \
  -e DOCLING_SERVE_ENABLE_MANAGEMENT_ENDPOINTS=true \
  -e UVICORN_WORKERS=1 \
  quay.io/docling-project/docling-serve
```
Access the endpoints using curl:
```
curl http://localhost:5001/v1/memory/stats
curl http://localhost:5001/v1/memory/counts
```
For file download issues in the UI, set DOCLING_SERVE_SCRATCH_PATH and optionally GRADIO_TEMP_DIR to a writable volume. Adjust Gradio cache settings if you want downloads to remain available longer.

References:

For large workloads or high throughput, docling-serve is designed for distributed/cloud setups rather than a single MacBook Pro.