Using Docling's REST API to Convert PDFs to Markdown with AI-Generated Image Descriptions#

This guide explains how to use Docling's REST API to convert a PDF to Markdown, including generating image descriptions using a vision-language model (such as OpenAI or a local model), and how to control the alt text for images in the Markdown output.

1. Start Docling-Serve#

Install and run the server:

pip install "docling-serve[ui]"
docling-serve run --enable-ui

Enable remote services if needed:

export DOCLING_SERVE_ENABLE_REMOTE_SERVICES=true

2. Choose the API Endpoint#

Use /v1/convert/file for file uploads (multipart/form-data).
Use /v1/convert/source for JSON payloads (URLs or base64-encoded files).

3. Configure Your Request#

PDF Backend#

The default PDF backend is now docling_parse (previously dlparse_v4).
You can select the backend with the pdf_backend option. Allowed values: docling_parse, pypdfium2, dlparse_v1, dlparse_v2, dlparse_v4.

Vision-Language Model (VLM) for Image Description#

Recommended: Use the new preset or custom config fields for vision-language models and image description. The old fields (picture_description_api, picture_description_local, etc.) are deprecated.

📝 Server-Side Configuration
Administrators can control which VLM presets and engines are available through server-side settings (DOCLING_SERVE_ALLOWED_VLM_PRESETS, DOCLING_SERVE_ALLOWED_VLM_ENGINES). If restrictions are configured, only specific presets or engine types may be usable.
Additionally, custom VLM presets can be defined server-side via DOCLING_SERVE_CUSTOM_VLM_PRESETS, allowing administrators to create reusable named configurations. The default VLM preset can be customized via DOCLING_SERVE_DEFAULT_VLM_PRESET (default is "granite_docling").
For complex configurations, administrators can use YAML or JSON configuration files via DOCLING_SERVE_CONFIG_FILE. See the configuration documentation for complete details on preset management.

Key Options#

do_picture_description: Enable image description (boolean).
picture_description_preset: Use a preset VLM configuration (e.g., granite_vision, lightonocr, default). Available presets may be restricted by server configuration.
- lightonocr: Uses LightOnOCR-2-1B model (1B parameters), a lightweight OCR-focused model optimized for OCR and markdown conversion tasks.
picture_description_custom_config: Advanced: provide a custom VLM config dict (see below). Requires administrator permission.
image_alt_mode: Controls Markdown image alt text:
- static (default): Always "Image"
- caption: Use image caption if available
- description: Use AI-generated description if available
picture_description_area_threshold: Minimum area ratio for images to be described (set to 0 to describe all images).

Filtering Picture Descriptions by Classification#

You can control which images receive descriptions based on their predicted classification and confidence scores:

classification_allow: List of picture classification labels. Only images whose predicted class is in this allow-list will be described.
classification_deny: List of picture classification labels. Images whose predicted class is in this deny-list will not be described.
classification_min_confidence: Float value specifying the minimum classification confidence required before an image can be described.

These filters apply to both preset and custom VLM configurations. When using classification_allow, only explicitly allowed classes will be described. When using classification_deny, all classes except those denied will be described. The classification_min_confidence filter is applied in addition to allow/deny filters.

Deprecated Fields#

picture_description_api, picture_description_local, vlm_pipeline_model, vlm_pipeline_model_local, vlm_pipeline_model_api are deprecated. Use picture_description_preset or picture_description_custom_config instead.

Example: multipart/form-data (curl)#

curl -X POST "http://localhost:5001/v1/convert/file" \
  -H "accept: application/json" \
  -F "files=@your.pdf;type=application/pdf" \
  -F "to_formats=md" \
  -F "do_picture_description=true" \
  -F "picture_description_preset=granite_vision" \
  -F "picture_description_area_threshold=0" \
  -F "image_alt_mode=description"

Example: JSON (base64 file)#

{
  "options": {
    "to_formats": ["md"],
    "do_picture_description": true,
    "picture_description_preset": "granite_vision",
    "picture_description_area_threshold": 0,
    "image_alt_mode": "caption"
  },
  "file_sources": [{
    "base64_string": "<base64-encoded-pdf>",
    "filename": "your.pdf"
  }]
}

POST this to /v1/convert/source.

Example: Using Classification Filters#

{
  "options": {
    "to_formats": ["md"],
    "do_picture_description": true,
    "picture_description_preset": "granite_vision",
    "classification_allow": ["Figure", "Chart", "Diagram"],
    "classification_min_confidence": 0.7,
    "image_alt_mode": "description"
  },
  "file_sources": [{
    "base64_string": "<base64-encoded-pdf>",
    "filename": "your.pdf"
  }]
}

This configuration will only describe images classified as "Figure", "Chart", or "Diagram" with at least 70% confidence.

Advanced: Custom VLM Configuration#

⚠️ Administrator Permission Required
Custom configuration parameters require administrator permission to use. By default, these features are disabled for security reasons.
To enable custom configurations, your administrator must set these environment variables:

DOCLING_SERVE_ALLOW_CUSTOM_PICTURE_DESCRIPTION_CONFIG=true — Enables picture_description_custom_config

DOCLING_SERVE_ALLOW_CUSTOM_VLM_CONFIG=true — Enables vlm_pipeline_custom_config

DOCLING_SERVE_ALLOW_CUSTOM_CODE_FORMULA_CONFIG=true — Enables code_formula_custom_config
When set to false (the default), only presets are accepted. If these environment variables are not enabled, attempting to use the respective custom configuration parameters will result in a 422 error.
Contact your Docling Serve administrator if you need access to custom configuration options.

If you need to specify a custom model or engine, use picture_description_custom_config:

{
  "options": {
    "to_formats": ["md"],
    "do_picture_description": true,
    "picture_description_custom_config": {
      "engine_options": {"engine_type": "api"},
      "model_spec": {
        "name": "OpenAI GPT-4 Vision",
        "default_repo_id": "openai/gpt-4-vision-preview",
        "prompt": "Describe the image in three sentences. Be concise and accurate.",
        "response_format": "text"
      },
      "prompt": "Describe the image in three sentences. Be concise and accurate.",
      "generation_config": {"max_tokens": 200},
      "api_overrides": {
        "api": {
          "url": "https://api.openai.com/v1/chat/completions",
          "headers": {"Authorization": "Bearer YOUR_OPENAI_API_KEY"},
          "params": {"model": "gpt-4-vision-preview"},
          "timeout": 90
        }
      }
    },
    "classification_deny": ["Logo", "Icon"],
    "picture_description_area_threshold": 0,
    "image_alt_mode": "description"
  },
  "file_sources": [{
    "base64_string": "<base64-encoded-pdf>",
    "filename": "your.pdf"
  }]
}

This example uses a custom OpenAI model configuration and excludes logos and icons from description.

Parameter Precedence for API Engines#

When using API-based engines (engine_type: "api"), user-specified parameters in api_overrides.api.params take precedence over engine defaults. This means you can override parameters like temperature, max_tokens, or other model-specific settings to suit your requirements.

Azure OpenAI Compatibility: Azure OpenAI requires the max_completion_tokens parameter instead of max_tokens. When you specify max_completion_tokens in your API parameters, it will automatically take precedence and any conflicting max_tokens will be removed to ensure compatibility.

Example: Azure OpenAI Configuration

{
  "options": {
    "to_formats": ["md"],
    "do_picture_description": true,
    "picture_description_custom_config": {
      "engine_options": {"engine_type": "api"},
      "model_spec": {
        "name": "Azure GPT-4 Vision",
        "default_repo_id": "gpt-4-vision",
        "prompt": "Describe the image in detail.",
        "response_format": "text"
      },
      "api_overrides": {
        "api": {
          "url": "https://YOUR_RESOURCE.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT/chat/completions?api-version=2024-02-15-preview",
          "headers": {"api-key": "YOUR_AZURE_API_KEY"},
          "params": {
            "max_completion_tokens": 4096,
            "temperature": 1.0
          },
          "timeout": 90
        }
      }
    },
    "picture_description_area_threshold": 0,
    "image_alt_mode": "description"
  },
  "file_sources": [{
    "base64_string": "<base64-encoded-pdf>",
    "filename": "your.pdf"
  }]
}

vLLM Engine Options#

When using the vLLM engine (engine_type: "vllm"), you can configure additional performance options in engine_options:

Available vLLM engine options:

tensor_parallel_size: Number of GPUs for tensor parallelism
gpu_memory_utilization: Fraction of GPU memory to use (0.0–1.0)
trust_remote_code: Allow execution of custom code from model repository
cudagraph_mode: CUDA graph capture mode for vLLM v1 engines (controls performance vs. flexibility trade-offs)

cudagraph_mode values:

CUDA graphs reduce kernel-launch overhead by replaying a recorded sequence of CUDA operations. The mode you choose affects startup time, memory usage, and throughput:

NONE: Disable CUDA graphs entirely; everything runs in eager mode. Fastest startup, lowest steady-state throughput. Best for short-lived processes, notebooks, and debugging.
FULL: Capture the entire forward pass as one monolithic CUDA graph. Maximum coverage but requires static execution shapes; may fail with some models or dynamic workloads.
PIECEWISE (default): Capture segments of the model (e.g., transformer blocks) as multiple smaller graphs. Handles dynamic shapes better than FULL while still accelerating most of the forward pass.
FULL_AND_PIECEWISE: Hybrid mode using FULL graphs for decode-only batches and PIECEWISE graphs for prefill and mixed prefill+decode batches. Usually the best throughput option for typical LLM serving workloads.
FULL_DECODE_ONLY: FULL CUDA graphs only for decode batches; prefill and mixed batches run in eager mode. Dramatically reduces graph-capture time and memory footprint compared to FULL_AND_PIECEWISE while still accelerating token generation.

Example with vLLM engine options:

{
  "options": {
    "to_formats": ["md"],
    "do_picture_description": true,
    "picture_description_custom_config": {
      "engine_options": {
        "engine_type": "vllm",
        "cudagraph_mode": "PIECEWISE",
        "tensor_parallel_size": 1,
        "gpu_memory_utilization": 0.9,
        "trust_remote_code": false
      },
      "model_spec": {
        "name": "My Custom VLM",
        "default_repo_id": "my-org/my-vlm-model",
        "prompt": "Describe this image.",
        "response_format": "text"
      },
      "generation_config": {"max_tokens": 8192},
      "api_overrides": {
        "api": {
          "params": {
            "temperature": 0.0,
            "skip_special_tokens": false
          }
        }
      }
    },
    "image_alt_mode": "description"
  },
  "file_sources": [{
    "base64_string": "<base64-encoded-pdf>",
    "filename": "your.pdf"
  }]
}

4. Get the Markdown Output#

The response JSON will include document.md_content with the Markdown output. Image tags will use alt text according to the image_alt_mode setting:

static: ![Image](...)
caption: ![<caption text>](...) (falls back to "Image" if no caption)
description: ![<AI-generated description>](...) (falls back to "Image" if no description)

Tips & Caveats#

For multipart/form-data, picture_description_custom_config and picture_description_preset must be JSON strings.
The OpenAI API response must return a plain string or a top-level description field for Docling to attach the description.
If a PDF image already has a caption, Docling may prioritize the caption over the generated description depending on image_alt_mode.
If neither caption nor description is available, alt text defaults to "Image".
When using vLLM for VLM pipelines: Set skip_special_tokens: false, max_tokens: 8192, and temperature: 0.0 in API params to avoid empty or malformed responses.
Full API docs and a UI playground are available at /docs and /ui on your server.

References#

Summary of `image_alt_mode` values#

static: Always uses "Image" as alt text.
caption: Uses the image's caption if present; otherwise falls back to "Image".
description: Uses the AI-generated description if present; otherwise falls back to "Image".

Use the mode that best fits your accessibility and documentation needs.

Note: For the full list of available options and model configuration schemas, see the usage documentation. Deprecated fields are still supported for backward compatibility but will be removed in a future release. Migrate to the new *_preset and *_custom_config fields for all VLM and image description tasks.