Local AI Pipeline for Technical PDF Processing (PMS/IDC)#
A student can build this system using Docling + ChromaDB + Ollama + LangChain + Streamlit, all running locally with no cloud API calls. The full pipeline is:
PDF (PMS/IDC)
↓ [Docling + OCR + TableFormer]
Structured Text (JSON/Markdown)
↓ [HybridChunker]
Chunks
↓ [Sentence-Transformers]
Embeddings (vectors)
↓ [ChromaDB]
Vector Database
↓ [LangChain + Ollama LLM]
Intelligent Answers / Datasheets
↓ [Streamlit]
Web UI
Requirements#
- Python 3.10+ (Docling does NOT work on Python 3.7/3.8/3.9 —
SyntaxErroron walrus operator:=) - Windows users should use PowerShell with a virtual environment
Installation#
cd C:\your\project\folder
py -3.11 -m venv venv
.\venv\Scripts\activate
pip install docling langchain-docling langchain-community langchain-chroma chromadb sentence-transformers streamlit torch
⚠️ Always activate the venv before running scripts:
.\venv\Scripts\activate
Step 1 — Extract PDF to JSON/Markdown (Docling)#
Docling runs 100% locally. No data is sent externally by default.
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from pathlib import Path
pdf_options = PdfPipelineOptions()
pdf_options.do_ocr = True
pdf_options.do_table_structure = True
pdf_options.generate_page_images = False # saves RAM
pdf_options.generate_picture_images = False
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
backend=PyPdfiumDocumentBackend,
pipeline_options=pdf_options
)
}
)
# Process pages 1–10 only (to avoid std::bad_alloc on low-RAM machines)
result = converter.convert("documents/PMS.pdf", page_range=(1, 10))
Path("outputs").mkdir(exist_ok=True)
result.document.save_as_json("outputs/PMS.json")
md = result.document.export_to_markdown()
Path("outputs/PMS.md").write_text(md, encoding="utf-8")
⚠️ On machines with only 8GB RAM, processing all pages at once causes
std::bad_allocerrors. Usepage_rangeto process in batches of 10 pages and usePyPdfiumDocumentBackendto reduce memory usage.
Step 2 — Chunking (split document into smart pieces)#
The LLM cannot read 74 pages at once. HybridChunker splits the document into ~512-token pieces while keeping tables intact and preserving section context.
from docling.chunking import HybridChunker
chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(dl_doc=result.document))
enriched_texts = [chunker.contextualize(chunk=c) for c in chunks]
Step 3–4 — Embeddings + Vector Database (ChromaDB)#
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain.schema import Document
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
langchain_docs = [
Document(page_content=text, metadata={"source": "PMS.pdf"})
for text in enriched_texts
]
vectorstore = Chroma.from_documents(langchain_docs, embeddings, persist_directory="./chroma_db")
Step 5 — RAG with Local LLM (Ollama)#
RAG (Retrieval-Augmented Generation) works as follows:
- You ask a question
- The system finds the relevant chunks from your documents
- Those chunks + your question are sent to the LLM
- The LLM answers based on your actual document content
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
# Install Ollama from https://ollama.ai, then: ollama pull phi3:mini
llm = Ollama(model="phi3:mini") # lightweight for 8GB RAM
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=True
)
result = qa_chain({"query": "What is the material for pipe class A1?"})
print(result['result'])
Step 6 — Streamlit Web UI#
streamlit run app.py
# Opens at http://localhost:8501
Recommended LLM models (local, via Ollama)#
| Model | RAM needed | Use case |
|---|---|---|
mistral | ~6 GB | Best quality, needs more RAM |
phi3:mini | ~3 GB | Good balance for 8GB RAM |
tinyllama | ~2 GB | Very low RAM, lower quality |
Privacy#
Docling is fully local by default. All OCR, layout detection, and table extraction run on your machine. Remote services are only activated if you explicitly set enable_remote_services=True, which you should never do for confidential documents.
To run completely offline, pre-download all models:
pip install docling-tools
docling-tools models download --all
Project Folder Structure#
project/
├── venv/
├── documents/ # Your PMS.pdf, IDC.pdf files
├── outputs/ # Extracted JSON and Markdown
├── chroma_db/ # Vector database
├── test_extraction.py
├── rag_pipeline.py
└── app.py