Documents
How can a student build a local, open-source, privacy-preserving AI system to extract and query technical PDF documents (PMS/IDC) on a PC with limited resources?
How can a student build a local, open-source, privacy-preserving AI system to extract and query technical PDF documents (PMS/IDC) on a PC with limited resources?
Type
Answer
Status
Published
Created
May 13, 2026
Updated
May 13, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Local AI Pipeline for Technical PDF Processing (PMS/IDC)#

A student can build this system using Docling + ChromaDB + Ollama + LangChain + Streamlit, all running locally with no cloud API calls. The full pipeline is:

PDF (PMS/IDC)
    ↓ [Docling + OCR + TableFormer]
Structured Text (JSON/Markdown)
    ↓ [HybridChunker]
Chunks
    ↓ [Sentence-Transformers]
Embeddings (vectors)
    ↓ [ChromaDB]
Vector Database
    ↓ [LangChain + Ollama LLM]
Intelligent Answers / Datasheets
    ↓ [Streamlit]
Web UI

Requirements#

  • Python 3.10+ (Docling does NOT work on Python 3.7/3.8/3.9 — SyntaxError on walrus operator :=)
  • Windows users should use PowerShell with a virtual environment

Installation#

cd C:\your\project\folder
py -3.11 -m venv venv
.\venv\Scripts\activate
pip install docling langchain-docling langchain-community langchain-chroma chromadb sentence-transformers streamlit torch

⚠️ Always activate the venv before running scripts: .\venv\Scripts\activate


Step 1 — Extract PDF to JSON/Markdown (Docling)#

Docling runs 100% locally. No data is sent externally by default.

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
from pathlib import Path

pdf_options = PdfPipelineOptions()
pdf_options.do_ocr = True
pdf_options.do_table_structure = True
pdf_options.generate_page_images = False # saves RAM
pdf_options.generate_picture_images = False

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            backend=PyPdfiumDocumentBackend,
            pipeline_options=pdf_options
        )
    }
)

# Process pages 1–10 only (to avoid std::bad_alloc on low-RAM machines)
result = converter.convert("documents/PMS.pdf", page_range=(1, 10))

Path("outputs").mkdir(exist_ok=True)
result.document.save_as_json("outputs/PMS.json")
md = result.document.export_to_markdown()
Path("outputs/PMS.md").write_text(md, encoding="utf-8")

⚠️ On machines with only 8GB RAM, processing all pages at once causes std::bad_alloc errors. Use page_range to process in batches of 10 pages and use PyPdfiumDocumentBackend to reduce memory usage.


Step 2 — Chunking (split document into smart pieces)#

The LLM cannot read 74 pages at once. HybridChunker splits the document into ~512-token pieces while keeping tables intact and preserving section context.

from docling.chunking import HybridChunker

chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(dl_doc=result.document))
enriched_texts = [chunker.contextualize(chunk=c) for c in chunks]

Step 3–4 — Embeddings + Vector Database (ChromaDB)#

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain.schema import Document

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

langchain_docs = [
    Document(page_content=text, metadata={"source": "PMS.pdf"})
    for text in enriched_texts
]

vectorstore = Chroma.from_documents(langchain_docs, embeddings, persist_directory="./chroma_db")

Step 5 — RAG with Local LLM (Ollama)#

RAG (Retrieval-Augmented Generation) works as follows:

  1. You ask a question
  2. The system finds the relevant chunks from your documents
  3. Those chunks + your question are sent to the LLM
  4. The LLM answers based on your actual document content
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA

# Install Ollama from https://ollama.ai, then: ollama pull phi3:mini
llm = Ollama(model="phi3:mini") # lightweight for 8GB RAM

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True
)

result = qa_chain({"query": "What is the material for pipe class A1?"})
print(result['result'])

Step 6 — Streamlit Web UI#

streamlit run app.py
# Opens at http://localhost:8501

ModelRAM neededUse case
mistral~6 GBBest quality, needs more RAM
phi3:mini~3 GBGood balance for 8GB RAM
tinyllama~2 GBVery low RAM, lower quality

Privacy#

Docling is fully local by default. All OCR, layout detection, and table extraction run on your machine. Remote services are only activated if you explicitly set enable_remote_services=True, which you should never do for confidential documents.

To run completely offline, pre-download all models:

pip install docling-tools
docling-tools models download --all

Project Folder Structure#

project/
├── venv/
├── documents/ # Your PMS.pdf, IDC.pdf files
├── outputs/ # Extracted JSON and Markdown
├── chroma_db/ # Vector database
├── test_extraction.py
├── rag_pipeline.py
└── app.py
How can a student build a local, open-source, privacy-preserving AI system to extract and query technical PDF documents (PMS/IDC) on a PC with limited resources? | Dosu