Documents
How can a student build a local, open-source LLM-based system to extract and analyze technical PDF documents (PMS/IDC) and generate instrumentation datasheets, using Python on a PC with limited resources?
How can a student build a local, open-source LLM-based system to extract and analyze technical PDF documents (PMS/IDC) and generate instrumentation datasheets, using Python on a PC with limited resources?
Type
Answer
Status
Published
Created
May 13, 2026
Updated
May 13, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Local LLM-Based PDF Analysis System for Technical Documents (PFE Project)#

PDF (PMS/IDC) → Docling (extraction) → JSON structuré → Vector DB → LLM local → Datasheets

Key Components & Tools#

ComponentTool
PDF Extractiondocling (with OCR + TableFormer)
OCR EngineEasyOCR (built into Docling)
Local LLMOllama + Mistral 7B / Phi-3 Mini / TinyLlama
Vector DatabaseChromaDB (local)
RAG FrameworkLangChain or LlamaIndex
UIStreamlit

Privacy & Local Operation#

Docling runs 100% locally — no data is sent to external servers. All OCR, layout analysis, and table extraction happen on your machine. You can pre-download all models for fully offline operation:

pip install docling-tools
docling-tools models download --all

The only exception is if enable_remote_services=True is explicitly set, which is disabled by default.


Step-by-Step Implementation#

1. Set up the environment:

mkdir ~/mon-projet-pfe && cd ~/mon-projet-pfe
python3 -m venv venv && source venv/bin/activate

2. Install dependencies:

pip install "docling[tesserocr]" docling-tools langchain-docling langchain-community \
    langchain-chroma chromadb sentence-transformers streamlit transformers torch

3. Install Ollama + a local LLM model:

curl https://ollama.ai/install.sh | sh
ollama pull phi3:mini # Lightweight (~8 GB RAM needed)
# or
ollama pull tinyllama # Ultra-light (~4 GB RAM needed)

4. Extract PDFs to structured JSON (minimal test):

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions, TableStructureV2Options
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from pathlib import Path

pdf_options = PdfPipelineOptions()
pdf_options.do_ocr = True
pdf_options.ocr_options = EasyOcrOptions(lang=["fr", "en"], use_gpu=False)
pdf_options.do_table_structure = True
pdf_options.table_structure_options = TableStructureV2Options(do_cell_matching=True)

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_options)}
)

result = converter.convert("documents/your_document.pdf")
result.document.save_as_json("outputs/output.json")
md = result.document.export_to_markdown()
Path("outputs/output.md").write_text(md, encoding="utf-8")

5. Build a RAG pipeline:

from docling.chunking import HybridChunker
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA

chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(dl_doc=result.document))

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(documents, embeddings, persist_directory="./chroma_db")

llm = Ollama(model="phi3:mini")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    return_source_documents=True
)

6. Launch the Streamlit web UI:

streamlit run app.py
# Opens at http://localhost:8501

Hardware Constraints & Solutions#

ConstraintSolution
Low available RAM (< 2 GB free)Close all other programs; use quantized models (GGUF 4-bit) via Ollama
Limited disk spaceUse generate_page_images=False in pipeline options to reduce memory
Scanned PDF checkboxesOCR + regex post-processing or a Vision Language Model (VLM)
Complex tables (merged cells)Use TableFormer V2 or Granite Vision Table
Very large PDFsProcess in page batches

Note: If the PC has only ~1.6 GB of available RAM, even Docling alone may struggle. It is recommended to free RAM by closing background applications, or use Google Colab (free GPU) for processing while keeping local extraction for confidentiality.