Converting PDF Pages to JPG Images Using Python — A Practical Automation G

Aior · Jun 23, 2023

Neden PDF → JPG Dönüşümü Önemli?

PDF dosyaları yazılı belgenin altın standardıdır ancak bazı senaryolarda her sayfayı ayrı bir görsele dönüştürmek gerekir: önizleme thumbnail'leri oluşturma, makine öğrenmesi modeline eğitim verisi hazırlama, OCR (Optical Character Recognition) pipeline'ında ön işleme, içerik yönetim sistemlerine asset import veya görsel kalite kontrolü amacıyla sayfaların ayrıntılı incelenmesi. Bu rehber Python ile bu dönüşümü hızlı, güvenilir ve toplu çalışacak biçimde nasıl yapacağınızı paylaşır.

AIOR olarak müşterilerimizin döküman yönetim sistemlerinde, e-fatura görsellerinin işlenmesinde ve teknik kitap koleksiyonlarının dijital arşivlenmesinde bu yöntemi sıklıkla kullanıyoruz. Aşağıdaki kod örnekleri production'a hazır ve test edilmiştir.

Hangi Kütüphane?

Python ekosisteminde PDF işleme için birkaç olgun seçenek vardır. pdf2image arka planda Poppler binary'lerini kullanarak yüksek kaliteli render üretir; PyMuPDF (fitz adıyla da bilinir) ise pure-Python tek bağımlılıkla çalışır ve hızı en yüksek olandır. Bizim önerimiz çoğu use-case için PyMuPDF'tir çünkü kurulum kolay, hız üstün ve sayfa metadata erişimi de sağlar.

Kurulum:

Code:

pip install PyMuPDF Pillow

Pillow (PIL fork), JPG yazımı için kullanılır. PyMuPDF'in kendi save işlevi PNG'ye varsayılan olarak yazar; JPG için Pillow araya girer.

Temel Dönüşüm Scripti

Tek bir PDF'i sayfa sayfa JPG'e çeviren basit script:

Code:

import fitz  # PyMuPDF
from PIL import Image
import io
import os

def pdf_to_jpg(pdf_path, output_dir, dpi=200, quality=85):
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)
    pages = []
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        zoom = dpi / 72  # 72 = PDF default DPI
        mat = fitz.Matrix(zoom, zoom)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        out_path = os.path.join(output_dir, f"page_{page_num+1:03d}.jpg")
        img.save(out_path, "JPEG", quality=quality, optimize=True)
        pages.append(out_path)
        print(f"  saved: {out_path}")
    doc.close()
    return pages

# Kullanım
pdf_to_jpg("document.pdf", "output/", dpi=200, quality=85)

Bu fonksiyon her sayfayı 200 DPI çözünürlükte render eder. Web önizleme için 150 DPI yeterli; matbaa kalitesi için 300 DPI gerekir. JPEG kalitesi 85, görsel kalite ile dosya boyutu arasında iyi bir denge sağlar.

Toplu Dönüşüm — Bir Dizindeki Tüm PDF'ler

Tek script ile bir klasördeki tüm PDF'leri işlemek:

Code:

import glob
from pathlib import Path

def batch_convert(input_dir, output_root, dpi=200, quality=85):
    pdfs = glob.glob(os.path.join(input_dir, "*.pdf"))
    print(f"Bulunan PDF sayısı: {len(pdfs)}")
    for pdf in pdfs:
        name = Path(pdf).stem
        out_dir = os.path.join(output_root, name)
        pdf_to_jpg(pdf, out_dir, dpi, quality)
    print("Tüm dönüşümler tamamlandı.")

batch_convert("input_pdfs/", "output_images/", dpi=200, quality=85)

Bu yaklaşım yüzlerce PDF içeren arşivleri saatler içinde dönüştürebilir. Sunucu donanımına göre saatte 1000-5000 sayfa işlenebilir.

Paralel İşleme ile Hızlandırma

Büyük PDF koleksiyonları için multiprocessing kullanarak CPU çekirdeklerini paralel kullanın:

Code:

from multiprocessing import Pool

def process_pdf(args):
    pdf, out_root, dpi, quality = args
    name = Path(pdf).stem
    out_dir = os.path.join(out_root, name)
    pdf_to_jpg(pdf, out_dir, dpi, quality)
    return pdf

def parallel_batch(input_dir, output_root, workers=4, dpi=200, quality=85):
    pdfs = glob.glob(os.path.join(input_dir, "*.pdf"))
    args = [(pdf, output_root, dpi, quality) for pdf in pdfs]
    with Pool(workers) as pool:
        pool.map(process_pdf, args)

parallel_batch("input_pdfs/", "output_images/", workers=8)

8 worker'la 8 çekirdekli sunucuda hızlanma yaklaşık 6-7 kat olur (linear scaling I/O nedeniyle nadiren tam erişilir).

Görsel Kalitesi Optimizasyonu

Resim kalitesi DPI ile orantılıdır ama dosya boyutu da artar. Use-case'inize göre öneriler:

- Web thumbnail (75-100 DPI): hızlı yüklenir, küçük dosya
- Sayfa önizleme (150 DPI): çoğu içerik yönetim sistemi için ideal
- Yüksek kalite (200-300 DPI): OCR pipeline'ı veya matbaa
- Arşiv kalitesi (400+ DPI): uzun vadeli saklama, downsampling olasılığı

JPEG quality değeri 75-90 arasında optimal; daha düşük artifact'lar görünür, daha yüksek dosya boyutu kontrolsüz büyür.

OCR Pipeline'ında Kullanım

PDF → JPG dönüşümü OCR için ön işlemenin temel adımıdır. Akış:

Code:

PDF dosyaları → pdf_to_jpg() → JPG sayfalar →
Tesseract OCR / Cloud Vision API → metin çıktısı →
Elasticsearch indeksleme → arama sonucu

Bu pipeline'da DPI 300, quality 95 kullanın. OCR doğruluk oranı görsel kalitesine doğrudan bağlıdır.

Yaygın Sorunlar

"DLL load failed" Windows hatası: PyMuPDF'in Visual C++ Redistributable bağımlılığı vardır; Microsoft sitesinden kurun.

Bellek hatası büyük PDF'lerde: sayfa sayfa işleyin, tüm sayfaları belleğe yüklemeyin. Yukarıdaki kod zaten bunu yapar.

Boş veya bozuk JPG: PDF şifreli olabilir; doc.authenticate("password") ile açın. PDF korunmuyorsa orijinal dosya bozuk olabilir, pikepdf ile onarım deneyin.

AIOR Production Senaryosu

Bir AIOR müşterimizin döküman arşivinde 50,000+ PDF (her biri 5-100 sayfa) Python script ile JPG'e çevrildi; pipeline 8-core sunucuda 12 saat sürdü, çıktı 1.2 milyon görsel oldu. Tüm asset'ler S3-uyumlu storage'a yüklendi, Elasticsearch ile aranabilir hale getirildi. Bu tür özel projelerde AIOR ekibi proje yönetimi + altyapı kurulumu + bakım hizmeti birlikte sunar.

Why Convert PDF Pages to JPG?

PDF is the gold standard for written documents, but several workflows need each page as a separate image: generating preview thumbnails, preparing training data for machine learning, pre-processing for an OCR (Optical Character Recognition) pipeline, importing assets into a content management system, or doing detailed visual quality control. This guide shows how to do that conversion in Python — fast, reliably and at batch scale.

At AIOR we use this method regularly for document management systems, e-invoice image processing and digital archiving of technical book collections. The code examples below are production-ready and tested.

Which Library?

The Python ecosystem offers several mature options for PDF handling. pdf2image wraps the Poppler binaries and produces high-quality renders; PyMuPDF (also known as fitz) is pure-Python with a single dependency and the fastest of the lot. For most use cases we recommend PyMuPDF — easy install, top performance and built-in page metadata access.

Install:

Code:

pip install PyMuPDF Pillow

Pillow (a PIL fork) handles JPG writing. PyMuPDF defaults to PNG; Pillow bridges to JPG.

The Basic Conversion Script

A simple script that converts one PDF into per-page JPGs:

Code:

import fitz  # PyMuPDF
from PIL import Image
import io
import os

def pdf_to_jpg(pdf_path, output_dir, dpi=200, quality=85):
    os.makedirs(output_dir, exist_ok=True)
    doc = fitz.open(pdf_path)
    pages = []
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        zoom = dpi / 72  # 72 = PDF default DPI
        mat = fitz.Matrix(zoom, zoom)
        pix = page.get_pixmap(matrix=mat, alpha=False)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        out_path = os.path.join(output_dir, f"page_{page_num+1:03d}.jpg")
        img.save(out_path, "JPEG", quality=quality, optimize=True)
        pages.append(out_path)
        print(f"  saved: {out_path}")
    doc.close()
    return pages

# Usage
pdf_to_jpg("document.pdf", "output/", dpi=200, quality=85)

This function renders each page at 200 DPI. 150 DPI is enough for web previews; 300 DPI for print quality. JPEG quality 85 strikes a good balance between visual fidelity and file size.

Batch Conversion — All PDFs in a Directory

One script to process every PDF in a folder:

Code:

import glob
from pathlib import Path

def batch_convert(input_dir, output_root, dpi=200, quality=85):
    pdfs = glob.glob(os.path.join(input_dir, "*.pdf"))
    print(f"Found {len(pdfs)} PDFs")
    for pdf in pdfs:
        name = Path(pdf).stem
        out_dir = os.path.join(output_root, name)
        pdf_to_jpg(pdf, out_dir, dpi, quality)
    print("All conversions complete.")

batch_convert("input_pdfs/", "output_images/", dpi=200, quality=85)

This approach can process archives of hundreds of PDFs in a few hours. Depending on hardware you can hit 1,000-5,000 pages per hour.

Parallel Processing

For large PDF collections, use multiprocessing to leverage multiple cores:

Code:

from multiprocessing import Pool

def process_pdf(args):
    pdf, out_root, dpi, quality = args
    name = Path(pdf).stem
    out_dir = os.path.join(out_root, name)
    pdf_to_jpg(pdf, out_dir, dpi, quality)
    return pdf

def parallel_batch(input_dir, output_root, workers=4, dpi=200, quality=85):
    pdfs = glob.glob(os.path.join(input_dir, "*.pdf"))
    args = [(pdf, output_root, dpi, quality) for pdf in pdfs]
    with Pool(workers) as pool:
        pool.map(process_pdf, args)

parallel_batch("input_pdfs/", "output_images/", workers=8)

With 8 workers on an 8-core server you see roughly 6-7× speedup (linear scaling is rarely fully achieved because of I/O).

Image Quality Tuning

Image quality scales with DPI but so does file size. Recommended settings by use case:

- Web thumbnail (75-100 DPI): fast load, small file
- Page preview (150 DPI): ideal for most CMS
- High quality (200-300 DPI): OCR pipeline or print
- Archive quality (400+ DPI): long-term storage, downsampling possible later

JPEG quality 75-90 is optimal; lower values introduce visible artefacts and higher values balloon file size with little visual gain.

OCR Pipeline Usage

PDF → JPG conversion is a fundamental pre-processing step for OCR. The flow:

Code:

PDF files → pdf_to_jpg() → JPG pages →
Tesseract OCR / Cloud Vision API → text output →
Elasticsearch indexing → searchable results

In this pipeline use DPI 300 and quality 95. OCR accuracy is directly tied to image quality.

Common Issues

"DLL load failed" on Windows: PyMuPDF needs Visual C++ Redistributable; install from Microsoft.

Memory errors on huge PDFs: process page by page rather than loading the whole doc into RAM. The code above already does this.

Blank or corrupt JPGs: the PDF may be encrypted — open with doc.authenticate("password"). If unprotected, the original PDF may be corrupted — try repair with pikepdf.

AIOR Production Scenario

For one AIOR customer's document archive, 50,000+ PDFs (5-100 pages each) were converted to JPG via a Python script — the pipeline ran 12 hours on an 8-core server and produced 1.2 million images. All assets were uploaded to S3-compatible storage and made searchable via Elasticsearch. For projects like this AIOR offers combined project management, infrastructure setup and ongoing maintenance.

Converting PDF Pages to JPG Images Using Python — A Practical Automation G

Converting PDF Pages to JPG Images Using Python — A Practical Automation G

Aior

Administrator

Neden PDF → JPG Dönüşümü Önemli?

Hangi Kütüphane?

Temel Dönüşüm Scripti

Toplu Dönüşüm — Bir Dizindeki Tüm PDF'ler

Paralel İşleme ile Hızlandırma

Görsel Kalitesi Optimizasyonu

OCR Pipeline'ında Kullanım

Yaygın Sorunlar

AIOR Production Senaryosu

Why Convert PDF Pages to JPG?

Which Library?

The Basic Conversion Script

Batch Conversion — All PDFs in a Directory

Parallel Processing

Image Quality Tuning

OCR Pipeline Usage

Common Issues

AIOR Production Scenario

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

Converting PDF Pages to JPG Images Using Python — A Practical Automation G

Converting PDF Pages to JPG Images Using Python — A Practical Automation G

Aior

Administrator

Neden PDF → JPG Dönüşümü Önemli?​

Hangi Kütüphane?​

Temel Dönüşüm Scripti​

Toplu Dönüşüm — Bir Dizindeki Tüm PDF'ler​

Paralel İşleme ile Hızlandırma​

Görsel Kalitesi Optimizasyonu​

OCR Pipeline'ında Kullanım​

Yaygın Sorunlar​

AIOR Production Senaryosu​

Why Convert PDF Pages to JPG?​

Which Library?​

The Basic Conversion Script​

Batch Conversion — All PDFs in a Directory​

Parallel Processing​

Image Quality Tuning​

OCR Pipeline Usage​

Common Issues​

AIOR Production Scenario​

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

Neden PDF → JPG Dönüşümü Önemli?

Hangi Kütüphane?

Temel Dönüşüm Scripti

Toplu Dönüşüm — Bir Dizindeki Tüm PDF'ler

Paralel İşleme ile Hızlandırma

Görsel Kalitesi Optimizasyonu

OCR Pipeline'ında Kullanım

Yaygın Sorunlar

AIOR Production Senaryosu

Why Convert PDF Pages to JPG?

Which Library?

The Basic Conversion Script

Batch Conversion — All PDFs in a Directory

Parallel Processing

Image Quality Tuning

OCR Pipeline Usage

Common Issues

AIOR Production Scenario