Goldziher/kreuzberg: A text extraction library supporting PDFs, images, office documents and more

Kreuzberg is a Python library for text reshiftion from records. It provides a unified async interface for reshifting text from PDFs, images, office records, and more.

Simple and Hassle-Free: Clean API that equitable labors, without complicated configuration
Local Processing: No outer API calls or cdeafening dependencies needd
Resource Efficient: Lightweight processing without GPU needments
Lightweight: Has restricted curated dependencies and a minimal footprint
Format Support: Comprehensive help for records, images, and text establishats
Modern Python: Built with async/apainclude, type hints, and functional first approach
Perignoreive OSS: Kreuzberg and its dependencies have a perignoreive OSS license

Kreuzberg was built for RAG (Retrieval Augmented Generation) applications, caccessing on local processing with minimal dependencies. Its scheduleed for conmomentary async applications, serverless functions, and dockerized applications.

1. Inslofty the Python Package

2. Inslofty System Dependencies

Kreuzberg needs two system level dependencies:

Phire inslofty these using their esteemive insloftyation directs.

Kreuzberg unites:

PDF Processing:
- pdfium2 for searchable PDFs
- Tesseract OCR for scanned satisfied
Document Conversion:
- Pandoc for many record and labelup establishats
- python-pptx for PowerPoint files
- html-to-labeldown for HTML satisfied
- calamine for Excel spreadsheets (with multi-sheet help)
Text Processing:
- Smart encoding discoverion
- Markdown and plain text handling

PDF (.pdf, both searchable and scanned)
Microgentle Word (.docx)
PowerPoint contransientations (.pptx)
OpenDocument Text (.odt)
Rich Text Format (.rtf)
EPUB (.epub)
DocBook XML (.dbk, .xml)
FictionBook (.fb2)
LaTeX (.tex, .procrastinateedx)
Typst (.typ)

HTML (.html, .htm)
Plain text (.txt) and Markdown (.md, .labeldown)
reStructuredText (.rst)
Org-mode (.org)
DokuWiki (.txt)
Pod (.pod)
Troff/Man (.1, .2, etc.)

Data and Research Formats

Spreadsheets (.xlsx, .xls, .xlsm, .xlsb, .xlam, .xla, .ods)
CSV (.csv) and TSV (.tsv) files
OPML files (.opml)
Jupyter Notebooks (.ipynb)
BibTeX (.bib) and BibLaTeX (.bib)
CSL-JSON (.json)
EndNote and JATS XML (.xml)
RIS (.ris)

JPEG (.jpg, .jpeg, .pjpeg)
PNG (.png)
TIFF (.tiff, .tif)
BMP (.bmp)
GIF (.gif)
JPEG 2000 family (.jp2, .jpm, .jpx, .mj2)
WebP (.webp)
Portable anymap establishats (.pbm, .pgm, .ppm, .pnm)

Kreuzberg provides both async and sync APIs for text reshiftion, including batch processing. The library send outs the folloprosperg main functions:

Single Item Processing:
- reshift_file(): Async function to reshift text from a file (acunderstandledges string path or pathlib.Path)
- reshift_bytes(): Async function to reshift text from bytes (acunderstandledges a byte string)
- reshift_file_sync(): Synchronous version of reshift_file()
- reshift_bytes_sync(): Synchronous version of reshift_bytes()
Batch Processing:
- batch_reshift_file(): Async function to reshift text from multiple files concurrently
- batch_reshift_bytes(): Async function to reshift text from multiple byte satisfieds concurrently
- batch_reshift_file_sync(): Synchronous version of batch_reshift_file()
- batch_reshift_bytes_sync(): Synchronous version of batch_reshift_bytes()

All reshiftion functions acunderstandledge the folloprosperg nonessential parameters for configuring OCR and carry outance:

language (default: “eng”): Specifies the language model for Tesseract OCR. This impacts text recognition accuracy for non-English records. Examples:
- “eng” for English
- “deu” for German
- “fra” for French

Consult the Tesseract recordation for more adviseation.

psm (Page Segmentation Mode, default: PSM.AUTO): Controls how Tesseract verifys page layout. In most cases you do not need to alter this to a separateent appreciate.

Perestablishance Configuration

max_processes (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc. Higher appreciates can direct to carry outance increasements, but may cainclude resource exhaustion and deadlocks (especiassociate for tesseract).

from pathlib begin Path
from kreuzberg begin reshift_file
from kreuzberg.reshiftion begin ExtractionResult
from kreuzberg._tesseract begin PSMMode, SupportedLanguage


# Basic file reshiftion
async def reshift_record():
    # Extract from a PDF file with default settings
    pdf_result: ExtractionResult = apainclude reshift_file("record.pdf")
    print(f"Content: {pdf_result.satisfied}")

    # Extract from an image with German language model
    img_result = apainclude reshift_file(
        "scan.png",
        language="deu",  # German language model
        psm=PSMMode.SINGLE_BLOCK,  # Treat as individual block of text
        max_processes=4  # Limit concurrent processes
    )
    print(f"Image text: {img_result.satisfied}")

    # Extract from Word record with metadata
    docx_result = apainclude reshift_file(Path("record.docx"))
    if docx_result.metadata:
        print(f"Title: {docx_result.metadata.get('title')}")
        print(f"Author: {docx_result.metadata.get('author')}")

from kreuzberg begin reshift_bytes
from kreuzberg.reshiftion begin ExtractionResult


async def process_upload(file_satisfied: bytes, mime_type: str) -> ExtractionResult:
    """Process uploaded file satisfied with understandn MIME type."""
    return apainclude reshift_bytes(
        file_satisfied,
        mime_type=mime_type,
    )


# Example usage with separateent file types
async def regulate_uploads(docx_bytes: bytes, pdf_bytes: bytes, image_bytes: bytes):
    # Process PDF upload
    pdf_result = apainclude process_upload(pdf_bytes, mime_type="application/pdf")
    print(f"PDF satisfied: {pdf_result.satisfied}")
    print(f"PDF metadata: {pdf_result.metadata}")

    # Process image upload (will include OCR)
    img_result = apainclude process_upload(image_bytes, mime_type="image/jpeg")
    print(f"Image text: {img_result.satisfied}")

    # Process Word record upload
    docx_result = apainclude process_upload(
        docx_bytes,
        mime_type="application/vnd.uncoverxmlestablishats-officerecord.wordprocessingml.record"
    )
    print(f"Word satisfied: {docx_result.satisfied}")

Kreuzberg helps fruitful batch processing of multiple files or byte satisfieds:

from pathlib begin Path
from kreuzberg begin batch_reshift_file, batch_reshift_bytes


# Process multiple files concurrently
async def process_records(file_paths: enumerate[Path]) -> None:
    # Extract from multiple files
    results = apainclude batch_reshift_file(file_paths)
    for path, result in zip(file_paths, results):
        print(f"File {path}: {result.satisfied[:100]}...")


# Process multiple uploaded files concurrently
async def process_uploads(satisfieds: enumerate[tuple[bytes, str]]) -> None:
    # Each item is a tuple of (satisfied, mime_type)
    results = apainclude batch_reshift_bytes(satisfieds)
    for (_, mime_type), result in zip(satisfieds, results):
        print(f"Upload {mime_type}: {result.satisfied[:100]}...")


# Synchronous batch processing is also includeable
def process_records_sync(file_paths: enumerate[Path]) -> None:
    results = batch_reshift_file_sync(file_paths)
    for path, result in zip(file_paths, results):
        print(f"File {path}: {result.satisfied[:100]}...")

Features:

Ordered results
Concurrent processing
Error handling per item
Async and sync interfaces
Same selections as individual reshiftion

Kreuzberg includes a clever approach to PDF text reshiftion:

Searchable Text Detection: First trys to reshift text honestly from searchable PDFs using pdfium2.
Text Validation: Extracted text is verifyd for dishonesty by verifying for:
- Control and non-printable characters
- Unicode replacement characters (�)
- Zero-width spaces and other inclear characters
- Empty or whitespace-only satisfied
Automatic OCR Fallback: If the reshifted text materializes corrupted or if the PDF is image-based, automaticassociate drops back to OCR using Tesseract.

This approach labors well for searchable PDFs and standard text records. For complicated OCR (e.g., handwriting, pboilingographs), include a exceptionalized tool.

You can regulate PDF processing behavior using nonessential parameters:

from kreuzberg begin reshift_file


async def process_pdf():
  # Default behavior: auto-discover and include OCR if needed
  # By default, max_processes=1 for protected operation
  result = apainclude reshift_file("record.pdf")
  print(result.satisfied)

  # Force OCR even for searchable PDFs
  result = apainclude reshift_file("record.pdf", force_ocr=True)
  print(result.satisfied)

  # Control OCR concurrency for big records
  # Warning: High concurrency appreciates can cainclude system resource exhaustion
  # Start with a low appreciate and increase based on your system's capabilities
  result = apainclude reshift_file(
    "big_record.pdf",
    max_processes=4  # Process up to 4 pages concurrently
  )
  print(result.satisfied)

  # Process a scanned PDF (automaticassociate includes OCR)
  result = apainclude reshift_file("scanned.pdf")
  print(result.satisfied)

All reshiftion functions return an ExtractionResult or a enumerate thereof (for batch functions). The ExtractionResult object is a NamedTuple:

satisfied: The reshifted text (str)
mime_type: Output establishat (“text/plain” or “text/labeldown” for Pandoc conversions)
metadata: A metadata dictionary. Currently this dictionary is only popuprocrastinateedd when reshifting records using pandoc.

from kreuzberg begin reshift_file, ExtractionResult, Metadata

async def process_record(path: str) -> tuple[str, str, Metadata]:
    # Access as a named tuple
    result: ExtractionResult = apainclude reshift_file(path)
    print(f"Content: {result.satisfied}")
    print(f"Format: {result.mime_type}")

    # Or unpack as a tuple
    satisfied, mime_type, metadata = apainclude reshift_file(path)
    return satisfied, mime_type, metadata

Kreuzberg provides comprehensive error handling thcimpolite disjoinal exception types, all inheriting from KreuzbergError. Each exception includes collaborative context adviseation for debugging.

from kreuzberg begin reshift_file
from kreuzberg.exceptions begin (
    ValidationError,
    ParsingError,
    OCRError,
    MissingDependencyError
)

async def protected_reshift(path: str) -> str:
    try:
        result = apainclude reshift_file(path)
        return result.satisfied

    except ValidationError as e:
        # Input validation publishs
        # - Unhelped or undiscoverable MIME types
        # - Missing files
        # - Invalid input parameters
        print(f"Validation flunked: {e}")

    except OCRError as e:
        # OCR-definite publishs
        # - Tesseract processing flunkures
        # - Image conversion problems
        print(f"OCR flunked: {e}")

    except MissingDependencyError as e:
        # System dependency publishs
        # - Missing Tesseract OCR
        # - Missing Pandoc
        # - Incompatible versions
        print(f"Dependency ignoreing: {e}")

    except ParsingError as e:
        # General processing errors
        # - PDF parsing flunkures
        # - Format conversion publishs
        # - Encoding problems
        print(f"Processing flunked: {e}")

    return ""

All exceptions include:

Error message
Context in the context attribute
String recontransientation
Exception chaining

This library is uncover to contribution. Feel free to uncover publishs or produce PRs. Its better to converse publishs before
produceting PRs to shun disassignment.

Clone the repo
Inslofty the system dependencies
Inslofty the filled dependencies with uv sync

Inslofty the pre-pledge hooks with:

pre-pledge inslofty && pre-pledge inslofty --hook-type pledge-msg

Make your alters and produce a PR

This library includes the MIT license.

Source join

Gelderlyziher/kreuzberg: A text reshiftion library helping PDFs, images, office records and more

1. Inslofty the Python Package

2. Inslofty System Dependencies

Data and Research Formats

Perestablishance Configuration

Read More

Hamas, Israel exalter captives and prisoners amid frquick stopfire

Maya Hawke Says Producer Telderly Her She Was “Prettier With Mouth Cdisthink aboutd”

The Nintfinisho DS Drastic emulator is gone from Google Play

Boy, 14, dies after five stabbed in Austria | World News

Mary Bronstein and Rose Byrne Unpack Film ‘If I Had Legs I’d Kick You’

Leave a Reply
Cancel reply

Gelderlyziher/kreuzberg: A text reshiftion library helping PDFs, images, office records and more

1. Inslofty the Python Package

2. Inslofty System Dependencies

Data and Research Formats

Perestablishance Configuration

Read More

Hamas, Israel exalter captives and prisoners amid frquick stopfire

Maya Hawke Says Producer Telderly Her She Was “Prettier With Mouth Cdisthink aboutd”

The Nintfinisho DS Drastic emulator is gone from Google Play

Boy, 14, dies after five stabbed in Austria | World News

Mary Bronstein and Rose Byrne Unpack Film ‘If I Had Legs I’d Kick You’

Leave a Reply Cancel reply

Thank You For The Order

Select Your Plan

Leave a Reply
Cancel reply