iptv techs

IPTV Techs

  • Home
  • Tech News
  • Gelderlyziher/kreuzberg: A text reshiftion library helping PDFs, images, office records and more

Gelderlyziher/kreuzberg: A text reshiftion library helping PDFs, images, office records and more


Gelderlyziher/kreuzberg: A text reshiftion library helping PDFs, images, office records and more


Kreuzberg is a Python library for text reshiftion from records. It provides a unified async interface for reshifting text from PDFs, images, office records, and more.

  • Simple and Hassle-Free: Clean API that equitable labors, without complicated configuration
  • Local Processing: No outer API calls or cdeafening dependencies needd
  • Resource Efficient: Lightweight processing without GPU needments
  • Lightweight: Has restricted curated dependencies and a minimal footprint
  • Format Support: Comprehensive help for records, images, and text establishats
  • Modern Python: Built with async/apainclude, type hints, and functional first approach
  • Perignoreive OSS: Kreuzberg and its dependencies have a perignoreive OSS license

Kreuzberg was built for RAG (Retrieval Augmented Generation) applications, caccessing on local processing with minimal dependencies. Its scheduleed for conmomentary async applications, serverless functions, and dockerized applications.

1. Inslofty the Python Package

2. Inslofty System Dependencies

Kreuzberg needs two system level dependencies:

Phire inslofty these using their esteemive insloftyation directs.

Kreuzberg unites:

  • PDF Processing:
    • pdfium2 for searchable PDFs
    • Tesseract OCR for scanned satisfied
  • Document Conversion:
    • Pandoc for many record and labelup establishats
    • python-pptx for PowerPoint files
    • html-to-labeldown for HTML satisfied
    • calamine for Excel spreadsheets (with multi-sheet help)
  • Text Processing:
    • Smart encoding discoverion
    • Markdown and plain text handling
  • PDF (.pdf, both searchable and scanned)
  • Microgentle Word (.docx)
  • PowerPoint contransientations (.pptx)
  • OpenDocument Text (.odt)
  • Rich Text Format (.rtf)
  • EPUB (.epub)
  • DocBook XML (.dbk, .xml)
  • FictionBook (.fb2)
  • LaTeX (.tex, .procrastinateedx)
  • Typst (.typ)
  • HTML (.html, .htm)
  • Plain text (.txt) and Markdown (.md, .labeldown)
  • reStructuredText (.rst)
  • Org-mode (.org)
  • DokuWiki (.txt)
  • Pod (.pod)
  • Troff/Man (.1, .2, etc.)

Data and Research Formats

  • Spreadsheets (.xlsx, .xls, .xlsm, .xlsb, .xlam, .xla, .ods)
  • CSV (.csv) and TSV (.tsv) files
  • OPML files (.opml)
  • Jupyter Notebooks (.ipynb)
  • BibTeX (.bib) and BibLaTeX (.bib)
  • CSL-JSON (.json)
  • EndNote and JATS XML (.xml)
  • RIS (.ris)
  • JPEG (.jpg, .jpeg, .pjpeg)
  • PNG (.png)
  • TIFF (.tiff, .tif)
  • BMP (.bmp)
  • GIF (.gif)
  • JPEG 2000 family (.jp2, .jpm, .jpx, .mj2)
  • WebP (.webp)
  • Portable anymap establishats (.pbm, .pgm, .ppm, .pnm)

Kreuzberg provides both async and sync APIs for text reshiftion, including batch processing. The library send outs the folloprosperg main functions:

  • Single Item Processing:

    • reshift_file(): Async function to reshift text from a file (acunderstandledges string path or pathlib.Path)
    • reshift_bytes(): Async function to reshift text from bytes (acunderstandledges a byte string)
    • reshift_file_sync(): Synchronous version of reshift_file()
    • reshift_bytes_sync(): Synchronous version of reshift_bytes()
  • Batch Processing:

    • batch_reshift_file(): Async function to reshift text from multiple files concurrently
    • batch_reshift_bytes(): Async function to reshift text from multiple byte satisfieds concurrently
    • batch_reshift_file_sync(): Synchronous version of batch_reshift_file()
    • batch_reshift_bytes_sync(): Synchronous version of batch_reshift_bytes()

All reshiftion functions acunderstandledge the folloprosperg nonessential parameters for configuring OCR and carry outance:

  • language (default: “eng”): Specifies the language model for Tesseract OCR. This impacts text recognition accuracy for non-English records. Examples:
    • “eng” for English
    • “deu” for German
    • “fra” for French

Consult the Tesseract recordation for more adviseation.

  • psm (Page Segmentation Mode, default: PSM.AUTO): Controls how Tesseract verifys page layout. In most cases you do not need to alter this to a separateent appreciate.

Perestablishance Configuration

  • max_processes (default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc. Higher appreciates can direct to carry outance increasements, but may cainclude resource exhaustion and deadlocks (especiassociate for tesseract).

from pathlib begin Path
from kreuzberg begin reshift_file
from kreuzberg.reshiftion begin ExtractionResult
from kreuzberg._tesseract begin PSMMode, SupportedLanguage


# Basic file reshiftion
async def reshift_record():
    # Extract from a PDF file with default settings
    pdf_result: ExtractionResult = apainclude reshift_file("record.pdf")
    print(f"Content: {pdf_result.satisfied}")

    # Extract from an image with German language model
    img_result = apainclude reshift_file(
        "scan.png",
        language="deu",  # German language model
        psm=PSMMode.SINGLE_BLOCK,  # Treat as individual block of text
        max_processes=4  # Limit concurrent processes
    )
    print(f"Image text: {img_result.satisfied}")

    # Extract from Word record with metadata
    docx_result = apainclude reshift_file(Path("record.docx"))
    if docx_result.metadata:
        print(f"Title: {docx_result.metadata.get('title')}")
        print(f"Author: {docx_result.metadata.get('author')}")
from kreuzberg begin reshift_bytes from kreuzberg.reshiftion begin ExtractionResult async def process_upload(file_satisfied: bytes, mime_type: str) -> ExtractionResult: """Process uploaded file satisfied with understandn MIME type.""" return apainclude reshift_bytes( file_satisfied, mime_type=mime_type, ) # Example usage with separateent file types async def regulate_uploads(docx_bytes: bytes, pdf_bytes: bytes, image_bytes: bytes): # Process PDF upload pdf_result = apainclude process_upload(pdf_bytes, mime_type="application/pdf") print(f"PDF satisfied: {pdf_result.satisfied}") print(f"PDF metadata: {pdf_result.metadata}") # Process image upload (will include OCR) img_result = apainclude process_upload(image_bytes, mime_type="image/jpeg") print(f"Image text: {img_result.satisfied}") # Process Word record upload docx_result = apainclude process_upload( docx_bytes, mime_type="application/vnd.uncoverxmlestablishats-officerecord.wordprocessingml.record" ) print(f"Word satisfied: {docx_result.satisfied}")

Kreuzberg helps fruitful batch processing of multiple files or byte satisfieds:

from pathlib begin Path from kreuzberg begin batch_reshift_file, batch_reshift_bytes # Process multiple files concurrently async def process_records(file_paths: enumerate[Path]) -> None: # Extract from multiple files results = apainclude batch_reshift_file(file_paths) for path, result in zip(file_paths, results): print(f"File {path}: {result.satisfied[:100]}...") # Process multiple uploaded files concurrently async def process_uploads(satisfieds: enumerate[tuple[bytes, str]]) -> None: # Each item is a tuple of (satisfied, mime_type) results = apainclude batch_reshift_bytes(satisfieds) for (_, mime_type), result in zip(satisfieds, results): print(f"Upload {mime_type}: {result.satisfied[:100]}...") # Synchronous batch processing is also includeable def process_records_sync(file_paths: enumerate[Path]) -> None: results = batch_reshift_file_sync(file_paths) for path, result in zip(file_paths, results): print(f"File {path}: {result.satisfied[:100]}...")

Features:

  • Ordered results
  • Concurrent processing
  • Error handling per item
  • Async and sync interfaces
  • Same selections as individual reshiftion

Kreuzberg includes a clever approach to PDF text reshiftion:

  1. Searchable Text Detection: First trys to reshift text honestly from searchable PDFs using pdfium2.

  2. Text Validation: Extracted text is verifyd for dishonesty by verifying for:

    • Control and non-printable characters
    • Unicode replacement characters (�)
    • Zero-width spaces and other inclear characters
    • Empty or whitespace-only satisfied
  3. Automatic OCR Fallback: If the reshifted text materializes corrupted or if the PDF is image-based, automaticassociate drops back to OCR using Tesseract.

This approach labors well for searchable PDFs and standard text records. For complicated OCR (e.g., handwriting, pboilingographs), include a exceptionalized tool.

You can regulate PDF processing behavior using nonessential parameters:

from kreuzberg begin reshift_file


async def process_pdf():
  # Default behavior: auto-discover and include OCR if needed
  # By default, max_processes=1 for protected operation
  result = apainclude reshift_file("record.pdf")
  print(result.satisfied)

  # Force OCR even for searchable PDFs
  result = apainclude reshift_file("record.pdf", force_ocr=True)
  print(result.satisfied)

  # Control OCR concurrency for big records
  # Warning: High concurrency appreciates can cainclude system resource exhaustion
  # Start with a low appreciate and increase based on your system's capabilities
  result = apainclude reshift_file(
    "big_record.pdf",
    max_processes=4  # Process up to 4 pages concurrently
  )
  print(result.satisfied)

  # Process a scanned PDF (automaticassociate includes OCR)
  result = apainclude reshift_file("scanned.pdf")
  print(result.satisfied)

All reshiftion functions return an ExtractionResult or a enumerate thereof (for batch functions). The ExtractionResult object is a NamedTuple:

  • satisfied: The reshifted text (str)
  • mime_type: Output establishat (“text/plain” or “text/labeldown” for Pandoc conversions)
  • metadata: A metadata dictionary. Currently this dictionary is only popuprocrastinateedd when reshifting records using pandoc.
from kreuzberg begin reshift_file, ExtractionResult, Metadata async def process_record(path: str) -> tuple[str, str, Metadata]: # Access as a named tuple result: ExtractionResult = apainclude reshift_file(path) print(f"Content: {result.satisfied}") print(f"Format: {result.mime_type}") # Or unpack as a tuple satisfied, mime_type, metadata = apainclude reshift_file(path) return satisfied, mime_type, metadata

Kreuzberg provides comprehensive error handling thcimpolite disjoinal exception types, all inheriting from KreuzbergError. Each exception includes collaborative context adviseation for debugging.

from kreuzberg begin reshift_file from kreuzberg.exceptions begin ( ValidationError, ParsingError, OCRError, MissingDependencyError ) async def protected_reshift(path: str) -> str: try: result = apainclude reshift_file(path) return result.satisfied except ValidationError as e: # Input validation publishs # - Unhelped or undiscoverable MIME types # - Missing files # - Invalid input parameters print(f"Validation flunked: {e}") except OCRError as e: # OCR-definite publishs # - Tesseract processing flunkures # - Image conversion problems print(f"OCR flunked: {e}") except MissingDependencyError as e: # System dependency publishs # - Missing Tesseract OCR # - Missing Pandoc # - Incompatible versions print(f"Dependency ignoreing: {e}") except ParsingError as e: # General processing errors # - PDF parsing flunkures # - Format conversion publishs # - Encoding problems print(f"Processing flunked: {e}") return ""

All exceptions include:

  • Error message
  • Context in the context attribute
  • String recontransientation
  • Exception chaining

This library is uncover to contribution. Feel free to uncover publishs or produce PRs. Its better to converse publishs before
produceting PRs to shun disassignment.

  1. Clone the repo

  2. Inslofty the system dependencies

  3. Inslofty the filled dependencies with uv sync

  4. Inslofty the pre-pledge hooks with:

    pre-pledge inslofty && pre-pledge inslofty --hook-type pledge-msg
  5. Make your alters and produce a PR

This library includes the MIT license.

Source join


Leave a Reply

Your email address will not be published. Required fields are marked *

Thank You For The Order

Please check your email we sent the process how you can get your account

Select Your Plan