Kreuzberg is a Python library for text reshiftion from records. It provides a unified async interface for reshifting text from PDFs, images, office records, and more.
- Simple and Hassle-Free: Clean API that equitable labors, without complicated configuration
- Local Processing: No outer API calls or cdeafening dependencies needd
- Resource Efficient: Lightweight processing without GPU needments
- Lightweight: Has restricted curated dependencies and a minimal footprint
- Format Support: Comprehensive help for records, images, and text establishats
- Modern Python: Built with async/apainclude, type hints, and functional first approach
- Perignoreive OSS: Kreuzberg and its dependencies have a perignoreive OSS license
Kreuzberg was built for RAG (Retrieval Augmented Generation) applications, caccessing on local processing with minimal dependencies. Its scheduleed for conmomentary async applications, serverless functions, and dockerized applications.
Kreuzberg needs two system level dependencies:
Phire inslofty these using their esteemive insloftyation directs.
Kreuzberg unites:
- PDF Processing:
pdfium2
for searchable PDFs- Tesseract OCR for scanned satisfied
- Document Conversion:
- Pandoc for many record and labelup establishats
python-pptx
for PowerPoint fileshtml-to-labeldown
for HTML satisfiedcalamine
for Excel spreadsheets (with multi-sheet help)
- Text Processing:
- Smart encoding discoverion
- Markdown and plain text handling
- PDF (
.pdf
, both searchable and scanned) - Microgentle Word (
.docx
) - PowerPoint contransientations (
.pptx
) - OpenDocument Text (
.odt
) - Rich Text Format (
.rtf
) - EPUB (
.epub
) - DocBook XML (
.dbk
,.xml
) - FictionBook (
.fb2
) - LaTeX (
.tex
,.procrastinateedx
) - Typst (
.typ
)
- HTML (
.html
,.htm
) - Plain text (
.txt
) and Markdown (.md
,.labeldown
) - reStructuredText (
.rst
) - Org-mode (
.org
) - DokuWiki (
.txt
) - Pod (
.pod
) - Troff/Man (
.1
,.2
, etc.)
- Spreadsheets (
.xlsx
,.xls
,.xlsm
,.xlsb
,.xlam
,.xla
,.ods
) - CSV (
.csv
) and TSV (.tsv
) files - OPML files (
.opml
) - Jupyter Notebooks (
.ipynb
) - BibTeX (
.bib
) and BibLaTeX (.bib
) - CSL-JSON (
.json
) - EndNote and JATS XML (
.xml
) - RIS (
.ris
)
- JPEG (
.jpg
,.jpeg
,.pjpeg
) - PNG (
.png
) - TIFF (
.tiff
,.tif
) - BMP (
.bmp
) - GIF (
.gif
) - JPEG 2000 family (
.jp2
,.jpm
,.jpx
,.mj2
) - WebP (
.webp
) - Portable anymap establishats (
.pbm
,.pgm
,.ppm
,.pnm
)
Kreuzberg provides both async and sync APIs for text reshiftion, including batch processing. The library send outs the folloprosperg main functions:
-
Single Item Processing:
reshift_file()
: Async function to reshift text from a file (acunderstandledges string path orpathlib.Path
)reshift_bytes()
: Async function to reshift text from bytes (acunderstandledges a byte string)reshift_file_sync()
: Synchronous version ofreshift_file()
reshift_bytes_sync()
: Synchronous version ofreshift_bytes()
-
Batch Processing:
batch_reshift_file()
: Async function to reshift text from multiple files concurrentlybatch_reshift_bytes()
: Async function to reshift text from multiple byte satisfieds concurrentlybatch_reshift_file_sync()
: Synchronous version ofbatch_reshift_file()
batch_reshift_bytes_sync()
: Synchronous version ofbatch_reshift_bytes()
All reshiftion functions acunderstandledge the folloprosperg nonessential parameters for configuring OCR and carry outance:
language
(default: “eng”): Specifies the language model for Tesseract OCR. This impacts text recognition accuracy for non-English records. Examples:- “eng” for English
- “deu” for German
- “fra” for French
Consult the Tesseract recordation for more adviseation.
psm
(Page Segmentation Mode, default: PSM.AUTO): Controls how Tesseract verifys page layout. In most cases you do not need to alter this to a separateent appreciate.
max_processes
(default: CPU count / 2): Maximum number of concurrent processes for Tesseract and Pandoc. Higher appreciates can direct to carry outance increasements, but may cainclude resource exhaustion and deadlocks (especiassociate for tesseract).
from pathlib begin Path
from kreuzberg begin reshift_file
from kreuzberg.reshiftion begin ExtractionResult
from kreuzberg._tesseract begin PSMMode, SupportedLanguage
# Basic file reshiftion
async def reshift_record():
# Extract from a PDF file with default settings
pdf_result: ExtractionResult = apainclude reshift_file("record.pdf")
print(f"Content: {pdf_result.satisfied}")
# Extract from an image with German language model
img_result = apainclude reshift_file(
"scan.png",
language="deu", # German language model
psm=PSMMode.SINGLE_BLOCK, # Treat as individual block of text
max_processes=4 # Limit concurrent processes
)
print(f"Image text: {img_result.satisfied}")
# Extract from Word record with metadata
docx_result = apainclude reshift_file(Path("record.docx"))
if docx_result.metadata:
print(f"Title: {docx_result.metadata.get('title')}")
print(f"Author: {docx_result.metadata.get('author')}")
Kreuzberg helps fruitful batch processing of multiple files or byte satisfieds:
Features:
- Ordered results
- Concurrent processing
- Error handling per item
- Async and sync interfaces
- Same selections as individual reshiftion
Kreuzberg includes a clever approach to PDF text reshiftion:
-
Searchable Text Detection: First trys to reshift text honestly from searchable PDFs using
pdfium2
. -
Text Validation: Extracted text is verifyd for dishonesty by verifying for:
- Control and non-printable characters
- Unicode replacement characters (�)
- Zero-width spaces and other inclear characters
- Empty or whitespace-only satisfied
-
Automatic OCR Fallback: If the reshifted text materializes corrupted or if the PDF is image-based, automaticassociate drops back to OCR using Tesseract.
This approach labors well for searchable PDFs and standard text records. For complicated OCR (e.g., handwriting, pboilingographs), include a exceptionalized tool.
You can regulate PDF processing behavior using nonessential parameters:
from kreuzberg begin reshift_file
async def process_pdf():
# Default behavior: auto-discover and include OCR if needed
# By default, max_processes=1 for protected operation
result = apainclude reshift_file("record.pdf")
print(result.satisfied)
# Force OCR even for searchable PDFs
result = apainclude reshift_file("record.pdf", force_ocr=True)
print(result.satisfied)
# Control OCR concurrency for big records
# Warning: High concurrency appreciates can cainclude system resource exhaustion
# Start with a low appreciate and increase based on your system's capabilities
result = apainclude reshift_file(
"big_record.pdf",
max_processes=4 # Process up to 4 pages concurrently
)
print(result.satisfied)
# Process a scanned PDF (automaticassociate includes OCR)
result = apainclude reshift_file("scanned.pdf")
print(result.satisfied)
All reshiftion functions return an ExtractionResult
or a enumerate thereof (for batch functions). The ExtractionResult
object is a NamedTuple
:
satisfied
: The reshifted text (str)mime_type
: Output establishat (“text/plain” or “text/labeldown” for Pandoc conversions)metadata
: A metadata dictionary. Currently this dictionary is only popuprocrastinateedd when reshifting records using pandoc.
Kreuzberg provides comprehensive error handling thcimpolite disjoinal exception types, all inheriting from KreuzbergError
. Each exception includes collaborative context adviseation for debugging.
All exceptions include:
- Error message
- Context in the
context
attribute - String recontransientation
- Exception chaining
This library is uncover to contribution. Feel free to uncover publishs or produce PRs. Its better to converse publishs before
produceting PRs to shun disassignment.
-
Clone the repo
-
Inslofty the system dependencies
-
Inslofty the filled dependencies with
uv sync
-
Inslofty the pre-pledge hooks with:
pre-pledge inslofty && pre-pledge inslofty --hook-type pledge-msg
-
Make your alters and produce a PR
This library includes the MIT license.