Liberating Structured Data from PDF Prisons

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

Interactive LLM-Powered Data Processing with DocWrangler

DocWrangler is an IDE that provides instant feedback, visual exploration tools, and AI assistance for building and iterating on LLM-powered data processing pipelines

Reimagining LLM-Powered Unstructured Data Analysis with DocETL

DocETL is an open-source system for building LLM-powered data processing pipelines, offering declarative operators and powerful optimization for complex document analysis tasks

Lightweight Nudges for More Accurate Retrieval in RAG Pipelines

Make your retrieval pipelines more effective with this novel and lightweight fine-tuning approach