Agentic Document Parsing for Real-World Enterprise Documents

  • Type:Bachelor's thesis / Master's thesis
  • Date:Immediatly
  • Supervisor:

    Moritz Diener 

  • Motivation

    Document parsing is a core building block for AI systems that work with real-world documents. However, parsing real enterprise documents remains challenging: tables may contain merged cells and hierarchical headers, charts require exact data-point extraction, multi-column layouts break reading order, and semantically meaningful formatting such as strikethrough or superscripts is often lost. These issues are particularly critical when parsed documents are used by downstream AI systems for reasoning, search, or automation.

    Recent benchmarks highlight that current parsers often perform well on some dimensions while failing on others. This suggests that robust document parsing may require more adaptive systems that can analyze a document, identify its challenges, and dynamically choose suitable extraction strategies.

    Background

    This thesis will use the ParseBench dataset and benchmark framework as empirical basis. The benchmark focuses on parsing real-world enterprise documents and evaluates dimensions such as tables, charts, content faithfulness, semantic formatting, and visual grounding. It therefore provides a realistic foundation for studying weaknesses of existing systems and developing improved parsing approaches.

    Goal

    The goal of this thesis is to develop and evaluate an improved approach for parsing and extracting information from real-world documents. The work should use the ParseBench dataset and benchmark framework to identify weaknesses of existing approaches and to design, implement, and evaluate an improved parsing pipeline.

    A particular focus of this thesis can be on agentic document parsing, i.e. parsing systems that do not rely on a single fixed extraction step, but instead decompose the task into multiple reasoning and processing steps. Such a system could, for example, detect the relevant document regions, classify the type of content, select specialized tools or models for tables, charts, or text blocks, and iteratively refine the extracted result.

    The exact scope can be adapted depending on the student’s interests and background.

    Possible Thesis Directions

    1. Table Parsing for Real-World Documents
    This option focuses on improving parsing quality for complex tables, such as merged cells, hierarchical headers, and multi-page structures. The thesis may develop a specialized method for table extraction, normalization, or post-processing.

    2. Chart and Figure Extraction for AI-ready Document Understanding
    This option focuses on extracting structured information from charts and visual elements. The goal is to improve exact data-point extraction, label matching, and conversion into machine-readable representations.

    3. Holistic Document Parsing System
    This option takes a broader view and aims to build or improve a system that balances multiple parsing capabilities at once, such as tables, charts, content faithfulness, and visual grounding.

    4. Agentic / Agent-based Document Parsing
    This option investigates whether an agentic system can improve document parsing on complex real-world documents. Possible approaches include:

    • decomposing parsing into multiple subtasks such as layout analysis, region selection, table extraction, chart interpretation, and result verification,
    • dynamically choosing between specialized tools or models depending on document type or detected content,
    • iteratively refining uncertain extraction results,
    • using feedback or self-evaluation steps to improve faithfulness and structural correctness.

    The central research question is whether such an adaptive, agent-based approach can outperform static end-to-end pipelines, either on a specific subproblem or across multiple dimensions of the benchmark.

    5. Benchmark-driven Error Analysis and Parser Improvement
    This option focuses on systematically analyzing failure modes of existing parsers and deriving targeted improvement strategies based on benchmark results.

    Expected Contribution

    Depending on the chosen direction, the thesis is expected to contribute one or more of the following:

    • an implemented and evaluated prototype for improved document parsing,
    • a specialized method for one challenging subproblem such as tables or charts,
    • an agentic or agent-based parsing pipeline for adaptive document understanding,
    • a holistic system that balances several parsing dimensions,
    • a structured error analysis of current document parsing approaches on real-world documents.

    Methodological Approaches

    Possible methods include:

    • implementation and evaluation of parsing pipelines,
    • benchmark-based experimentation on ParseBench,
    • development of post-processing, reranking, or verification strategies,
    • comparison of model-based, rule-based, hybrid, and agentic approaches,
    • iterative multi-step extraction pipelines,
    • error analysis and ablation studies.

    For a Bachelor’s thesis, the work would usually focus on one clearly scoped subproblem or one agentic component. For a Master’s thesis, the topic could be extended toward a broader adaptive system or a more comprehensive evaluation across several benchmark dimensions.

    Requirements / Recommended Profile

    The topic is suitable for students with an interest in one or more of the following areas:

    • document AI, OCR, or information extraction,
    • machine learning and multimodal AI,
    • natural language processing,
    • agentic systems and LLM-based orchestration,
    • evaluation and benchmarking of AI systems,
    • software development and experimentation.

    Experience with Python is expected. Prior experience with machine learning, document analysis, vision-language models, or agent-based systems is helpful but not strictly required.

    Contact

    If you are interested, please send a current transcript of records, a short CV, and a brief motivation to Moritz Diener (moritz.diener@kit.edu). The exact thesis scope can be discussed and adapted based on your interests and methodological background.

    Starting Point

    • https://www.llamaindex.ai/blog/parsebench
    • https://huggingface.co/datasets/llamaindex/ParseBench
    • https://arxiv.org/pdf/2604.08538