Optimizing Data Selection for Fine-tuning Language Models on Specialized Tasks

Background

Large language models (LLMs) have revolutionized natural language
processing, demonstrating impressive performance across diverse
applications. While these models offer tremendous potential for
organizations, many specialized tasks remain beyond the capabilities of
off-the-shelf LLMs. For example, automated contract analysis requires
extracting specific clauses and identifying non-standard terms—
capabilities not adequately addressed by general models. Fine-tuning
open-source models presents a viable solution, but this approach faces a
critical challenge: the necessary training data is often fragmented across
organizational databases and public datasets, hindering effective model
adaptation.

 

Objective

The thesis' goal is to investigate techniques to evaluate, filter, and combine data from
various sources to create optimized training datasets. The research will
focus on how different data selection strategies impact model
performance on specific organizational use cases. By systematically
analyzing the relationship between data characteristics and model
performance, we aim to establish guidelines for efficient data curation
that maximizes task-specific capabilities while minimizing training
resources.

This thesis aims to develop effective methods for identifying and selecting
the most suitable data from fragmented sources to create targeted
datasets for fine-tuning language models on specialized organizational
tasks.

 

 

In this thesis you will:

This work will contribute to bridging the gap between general-purpose
LLMs and their effective deployment for specialized organizational
applications. Here are potential approaches to the thesis:

  • Conduct a thorough review of active learning approaches, combining these insights with retrieval-augmented generation (RAG) systems to source potential training instances.
  • Design a data selection framework that automates the identification and selection of the most relevant training examples for a given task, incorporating both task representation and instance suitability.
  • Implement this framework in a controlled environment, enabling multiple experiments and systematic analysis of data selection strategies.
  • Assess the performance of the proposed framework across various key metrics (e.g., accuracy, propensity for hallucination) and compare the results against baseline methods, such as training on an entire dataset or using random subsets.
  • Quantitatively and qualitatively analyze the trade-offs between different data selection strategies to determine the most impactful approaches.

 

Details

  • Start: Immediately
  • Duration: 6 months
  • Language: preferrably English (German language skills will help when conducting interviews)

 

How to Apply

We offer you a challenging research topic, close supervision from two research associates, and the opportunity to develop practical as well as theoretical skills. If you are interested, please send a current transcript of records, a short CV, and a brief motivation (2-3 sentences) to Daniel Hendriks and Leopold Müller: