About Me
As a PhD candidate at Radboud University, under the supervision of Faegheh Hasibi (Radboud University) and Evangelos Kanoulas (University of Amsterdam), I study LLMs and their role in Retrieval-Augmented Generation (RAG) systems. My research is part of the LESSEN project, which aims to develop a Dutch national chatbot to be deployed by major companies such as KPN, bol.com, and Albert Heijn. I am also a guest researcher at KPN, where I work on applying uncertainty estimation techniques to improve the trustworthiness of RAG systems in real-world scenarios.
My research interests lie in data generation using LLMs, as well as enhancing the reliability and interpretability of RAG through uncertainty estimation. I have over eight years of experience in data engineering, back-end development, and software engineering, and have published multiple papers at prestigious conferences during my PhD and MSc.
I recently began a collaboration with Mohammad Aliannejadi (University of Amsterdam) and Hamed Zamani (University of Massachusetts Amherst) to create challenging benchmarks for evaluating RAG systems. My passion lies in pushing the boundaries of information retrieval and LLM-based systems, while also ensuring their practical impact in real-world applications.
For additional information about my background, see my complete CV here.
Current Focus
One of our primary goals is to propose an uncertainty estimation method for RAG-based reasoning systems. In such systems, the model generates a sequence of reasoning steps and search queries that ultimately lead to a final answer. Our objective is to define an uncertainty score for the final answer, which can serve as a proxy for evaluating the system’s reliability.
Benchmark for High Recall RAG
In progress ...
Our aim is to create or benchmark a RAG dataset that presents the challenge of total or high-recall retrieval. This dataset should contain queries for which answering correctly requires retrieving all relevant and necessary documents. This project is currently in its early stages.
Publications
LLMs are valued for their strong performance across various tasks, but they also produce inaccurate or misleading outputs. Uncertainty Estimation (UE) quantifies the model’s confidence and helps users assess response reliability. However, existing UE methods have not been thoroughly examined in scenarios like RAG, where the input prompt includes non-parametric knowledge. This paper shows that current UE methods cannot reliably assess correctness in the RAG setting. We further propose an axiomatic framework to identify deficiencies in existing methods and guide the development of improved approaches. Our framework introduces five constraints that an effective UE method should meet after incorporating retrieved documents into the LLM’s prompt. Experimental results reveal that no existing UE method fully satisfies all the axioms, explaining their suboptimal performance in RAG. We further introduce a simple yet effective calibration function based on our framework, which not only satisfies more axioms than baseline methods but also improves the correlation between uncertainty estimates and correctness.
LLMs memorize a vast amount of factual knowledge, exhibiting strong performance across diverse tasks and domains. However, it has been observed that the performance diminishes when dealing with less-popular or low-frequency concepts and entities, for example in domain specific applications. The two prominent approaches to enhance the performance of LMs on low-frequent topics are: RAG and fine-tuning (FT) over synthetic data. This paper explores and evaluates the impact of RAG and FT on customizing LMs in handling low-frequency entities on question answering tasks. We conduct extensive experiments on twelve LMs of varying size and type and different FT methods, data augmentation, and retrieval models. Our findings indicate that while FT boosts the performance across entities of varying popularity, RAG surpasses FT by a large margin particularly for least popular factual knowledge. Additionally, the success of both RAG and FT approaches is amplified by improving retrieval and data augmentation techniques. Fine tuning, while beneficial for small LMs, requires extensive resources. To address this issue, we propose the new Stimulus RAG approach that surpasses the effectiveness of fine tuning based approaches, thereby eliminating the need for the costly data augmentation and fine tuning step for enriching LMs with less popular factual knowledge.
Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. Traditionally, conversational datasets were created through crowdsourcing, but this method has proven costly, limited in scale, and labor-intensive. As a solution, the development of synthetic dialogue data has emerged, utilizing techniques to augment existing datasets or convert textual resources into conversational formats, providing a more efficient and scalable approach to dataset creation. In this survey, we offer a systematic and comprehensive review of multi-turn conversational data generation, focusing on three types of dialogue systems: open domain, task-oriented, and information-seeking. We categorize the existing research based on key components like seed data creation, utterance generation, and quality filtering methods, and introduce a general framework that outlines the main principles of conversation data generation systems. Additionally, we examine the evaluation metrics and methods for assessing synthetic conversational data, address current challenges in the field, and explore potential directions for future research. Our goal is to accelerate progress for researchers and practitioners by presenting an overview of state-of-the-art methods and highlighting opportunities to further research in this area.
Incorporating information from other languages can improve the results of tasks in low-resource languages. A powerful method of building functional NLP systems for low-resource languages is to combine multilingual pre-trained representations with cross-lingual transfer learning. In general, however, shared representations are learned separately, either across tasks or across languages. This paper proposes a meta-learning approach for inferring natural language in Persian. Alternately, meta-learning uses different task information (such as QA in Persian) or other language information (such as natural language inference in English). Also, we investigate the role of task augmentation strategy for forming additional high-quality tasks. We evaluate the proposed method using four languages and an auxiliary task. Compared to the baseline approach, the proposed model consistently outperforms it, improving accuracy by roughly six percent. We also examine the effect of finding appropriate initial parameters using zero-shot evaluation and CCA similarity.
Tutorials & Talks
SIGIR 2025
Enhancing Knowledge Injection in Large Language Models for Efficient and Trustworthy Responses
I presented my doctoral consortium paper at SIGIR 2025, where we discussed that LLMs excel across various tasks but often generate incorrect or inappropriate outputs due to knowledge limitations in specific domains. Augmenting LLMs with additional knowledge is a promising solution for generating correct responses. This can be achieved through parametric methods, which involve integrating knowledge directly into the model’s parameters, or non-parametric methods, which incorporate external information such as retrieved text chunks. However, challenges persist regarding the methods, content, and timing of integrating this knowledge. This study aims to identify optimal strategies for incorporating knowledge into LLMs to ensure trustworthy responses. Key research questions include comparing parametric and non-parametric knowledge injection methods, utilizing uncertainty to predict the accuracy of LLM-generated response, and determining when LLMs should incorporate external knowledge. By addressing these areas, our research seeks to enhance LLM response generation and pave the way for conversational information-seeking systems that meet users information needs with reliable and transparent responses.
We presented the second version of our tutorial at WWW 2024, where we discussed that advancements in conversational systems have revolutionized information access, surpassing the limitations of single queries. However, developing dialogue systems requires a large amount of training data, which is a challenge in low-resource domains and languages. Traditional data collection methods like crowd-sourcing are laborintensive and time-consuming, making them ineffective in this context. Data augmentation (DA) is an affective approach to alleviate the data scarcity problem in conversational systems. This tutorial provides a comprehensive and up-to-date overview of DA approaches in the context of conversational systems. It highlights recent advances in conversation augmentation, open domain and task-oriented conversation generation, and different paradigms of evaluating these models. We also discuss current challenges and future directions in order to help researchers and practitioners to further advance the field in this area.
We presented the first version of our tutorial at CIKM 2023
I presented our SIGIR-AP 2024 paper titled “Fine tuning vs. retrieval augmented generation for less popular knowledge” at ICTOpen in April 2025.
A Little More About Me
To be written!