Developing a Domain-Specific Scientific Literature Chatbot

Mentor Areas

Machine learning, materials science, energy and sustainability

Description:

The rapid advancement of large language models (LLMs) presents a transformative opportunity for scientific research, particularly in how researchers access and interact with scientific literature. Inspired by a recent paper on domain-specific chatbots (K. Yager, Digital Discovery, 2023,2, 1850-1861. DOI https://doi.org/10.1039/D3DD00112A), this research project aims to recreate and extend this innovative approach to aid research efforts at the Laboratory for Research on the Structure of Matter.

The core objective is to develop a sophisticated chatbot system that can effectively navigate and extract meaningful information from scientific publications. Drawing directly from the methodology outlined in the Digital Discovery paper, our project will focus on implementing a text embedding-based approach to contextual scientific document retrieval. Students will explore the process of converting scientific documents into semantically meaningful vector representations, enabling precise and context-aware information extraction.

Our primary research questions will investigate the effectiveness of text embedding techniques in scientific document analysis. We will examine how different chunking strategies impact the chatbot's ability to retrieve relevant information, test various embedding models, and assess the system's performance across different scientific domains. Students will have the opportunity to replicate the paper's key experiments, including semantic space visualization using t-SNE and comparative analysis of document retrieval methods.

The methodology will closely follow the original paper's approach, involving several critical steps. Students will first implement a PDF parsing system using Grobid to extract clean text from scientific publications. They will then develop a text chunking mechanism with overlapping segments, compute text embeddings using models like OpenAI's text-embedding-ada-002, and create a database for semantic retrieval. An additional focus will be on experimenting with different prompt engineering techniques, including summarization strategies and context-augmentation methods.

Beyond text analysis, the project will also explore the image embedding component demonstrated in the paper. Students will implement CLIP-based image similarity search, allowing semantic retrieval of scientific figures and experimental images. This aspect offers an exciting opportunity to work with multi-modal AI technologies and understand how visual semantic understanding can be applied to scientific research.

Preferred Qualifications

Knowledge of Python, Large-language models.

Project Website

Learn more about the researcher and/or the project here.
https://doi.org/10.1039/D3DD00112A

Details:

Preferred Student Year

Second-Year, Junior, Senior

Academic Term

Fall, Spring

I prefer to have students start during the above term(s).

Volunteer

Yes

Yes indicates that faculty are open to volunteers.

Paid

Yes indicates that faculty are open to paying students they engage in their research, regardless of their work-study eligibility.