Exploring NLP in healthcare: An interview with Klaudia Kantor, Data Scientist at Roche
Search Blog

Exploring NLP in healthcare: An interview with Klaudia Kantor, Data Scientist at Roche
Mar 14, 2025
This interview with Klaudia Kantor, a data scientist at Roche, explores how NLP is transforming clinical trials and patient care—bridging biomedical expertise with cutting-edge AI to make healthcare smarter and more accessible.
Introduction
The intersection of healthcare and data science is becoming increasingly important, especially in the realm of Natural Language Processing (NLP). To explore this evolving field, we spoke with Klaudia, a data scientist specializing in NLP at Roche, whose journey from biomedical engineering to data science highlights the dynamic career paths available in this space.
From biomedical engineering to NLP: A unique career path
Klaudia’s academic background began in biomedical engineering, driven by her dual passion for medicine and the exact sciences, such as physics. However, during her studies, she realized that designing medical implants or devices was not the direction she wanted to pursue. Instead, she found herself drawn to programming, data analysis, and machine learning:
“To deepen my knowledge, I completed a Data Science bootcamp, where I learned the basics of machine learning. At the same time, I was finishing my studies and actively searching for an internship that would allow me to merge my interests in biomedicine and data analysis. That’s how I found Roche, where I started a one-year internship as a Data Scientist. I enjoyed it so much that I’m still with the company today.”
She began her NLP journey with the somewhat tedious work of parsing PDFs and writing regular expressions. Wanting to explore more exciting aspects of NLP, she independently experimented with, text classification, sentiment analysis, vectorization, and models such as Word2Vec and FastText:
“I was fascinated by the idea of representing text in vector space, where words with similar meanings are positioned close to each other. I saw the immense potential of NLP and its wide range of applications. Unlike image analysis, which felt more abstract, NLP was tangible—we interact with text daily, making it easier to manipulate and interpret model results.”
Note: While traditional vector-based models like Word2Vec and FastText have played a foundational role in NLP and remain useful in some applications, recent advancements have led to a growing focus on transformer-based models (e.g., BERT and its biomedical variants, or sentence transformers). However, for some use cases that do not require such advanced embeddings, simpler approaches like even TF-IDF remain effective and widely used. These newer methods provide better contextual understanding and improved accuracy, making them the preferred choice for modern NLP applications.
Trends in NLP for healthcare and clinical trials
Given Klaudia’s extensive work in NLP, we asked about the most significant trends shaping data science in the medical field, particularly in clinical trials. She highlighted the rise of Generative AI (GenAI) and large language models (LLMs) as transformative forces across industries, including healthcare.
“Many traditional NLP approaches, such as encoders and vector-based classification models, are now being replaced by generative models because they often yield significantly better results. However, in the biomedical sector, data privacy regulations pose a major barrier to widespread adoption. The best models are hosted in the cloud by providers and accessible via API, but transmitting sensitive biomedical data externally is problematic.”
Because of this, organizations are actively seeking secure deployment strategies for LLMs, such as internal hosting to maintain compliance while leveraging advanced NLP capabilities.
NLP’s potential in clinical trials is particularly promising. Clinical trial protocols are lengthy and complex documents containing essential information about study objectives, procedures, and eligibility criteria. NLP models could help structure, analyze, and even generate these documents more efficiently.
“One of the most exciting applications is automated patient matching for clinical trials. Right now, this is one of the biggest challenges in research. NLP tools could search patient databases and match them to relevant trials based on eligibility criteria, significantly streamlining recruitment.”
Other key NLP applications in clinical trials include:
Generating clinical study reports automatically
Analyzing scientific literature for relevant insights
Comparing trial results with past studies
Traditional machine learning models had limited success in these areas due to the difficulty of collecting sufficient training data. In biomedical research, data scarcity is a significant hurdle, requiring expert curation. However, Generative AI models are changing the game by offering more flexibility and better performance in low-data environments.
Looking ahead, LLMs in biomedicine will likely focus on controlled implementations that balance innovation with privacy concerns. Retrieval-Augmented Generation (RAG) and fine-tuning models on proprietary datasets will play a crucial role in this evolution.
“Privacy and regulatory challenges will remain critical, so future advancements will focus on more controlled, in-house deployments within pharmaceutical companies and research institutions.”
NLP in Action: Improving clinical trial processes
Klaudia has worked on multiple projects applying NLP to enhance clinical trials. One example involved structuring patient eligibility criteria, a major pain point in clinical research.
“Currently, matching patients to clinical trials is a significant challenge because eligibility criteria are written in unstructured text formats, making them difficult to process and compare against patient records.”
To address this, Klaudia developed a tool where patients fill out a health questionnaire, and an algorithm searches an active clinical trial database to match them with relevant studies. A key component of this solution was converting unstructured eligibility criteria into structured, logical expressions.
“I used a GPT-based model to extract key entities from eligibility criteria and map them to core medical concepts. This structured approach makes criteria more uniform and easier to analyze.”
Although the project has yet to be deployed, it holds great potential for accelerating and improving patient recruitment.
Another upcoming pilot project applies NLP to medical device quality reporting:
“At Roche, we must document incidents related to medical devices we produce. These reports must meet strict regulatory requirements so that subsequent support teams can handle cases appropriately. We are developing an NLP system that automatically generates standardized incident summaries based on preliminary notes, which users can then review and approve.”
This solution enhances documentation quality and speeds up case processing, demonstrating the power of NLP beyond patient data analysis.
Advice for aspiring NLP specialists in healthcare and use case
For data scientists looking to enter healthcare-focused NLP roles, Klaudia emphasizes building a strong foundation in fundamental concepts, even in the age of Generative AI:
“Core principles like text embeddings, language models, and similarity estimation remain essential, even as LLMs become dominant. Understanding attention mechanisms, causal language modeling, and vector search techniques is crucial, especially for Retrieval-Augmented Generation (RAG).”
Technical skills include:
Named Entity Recognition (NER) – critical for extracting key terms from medical texts, such as drug names, diagnoses, and procedures.
Normalization & Standardization – medical texts often contain different variations of the same concept, so unifying terminology is essential for accurate analysis.
Explainability & Interpretability – transparency in NLP models is crucial in healthcare, as their decisions can impact patient outcomes. Techniques like SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-agnostic Explanations) help explain why a model made a specific prediction, ensuring trust and regulatory compliance.
But soft skills are equally important:
Understanding clinical needs by working closely with doctors and researchers
Patience in handling complex texts, as medical documentation is highly specialized
Collaboration and communication, as NLP projects require interdisciplinary teamwork
Familiarity with modern tools is important, but even more crucial is understanding how they work and when to use them. Otherwise, one may encounter situations where off-the-shelf solutions fail, and without theoretical knowledge, finding a workaround can be challenging.
A great starting point is the NLP course on Coursera by DeepLearning.AI, and for those looking to deepen their understanding of the fundamentals, an excellent resource is the book 'Speech and Language Processing' by Jurafsky and Martin.
'Speech and Language Processing' by Jurafsky and Martin
NLP course on Coursera by deeplearning.ai
The fusion of NLP and healthcare is unlocking new opportunities to enhance clinical trials, patient care, and regulatory processes. With advancements in LLMs, secure AI deployments, and structured medical data processing, the future of NLP in biomedicine is poised for significant growth.
For aspiring data scientists, balancing technical expertise with industry-specific knowledge is the key to making an impact in this evolving field.