Research Scientist at the Deep Learning & NLP Group. Specializing in Large Language Models (LLMs), Data Mining, and Low-Resource NLP for African and Arabic languages.
My research lies at the intersection of NLP, Data Mining, and Scalable Data Management, aiming to democratize AI for underrepresented communities.
Developing novel techniques to mine and manage massive-scale datasets for training LLMs. Creator of models like ARBERT, MARBERT, and SERENGETI.
Pioneering benchmarks for African, Arabic, and Indigenous languages. Leading the "Voice of a Continent" initiative to map speech technology frontiers.
Analyzing billions of data points using PySpark and custom pipelines. Automating data cleaning and validation for high-stakes domains.
Recent work accepted in EMNLP and ACL
F. Alwajih, S. Magdy, A. Mekki... A. Elmadany... M. Abdul-Mageed
I. Adebara, H. Toyin... A. Elmadany, M. Abdul-Mageed
ACL 2025 - Vienna, Austria
For the "Palm" dataset: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs.
IRCAI-UNESCO, May 2023
Afrocentric-NLP ranked among the global Top 10 outstanding projects.
OSACT5, June 2022
For "TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation."
Google & AMD
Secured Google Education Research Grant (2025) and TPU/GPU compute grants for large-scale model training.