Scientific Publications
Scientific publications from my work at ETH Zurich and CSEM, focusing on machine learning and natural language processing. Featured studies include my arXiv article on expanding large language models for the Polish language.
Efficient Language Adaptive Pre-training: Extending State-of-the-Art Large Language Models for Polish
This study explores the potential of fine-tuning foundational English Large Language Models (LLMs) for generating Polish text.
The first step involves Language Adaptive Pre-training (LAPT) on a high-quality dataset of 3.11 GB, consisting of 276 million
Polish tokens. The LAPT is followed by additional fine-tuning aimed at solving nine KLEJ challenges. Our trained model
Curie-7B-v1 not only generates Polish text with the lowest perplexity of 3.02 among decoder-based Polish models but also
closely rivals the performance of the best Polish encoder-decoder models with a less than 2% gap on 8 out of 9 tasks.
Curie-7B-v1 used approximately 2-3% of a typical dataset size to learn Polish. The LAPT was completed in less than five days
using a consumer GPU, highlighting the method's efficiency.