Papers/Events | Authors | Time |
---|
Papers/Events | Authors | Time |
---|
I will do a systematic exploration of strategies for pretraining generative Large Language Models (LLMs) within the Galician-Portuguese diasystem. We investigate the impact of combining versus separating linguistic varieties during continued pretraining, the trade-offs between large-scale noisy data and smaller high-quality corpora, and the potential gains from incorporating instruction-based data during the training phase instead of in post-training (e.g., instruction tuning). In sum, I will try to give some hints on how to improve an LLM, taking into account factors such as the quality and size of the corpus, language varieties used and the ability to understand instructions.
Pablo Gamallo defended his Linguistics thesis in 1998 at the Université Blaise Pascal (France), and since 2004, he has been working at the University of Santiago de Compostela (Spain), first as a Ramón y Cajal research fellow and now as Full Professor. He was promoter and founding partner of Cilenis, spin-off of the University of Santiago de Compostela on language technologies. Concerning his research activities, he is a member of the Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS). His main scientific interest is Natural Language Processing and Information Extraction. At present, he is the Coordinator of the European project HYBRIDS, a MSC Doctoral Network with nine European Beneficiaries, and is one of the Principal Investigators of Proxecto NÓS, an ambitous project aimed at building linguistic resources (corpora, datasets, and language models) for Galician language.
After four decades in the study of (formal) languages, the time has come to pause, reflect on the insights gained, and weave an ontology that elegantly connects the myriad facets of this rich domain, laying the foundation for a deeper characterization. In this talk, I invite you to explore topics such as knowledge representations, language types and their subclasses, language affinities, blended language paradigms, problem-solving, programming, reasoning frameworks, computational thinking, grammars, quality, language processing, interpretation, compilation, and static analysis, among others. Through this journey, I will adopt a structured lens, presenting concise slices of the ontology. My aim is not to reach a singular destination but to meander thoughtfully through these interwoven themes, embracing the exploration itself as the reward.
Pedro Rangel Henriques holds a PhD in Formal Languages and Attribute Grammars from the University of Minho (UM), where he serves as a Full Professor in the Informatics Department. A dedicated researcher at the Algoritmi Research Center and a member of LASI, he leads the Language Processing Group. His teaching spans a diverse array of Computer Science courses, including Programming Languages and Paradigms, Compilers, Language and Grammar Engineering, Markup Languages for Document Annotation, Ontologies, and Introduction to Informatics. With an extensive supervisory record, Pedro has guided 19 PhD dissertations, over 100 master’s theses, and more than 100 undergraduate projects. His mentorship focuses on areas such as language and document processing, code analysis, program visualization and comprehension, computational thinking, ontologies, natural language processing, data mining, and data cleaning. A prolific scholar, he has co-authored one book, contributed over 15 book chapters, published more than 35 journal articles and 100 conference papers, and participated in 28 R&D projects, advancing the frontiers of language processing and related fields.