Gerardo Sierra
Engineering Institute, Universidad Nacional Autónoma de México (UNAM)
Abstract
In the field of Natural Language Processing (NLP), the task of identifying textual similarity, particularly paraphrase detection, presents challenges in various applications like plagiarism detection, question answering, textual entailment, summarization, and evaluating automatic machine translation, among others. To tackle this, numerous NLP techniques have been developed, including vector space models (based on terms), text alignment (based on linguistic knowledge), n-gram overlapping (based on strings), machine learning algorithms, and deep learning architectures.
Most of the datasets used for detecting and quantifying semantic textual similarity rely on pairs of texts treated as feature vectors, with each feature representing a score corresponding to a specific type of similarity. However, paraphrases can take different forms beyond sentence pairs, leading to a wide range of variations. Examples include the mixing or splitting of sentences, or even the deletion of certain elements while combining others.
As a result, paraphrase detection models need to consider the analysis of datasets that encompass more complex forms of paraphrasing. Additionally, it becomes necessary to account for other levels of linguistic analysis, such as discursive or stylometric analysis.
Short Bio
Gerardo Sierra is Researcher and Head of the Language Engineering Group at Universidad Nacional Autónoma de México (UNAM). His work focuses on research and development on corpus linguistics and computational Lexicography. Regarding the former, he has published the book "Introduction to corpus linguistics", which constitutes a reference in the linguistic and language technology community. He is the researcher who has put more corpora on Internet in Mexico, with own technology that includes the GECO corpus manager. Among them, the Corpus of Sexualities in Mexico, the RST Spanish Treebank and the Parallel Corpus of Mexican Languages. On computational lexicography, his work on onomasiological dictionaries, terminological extraction systems and definitional contexts are recognized worldwide.