Computer Science/Discrete Mathematics Seminar I

Language Generation in the Limit

Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without any further properties or distributional assumptions. In particular, we consider models in which an adversary enumerates the strings of an unknown target language that is known only to come from a possibly infinite list of candidate languages, and we show that it is possible to give certain non-trivial guarantees for language generation in this setting. The resulting guarantees contrast dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it. 

The talk will cover joint work with Sendhil Mullainathan and with Fan Wei.

 

Date & Time

April 21, 2025 | 10:30am – 11:30am

Location

Simonyi Hall 101 and Remote Access

Speakers

Jon Kleinberg, Cornell University