I am*	↓ ↓
E-mail*	↓ ↓

Back to search results

Garranties statistiques des méthodes de 'policy gradient' pour les bandits et l'apprentisage par renforcement // Statistical guarantees for Policy Gradient Methods in Bandits and Reinforcement Learning

Ref. ABG-139302 ADUM-75336	Thesis topic
2026-05-27

Université Grenoble Alpes

Workplace

Saint-Martin-d'Hères - Auvergne-Rhône-Alpes - France

Topic title

Garranties statistiques des méthodes de 'policy gradient' pour les bandits et l'apprentisage par renforcement // Statistical guarantees for Policy Gradient Methods in Bandits and Reinforcement Learning

Scientific expertise

Computer science

Keywords

apprentissage par renformcement, descente de gradient, approximation stochastique, bandits multi-bras
reinforcement learning, gradeitn descent, stochastric approximation, bandits

Topic description

Les méthodes de gradient de politique font partie des approches les plus utilisées en apprentissage par renforcement. Elles optimisent directement une règle de décision paramétrée au moyen de mises à jour par gradient, ce qui les rend flexibles et capables de passer à l'échelle. Pourtant, malgré leur succès empirique, leur compréhension théorique reste incomplète. Il est encore difficile de caractériser précisément quand ces méthodes explorent efficacement, convergent de manière fiable et obtiennent de bonnes garanties de performance à long terme.

L'objectif de ce projet de thèse est de faire progresser la compréhension théorique des méthodes de gradient de politique dans les bandits et l'apprentissage par renforcement. Le projet se concentrera sur des modèles simplifiés mais mathématiquement précis, afin d'identifier les mécanismes qui expliquent la convergence, la stabilité, l'exploration et le regret. Une question centrale est de déterminer si l'exploration induite par le gradient est suffisante, ou si des mécanismes supplémentaires sont nécessaires pour éviter des dynamiques d'apprentissage lentes ou instables.

Le cadre des bandits multi-bras constituera un point de départ naturel. Il fournit un modèle minimal de décision séquentielle sous incertitude, tout en capturant le compromis fondamental entre exploration et exploitation. Un exemple clé est l'algorithme Stochastic Gradient Bandit, qui peut être vu comme une méthode simple de gradient de politique avec paramétrisation softmax. Des travaux récents ont montré qu'il possède de bonnes propriétés de convergence, mais que son comportement en regret est plus délicat. Selon le pas d'apprentissage, il peut atteindre un regret logarithmique ou subir un regret polynomial. Cela met en évidence une distinction importante : un algorithme peut converger vers l'action optimale tout en explorant trop souvent des actions sous-optimales pour être efficace.

Une première direction de la thèse consistera à caractériser le regret des algorithmes de gradient de politique dans les bandits. Le projet étudiera comment le pas d'apprentissage, la paramétrisation, la régularisation et la structure des récompenses influencent la concentration de la politique sur les actions optimales. L'objectif sera d'identifier les régimes où ces algorithmes sont efficaces, inefficaces, ou limités par rapport aux méthodes classiques de bandits.

Une deuxième direction sera algorithmique. La thèse étudiera des modifications des méthodes de gradient de politique permettant d'améliorer leur stabilité et leurs garanties en regret, tout en conservant leur simplicité. Les pistes possibles incluent des mises à jour régularisées, des pas adaptatifs, des paramétrisations alternatives à softmax, ainsi que des méthodes hybrides combinant apprentissage par gradient et exploration explicite.

Une troisième direction consistera à étendre les enseignements obtenus dans les bandits vers l'apprentissage par renforcement. Les processus de décision markoviens introduisent des difficultés supplémentaires, car les actions influencent à la fois les récompenses immédiates et les états futurs. Le projet cherchera à déterminer si les phénomènes observés dans les bandits, comme la sensibilité au pas d'apprentissage, l'exploration implicite insuffisante ou le rôle stabilisateur de la régularisation, apparaissent aussi dans des environnements simples d'apprentissage par renforcement.

Dans l'ensemble, ce projet vise à clarifier quand et pourquoi les méthodes de gradient de politique fonctionnent, quand elles échouent, et comment elles peuvent être améliorées. En combinant des outils issus de l'apprentissage en ligne, de l'approximation stochastique, de l'optimisation et de la théorie de l'apprentissage par renforcement, la thèse cherchera à fournir de nouveaux éclairages théoriques et de nouveaux principes algorithmiques pour l'apprentissage par gradient dans les problèmes de décision séquentielle.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Policy gradient methods are among the most widely used approaches in modern reinforcement learning. They directly optimize a parameterized decision rule through gradient-based updates, which makes them flexible and scalable. However, despite their empirical success, their theoretical understanding remains incomplete. In particular, it is still difficult to characterize when these methods explore efficiently, converge reliably, and achieve strong long-term performance guarantees.

The objective of this PhD project is to advance the theoretical understanding of policy gradient methods in bandits and reinforcement learning. The project will focus on simplified but mathematically precise models, with the goal of identifying the mechanisms that explain convergence, stability, exploration, and regret. A central question is whether gradient-based exploration is sufficient to obtain strong performance guarantees, or whether additional algorithmic ingredients are needed to avoid slow or unstable learning.

The multi-armed bandit setting will serve as a natural starting point. Bandits provide a minimal model of sequential decision-making under uncertainty, while already capturing the fundamental exploration-exploitation trade-off that appears in reinforcement learning. A key example is the Stochastic Gradient Bandit algorithm, which can be interpreted as a simple softmax policy gradient method applied to bandits. Recent works have shown that this algorithm can have favorable convergence properties, but its regret behavior is more delicate. Depending on the learning-rate regime, it may achieve logarithmic regret or suffer from much slower polynomial regret. This highlights an important distinction between convergence and regret: an algorithm may eventually converge to the optimal action, while still exploring suboptimal actions too often to be efficient over time.

The first direction of the thesis will be to characterize the regret behavior of policy gradient algorithms in bandit models. The project will study how the learning rate, policy parameterization, regularization, and reward structure influence the concentration of the policy on optimal actions. The goal is to identify regimes in which gradient-based bandit algorithms are efficient, inefficient, or fundamentally limited compared with classical bandit algorithms.

The second direction will be algorithmic. The thesis will investigate modifications of policy gradient methods that improve stability and regret guarantees while preserving simplicity and scalability. Possible directions include regularized updates, adaptive learning-rate schemes, alternative policy parameterizations beyond softmax, and hybrid methods combining gradient-based learning with explicit exploration mechanisms.

The third direction will be to extend the insights obtained in bandits toward reinforcement learning. Markov Decision Processes introduce additional difficulties, since actions affect both immediate rewards and future states. The project will investigate whether phenomena observed in bandits, such as learning-rate sensitivity, insufficient implicit exploration, and the stabilizing role of regularization, also appear in simple reinforcement learning environments.

Overall, the project aims to clarify when and why policy gradient methods work, when they fail, and how they can be improved. By combining tools from online learning, stochastic approximation, optimization, and reinforcement learning theory, the thesis will seek to provide both new theoretical insights and new algorithmic principles for gradient-based learning in sequential decision-making problems.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Début de la thèse : 01/10/2026

Funding category

Funding further details

Concours allocations

Presentation of host institution and host laboratory

Université Grenoble Alpes

Institution awarding doctoral degree

Université Grenoble Alpes

Graduate school

217 MSTII - Mathématiques, Sciences et technologies de l'information, Informatique

Candidate's profile

compétence de hait niveau en apprentissage machine, probabilités et optimisation.
high-level expertise in machine learning, probability and optimization.

Application deadline

2026-06-09

Partager via

Apply

Vous avez déjà un compte ?

Nouvel utilisateur ?

Mr/Mrs*	↓ ↓
First name*	↓ ↓
Last name*	↓ ↓
E-mail*	↓ ↓
Confirm your e-mail*	↓ ↓
Password*	8 characters minimum, including at least one figure, one lower case letter and one uppercase letter. ↓ ↓
Please confirm password*	↓ ↓