Adaptive Policy Smoothing in Reinforcement Learning: Applications to Wavefront Sensorless Adaptive Optics and Robotics
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Université d'Ottawa / University of Ottawa
Abstract
Optical communication between low-Earth orbit (LEO) satellites and the ground is an emerging form of free-space data transmission. It offers significantly faster data transfer and supports higher bandwidths than radio frequency communication. However, atmospheric turbulence distorts the optical beam wavefront, leading to reduced data transfer rates. Adaptive Optics (AO) can correct these distortions by using real-time control commands, informed by data from a wavefront sensor, to adjust a deformable mirror.
Traditional AO systems, however, suffer from high complexity and cost, with a significant portion of the cost attributed to wavefront sensors. Additionally, the wavefront sensors are limited in dynamic range, consume a fraction of the incident beam intensity, and introduce latency between the measurement and actuation of the deformable mirror. These factors can cause discrepancies between the measured and actual atmospheric characteristics as the satellite traverses the sky.
This thesis demonstrates that reinforcement learning (RL) can serve as a viable, low-cost, and low-latency alternative by eliminating the wavefront sensor and its associated processing electronics. This can be accomplished by developing a control policy learned through interactions with a cost-effective and ultra-fast readout of a low-dimensional photodetector array rather than relying on a wavefront phase profiling camera.
Inspired by the application of an RL-based wavefront sensorless AO system for optical LEO satellite-to-ground communication downlinks, this thesis recognizes and addresses a general limitation in standard deep RL controllers. A significant limitation of policies trained using standard deep RL algorithms is the presence of high-frequency components in the control signal, which can lead to oscillations that increase actuator amplitudes. This limitation results in decreased correction speed and the system's inability to keep pace with rapidly changing wavefronts. Existing action regularization methods mitigate these oscillations but often reduce performance, particularly in fast-evolving dynamic environments, where they restrict the policy from quickly adjusting to account for large state changes.
To address this challenge, a novel State-Adaptive Proportional Policy Smoothing (SAPPS) method is proposed for RL. SAPPS reduces high-frequency components in the control signal in continuous environments through policy smoothing proportionally to the magnitude of state changes, ensuring the policy remains responsive to environmental changes without compromising performance.
The proposed SAPPS method is integrated with the on-policy deep RL algorithm Proximal Policy Optimization (PPO) and compared against standard PPO and the state-of-the-art smoothing methods CAPS and LipsNet, both implemented within the PPO framework for fair comparison. To assess the generality of the proposed approach, it is evaluated across standard MuJoCo continuous-control tasks, which serve as widely used benchmarks for RL, and a real-world quadcopter hovering experiment that demonstrates its hardware applicability. In addition, the method is evaluated on a complex wavefront sensorless AO system for optical satellite communication, highlighting its effectiveness in highly dynamic environments. To address the challenges of AO systems and facilitate the evaluation of RL algorithms, an RL environment for a simulated wavefront sensorless AO system is also developed.
The results show that PPO+SAPPS performs comparably to PPO+CAPS, and both outperform standard PPO in MuJoCo tasks. Specifically, PPO+SAPPS improves performance by 11% and policy smoothness by 28%. In the physical quadcopter experiment, PPO+SAPPS outperforms PPO+CAPS and standard PPO, achieving a 17% higher average return and a 29% improvement in policy smoothness. In the wavefront sensorless AO system, PPO+SAPPS achieves performance comparable to PPO+CAPS, reaching the maximum performance attained by the Shack-Hartmann wavefront sensor in a quasi-static atmosphere. Moreover, under dynamic conditions, PPO+SAPPS surpasses standard PPO, PPO+CAPS, and PPO (LipsNet) in high-velocity conditions.
These findings highlight the contribution of the proposed SAPPS method in reducing high-frequency control fluctuations in proportion to environmental changes, while enabling more responsive performance compared to state-of-the-art smoothing methods across multiple simulated and real-world systems. In particular, it demonstrates strong potential for optical satellite communication, which could effectively serve rural and remote communities by enabling high-speed connectivity with reduced costs.
Description
Keywords
Adaptive Policy Smoothing, Policy Regularization, Smooth Control, Reinforcement Learning, Adaptive Optics, Wavefront Sensorless Adaptive Optics, Optical Satellite Communication Downlinks, Fiber Coupling
