Imitation Initialized Reinforcement Learning for Real World Robot Manipulation

Aayush Dulal

Learning Based Robotics • Real Hardware Deployment

This project presents a practical learning based manipulation pipeline implemented on the SO 101 robotic arm using a hybrid imitation learning and reinforcement learning approach. The system enables language conditioned pick and place behavior on real hardware under limited data and noisy actuation.

A compact Vision Language Action model is first trained via imitation learning from real world demonstrations to provide safe, task relevant behavior. This policy is then used to initialize reinforcement learning, allowing the robot to improve robustness and task success through interaction.

By constraining reinforcement learning exploration around expert like actions, the system avoids unsafe behavior while learning recovery strategies, correcting for hardware imperfections, and adapting to perception noise. The final policy demonstrates improved grasp stability and reliable execution on low cost robotic hardware. The imitation learning is done by fine tuning the huggingface's SMOLVLA model. This is done by recording 90 episodes to imitate. The SMOLVLA is intended to run with a top down camera and the camera attached to the end-effector. In this version, only the end-effector attached camera is used because of which the input state of the robot changes with every frame and there is no stable top-down camera for the robot to rely on. Therefore, the motion shows jitters and has success rate of only seven out of ten attempts.

Key Skills Demonstrated: Reinforcement Learning, Imitation Learning, Robot Manipulation, Vision Language Action Models, Real World Deployment