Building a Vision-Language-Action system for the Unitree G1 humanoid robot taught me more about robotics engineering than any course could. Here are the key lessons, condensed from 10 weeks of development across 6 project phases.
Environment Setup
Torque actuators are not position actuators. When the MuJoCo model uses torque-controlled joints, sending position commands directly will not work. You need a PD controller: τ = Kp(q_des - q) - Kd * q̇ + τ_gravity.
Gravity compensation matters from day one. Without it, the robot's arms will fall limp. Compute gravity torques via mj_inverse on a zero-velocity state and add them to your control signal.
Demo Generation
Kinematic IK is fragile. Iterative Jacobian-based IK can diverge if your step size is too large or your target is at the edge of the workspace. Always validate generated demos before feeding them to training.
Weld constraints are your friend for data collection. During scripted demonstrations, temporarily welding the object to the hand avoids physics artifacts that would confuse the policy during training.
ACT Training
Action chunking is the key insight. Predicting 20 timesteps at once (an action chunk) instead of single-step actions dramatically improves temporal consistency. The policy learns smoother trajectories.
Freeze early ResNet layers. Training the full vision encoder from scratch on small robotics datasets leads to overfitting. Freezing layers 0-6 of ResNet18 and only training the later layers works much better.
Temporal ensembling smooths execution. Overlapping action chunks with exponential decay weighting produces smoother motion than executing chunks sequentially.
Bimanual Manipulation
Friction-based grasping is harder than weld-based. When both hands must squeeze an object using only friction (no cheating with weld constraints), you need careful force balancing and compliance control.
Leg drift is a real problem. Even with a fixed base, uncontrolled leg joints accumulate drift during long episodes. Set leg joints to their initial position with a separate PD controller.
The Bigger Picture
Building this system reinforced that robotics engineering is fundamentally about making things work reliably in the real world — not about having the fanciest algorithm. The gap between a working demo and a robust system is where most of the engineering happens.