each tailored to MLX. The double critic approach, inspired by previous works, prevents overestimation of rewards and incorporates target network soft updates for improved policy learning stability.