Analyzing the Code#

Let’s analyze this code to connect the idea of the MDP framework to Isaac Lab’s Manager-based workflow, along with some of the math that aids our training process. Each one of these major ideas or “Managers” is defined by a Python @configclass.

Open the following Python file, and let’s take a closer look at how the MDP elements are defined in code: source/Cartpole/Cartpole/tasks/manager_based/cartpole/cartpole_env_cfg.py

Termination#

When does a training episode end? The carts terminate their training episode when they get too far away from where they started, or after a timeout. At some point, the results from this episode are so far off it’s best to move on.

 1@configclass
 2class TerminationsCfg:
 3   """Termination terms for the MDP."""
 4
 5   # (1) Time out
 6   time_out = DoneTerm(func=mdp.time_out, time_out=True)
 7   # (2) Cart out of bounds
 8   cart_out_of_bounds = DoneTerm(
 9       func=mdp.joint_pos_out_of_manual_limit,
10       params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"]), "bounds": (-3.0, 3.0)},
11   )

Action#

The cart can apply an effort to move the cart left or right.

1@configclass
2class ActionsCfg:
3   """Action specifications for the MDP."""
4
5   joint_effort = mdp.JointEffortActionCfg(asset_name="robot", joint_names=["slider_to_cart"], scale=100.0)

Observations#

We observe both the position and velocity of the cart.

 1@configclass
 2class ObservationsCfg:
 3   """Observation specifications for the MDP."""
 4
 5   @configclass
 6   class PolicyCfg(ObsGroup):
 7       """Observations for policy group."""
 8
 9       # observation terms (order preserved)
10       joint_pos_rel = ObsTerm(func=mdp.joint_pos_rel)
11       joint_vel_rel = ObsTerm(func=mdp.joint_vel_rel)

Rewards#

We reward several things, with different weights. Before you read on, take a second and think about how you might describe balancing with the cartpole in physics terms. It’s okay if you’re not sure, we’ll explain it all below.

Here’s what rewards worked for this task:

  • Reward the robot for maintaining the training. Balancing is a continuous task, which has no definite end like dropping a ball in a bucket would.

  • Penalize strongly for terminating. If the cart flies off the screen, that’s bad.

  • Reward for keeping the pole upright, meaning staying close to the target position of “0.0” for this particular mechanism.

  • Reward a lower cart velocity. In other words, encourage slow movements to balance.

  • Reward for lower pole velocity.

 1@configclass
 2class RewardsCfg:
 3   """Reward terms for the MDP."""
 4
 5
 6   # (1) Constant running reward
 7   alive = RewTerm(func=mdp.is_alive, weight=1.0)
 8   # (2) Failure penalty
 9   terminating = RewTerm(func=mdp.is_terminated, weight=-2.0)
10   # (3) Primary task: keep pole upright
11   pole_pos = RewTerm(
12       func=mdp.joint_pos_target_l2,
13       weight=-1.0,
14       params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"]), "target": 0.0},
15   )
16   # (4) Shaping tasks: lower cart velocity
17   cart_vel = RewTerm(
18       func=mdp.joint_vel_l1,
19       weight=-0.01,
20       params={"asset_cfg": SceneEntityCfg("robot", joint_names=["slider_to_cart"])},
21   )
22   # (5) Shaping tasks: lower pole angular velocity
23   pole_vel = RewTerm(
24       func=mdp.joint_vel_l1,
25       weight=-0.005,
26       params={"asset_cfg": SceneEntityCfg("robot", joint_names=["cart_to_pole"])},
27   )
28

Note that in this “Manager workflow” we’re using, the function that actually calculates the reward is passed into the func parameter. See mdp/rewards.py or read on to learn more about that function.

Let’s look at that reward function for joint_position. Critically, this function doesn’t just calculate the reward for one environment, but for all the environments! Remember, we are actually training multiple cartpoles at the same time for accelerated, parallel training, leveraging PyTorch tensors.

The steps inside this function are:

  • Get the Articulation from the scene. An articulation is the collection of physics joints that define the cartpole mechanism.

  • Wrap the joint positions to always be in terms of -pi to pi radians.

  • Using the distance formula (joint_pos - target)^2, return the sum of all elements in the input tensor, a kind of data container that can hold values in one or more dimensions.

Note

If you looked closely, you might’ve noticed multiple reward functions, but only one is explicitly defined in rewards.py. Why?

That’s because Isaac Lab provides some common reward functions for us, so we don’t have to redefine them across projects. These come from the mdp package. See a full list of available functions here.

1def joint_pos_target_l2(env: ManagerBasedRLEnv, target: float, asset_cfg: SceneEntityCfg) -> torch.Tensor:
2   """Penalize joint position deviation from a target value."""
3   # extract the used quantities (to enable type-hinting)
4   asset: Articulation = env.scene[asset_cfg.name]
5   # wrap the joint positions to (-pi, pi)
6   joint_pos = wrap_to_pi(asset.data.joint_pos[:, asset_cfg.joint_ids])
7   # compute the reward
8   return torch.sum(torch.square(joint_pos - target), dim=1)

Scene and Environment Configuration#

We’ve looked at the major components in isolation - the managers, and underlying math of the reward function. Now let’s look at how it all gets tied together into a training scene.

First, there’s a Scene Config.

Here we see how the training environment is set up, from ground plane, to robot, to the dome light. Isaac Lab lets us customize the training environment itself, so we don’t have to put lights or other elements directly into our robot’s USD file that are only needed for this training. If we needed other props for training, such as objects to grasp, we could add them in a similar way here.

 1@configclass
 2class CartpoleSceneCfg(InteractiveSceneCfg):
 3    """Configuration for a cart-pole scene."""
 4
 5    # ground plane
 6    ground = AssetBaseCfg(
 7        prim_path="/World/ground",
 8        spawn=sim_utils.GroundPlaneCfg(size=(100.0, 100.0)),
 9    )
10
11    # robot
12    robot: ArticulationCfg = CARTPOLE_CFG.replace(prim_path="{ENV_REGEX_NS}/Robot")
13
14    # lights
15    dome_light = AssetBaseCfg(
16        prim_path="/World/DomeLight",
17        spawn=sim_utils.DomeLightCfg(color=(0.9, 0.9, 0.9), intensity=500.0),
18    )

And now, finally, this is where everything comes together, in the Environment Configuration.

For all the managers defined above, this is where they are instantiated. We also configure the simulator, decimation, time settings, and episode length. The important part here is noting how Isaac Lab is controlling some of these aspects of Isaac Sim to perform our training.

 1##
 2# Environment configuration
 3##
 4
 5@configclass
 6class CartpoleEnvCfg(ManagerBasedRLEnvCfg):
 7    # Scene settings
 8    scene: CartpoleSceneCfg = CartpoleSceneCfg(num_envs=4096, env_spacing=4.0)
 9    # Basic settings
10    observations: ObservationsCfg = ObservationsCfg()
11    actions: ActionsCfg = ActionsCfg()
12    events: EventCfg = EventCfg()
13    # MDP settings
14    rewards: RewardsCfg = RewardsCfg()
15    terminations: TerminationsCfg = TerminationsCfg()
16
17    # Post initialization
18    def __post_init__(self) -> None:
19        """Post initialization."""
20        # general settings
21        self.decimation = 2
22        self.episode_length_s = 5
23        # viewer settings
24        self.viewer.eye = (8.0, 0.0, 5.0)
25        # simulation settings
26        self.sim.dt = 1 / 120
27        self.sim.render_interval = self.decimation

And because Isaac Lab is controlling the simulator, we have a special ability to “step” the simulation through time. Instead of running freely, we can pause, do some work, then execute a certain amount of time, and repeat.

Okay, now are you ready to see our trained policy in action? Let’s try it out!