Configuring the managers#

Going back to the Markov Decision Process framework, let’s apply it to the problem at hand.

Inspecting the code#

  1. Open the file: /Reach/tasks/manager_based/reach/reach_env_cfg.py

    • This is where we’ll configure all the various managers for our task. First, let’s set up our environment with the robot, a ground plane, and a light.

There are two types of functions we’ll engage with:

Built-in MDP functions

  • These come with Isaac Lab and cover common parameters needed, such as mdp.joint_pos_rel

  • Learn more about the built in MDP functions for creating Observation Terms in the docs here.

Custom functions

  • When a built-in function doesn’t exist, we write one. It’s a common practice in Isaac Lab manager workflows to create an mdp folder to house the functions.

Let’s work on our main class where we define all the managers, then we’ll come back and fill in the custom MDP functions.

Actions#

These define what the agent can do. In our situation, it’s moving all 6 revolute joints of the robot. We’ll be leveraging the built in JointPositionActionCfg class. The joint names are taken from our robot’s joints in the USD file. We don’t have to provide prim paths for these, they’ll be located for us.

Tip

Joint names can also be defined using regular expressions. So if your robot had joints named similarly like: joint_1, joint_2, etc, you could simply use joint_.* to select them all. Here we’re using explicit names for clarity.

Replace this class in your reach_env_cfg.py file.

 1@configclass
 2class ActionsCfg:
 3    """Action specifications for the MDP."""
 4
 5    arm_action: ActionTerm = mdp.JointPositionActionCfg(
 6        asset_name="robot", 
 7        joint_names=["shoulder_pan_joint", "shoulder_lift_joint", "elbow_joint", "wrist_1_joint", "wrist_2_joint", "wrist_3_joint"], 
 8        scale=1, 
 9        use_default_offset=True, 
10        debug_vis=True
11    )

Commands#

This sounds a lot like actions, and it’s related. But these are higher-level task specifications, of what to achieve, rather than how to achieve it (actions). Commands are the goal, actions are the means to the goal.

Add this class to your reach_env_cfg.py file.

 1@configclass
 2class CommandsCfg:
 3    """Command terms for the MDP."""
 4
 5    ee_pose = mdp.UniformPoseCommandCfg(
 6        asset_name="robot",
 7        body_name="ee_link", # This is the body in the USD file
 8        resampling_time_range=(4.0, 4.0),
 9        debug_vis=True,
10# These are essentially ranges of poses that can be commanded for the end of the robot during training
11        ranges=mdp.UniformPoseCommandCfg.Ranges(
12            pos_x=(0.35, 0.65),
13            pos_y=(-0.2, 0.2),
14            pos_z=(0.15, 0.5),
15            roll=(0.0, 0.0),
16            pitch=(math.pi / 2, math.pi / 2),
17            yaw=(-3.14, 3.14),
18        ),
19    )

Observations#

These are measurements to take from the environment that are needed to evaluate our task. In this case it’s:

  • Where the end effector of the robot is

  • What the positions of the joints are

  • Where the end effector of the robot is

  • What the velocities of the joints are.

With these observations, we can evaluate the smoothness of the motion and if the robot is close to its goal position.

Observations can be grouped together, and in this case we simply have observations relevant for the policy.

Finally, the self.enable_corruption = True line allows observations to be “corrupted” by noise. In the code below, the noise is added as a parameter to the joint_pos and joint_vel observation terms. This noise helps simulate real-world sensor noise, improve policy robustness, and prevent overfitting to clean data.

Replace this class in your reach_env_cfg.py file.

 1@configclass
 2class ObservationsCfg:
 3    """Observation specifications for the MDP."""
 4
 5    @configclass
 6    class PolicyCfg(ObsGroup):
 7        """Observations for policy group."""
 8
 9        # observation terms (order preserved)
10        joint_pos = ObsTerm(func=mdp.joint_pos_rel, noise=Unoise(n_min=-0.01, n_max=0.01))
11        joint_vel = ObsTerm(func=mdp.joint_vel_rel, noise=Unoise(n_min=-0.01, n_max=0.01))
12        pose_command = ObsTerm(func=mdp.generated_commands, params={"command_name": "ee_pose"})
13        actions = ObsTerm(func=mdp.last_action)
14
15        def __post_init__(self):
16            self.enable_corruption = True
17            self.concatenate_terms = True
18
19    # observation groups
20    policy: PolicyCfg = PolicyCfg()
21

Terminations#

What defines done for a training episode? Is it success, or maybe we know a situation where we want to abort to avoid wasting time? We’ll use a simple timeout here to stop our task after a certain amount of time. The actual value for this duration will be provided by the episode length set later.

Replace this class in your reach_env_cfg.py file.

1@configclass
2class TerminationsCfg:
3    """Termination terms for the MDP."""
4
5    time_out = DoneTerm(func=mdp.time_out, time_out=True)

Consider the Following#

If our task were more complicated, what kind of terminations might you use for efficient training? Examples: we may consider certain poses to be a termination condition, or collisions with other objects, or too high of a velocity.

Events#

For our project, we’ll only handle the “reset” event. A reset event happens when the training event ends. By configuring this EventTerm, we call a function to reset the joints with a bit of randomness.

More specifically, in this case we’re using the built-in mdp.reset_joints_by_scale function, which resets the robot joints by scaling the default position and velocity by the given ranges.

Replace this class in your reach_env_cfg.py file.

 1@configclass
 2class EventCfg:
 3    """Configuration for events."""
 4
 5    reset_robot_joints = EventTerm(
 6        func=mdp.reset_joints_by_scale,
 7        mode="reset",
 8        params={
 9            "position_range": (0.75, 1.25),
10            "velocity_range": (0.0, 0.0),
11        },
12    )

Consider the Following#

Remember when we talked about how difficult this kind of training would be with a physical robot? Consider now how these few lines of code allow you to simply reset the robot, and the environment.

Reward#

This is one of the key aspects of our challenge. How do we define our reward function? In fact we have several reward functions to help achieve this task.

There’s a lot of code in this class, so let’s break it down.

  1. Track end-effector position - where is the end of the robot?

  2. Track end-effector position, fine-grained

  3. Penalize action rate

  4. Penalize joint velocity

Replace this class in your reach_env_cfg.py file.

 1@configclass
 2class RewardsCfg:
 3    """Reward terms for the MDP."""
 4
 5    # task terms
 6    end_effector_position_tracking = RewTerm(
 7        func=mdp.position_command_error,
 8        weight=-0.2,
 9        params={"asset_cfg": SceneEntityCfg("robot", body_names=["ee_link"]), "command_name": "ee_pose"},
10    )
11    end_effector_position_tracking_fine_grained = RewTerm(
12        func=mdp.position_command_error_tanh,
13        weight=0.1,
14        params={"asset_cfg": SceneEntityCfg("robot", body_names=["ee_link"]), "std": 0.1, "command_name": "ee_pose"},
15    )
16
17    # action penalty
18    action_rate = RewTerm(func=mdp.action_rate_l2, weight=-0.0001)
19    joint_vel = RewTerm(
20        func=mdp.joint_vel_l2,
21        weight=-0.0001,
22        params={"asset_cfg": SceneEntityCfg("robot")},
23    )

Note that some of the functions, such as mdp.orientation_command_error are not built-in functions, so we’ll need to define those soon.

Also keep in mind that Isaac Lab will handle calculating the rewards for a given state, based on this sort of configuration data we provide.

Curriculum#

These provide us ways to further configure more detailed training instructions, known as a curriculum.

Add this class to your reach_env_cfg.py file.

 1@configclass
 2class CurriculumCfg:
 3    """Curriculum terms for the MDP."""
 4
 5    action_rate = CurrTerm(
 6        func=mdp.modify_reward_weight, params={"term_name": "action_rate", "weight": -0.005, "num_steps": 4500}
 7    )
 8
 9    joint_vel = CurrTerm(
10        func=mdp.modify_reward_weight, params={"term_name": "joint_vel", "weight": -0.001, "num_steps": 4500}
11    )