Extending the benchmark¶
This guide provides information on how to extend the benchmark, e.g. to add a new IRL algorithm. We cover four possible extension: adding new environments, adding new IRL algorithms, adding new RL algorithms, and adding new metrics.
We are happy to add new extensions to the main benchmark. If you are interested in collaborating, please see our collaboration guide.
Environments¶
TODO
TODO
: also talk here about collecting expert trajectories. The expert trajectories need to contain features, potentially artificially wrapped added with FeatureWrapper?
IRL algorithms¶
All IRL algorithms have to extend the abstract base class BaseIRLAlgorithm
:
from irl_baselines.irl.algorithms.base_algorithm import BaseIRLAlgorithm
class ExampleIRL(BaseIRLAlgorithm):
"""An example IRL algorithm"""
Initializing¶
If your algorithm class implements it’s own __init__
method, make sure that the base class __init__
is called as well. This is necessary since in this way the passed config is preprocessed correctly. Please use the same parameters for the new __init__
and don’t add additional ones. Any additional parameters required by your algorithm should go into the config dictionary.
def __init__(self, env: gym.Env, expert_trajs: List[Dict[str, list]], rl_alg_factory: Callable[[gym.Env], BaseRLAlgorithm], config: dict): """ docstring ... """ super(ExampleIRL, self).__init__(env, expert_trajs, rl_alg_factory, config)
Let’s go over the four parameters that are always passed to an IRL algorithm when it is created:
env
is an openAI gym environment, at least wrapped in aRewardWrapper
. The reward wrapper will make sure that the environment’s true reward function is not accidentally leaked to the IRL algorithm. If required, the true reward can still be read from the info dictionary returned by the environmentsstep
function as follows:
state, reward, done, info = env.step(action) print(info['true_reward'])
expert_trajs
is a list of trajectories collected from the expert. Each trajectory is a dictionary with keys['states', 'actions', 'rewards', 'true_rewards', 'features']
. Each value in the dictionary is a list, containing e.g. all states ordered by time. The states list will have one more element than the others, since it contains both the initial and final state. In the case of expert trajectories,true_rewards
will be an empty list. Seecollect_trajs
which defines how trajectories are generated.rl_alg_factory
is a function which takes an environment and returns a newly initialized reinforcement learning algorithm. This is used to keep the IRL algorithms flexible about which concrete RL algorithm they can be used with. If your IRL algorithm requires a specific RL algorithm (such as in guided cost learning), simply overwriteself.rl_alg_factory
in your__init__
after calling the base class__init__
.
def __init__(self, env: gym.Env, expert_trajs: List[Dict[str, list]], rl_alg_factory: Callable[[gym.Env], BaseRLAlgorithm], config: dict): """ docstring ... """ super(ExampleIRL, self).__init__(env, expert_trajs, rl_alg_factory, config) # enforce use of specific RL algorithm: def specific_rl_alg_factory(env: gym.Env): return SpecificRlAlg(env, {'hyperparam': 42}) self.rl_alg_factory = specific_rl_alg_factory
config
is a dictionary containing algorithm-specific hyperparameters. To make sure we can call IRL algorithms in a unified way, you have to specify which hyperparameters your algorithm can take, as well as legal ranges and defaults. This is done as follows:
from irl_benchmark.config import IRL_CONFIG_DOMAINS
from irl_baselines.irl.algorithms.base_algorithm import BaseIRLAlgorithm
class ExampleIRL(BaseIRLAlgorithm):
"""An example IRL algorithm"""
# implementation here
# ...
# ...
IRL_CONFIG_DOMAINS[ExampleIRL] = {
'gamma': {
'type': float,
'min': 0.0,
'max': 1.0,
'default': 0.9,
},
'hyperparam1': {
'type': 'categorical',
'values': ['a', 'b'],
'default': 'a',
},
'temperature': {
'type': float,
'optional': True, # allows value to be None
'min': 1e-10,
'max': float('inf'),
'default': None
}
}
Training¶
The BaseIRLAlgorithm
: class provides the abstract method train
as an interface of how IRL algorithms are run. You have to overwrite this method in your own implementation. The required parameters are:
no_irl_iterations
: an integer specifying for how many iterations the algorithm should be run.no_rl_episodes_per_irl_iteration
: an integer specifying how many episodes the RL agent is allowed to run in each iteration.no_irl_episodes_per_irl_iteration
: an integer specifying how many episodes the IRL algorithm is allowed to run in addition to the RL episodes in each iteration. This can be used to collect empirical information with the trained agent, e.g. feature counts from the currently optimal policy.
The train method returns a tuple containing the current reward function estimate on first position, and the trained agent on second position.
TODO
: link here to description of the interface provided by the RL algorithm. Show code example
Useful methods¶
The BaseIRLAlgorithm
: class comes with some useful methods that can be used in different subclasses.
- There is a method to calculate discounted feature counts:
feature_count