RL Competition 2009

News

Slides from the workshop presentations are now available.

Results are now available. Congratulations to our winners.

Testing round closed. Thank you to all of our competitors!

Updated Testing application (R15) is now available HERE.

Proving application is now available HERE.

The rules, schedule, and prizes have been announced.

GAME ON! The software is now available.

Stay Informed

Sign up for our mailing list to receive important announcements about the competition.

Acrobot

The acrobot, a two-link, underactuated robot roughly analogous to a gymnast swinging on a high bar. The first joint (corresponding to the gymnast's hands on the bar) cannot exert torque, but the second joint (corresponding to the gymnast bending at the waist) can. The system has four continuous state variables: two joint positions and two joint velocities.

acrobot

The acrobot version that we present here is a generalized version that changes some aspects of the original acrobot dynamics but (we hope) presents interesting challenges to Reinforcement Learning algorithms and practitioners.

Domain Description

The acrobot is an episodic task.

Observation Space: The system has four continuous observation variables: two joint positions and two joint velocities.

  1. θ1 ∈ [-π, π] angle between the horizontal base and the first link
  2. θ2 ∈ [-π, π] angle between the first and the scond link
  3. θ1_dot ∈ [-4π, 4π] angular velocity w.r.t. θ1
  4. θ2_dot ∈ [-9π, 9π] angular velocity w.r.t. θ2

Action Space: In order to make the problem more interesting, we have changed the behaviour of the actions from the typical acrobot problem.

  • There are 8 discrete actions [0,1,...,7].
  • Actions will apply a torque of -1, 0, or +1 to the joint between the first and second link.
  • The effect of an action is an unknown function (f : SxA ➝ {-1,0,1}) that depends on the current state. For example, action 0 at state x may apply a torque of -1, while action 0 at state yx may apply a torque of +1.
  • Thus, as a tip!, the Agent should balance very well exploration vs. exploitation.

Reward:

  • -1 per time step
  • 0 when goal state reached

References

see [http://www.cs.ualberta.ca/~sutton/book/ebook/node110.html]