Submission error, no output

Hi,
I have been having trouble with submitting my code, and cant make head or tail of where it is actually failing. No where does it show that my scripts are failing, although i’m pretty certain that it has got to do with something in my code.

The error_report.txt says:

Submission failed with the following error:
<class 'trifingerpro_runner.data_runner.DataRunner'>

the stdout.txt says:

/beegfs/pool2/shared/scripts/subrob_run.sh: Mon Jul 19 18:35:34 CEST 2021 @ roboch3
[SUBMISSION INFO 2021-07-19 18:35:34,912] Run bash /shared/scripts/run_scripts/rrc2021/run_submission.sh
Set ROS Domain ID to 3.
[INFO] [1626712538.917752397] [trifinger_launcher]: Starting...
[INFO] [1626712551.256440846] [trifinger_launcher]: Goal: [[0, [0.07866100092503843, -0.050372920269793786, 0.08806933544062867]], [30000, [0.016859485000966686, 0.01694575669574699, 0.08345118282914502]], [40000, [0.031437499800654495, -0.025054352965863783, 0.08316305812214558]], [50000, [0.11707277814061176, 0.005274245327753821, 0.09563246276576348]], [60000, [0.01383340398852101, 0.06053858461265269, 0.04911745107109536]], [70000, [-0.08209615522348125, -0.014516901367696417, 0.09121500744583602]], [80000, [0.04444416503769732, -0.07515544536299136, 0.07944166635354832]], [90000, [-0.11851966113100951, 0.0010157043018426794, 0.089484872035157]], [100000, [0.042354937521218154, 0.06846986327972195, 0.03854694974055467]], [110000, [-0.05900746543507233, 0.06125452352786036, 0.08681097352648617]]]
[INFO] [1626712551.271647799] [trifinger_launcher]: Start data node
[INFO] [1626712555.767109737] [trifinger_data]: Start camera data
Traceback (most recent call last):
  File "/opt/blmc_ei/install/robot_fingers/lib/robot_fingers/trifinger_data_backend", line 149, in <module>
    sys.exit(main())
  File "/opt/blmc_ei/install/robot_fingers/lib/robot_fingers/trifinger_data_backend", line 77, in main
    camera_data = tricamera.MultiProcessData(
RuntimeError: Permission denied
/beegfs/pool2/shared/scripts/subrob_run.sh: [POST-FLIGHT] Mon Jul 19 18:35:57 CEST 2021 @ roboch3
Post submission processing successful.
{"mode":"full","isActive":false}

and the stderr.txt says:

pybullet build time: Dec 23 2020 01:48:06
CRITICAL:root:FAILURE: <class 'trifingerpro_runner.data_runner.DataRunner'>
Traceback (most recent call last):
  File "/beegfs/pool2/shared/scripts/run_scripts/rrc2021/trifingerpro_runner/run_submission.py", line 751, in run
    node.run(config, ws_dir, backend_type)
  File "/beegfs/pool2/shared/scripts/run_scripts/rrc2021/trifingerpro_runner/run_submission.py", line 570, in run
    config, ws_dir, backend_type, goal, backend_kwargs
  File "/beegfs/pool2/shared/scripts/run_scripts/rrc2021/trifingerpro_runner/run_submission.py", line 498, in _run_robot
    data_runner, lambda: self.data_node_ready
  File "/beegfs/pool2/shared/scripts/run_scripts/rrc2021/trifingerpro_runner/run_submission.py", line 252, in _wait_until_node_is_ready
    raise NodeUnexpectedTerminationError(str(type(runner)))
trifingerpro_runner.run_submission.NodeUnexpectedTerminationError: <class 'trifingerpro_runner.data_runner.DataRunner'>
{"mode":"full","isActive":false}

there is no user_stderr.txt user_stdout.txt or report.json created. Any idea what I’m doing wrong.
Thanks in advance

I’m very sorry, this was my fault. I left some of the robots in a bad state after running the evaluation.
It should be fixed now.

Ok, the user_stderr.txt and the user_stdout.txt files have been generated, yet the same errors remain.

To give you an update,
I have this in my stderr.txt:

pybullet build time: Dec 23 2020 01:48:06
ERROR:root:User code exited with non-zero exist status: -9
[ERROR] [1626776289.671907578] [trifinger_launcher]: Finished with error.

I have this in my stdout.txt:

/beegfs/pool2/shared/scripts/subrob_run.sh: Tue Jul 20 12:14:51 CEST 2021 @ roboch3
[SUBMISSION INFO 2021-07-20 12:14:51,244] Run bash /shared/scripts/run_scripts/rrc2021/run_submission.sh
Set ROS Domain ID to 3.
[INFO] [1626776095.199261417] [trifinger_launcher]: Starting...
[INFO] [1626776106.414208733] [trifinger_launcher]: Goal: [[0, [-0.03394670573073381, 0.052045098044664694, 0.054060251899048554]], [30000, [-0.13184681397466017, 0.025293071084366917, 0.04625902568555944]], [40000, [-0.07151435799834593, -0.10594201093891471, 0.07203309041983133]], [50000, [0.06720465242811723, 0.07727312408085596, 0.04927916162392177]], [60000, [0.1343893525307558, -0.007320505576058287, 0.07709437422465021]], [70000, [-0.09239774157114984, 0.02500877379928202, 0.049908398994807174]], [80000, [-0.05232210707042553, -0.053568489791761265, 0.059825496261361785]], [90000, [0.040993561791286985, -0.11257753174712565, 0.04560681574221171]], [100000, [-0.12607199774705302, -0.01818986116176487, 0.053035900617761395]], [110000, [0.01331133063574705, -0.033558108248437774, 0.06491763718554802]]]
[INFO] [1626776106.432779204] [trifinger_launcher]: Start data node
[INFO] [1626776111.029424067] [trifinger_data]: Start camera data
[INFO] [1626776111.036913921] [trifinger_data]: Start robot data
Using multi process time series.
[INFO] [1626776111.057758209] [trifinger_data]: Initialize camera logger with buffer size 1800
[INFO] [1626776111.058450740] [trifinger_data]: Data backend is ready
[INFO] [1626776111.129015222] [trifinger_launcher]: Data node is ready
[INFO] [1626776130.126354030] [trifinger_backend]: Start camera backend
[INFO] [1626776135.161826195] [trifinger_backend]: Camera backend ready.
[INFO] [1626776135.163147695] [trifinger_backend]: Start robot backend
Using multi process time series.
Load robot driver configuration from file '/etc/trifingerpro/trifingerpro.yml'
Configuration:
	 can_ports: can0 can1 can2 can3 can4 can5
	 max_current_A: 2.2
	 has_endstop: 1
	 move_to_position_tolerance_rad: 0.1
	 homing_method: endstop_release
	 calibration:
		 endstop_search_torques_Nm:  0.3  0.3 -0.2  0.3  0.3 -0.2  0.3  0.3 -0.2
		 move_steps: 500
	 safety_kd: 0.09 0.09 0.05 0.09 0.09 0.05 0.09 0.09 0.05
	 position_control_gains:
		 kp: 9 9 9 9 9 9 9 9 9
		 kd: 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
	 hard_position_limits_
		 lower:    -1 -1.67  -2.8    -1 -1.67  -2.8    -1 -1.67  -2.8
		 upper:  1.5 1.67  2.8  1.5 1.67  2.8  1.5 1.67  2.8
	 soft_position_limits_
		 lower: -0.33     0  -2.7 -0.33     0  -2.7 -0.33     0  -2.7
		 upper:    1 1.57    0    1 1.57    0    1 1.57    0
	 home_offset_rad: -2.136 -2.442   2.71 -2.136 -2.442   2.71 -2.136 -2.442   2.71
	 initial_position_rad:    0  0.9 -1.7    0  0.9 -1.7    0  0.9 -1.7
	 shutdown_trajectory:
		 - target:    0  0.9 -1.7    0  0.9 -1.7    0  0.9 -1.7
		   move_steps: 3000
		 - target: 0.0161 0.8998 -0.977 0.0161 0.8998 -0.977 0.0161 0.8998 -0.977
		   move_steps: 1000
	 run_duration_logfiles:
		 - /log/run_duration_f0.log
		 - /log/run_duration_f120.log
		 - /log/run_duration_f240.log

[INFO] [1626776153.495448513] [trifinger_backend]: Robot backend is ready
[INFO] [1626776153.499985357] [trifinger_launcher]: Backend node is ready
[INFO] [1626776153.510559414] [trifinger_launcher]: Monitor nodes...
waiting for board and motors to be ready 
board and motors are ready 
waiting for board and motors to be ready 
board and motors are ready 
waiting for board and motors to be ready 
board and motors are ready 
waiting for board and motors to be ready 
board and motors are ready 
waiting for board and motors to be ready 
board and motors are ready 
waiting for board and motors to be ready 
board and motors are ready 
Start homing.
Reached end stop.
Finished homing.Error: First action was not provided in time
Robot is shut down.
Write run duration log /log/run_duration_f0.log
Write run duration log /log/run_duration_f120.log
Write run duration log /log/run_duration_f240.log
[FATAL] [1626776270.552528609] [trifinger_backend]: Expected termination reason 2 (MAXIMUM_NUMBER_OF_ACTIONS_REACHED) but got -2
[INFO] [1626776273.607191454] [trifinger_launcher]: State (Data: RUNNING, Robot: RUNNING, User: RUNNING) --> (Data: RUNNING, Robot: TERMINATED_ERROR, User: RUNNING)
[INFO] [1626776283.616472204] [trifinger_launcher]: Kill user node.
[INFO] [1626776286.659770595] [trifinger_launcher]: State (Data: RUNNING, Robot: TERMINATED_ERROR, User: RUNNING) --> (Data: RUNNING, Robot: TERMINATED_ERROR, User: TERMINATED_ERROR)
[INFO] [1626776286.662430457] [trifinger_launcher]: Shut down data node.
[INFO] [1626776286.667131610] [trifinger_data]: Node trifinger_data received shutdown request.
[INFO] [1626776286.669893796] [trifinger_data]: Start camera logging
[INFO] [1626776286.671788205] [trifinger_data]: Save recorded camera data to file /output/camera_data.dat
[INFO] [1626776286.755994197] [trifinger_data]: Save robot data to file /output/robot_data.dat
[INFO] [1626776289.668695695] [trifinger_launcher]: State (Data: RUNNING, Robot: TERMINATED_ERROR, User: TERMINATED_ERROR) --> (Data: TERMINATED_SUCCESS, Robot: TERMINATED_ERROR, User: TERMINATED_ERROR)
/beegfs/pool2/shared/scripts/subrob_run.sh: [POST-FLIGHT] Tue Jul 20 12:18:09 CEST 2021 @ roboch3
Post submission processing successful.
{"mode":"full","isActive":false}

I have this in my user_stderr.txt:

pybullet build time: Dec 23 2020 01:48:06
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument whitelist is deprecated. Please use allowlist.
WARNING:root:Argument whitelist is deprecated. Please use allowlist.
WARNING:root:Argument whitelist is deprecated. Please use allowlist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
WARNING:root:Argument blacklist is deprecated. Please use denylist.
{"mode":"full","isActive":false}

and this in my user_stdout.txt:

Using multi process time series.
{"mode":"full","isActive":false}

I am very sure that there is something in my script that is wrong. I did put a lot of print statements in my scripts, but none of them seem to be printed. Advice?

Note: its the same situation while im testing in simulation as well

Thank you :smiley:

This error is different from the one in the previous post. The stdout.txt contains the following line, which gives some information on what went wrong:

Error: First action was not provided in time

Once started, the backend has a timeout of 2 minutes within which the first action as to be provided by the user code, otherwise it shuts down with this error. Maybe your script is hanging somewhere in the beginning before sending the first action.

This is unfortunately a bit annoying to debug, I would suggest adding even more prints (starting at the very top of your script, unless you already have this) to find out at which point the execution stops. You should be able to do this in simulation, which should be faster then submitting to the real robots.

I don’t know if this is the reason in your case but since it’s a common pitfall: Keep in mind that the backend only starts sending observations after the first action has been sent. This means you have to first send an action before you can get any observation.
Doing something like platform.get_robot_observation(0) before sending an action will block forever. If you first need an observation to determine your action, you can initially send a zero-torque action to start the loop (see also here).

Got it, thanks for that. I seemed to have solved that issue now. Although i seem to be getting another issue in terms of the number of actions I am sending

my user_stderr.txt is as follows:

Traceback (most recent call last):
  File "/ws/install/rrc21/lib/rrc21/real_trajectory_example_with_gym", line 11, in <module>
    load_entry_point('rrc21==2.0.0', 'console_scripts', 'real_trajectory_example_with_gym')()
  File "/ws/install/rrc21/lib/python3.8/site-packages/rrc_example_package/scripts/real_trajectory_example_with_gym.py", line 87, in main
    observation, reward, is_done, info = env.step(action)
  File "/ws/install/rrc21/lib/python3.8/site-packages/env/wrappers.py", line 225, in step
    return self.env.step(action)
  File "/ws/install/rrc21/lib/python3.8/site-packages/env/wrappers.py", line 191, in step
    return self.env.step(action['position'])
  File "/usr/local/lib/python3.8/dist-packages/gym/core.py", line 268, in step
    observation, reward, done, info = self.env.step(action)
  File "/ws/install/rrc21/lib/python3.8/site-packages/env/cube_env.py", line 268, in step
    t = self.real_platform.append_desired_action(robot_action)
RuntimeError: Backend Error: Maximum number of actions reached.

This happens when you send an action after the backend already reached the end of the episode. You can avoid this by checking the time index t and stopping when t <= EPISODE_LENGTH - 1. This can be a bit unreliable, though, due to race conditions.

The error is not really critical, though, since at this point the end of the episode is already reached and the backend is shutting down anyway, so you can also just catch the exception.

I dont think this is the case, because on printing the time steps, this error pops up roughly ard 80,000 time steps, where as episode length is 120,000. Am I missing something here?

That’s weird. Can you check the resulting robot_data.dat, how many steps are logged there?

Also when printing the time steps, are you using the value returned by append_desired_action or are you using a custom counter?

1 Like

Ah, you were right, I still had my simulation step counter on. Thanks for all the help.