Recovery

When using NSML, there are a number of reasons why a session may have to stop and recover. The recovery method depends on whether use function provided by NSML library. The detailed recovery process is as follows:

  1. If you need to get a model from an existing session

    • Within the code, you can load the model of the target session with nsml.load.

  2. When you can regenerate a session using NSML commands

    • The nsml fork command allows you to copy an existing session to resume learning from the last saved model, or from a specific model.

There are three types of situations that need to be recovered. Each situation and countermeasures are as follows.

  • If the session is exited abnormally.
    • Like OOM , a session dies without notice. It can be solved by the method of 1, 2, 3 above.

  • If the session is not responding
    • It is case that commands such as nsml [rm, stop, logs, …] etc. that access the session does not respond. This problem is resolved by the above methods, or please report to NSML admin.

  • An ‘NSML warning’ occurred in the session log.
    • An error occurred while communicating with the NSML server. If you report the log contents to NSML admin, you can get the recovery method according to the situation.

    • If saving of the model fails due to a storage problem, it can not be recovered (it can be recovered to 1 if it is saved in the session). To avoid this situation, model repository management is required. You can check the size of the current session directly through nsml ps -a. If the available capacity is insufficient, NSML admin will send you an email in advance.