Recovery¶
When using NSML, there are a number of reasons why a session may have to stop and recover. The recovery method depends on whether use function provided by NSML library. The detailed recovery process is as follows:
If you need to get a model from an existing session
Within the code, you can load the model of the target session with nsml.load.
When you can regenerate a session using NSML commands
The nsml fork command allows you to copy an existing session to resume learning from the last saved model, or from a specific model.
There are three types of situations that need to be recovered. Each situation and countermeasures are as follows.
- If the session is exited abnormally.
Like OOM , a session dies without notice. It can be solved by the method of 1, 2, 3 above.
- If the session is not responding
It is case that commands such as nsml [rm, stop, logs, …] etc. that access the session does not respond. This problem is resolved by the above methods, or please report to NSML admin.
- An ‘NSML warning’ occurred in the session log.
An error occurred while communicating with the NSML server. If you report the log contents to NSML admin, you can get the recovery method according to the situation.
If saving of the model fails due to a storage problem, it can not be recovered (it can be recovered to 1 if it is saved in the session). To avoid this situation, model repository management is required. You can check the size of the current session directly through nsml ps -a. If the available capacity is insufficient, NSML admin will send you an email in advance.