A practical guide to efficiently managing GPU memory in TensorFlow 2

2024-07-08

Preface

When using TensorFlow 2 for training or prediction, it is crucial to properly manage GPU memory. Failure to effectively manage and release GPU memory may lead to memory leaks, which in turn affects subsequent computing tasks. In this article, we will explore several methods to effectively release GPU memory, including general methods and methods for handling forced termination of tasks.

1. Conventional video memory management method

1. Reset the default image

Each time you run a new TensorFlow graph, you can tf.keras.backend.clear_session() to clear the current TensorFlow graph and release memory.

import tensorflow as tf
tf.keras.backend.clear_session()

2. Limit GPU memory usage

By setting the video memory usage policy, you can avoid excessive GPU memory usage.

Grow video memory usage as needed：

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

Limit video memory usage：

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(
            gpus[0],
            [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])  # 限制为 4096 MB
    except RuntimeError as e:
        print(e)

3. Manually release GPU memory

After training or prediction, use gc Modules and TensorFlow's memory management functions manually release GPU memory.

import tensorflow as tf
import gc

tf.keras.backend.clear_session()
gc.collect()

4. Use `with` Statement Management Context

Use in training or prediction code with Statement, which can automatically manage resource release.

import tensorflow as tf

def train_model():
    with tf.device('/GPU:0'):
        model = tf.keras.models.Sequential([
            tf.keras.layers.Dense(64, activation='relu', input_shape=(32,)),
            tf.keras.layers.Dense(10, activation='softmax')
        ])
        model.compile(optimizer='adam', loss='categorical_crossentropy')
        # 假设 X_train 和 y_train 是训练数据
        model.fit(X_train, y_train, epochs=10)

train_model()

2. Video memory management when forcibly terminating a task

Sometimes we need to forcefully terminate a TensorFlow task to free up GPU memory. In this case, use Python multiprocessing Module oros Modules can manage resources efficiently.

1. Use `multiprocessing` Modules

By running TensorFlow tasks in separate processes, the entire process can be terminated to free up video memory when needed.

import multiprocessing as mp
import tensorflow as tf
import time

def train_model():
    model = tf.keras.models.Sequential([
        tf.keras.layers.Dense(64, activation='relu', input_shape=(32,)),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    # 假设 X_train 和 y_train 是训练数据
    model.fit(X_train, y_train, epochs=10)

if __name__ == '__main__':
    p = mp.Process(target=train_model)
    p.start()
    time.sleep(60)  # 例如，等待60秒
    p.terminate()
    p.join()  # 等待进程完全终止

2. Use `os` Module terminates the process

By getting the process ID and using os Module that can forcefully terminate the TensorFlow process.

import os
import signal
import tensorflow as tf
import multiprocessing as mp

def train_model():
    pid = os.getpid()
    with open('pid.txt', 'w') as f:
        f.write(str(pid))

    model = tf.keras.models.Sequential([
        tf.keras.layers.Dense(64, activation='relu', input_shape=(32,)),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    # 假设 X_train 和 y_train 是训练数据
    model.fit(X_train, y_train, epochs=10)

if __name__ == '__main__':
    p = mp.Process(target=train_model)
    p.start()
    time.sleep(60)  # 例如，等待60秒
    with open('pid.txt', 'r') as f:
        pid = int(f.read())
    os.kill(pid, signal.SIGKILL)
    p.join()

Summarize

When using TensorFlow 2 for training or prediction, it is important to properly manage and release GPU memory. with The statement management context can effectively avoid the problem of video memory leak. When you need to force the task to terminate, usemultiprocessing Modules andos The module can ensure that the video memory is released in time. Through these methods, the efficient use of GPU resources can be ensured, and the stability and performance of computing tasks can be improved.

Technology Sharing