python - How to load large numpy file without memory dump into kaggle notebook?

Question

Welcome To Ask or Share your Answers For Others

python - How to load large numpy file without memory dump into kaggle notebook?

posted Jan 27, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to load large numpy file without memory dump into kaggle notebook?

I am working with a dataset to train a Keras Deep Learning model on a Kaggle notebook with a GPU. The dataset has a csv which contains an id, for a .tif image in another directory, and a label, 1 or 0. I balanced the data and saved it using numpy.save() (See Code 1). This works fine and afterwards, I download the files and reupload them as a dataset. However, when I try to use this dataset in a different notebook using numpy.load() (See Code 2), I get the following error:

Your notebook tried to allocate more memory than is available. It has restarted.

I have been following this tutorial, and it's a few years old, so I might be doing unnecessary steps while making the array. How do I fix the memory dump issue?

I have already tried using pickle to save the array produced by Code 1, but it has the same memory dump error.

Code 1:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import cv2
import random

df = pd.read_csv("../input/histopathologic-cancer-detection/train_labels.csv")

ones_subset = df.loc[df["label"] == 1, :]
num_ones = len(ones_subset)

zeros_subset = df.loc[df["label"] == 0, :]
sampled_zeros = zeros_subset.sample(num_ones)

print(num_ones)
print(sampled_zeros)

df = pd.concat([ones_subset, sampled_zeros], ignore_index=True)
df = df.groupby("label").sample(50000).sample(frac=1).reset_index(drop=True)
print(df)

DATADIR = "../input/histopathologic-cancer-detection/train"
IMG_SIZE = 96
training_data = []
def create_training_data():
    for i in range(0,len(df.index)):
        img_path = df.iloc[i,0]+".tif"
        path = os.path.join(DATADIR, img_path)
        img_array = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
        class_num = df.iloc[i,1]
        training_data.append([img_array,class_num])
        print(i)
create_training_data()

print(len(training_data))
random.shuffle(training_data)

X = []
y = []
count = 0

for features, label in training_data:
    X.append(features)
    y.append(label)
    count += 1
    print(count)

X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)

np.save('image_arrays.npy',X)
np.save('labels.npy',y)

Code 2:

import tensorflow as tf
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D
from keras.callbacks import TensorBoard

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import shutil

X = np.load("../input/metastasis-cancer-100000/image_arrays.npy", "r")
y = np.load("../input/metastasis-cancer-100000/labels.npy", "r")

X=np.array(X/255.0)
y=np.array(y)

UPDATE 1:
I tried to use np.memmap() from a comment suggestion. I'm not very familiar with it, so I'm not sure if I'm using it correctly, but I tried the following code, and the same memory error occured:

X = np.memmap("../input/metastasis-cancer-100000/image_arrays.npy", mode="r")
y = np.memmap("../input/metastasis-cancer-100000/labels.npy", mode="r")

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-01-27T03:53:26+0000

Problem is solved by using:

X = np.load('../input/metastasis-cancer-100000/image_arrays.npy', mmap_mode='r')

This one nothing but just acts as a normal array, however backed by DISK. Also consider that it will be slower than the arrays which are backed by RAM.

Categories

python - How to load large numpy file without memory dump into kaggle notebook?

python - How to load large numpy file without memory dump into kaggle notebook?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags