Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
498 views
in Technique[技术] by (71.8m points)

python - How to load large numpy file without memory dump into kaggle notebook?

I am working with a dataset to train a Keras Deep Learning model on a Kaggle notebook with a GPU. The dataset has a csv which contains an id, for a .tif image in another directory, and a label, 1 or 0. I balanced the data and saved it using numpy.save() (See Code 1). This works fine and afterwards, I download the files and reupload them as a dataset. However, when I try to use this dataset in a different notebook using numpy.load() (See Code 2), I get the following error:

Your notebook tried to allocate more memory than is available. It has restarted.

I have been following this tutorial, and it's a few years old, so I might be doing unnecessary steps while making the array. How do I fix the memory dump issue?

I have already tried using pickle to save the array produced by Code 1, but it has the same memory dump error.

Code 1:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import cv2
import random

df = pd.read_csv("../input/histopathologic-cancer-detection/train_labels.csv")

ones_subset = df.loc[df["label"] == 1, :]
num_ones = len(ones_subset)

zeros_subset = df.loc[df["label"] == 0, :]
sampled_zeros = zeros_subset.sample(num_ones)

print(num_ones)
print(sampled_zeros)

df = pd.concat([ones_subset, sampled_zeros], ignore_index=True)
df = df.groupby("label").sample(50000).sample(frac=1).reset_index(drop=True)
print(df)

DATADIR = "../input/histopathologic-cancer-detection/train"
IMG_SIZE = 96
training_data = []
def create_training_data():
    for i in range(0,len(df.index)):
        img_path = df.iloc[i,0]+".tif"
        path = os.path.join(DATADIR, img_path)
        img_array = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
        class_num = df.iloc[i,1]
        training_data.append([img_array,class_num])
        print(i)
create_training_data()

print(len(training_data))
random.shuffle(training_data)

X = []
y = []
count = 0

for features, label in training_data:
    X.append(features)
    y.append(label)
    count += 1
    print(count)

X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)

np.save('image_arrays.npy',X)
np.save('labels.npy',y)

Code 2:

import tensorflow as tf
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D
from keras.callbacks import TensorBoard

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import shutil

X = np.load("../input/metastasis-cancer-100000/image_arrays.npy", "r")
y = np.load("../input/metastasis-cancer-100000/labels.npy", "r")

X=np.array(X/255.0)
y=np.array(y)

UPDATE 1:
I tried to use np.memmap() from a comment suggestion. I'm not very familiar with it, so I'm not sure if I'm using it correctly, but I tried the following code, and the same memory error occured:

X = np.memmap("../input/metastasis-cancer-100000/image_arrays.npy", mode="r")
y = np.memmap("../input/metastasis-cancer-100000/labels.npy", mode="r")

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Problem is solved by using:

X = np.load('../input/metastasis-cancer-100000/image_arrays.npy', mmap_mode='r') 

This one nothing but just acts as a normal array, however backed by DISK. Also consider that it will be slower than the arrays which are backed by RAM.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...