Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
145 views
in Technique[技术] by (71.8m points)

python - How come the input pickle file for my neural network is 19GB?

So I am trying to train a neural network for the knapsack problem and so I have created an input pickle file that is a list of 5,000 elements. Each element has 100 weights and values that are floats. There is also the total weight of the items which is expressed as an int. There's also a solution as a list of 1s and 0s, with a 1 indicating the index of an item in the sack. Each element has the format of:

( (tensor(weights), tensor(values), tensor(total weight), int(capacity percentage) ), (list(solution), float(solution time)) )

Here is what an element looks like. For some reason the file is 19 GB which is abnormally large for something with only 5,000 elements. Doing some deeper digging, I used the sys.getsizeof() on one of the elements which returned a value of 56 bytes meaning that 5,000 elements should equate to only 280 KB. However, I did try pickling one of the elements which had a size of 3.9 MB meaning that that if I pickled all 5,000 elements separately it would add up the 19 GB. Why is the pickle file so large? Is there anything I can be doing to reduce the size of the file?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

When creating the pickle file for the inputs and outputs for all 5,000 elements I would merge each input element file with its solution file, into one large file. I have realized that the input file for each element would be around 4 MB while the output was only about 2 KB, so I assume the input files are attributing to the massive size of the pickle file.

So this is the function creates the data set:

def uniform_data_set(minimum=80, maximum=120, size=(5000,100)):
    values = (((maximum - minimum) * torch.rand(size)).floor() + minimum)
    weights = (((maximum - minimum) * torch.rand(size)).floor() + minimum)
    total_weights = torch.ones(size=(size[0],1)) * (((maximum-minimum)/2 + minimum) * size[1])
    return (weights, values, total_weights) 

And this program is creates the individual element input file:

from data_generator import uniform_data_set
import argparse
import pickle

def data_generator(in_dir, out_dir, submit_file_path):
    capacity_percentages = [1, 5, 10, 25, 50, 75, 90, 95, 99]

    for capacity_percentage in capacity_percentages:
        submit_file = open(submit_file_path, 'a+')
        uni_weights, uni_values, uni_total_weights = uniform_data_set()

        num_samples = len(uni_weights)

        for i in range(num_samples):
            pickle.dump((uni_weights[i], uni_values[i], uni_total_weights[i], capacity_percentage), open(f'{in_dir}/data_{capacity_percentage}_{i}.p', 'wb'))
            submit_file.write(f'uni_{i}, {in_dir}/data_{capacity_percentage}_{i}.p, {out_dir}/data_{capacity_percentage}_{i}.p
')

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument('in_dir', type=str, help="Input directory")
    parser.add_argument('out_dir', type=str, help="Output directory")
    parser.add_argument('submit_file_path', type=str, help="Condor submit file path")
    args = parser.parse_args()
    
    in_dir = args.in_dir
    out_dir = args.out_dir
    submit_file_path = args.submit_file_path

    data_generator(in_dir, out_dir, submit_file_path)

The program shown above stores the weights, values and total weight as a torch tensor. However, I tested out converting the tensor to a list using the tensor.tolist() method, and the individual input element pickle file only ended up being 2 KB. I should have stored the weights, values and total weight as a list from the beginning because when loading in the pickle file to the solver program, I had to convert them to lists anyways.

How come storing the dataset in torch tensors causes the pickle file to be substantially larger?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...