When creating the pickle file for the inputs and outputs for all 5,000 elements I would merge each input element file with its solution file, into one large file. I have realized that the input file for each element would be around 4 MB while the output was only about 2 KB, so I assume the input files are attributing to the massive size of the pickle file.
So this is the function creates the data set:
def uniform_data_set(minimum=80, maximum=120, size=(5000,100)):
values = (((maximum - minimum) * torch.rand(size)).floor() + minimum)
weights = (((maximum - minimum) * torch.rand(size)).floor() + minimum)
total_weights = torch.ones(size=(size[0],1)) * (((maximum-minimum)/2 + minimum) * size[1])
return (weights, values, total_weights)
And this program is creates the individual element input file:
from data_generator import uniform_data_set
import argparse
import pickle
def data_generator(in_dir, out_dir, submit_file_path):
capacity_percentages = [1, 5, 10, 25, 50, 75, 90, 95, 99]
for capacity_percentage in capacity_percentages:
submit_file = open(submit_file_path, 'a+')
uni_weights, uni_values, uni_total_weights = uniform_data_set()
num_samples = len(uni_weights)
for i in range(num_samples):
pickle.dump((uni_weights[i], uni_values[i], uni_total_weights[i], capacity_percentage), open(f'{in_dir}/data_{capacity_percentage}_{i}.p', 'wb'))
submit_file.write(f'uni_{i}, {in_dir}/data_{capacity_percentage}_{i}.p, {out_dir}/data_{capacity_percentage}_{i}.p
')
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('in_dir', type=str, help="Input directory")
parser.add_argument('out_dir', type=str, help="Output directory")
parser.add_argument('submit_file_path', type=str, help="Condor submit file path")
args = parser.parse_args()
in_dir = args.in_dir
out_dir = args.out_dir
submit_file_path = args.submit_file_path
data_generator(in_dir, out_dir, submit_file_path)
The program shown above stores the weights, values and total weight as a torch tensor. However, I tested out converting the tensor to a list using the tensor.tolist() method, and the individual input element pickle file only ended up being 2 KB. I should have stored the weights, values and total weight as a list from the beginning because when loading in the pickle file to the solver program, I had to convert them to lists anyways.
How come storing the dataset in torch tensors causes the pickle file to be substantially larger?