Greg's Tech Blog

Tensorflow Import Your Own Data

Dec 17 2019

Picture here

This is all taken from sentdex's excellent Tensorflow tutorial series. Check it all out here:
https://www.youtube.com/playlist?list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN

In this post I'll briefly note how to import own data into a jupyter notebook. Venv setup is the same as in the previous post.

Then pip install numpy and pip install opencv-python to use cv2 if you're going to be working with images.
Set up a data directory with paths to your relevant training data/categories, e.g here we have two separate categories Cat and Dog images and we're just checking if we've connected the paths correctly by displaying an image:

DATADIR = "C:/Users/Greg Sukochev/Desktop/PetImages"
CATEGORIES = ["Dog", "Cat"]

for category in CATEGORIES:
    path = os.path.join(DATADIR, category) # path to cats or dogs dir
    for img in os.listdir(path):
        img_array = cv2.imread(os.path.join(path,img), cv2.IMREAD_GRAYSCALE)
        plt.imshow(img_array, cmap="gray")
        plt.show()
        break
    break

Since the images are all different sizes we need to standardize them and have a look at the result:

IMG_SIZE = 50
new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
plt.imshow(new_array, cmap = "gray")
plt.show()

Then we can create our training data, passing in this case if we encounter a broken image/error (should probably delete it in practice):

training_data = []

def create_training_data():
    for category in CATEGORIES:
        path = os.path.join(DATADIR, category) # path to cats or dogs dir
        class_num = CATEGORIES.index(category)
        for img in os.listdir(path):
            try:
                img_array = cv2.imread(os.path.join(path,img), cv2.IMREAD_GRAYSCALE)
                new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
                training_data.append([new_array, class_num])
            except Exception as e:
                pass

create_training_data()

The class_num is assigning an actual number to a dog or cat image, I think 0 for dog and 1 for cat in this case.
We then shuffle our data (our data should be balanced in terms of numbers already 50:50, dogs:cats):

import random
random.shuffle(training_data)

Now we pack our data into the variables we're going to use in our network. x has to be a numpy array in order to work with keras. x is our image (just an array of numbers), y is our label for our image (in this case a 0 or 1). We also reshape it: -1 is a catch-all for how many features we have, the shape of data is IMG_SIZE by IMG_SIZE, the final 1 is because it's a grayscale (would be a 3 for colour).

x = []
y = []

for features, label in training_data:
    x.append(features)
    y.append(label)

x = np.array(x).reshape(-1, IMG_SIZE, IMG_SIZE, 1)

Finally, we use pickle to save this data, because we don't want to generate this data everytime, particularly when we start tweaking the model:

import pickle

pickle_out = open("x.pickle", "wb")
pickle.dump(x, pickle_out)
pickle_out.close()

pickle_out = open("y.pickle", "wb")
pickle.dump(y, pickle_out)
pickle_out.close()

And load it for use:

pickle_in = open("x.pickle", "rb")
x = pickle.load(pickle_in)

In the next post I'll use this data in the actual neural network via:

x = np.asarray(pickle.load(open("x.pickle", "rb")))
y = np.asarray(pickle.load(open("y.pickle", "rb")))