What is AI Data Collection? Why is it important in AI Training and AI Models?


Dec. 16, 2020

Data collection is the first step in the process of generating a dataset for use in machine learning and computer vision training. Performing good data collection is essential to success: the quality of an AI model can only be as good as the quality of the dataset it’s trained on.

For example, suppose that we want to train a computer vision AI model to recognize images of different dog breeds. To do so, we would need to collect a robust dataset of image data for each breed we want to recognize. After being acquired, each image needs to be labeled with the correct category. Images may also require post-processing, such as cropping or resizing, to prepare them for use in training. (In fact, the Chooch AI app recognizes many things, including dogs.)AI Training

When performing data collection, we should keep the following guidelines in mind:

  • For best results, the dataset should have a roughly balanced sample of images from each category. If we want to recognize 10 different dog breeds, for example, images of each breed should occupy roughly 10% in the dataset.
  • Getting a diverse sample of each category is essential. In computer vision, for example, the same object may appear very different from image to image, based on factors such as the angle, size, lighting conditions, background, etc.

Once the data is collected, it needs to be separated into three sets: the training set, the validation set, and the test set:

  • The training set is used to initially train and fit the model.
  • The validation set is used to tweak different parts of the model’s configuration (known as hyperparameters).
  • The test set is used to evaluate the model’s performance on fresh, unseen data to provide an idea of how it will do in the real world.

Data should be randomly divided into each of the sets to ensure that each category is adequately represented in the AI training process. The training set is usually between 60-80% of the dataset’s total size, leaving 10-20% for each of the validation and test sets.

