Dec. 16, 2020
Data collection is the first step in the process of generating a dataset for use in machine learning and computer vision training. Performing good data collection is essential to success: the quality of an AI model can only be as good as the quality of the dataset it’s trained on.
For example, suppose that we want to train a computer vision AI model to recognize images of different dog breeds. To do so, we would need to collect a robust dataset of image data for each breed we want to recognize. After being acquired, each image needs to be labeled with the correct category. Images may also require post-processing, such as cropping or resizing, to prepare them for use in training. (In fact, the Chooch AI app recognizes many things, including dogs.)
When performing data collection, we should keep the following guidelines in mind:
Once the data is collected, it needs to be separated into three sets: the training set, the validation set, and the test set:
Data should be randomly divided into each of the sets to ensure that each category is adequately represented in the AI training process. The training set is usually between 60-80% of the dataset’s total size, leaving 10-20% for each of the validation and test sets.
The Chooch AI platform provides tools for data collection and AI training. Learn more about how to use Chooch AI tools