1 Comment

Interesting as always. It is very important to monitor datasets, and consequently, it is crucial that they are open (and preferably open source). However, I disagree on the copyright issue, for various reasons including that it's also important to be able to freely input data into the training, or AI will inherit the biases of the data for which companies acquire rights (from other companies). Regarding the influence of the dataset on outputs, as you rightly say, we must keep in mind that it is very dangerous for it to end up in unrelated categories. A scene from World War II, if in the right category, 'teaches' correctly, but the problem rather lies if it is labeled as 'justice'. Of course illegal material should not exist in the dataset (and in general: I'm concerned that those images, even if removed from the dataset, remain online somewhere).

Expand full comment