I asked a question about this on the course website – how are people doing their mean subtraction and normalization? Before, or after the test/train/validation split?
Currently, I split first, then calculate the mean for the training data, and subtract the meaning of the training data from the train, test, and validation. I think this is the easiest and most statistically-okay way to do this. If you use the mean of the whole dataset, you’re introducing information from the test/validation into training.
To normalize, I calculate the min and max for each of the training, test, and validation sets. The training data is rescaled based on the training min and max. For the test and val sets, I use the smallest min and largest max between the each set and training (separately for test and val of course).
Otherwise, if the test/val data happened to have values higher than the training data I would clip them out, or I would be not taking into account information from the training. I can imagine cases where you wouldn’t want to do this (e.g. where you think your test/val have some kind of bias), but I think it’s safe to assume in this case that the training and test and val all come from the same distribution since they’re from the same file.
I wonder if there isn’t a better way to do this preprocessing though, particularly for the test and validation sets. Maybe something as simple as subtracting the mean(mean_of_train, mean_of_test), if you thought your test data were biased? But it seems in general like there should be something more intelligent to do.