We know the dataset that we have. We’ll always know what feature we’re trying to predict or calculate. In the training phase, if we’re working with supervised learning, we have the complete dataset and we’ll obviously know what we want to achieve. So we can split the dataset such that we have all the independent variables in one set (say x) and the dependent variable in another (say y).