Parameterisation of gradient boosting algorithms is not easy. Compared to random forest, it has the disadvantage of being iterative and therefore unsuited for parallelisation. In other words, the search for the right parameters can be long.
Let us go through the most important parameters (many are common to the RF algorithm):
loss: the cost function that we aim to minimise; an important criterion, but in our case, this parameter is fixed.
n_estimators: the number of estimators or iterations you will do. To set it, observe the train and test errors: if you do not overfit, increase the number of estimators. It is common to have hundreds of estimators.
max_depth: The depth of the trained trees. Trees are shallow compared to the random forest, from three to eight in practice. You can already get reliable results with trees with depth of two (also called stump trees), with only one split!
Another important boosting-specific hyperparameter is:
learning_rate: Step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features, and the learning rate (also called eta or shrinkage parameter) shrinks the feature weights to make the boosting process more conservative.
In our example we define a grid which contains the maximum depths of trees and the number of estimators.