I am playing around with different problem formulations. After finally reading smooth_dual one more time and expanding the equivalent of the prox call on paper, I think I understand how this is equivalent to eq. (6) in the user guide, IF the affine transform is the identity. Doing trivial changes, like setting A to -A, changes behavior markedly. My understanding right now is that the gradient indicated by the smooth part really goes the wrong way then. With a more general A - trying to put all the expensive parts in the linear operator, I encounter runaway scenarios of very high objective function values.

This is not (really) the scaling issue mentioned in 4.5. It would be good if someone could just confirm that I am not missing something and that the problem definition is not exact anymore if A is different from identity.