My biggest takeaway here is that choosing the context length and (to a lesser extent) the temperature carefully is important for reducing hallucinations. I expected model families to vary widely between themselves but not for context length to have such a massive impact tbh.
It seems from this like reducing context length in applications where it isn’t essential for the model to hold very large amounts of context simultaneously would be best practice no?