It is very common for general purpose game engines to use quaternions for describing objects' rotations.
Quaternions are used somewhat like rotation matrices, but have fewer components you'll multiply quaternions by quaternions to apply player input, and convert quaternions to matrices to render with. Instead, you should represent your camera/player orientation as a quaternion, a mathematical structure that is good for representing arbitrary rotations. However, this approach (“Euler angles”) is both tricky to compute with and has numerical stability issues (“gimbal lock”). The minimal solution to this is to add a roll component to your camera state.
As a consequence, no matter how you implement the controls, you will find that in some orientations the camera rolls strangely, because the effect of trying to do the math with this information is that every frame the roll is picked/reconstructed based on the pitch and yaw. Two numbers can represent a look-direction vector but they cannot represent the third component of camera orientation, called roll (rotation about the “depth” axis of the screen). The problem is that two numbers, pitch and yaw, provide insufficient degrees of freedom to represent consistent free rotation behavior in space without any “horizon”.