Simply put, a packet is a block of data.
This is generally determined by bandwidth. If the device has limited internet speeds, or a phone with a choppy signal, then packetsize will be smaller. If it's a desktop with dedicated service, packetsize could be quite a bit larger.
A frame could be thought of as one cell of animation, but typically these days, due to compression, it's not an actual keyframe image, but simply the changes since the last entire keyframe. They'll send one keyframe, an actual image once every few seconds or so, but every frame in-between is just a blending of data that specifies which pixels have changed since the last image, the delta.
So yea, let's say your packetsize is 1024 bytes, then your resolution will be limited to however many pixels that stream can carry the changes for. They might send one-frame-per-packet to keep it simple, but I don't think there's anything that absolutely guarantees that, as the datastream is reconstructed from those packets, often out of order, and then the frame deltas are generated once all those packets are pieced together.
Audio takes up much less space than video, so they might only need to send one audio packet for every 50 video packets.
I know these guys did a few clips on video-streams being recombined from packets, on their channel -- https://www.youtube.com/watch?v=DkIhI59ysXI