Several situations can make the actual size of memory allocated larger than the calculated sizes due to padding added to achieve optimum address alignment or due to minimum block sizes.
For the two examples you give, the data sizes are compatible with natural alignment sizes and boundaries so you probably won't see much difference between calculated and actual memory used. There may still be some variation, though, if cudaMalloc uses a suballocator - if it allocates a large block from the OS (or device), then subdivides that large block into smaller blocks to fill cudaMalloc() requests.
If a suballocator is involved, then the OS will show the actual memory use as considerably larger than your calculated use, but actual use will remain stable even as your app makes multiple small allocations (which can be filled from the already allocated large block).
Similarly, the hardware typically has a minimum allocation size which is usually the same as the memory page size. If the smallest chunk of memory that can be allocated from hardware is, say, 64K, then when you ask for 3k you've got 61K that's allocated but not being used. This is where a suballocator would be useful, to make sure you use as much as you can of the memory blocks you allocate.