Typically when working with audio on a computer, you will be working with audio in the time domain, in the format of PCM samples. That is, many times per second, the pressure level at that point of time will be measured an quantified into a number. If you are working with CD-quality audio, 44,1000 samples per second is the sample rate. The number is often quantified into 16-bit integers. (-32,767 to 32,768). (Other sample rates, bit depths, and quantization are out there and often used, this is just an example.)
If you want to mix two audio streams of the same sample rate, it is possible to simply add the values of each sample together. If you think about it, if you were to hear sound from two sources, their pressure levels would affect each other in much the same way. Sometimes they will cancel each other out, sometimes they will add to each other. You mentioned clipping... you can do this, but you will be introducing distortion into the mix. When a sound is too loud to be quantified, it is clipped at the maximum and minimums of the quantifiable range, causing audible clicks, pops, and poor quality sound. If you want to avoid this problem, you can cut the level of each in half, guaranteeing that even with both streams at their maximum level, they will be within the appropriate range.
Now, your question is about mixing audio with offset. It's absolutely no different. If you want to start mixing 5 seconds in, then 5 * 44,100 = 220500
, meaning align sample zero of one stream to sample 220500
of the other stream and mix.