It's not a good idea to use counter in C to implement the PWM or anything time critical really. Although C converts your code to specific machine code, you don't really know how much time it will take.
Your code does not translate to:
make port B high 400 times (PORTB |= (1<<7);)
make port B low 400 times (PORTB &= ~(1<<7);)
, but rather something like this (simplification, human-readable):
load variable cnt1 to memA;
load 399 to memB
compare mem A to memB
put result to memC
if memC eq "somthing indicating <=" do PORTB |= (1<<7);
if memC something else do PORTB &= ~(1<<7);
load cnt1 to memD and increment;
write memD to cnt1;
load 800 to memE
load cnt1 to memF
compare memF to memE
put result to memG
if memG eq "somthing indicating <=" do memF = 0, write memF to cnt1;
if memG something else go to start;
If you look at this from "C" point of view you need to do at least:
1. comare cnt1-399
2. if ok - do / else
3. port high / port low
4. add one to cnt1
5. compare cnt1 and 800
It then depends on you compiler how good it is at optimizing all the loads and writes (usually quite good).
You can have control on what the delays will be if you really know your compiler and don't use to much optimization (it is usually to complex to follow) or by writing the code in assembler. But then you will have to use logic similar to my explanation of the machine code (assembler is close to human-readable machine code).
I think the solution for you are timer interrupts. There's a good tutorial for atmega128 this here.
Also what do you mean with:
I tried to generate 20Khz signal (50 us)with 25us duty cycle.
Do you mean 20kHz signal with 50% duty cycle? so 25us low, 25 us high?
If this is the case you can do this with one timer interrupt and one (binary) counter.
Exactly the "8 bit timer example" you can read about in the provided link.