Looks right.
You should really calculate L1D index bits the same way you do for L2: log2(32KiB / (64B * 2))
= log2(256)
= 8 bits.
Calculating the L1 index bits as page offset - block offset
is only possible because your diagram shows you that your cache has the desirable property that all the index bits are page-offset bits. (So for aliasing behaviour, it's like a PIPT cache: homonyms and synonyms are impossible.
So you can get VIPT speed without any of the aliasing downsides of virtual caches.)
So I guess really calculating both ways and checking is a good sanity check. i.e. check that it matches the diagram, or that the diagram matches the other parameters.
It's also not required that L1D index+offset bits "use up" all the page offset bits: e.g. increasing L1D associativity would leave 1 or more page-offset bits as part of the tag. (This is fine, and wouldn't introduce aliasing problems, it just means your L1D isn't as big as it could be for a given associativity and page size.)
It is common to build caches this way, though, especially with smaller page sizes. For example, x86 has 4k pages, and Intel CPUs have used 32kiB / 8-way L1D for over a decade. (32k / 8 = 4k). Making it larger (64kiB) would also require making it 16-way associative, because changing the page size is not an option. This would start to get too expensive for a low-latency high throughput cache with parallel tag + data fetch. Earlier CPUs like Pentium III had 16kiB / 4-way, and they were able to scale that up to 32kiB / 8-way, but I don't think we should expect larger L1D unless something fundamental changes. But with your hypothetical CPU architecture with 16kiB pages, a small+fast L1D with more associativity is certainly plausible. (Your diagram is pretty clear that the index goes all the way up to the page split, but other designs are possible without giving up the VIPT benefits.)
See also Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? for more about the "VIPT hack" and why multi-level caches are necessary to get a combination of low-latency and large capacity in practical designs. (And note that current Intel L1D caches are pipelined and multi-ported (with 2 reads and 1 write per clock) for access widths up to 32 bytes, or even all 64 bytes of a line with AVX512. How can cache be that fast?. So making L1D larger and more highly associative would cost a lot of power.)