How to fix corrupted translog in a dockerized elasticsearch instance

Question

tl;dr: Running the elasticsearch-shard utility when your ElasticSearch instance is dockerized does not seem possible. If this is true how can we fix the occasional corrupted translog errors that crash ES??

I have had ElasticSearch (ES) running nicely locally via docker using docker-compose for some time now, but today when I started it up it started crashing with the error message:

TranslogCorruptedException[translog from source [/usr/share/elasticsearch/data/nodes/0/indices/0eNM-3niSvS0BUwAHf9M0w/0/translog/translog-175.tlog] is corrupted

Some googling revealed that this issue can be solved by running the utility bin/elasticsearch-shard remove-corrupted-data. The problem is that in order to run this utility ES must be shut down, but in order for the container that is hosting the ES instance to be alive ES needs to be running. This means that there is no way to have access to elasticsearch-shard to fix the issue inside of the environment where the data and the elasticsearch instance actually lives.

I have verified that it wont stay alive by stopping ES from within the command line of the container like so

## get into the docker container
docker exec -it 43146ff2a50c bash
## kill elasticsearch
pkill -f elasticsearch

and it immediately kills the container and kicks me out of the shell.

I tried to see if another docker container with access to the same data volumes but not based on an ES image (so that it could be alive while ES is off) could run the utility and fix the data on disk. I made a new docker-compose entry with a new Dockerfile and kept all the settings the same, but based the build on an ubuntu image (ignore the environment variables except ES_01_DATA_VOLUME, they aren't relevant):

docker-compose.yml

 es01-truncate-corrupted-shards:
        build:
            context: .
            dockerfile: Elasticsearch.TruncateCorruptedShards.Dockerfile
            args:
                - CERTS_DIR=${CERTS_DIR}
        container_name: es01-truncate-corrupted-shards
        environment:
            - node.name=es01
            - cluster.name=es-docker-cluster
            - discovery.seed_hosts=es02,es03
            - cluster.initial_master_nodes=es01,es02,es03
            - bootstrap.memory_lock=true
            - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
            - xpack.license.self_generated.type=basic 
            - xpack.security.enabled=true
            - xpack.security.http.ssl.enabled=true 
            - xpack.security.http.ssl.key=$CERTS_DIR/es01/es01.key
            - xpack.security.http.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
            - xpack.security.http.ssl.certificate=$CERTS_DIR/es01/es01.crt
            - xpack.security.transport.ssl.enabled=true 
            - xpack.security.transport.ssl.verification_mode=certificate 
            - xpack.security.transport.ssl.certificate_authorities=$CERTS_DIR/ca/ca.crt
            - xpack.security.transport.ssl.certificate=$CERTS_DIR/es01/es01.crt
            - xpack.security.transport.ssl.key=$CERTS_DIR/es01/es01.key
        ulimits:
            memlock:
                soft: -1
                hard: -1
        volumes:
            - ${ES_01_DATA_VOLUME}
            - ${CERTS_VOLUME}
        ports:
            - ${ES_01_PORT}
        mem_limit: ${SINGLE_NODE_MEM_LIMIT}

Elasticsearch.TruncateCorruptedShards.Dockerfile

FROM ubuntu:rolling

RUN apt-get update \
    && apt-get install --yes curl \
    && apt-get install -y gnupg \
    && curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | apt-key add - \
    && echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | tee -a /etc/apt/sources.list.d/elastic-7.x.list \
    && apt update \
    && apt install elasticsearch

RUN /usr/share/elasticsearch/bin/elasticsearch-shard remove-corrupted-data

When i run this it installs everything correctly and attempts to use the utility, but then errors like so:

#6 1.265     WARNING: Elasticsearch MUST be stopped before running this tool.
#6 1.265
#6 1.360 Exception in thread "main" ElasticsearchException[no node folder is found in data folder(s), node has not been started yet?]
#6 1.363    at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.processDataPaths(ElasticsearchNodeCommand.java:148)
#6 1.363    at org.elasticsearch.cluster.coordination.ElasticsearchNodeCommand.execute(ElasticsearchNodeCommand.java:168)
#6 1.363    at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:77)
#6 1.363    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112)
#6 1.363    at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:95)
#6 1.363    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:112)
#6 1.363    at org.elasticsearch.cli.Command.main(Command.java:77)
#6 1.363    at org.elasticsearch.index.shard.ShardToolCli.main(ShardToolCli.java:24)

which leads me to believe that despite having access to the ES_01_DATA_VOLUME volume it knows that an instance hasn't been set up in this container.

Ultimately I'm not too concerned with how the corrupted translog gets fixed as long as its possible, but it seems to me with these constraints of the docker environments its not possible. Do i need to install ES on the host machine and point it towards the data files and have it modify them? Seems like that is the same idea as the second non-ES container trick that I tried and so will fail. Also, that defeats the purpose of the containerized environment.

I am stumped and would be so grateful for any help. It's hard to imagine that fixing something like corrupted data files wouldn't be possible / be overlooked by the ES team!