0
votes

Sometimes I got this error in the TCP server:

data = connection.recv(4096).decode("utf-8-sig")

File "/usr/lib/python3.6/encodings/utf_8_sig.py", line 23, in decode (output, consumed) = codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 5: invalid continuation byte

This is the code:

server_address = ('xx.xx.xx.xx', 10000)
    print('starting up on %s port %s' % server_address)
    sock.bind(server_address)
    # Listen for incoming connections. Cantidad de 25 coneciones entrantes en cola
    sock.listen(25)
    while True:
        # Wait for a connection
        print ('waiting for a connection')
        try:
            connection, client_address = sock.accept()
            print('connection from', client_address)
            # Receive the data in small chunks and retransmit it
            while True:
                #with decode we convert byte to string, default decode is utf-8
                data = connection.recv(4096).decode("utf-8-sig")

If I do not put the function .decode("utf-8-sig") I got this error:

TypeError: a bytes-like object is required, not 'str'

How can I prevent this? Previously it used utf-8 and the error rate was higher than utf-8-sig encoding

1
TCP is a byte stream protocol and can split sent data, so you may not receive all the bytes of a complete UTF-8 multi-byte sequence without additional checking that you've receive a complete message packet. Are you sure you are sending UTF-8 data? You have no minimal reproducible example that shows what data is transmitted.Mark Tolonen

1 Answers

1
votes

0xe0 is an invalid continuation byte since it starts with the bit pattern 111 rather than 10 (see here). That means almost certainly that you have a mismatch between what you're getting and what you expect to get.

The best thing to do would probably be to dump, as debug information, the data you're reading in, before you try to decode it. That could be done with something like:

data = connection.recv(4096)
print("DEBUG", data)
data = data.decode("utf-8-sig")

This will let you see what's actually being received, so you can confirm it's of the desired format.


And. based on what you show in a comment, it's definitely not UTF-8:

b'\x03\x00\x00/*\xe0\x00\x00\x00\x00\x00Cookie: mstshash=Administr\r\n\x01\x00\x08\x00\x03\x00\x00\x00'

Interestingly, there are some links here and here, that describe sessions containing that mstshash=Administr string as possible RDP hacking attempts. So you may have to put some effort into seeing where these sessions are originating from, and potentially hardening your network a little more.