I am struggling for a while to get my program running stable. I am experiencing hard faults while my program is running. I am going in circles.
My project:
- Nucleo F446ze (STM32F446ze)
- An LTE modem connected to uart2
- My PC connected to uart3 (for logging only).
- FreeRTOS downloaded from git, using their STM port
- LWIP 2.1.0 download from git
Some freertos configs
- configASSERT enabled in free rtos
- configCHECK_FOR_STACK_OVERFLOW set to 2
- configUSE_MALLOC_FAILED_HOOK set to 1
- configTOTAL_HEAP_SIZE set to 30k (I have 10k left when I query remaining heap size)
- INCLUDE_uxTaskGetStackHighWaterMark set to 1 (all tasks are within stack limits)
- SysTick is dedicated for FreeRTOS. I use TIM6 on 1khz to increase the HAL tick.
- All NVIC interrupts are set to 5 and higher, and again, configASSERT is enabled so pretty sure "interrupt management" is covered.
And using the defines to map free rtos interrupt handlers to CMSIS
#define vPortSVCHandler SVC_Handler
#define xPortPendSVHandler PendSV_Handler
#define xPortSysTickHandler SysTick_Handler
My program does the following in sequence:
- setup clocks and peripherals
- enable interrupts
- create "StartLwIP" task
- start FreeRTOS scheduler
Then "StartLwIP" does:
- Send commands via uart2 to LTE modem to enable data mode
- Initialize LwIP stack (negotiate ppp with peer)
- Start a new "Test" task
The "Test" task does:
- Open connection to a TCP server on the internet
- Send a message
- Close socket
- vTaskDelay [100|10|-]
- repeat
When I use vTaskDelay(100), the program can run without problems for hours (ran it over night, no issues).
When I use vTaskDelay(10), the program runs for a while (between 1 minute - 5 minutes). Then it will crash and hang up in hard fault handler.
When I remove the vTaskDelay (which would be the preferred solution), it will crash even faster. Again, it will vary, but somewhere within seconds to a minute.
I am 99% sure the problem is not heap / stack related. The high water marks and heap consumption look perfectly fine. Not even close to go outside heap / stack.
Memory management is LWIP is somewhat confusing to me, but since I am only constantly opening and closing connections I can't believe I am running out of PBUFs in LWIP. I extended the numbers anyway.
I am struggling for weeks, and eventually started to doubt the STM HAL. Then I stumbled upon the __HAL_LOCK
in peripheral libraries (uart in my case). For example in HAL_UART_Transmit_IT
HAL_StatusTypeDef HAL_UART_Transmit_IT(UART_HandleTypeDef *huart, uint8_t *pData, uint16_t Size)
{
/* Check that a Tx process is not already ongoing */
if (huart->gState == HAL_UART_STATE_READY)
{
if ((pData == NULL) || (Size == 0U))
{
return HAL_ERROR;
}
/* Process Locked */
__HAL_LOCK(huart); <<<<======
huart->pTxBuffPtr = pData;
huart->TxXferSize = Size;
huart->TxXferCount = Size;
huart->ErrorCode = HAL_UART_ERROR_NONE;
huart->gState = HAL_UART_STATE_BUSY_TX;
/* Process Unlocked */
__HAL_UNLOCK(huart); <<<<======
/* Enable the UART Transmit data register empty Interrupt */
__HAL_UART_ENABLE_IT(huart, UART_IT_TXE);
return HAL_OK;
}
else
{
return HAL_BUSY;
}
}
When I go to the definition of the lock macro I got a bit worried:
#if (USE_RTOS == 1U)
/* Reserved for future use */
#error "USE_RTOS should be 0 in the current HAL release"
#else
#define __HAL_LOCK(__HANDLE__) \
I've read several threads on this. Here and here for example. I can also read many topics that the locking mechanism is poorly implemented and not thread safe at all. Interesting, since even without an RTOS, but using interrupts would then be a potential problem.
I downloaded STMCube latest version to check if this would be solved by now. But it's all still in the same state. STM HAL doesn't seem to do much with their USE_RTOS marco.
In my program, I am using different tasks that read and write over the same uart instance. The LWIP TCP thread will send data, while the LWIP RX thread will constantly read from uart. My uart receives data in interrupt mode (passing byte by byte to a ring buffer).
Finally my questions:
Is it possible that this locking mechanism is the root cause for my hard faults? I was trying to find somebody who experiences the same problem but couldn't find "proof" in that sense that would confirm this. So maybe the "horrible locking mechanism" isn't the best implementation, but is not the root cause for my problem.
Are there "steps" to take to get more details out of a hard fault? I would really like to find the offending line of code. I found this page that explains how to continue, but I don't know how to obtain the pc (I am using VScode, I can break in the while(1) loop, but then what...?).
It always crashes here:
HardFault_Handler
prvPortStartFirstTask
xPortStartScheduler
Sorry for the lengthy question, but I wanted to be thorough at least and hope that somebody can confirm some things, or maybe even help me in the right direction to get past this....
Many thanks in advance!