1
votes

I am working on a customized/proprietary RTOS provided by my client.

The RTOS uses round robin scheduling with priority preemption.

Scenario is -

  1. The Renesas H8S controller is running at 20 MHz
  2. I have configured interrupt for ethernet interrupt (A LAN9221 chip is interrupting)
  3. An OS task which reads the data from LAN controller is running at highest priority in OS
  4. Another OS task TCP which is second highest priority task in system
  5. An OS task which referesh watchdog

I have generated network traffic to simulate bombarding condition on the network. Problem is at high data rates (more than 500 packets/second) on ethernet ISR watchdog is getting fired which is configured for 1 second.

Watchdog is configured to be serviced by a lower priority task of OS to detect any problem in OS functionality.

I doubt the frequency of ISR and higher priority tasks are not letting the watchdog task to be scheduled. To confirm my doubt i have serviced the watchdog in ISR itself and found working till 2000 packets/second.

Could you please suggest how can handle the situation so the watchdog should not fire even on higher data/interrupt rate.

Watchdog is refreshed in OS task running at normal OS priority which helps in catching endless loop.

The task which is at highest OS priority is Ethernet packet reading task. There is one hardware interrupt which is raised when Ethernet receives packet and in ISR we schedule waiting Ethernet packet reading task.

Also in my system the OS is not running using timer interrupt (Like other OS run). The OS is round robin and relinquish the control voluntarily. So increasing the watchdog task priority above the normal is not possible otherwise OS will always find it at higher priority and ready (watchdog is refreshed in infinite loop no waiting for any event) and other tasks will not get time to execute. Only tasks which are waiting on some event can have high priorities.

So the problem is watchdog task is not getting time to refresh because of frequent interrupts and continuous scheduling of high priority tasks (Ethernet packet reading).

4
A task which reads the data from LAN controller is running at highest priority in OS. How long does this task take to complete? I do believe that the watchdog is not getting enough time to time-out. Try setting it to highest priority and see what happens. My guess is your problem will go away. Since the watchdog should only run very briefly it might not affect your scheduling.RedX

4 Answers

4
votes

Try to give you watchdog a higher priority.

This might seem wrong at first glance. A watchdog shouldn't get a high priority but that's only true for systems which aren't under heavy load. Under heavy load, the scheduling will push the watchdog back (it's low prio after all) which can cause spurious time outs.

Giving the watchdog a high priority should not have a big impact on performance (it's a small task, runs not very often, triggered by an interrupt) but makes sure it can't starve.

The disadvantage is that you can't catch endless loops anymore (since the loop can now be interrupted by the watchdog).

You should also consider badly designed hardware or a bad mapping of interrupts. Maybe you can give the watchdog IRQ a higher priority than the network card. That would allow the watchdog to process its interrupts in a timely fashion without you having to give the task a higher priority.

Or you can try to increment a counter when a network packet has been processed. A new, high priority watchdog thread could watch this counter and re-configure the low-prio watchdog task not to fire as long as the counter changes.

1
votes

In any form of real-time application you need, by definition, to be 100% aware of what is going on. You must know how much time each task consumes. Measure the time needed for each task with an oscilloscope by toggling a pin. Then calculate these times for the whole system. If the higher priority tasks take too much time, well, then obviously the dog will starve.

If this is too complex to measure because of acyclic or non-deterministic behavior, the program needs to be fixed. If the watchdog sits in a high priority task, you have pretty much disabled it for any task with lower prio. You might as well shut the watchdog off entirely then.

Trial & error patches, giving the watchdog higher prio, or increasing the CPU clock until the bug goes away is simply not a professional approach.

But then of course, the hardware might not be sufficient to service such a high data load as you expect. Then you may have no other option but to either use dirty patches or re-design the product from scratch with a suitable MCU.

0
votes

It is probably not a matter of telling how to do it, the architecture you described should work. What you need to do is discover why the watchdog is not serviced.

If your RTOS does not have instrumentation or tools for debugging and testing, you could add I/O toggling in the watchdog loop and watch it with a scope - all the periods where it stops toggling are where higher priority tasks or interrupts are running -if that happens for more than one second, the watchdog will trigger. You might then add similar instrumentation to your other tasks and ISRs to see what is taking the time.

Is it possible that you are dead-locking under high load so that the system is in fact failing? A situation where the watchdog firing would be entirely valid. You don't want to stop it firing if it is in fact detecting an system failure - you want to fix the system failure.

0
votes

If the task that handles network packets consumes so much time that it prevents the task responsible for refreshing the watchdog from getting CPU time; then the system is unable to handle high networking load. The watchdog problem is only a symptom of this "unable to handle high network load" problem.

The solution is to use a faster CPU, slow down the network, reduce the overhead of handling packets, or some combination of these options; so that the system can handle high network load (and so that the task that refreshes the watchdog does get run). Note that "handling high network load" may include dropping packets, which is the normal/established approach for handling network congestion.