Error on QMI_WWAN in an embedded linux

Hi

I am attaching Quectel EC25-E module to an ebmbedded board, which hosts a yocto-based linux (kernel version 4.19.0-xilinx-v2019.2).
The kernel is configured to detect the module as a QMI_WWAN and registers it. I have cross-compiled quectel-CM for the board, and it is able to run, detecting the module and connection to the interne through it, registering a wwan interface.

But when I run an in-house application which transmits data on udp (rather high speed, in the range of 1 Mbps or so), every once in a while I get this error. (The wwan is registered and works afterwards)
What could be the cause of this error?

[  372.301571] ------------[ cut here ]------------
[  372.306199] NETDEV WATCHDOG: wwan0 (qmi_wwan): transmit queue 0 timed out
[  372.313028] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x2c8/0x2d0
[  372.321281] Modules linked in: al5d(O) al5e(O) allegro(O) xlnx_vcu_clk xlnx_vcu xlnx_vcu_core mali(O) uio_pdrv_genirq dmaproxy(O)
[  372.332950] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           O      4.19.0-xilinx-v2019.2 #1
[  372.341548] Hardware name: xlnx,zynqmp (DT)
[  372.345717] pstate: 80000005 (Nzcv daif -PAN -UAO)
[  372.350500] pc : dev_watchdog+0x2c8/0x2d0
[  372.354501] lr : dev_watchdog+0x2c8/0x2d0
[  372.358500] sp : ffffff8008013d40
[  372.361799] x29: ffffff8008013d40 x28: 0000000000000020 
[  372.367102] x27: 00000000ffffffff x26: ffffffc879d1d3b8 
[  372.372406] x25: 0000000000000140 x24: ffffffc879ee0880 
[  372.377709] x23: ffffffc879d1d39c x22: ffffffc879d1cfb8 
[  372.383013] x21: ffffffc879d1d000 x20: ffffff80090a6000 
[  372.388316] x19: 0000000000000000 x18: 0000000000000010 
[  372.393620] x17: 0000000000000000 x16: 0000000000000000 
[  372.398923] x15: ffffffffffffffff x14: ffffff80090a8648 
[  372.404227] x13: ffffff8089165f07 x12: ffffff8009165f0f 
[  372.409530] x11: ffffff80090ba000 x10: ffffff8008013a40 
[  372.414834] x9 : 00000000ffffffd0 x8 : 756575712074696d 
[  372.420138] x7 : 736e617274203a29 x6 : ffffffc87ff9a998 
[  372.425441] x5 : ffffffc87ff9a998 x4 : 0000000000000004 
[  372.430744] x3 : 0000000000000103 x2 : 0000000000000102 
[  372.436048] x1 : e29bace31751b500 x0 : 0000000000000000 
[  372.441352] Call trace:
[  372.443785]  dev_watchdog+0x2c8/0x2d0
[  372.447440]  call_timer_fn+0x20/0x78
[  372.451006]  expire_timers+0xa8/0xb8
[  372.454573]  run_timer_softirq+0xac/0x198
[  372.458576]  __do_softirq+0x14c/0x248
[  372.462230]  irq_exit+0xb0/0xd0
[  372.465364]  handle_IPI+0x140/0x178
[  372.468843]  gic_handle_irq+0xb0/0xb8
[  372.472498]  el1_irq+0xb0/0x128
[  372.475633]  cpuidle_enter_state+0x80/0x230
[  372.479807]  cpuidle_enter+0x18/0x20
[  372.483374]  call_cpuidle+0x1c/0x38
[  372.486854]  do_idle+0x258/0x280
[  372.490074]  cpu_startup_entry+0x20/0x28
[  372.493990]  secondary_start_kernel+0x14c/0x160
[  372.498510] ---[ end trace 736d80a9afff1eb4 ]---

what is the application? Could you help to share more information about your test steps ? Just form the log you provide, it is hard to confirm the reason. It is better to debug module debug log to confirm it. You can email to support@quectel.com to get the tool to catch the module debug log. Thanks!

I send a UDP stream through a socket binded to the interface. What do you mean by what is the application?

I mean what is in-house application. In order to confirm your issue, it is better to provide the whole test log of quectel-CM and your test log. You know just the above log cannot confirm the root reason. Thanks!

My program is some gstreamer application streaming RTP packets on the device. It’s log would include lots of unrelated data.
Would a pcap, or a debug log of the udpsrc connected to the device good enough for this to be found out?

If that, it may need to catch module debug log to confirm it. You can email to support@quectel.com to get the tool to catch module debug log. Thanks!

I‘ve seen some case from the Internet same with yours.

https://bugzilla.kernel.org/show_bug.cgi?id=199549

Seems the runtime suspend feature matters. I think you can try ,
try 1, transmit "usbcore.autosuspend=-1 " to your kernel to disable
try 2, echo -1>/sys/bus/usb/devices/usbX/power/autosuspend_delay_ms to disable autosleep.
try 3, set supports_autosuspend = -1 in qmi_wwan_q.c
try 4 , disable the usb autosleep through echo on>/sys/bus/usb/devices/usb1/power/control.

This warning is expected once in a while if you push packets at a higher rate than the system as a whole, for host cpu to radio network, is able to handle. It looks a bit ugly because of the full stack trace, but it is merely a warning that the Linux network device was unable to push packets “fast enough” compared to some limit.

The qmi_wwan driver inherits this limit from the usbnet framework. It is hard coded to 5 seconds here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/usb/usbnet.c#n57

I have considered changing this value to eliminate the unnecessary warnings, but concluded that the current value is as good as any. Occasional timeouts cannot be completely avoided with a USB connected radio interface. There will always be situations where the modem is unable to transmit anything for a short period, for example due to intermittent interference or a handover failure or other radio problems we have to anticipate. The 5 seconds timeout is pretty much standard for all types of Linux network devices. But it’s quite a lot easier to hit for a radio adapter than for an ethernet adapter.

You can see the code emitting the warning here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/sched/sch_generic.c#n413

As you can see, it’s really much more harmless than it might look. It’s just signalling that the tx queue had to be stopped for more than 5 seconds.

2 Likes