Fix: Cursor stuck off server screen (Wayland, attempt)

Fix cursor sometimes disappearing when switching screens on wayland

tl;dr

It can happen that the libeis implementation (read: the wayland compositor) sends a start_emulating event to a device already emulating. This is a protocol violation of libei/libeis. The client<–>server connection is severed (which is correct behavior), but deskflow never releases the captured xdg-desktop-portal-input-capture. Thus, the cursor remains perpetually stuck. The fix is to release the portal-input-capture when deskflow handles the EI_EVENT_DISCONNECT event.

Detailed write-up

The issue as mentioned in #8005

The bug causes the mouse cursor to become stuck at the edge of the server screen (most often the top, but sometimes the sides) when using Deskflow on Wayland. When this happens, all mouse and keyboard input on the server is frozen, and the only way to recover is to SSH in and kill the deskflow-server process. Occasionally, instead of freezing, the cursor may jump from one edge to another or move along the edge but remains trapped. Logs show the system repeatedly trying and failing to move the cursor offscreen.

Reproducing the issue

I am working on KDE/plasma and arch-linux.

For me, the bug in a production build of deskflow, and also on a debug build, happens very rarely (~once per day or less). Thus, I first sought to reproduce the issue more reliably.

At some point in my investigation I have decided to attach valgrind to deskflow to look for memory leaks. Valgrind seems to slow the execution of deskflow, and by chance on my system makes the issue more reproducible. When valgrind is attached to deskflow-server, the transmission of events to the client is slowed.

Here, there is an interesting finding: I think there is probably some blocking code in deskflow, and when this code is executed deskflow will not respond to the client with heartbeats. Explanation: When setting short heartbeat times (<1 s) in the deskflow-server, the client will often disconnect due to the server being unresponsive within the heartbeat, and looking at the server’s DEBUG2 level logs shows that the server is busy processing the incoming mouse-events, unable to respond. In this case, the bug cannot be reproduced at all. But, meanwhile deskflow is barely usable due to the frequent resets.

But, if one sets longer heartbeats, like “5000” (==5s), and attaches valgrind to a debug build of deskflow to slow the server down a bit, one can reproduce the rare issue by “wiggling” the cursor with small (or circular) movements between the two screens (back-and-forth). Then, it shows as follows: Sometimes, the cursor is stuck, sometime it is not stuck, but when trying to enter the client’s screen the server & client connections are capped, and the server idles and logs the characteristic log-lines “event: motion on primary” and “event queue write result: 1” (see).

Understanding the root-cause of this bug (well, the one that shows in my valgrind&debug-build), has taken me some time. My setup is a debug build of deskflow, executed with the command DEBUGINFOD_URLS="https://debuginfod.archlinux.org/" valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --log-file=valgrind.log --verbose ./build/bin/deskflow-server -f --debug DEBUG2 --name linux --log deskflow.log --disable-client-cert-check --address 0.0.0.0:24800 --no-daemon --config ./deskflow-server.conf .From the DEBUG2 level logs of deskflow I could see that for this reproducible variant of the bug, the log-lines

[2025-04-16T14:26:40] DEBUG1: ei: dispatching ei_device.start_emulating() on object 0xff00000000000007
	/home/dustin/git/personalForks/deskflow/src/lib/platform/EiScreen.cpp:96
[2025-04-16T14:26:40] ERROR: ei: Invalid device state 3 for a start_emulating event
	/home/dustin/git/personalForks/deskflow/src/lib/platform/EiScreen.cpp:105
[2025-04-16T14:26:40] WARNING: ei: Connection error: Invalid device state 3 for a start_emulating event.

are always logged.

Here, it starts getting messy: I don’t know why the other people reporting the bug in the bugticket did not get these log-lines. @pruriggro here has gotten two subsequent EI_EVENT_DEVICE_START_EMULATING events, which could be consistent with what I observed. My logs were more verbose though. libei might not log as part of deskflow on other people’s build? I don’t fully get this, in the hope of not chasing a different bug here, I attribute this to my quite custom environment.

OK, back to the topic: Device state Nr. 3 is “EMULATING” (see enum here https://gitlab.freedesktop.org/libinput/libei/-/blob/main/src/libei-device.h?ref_type=heads#L36 ), and libei(s) also specify in their spec that calling start_emulating() on a device that is already emulating is a protocol violation (see https://libinput.pages.freedesktop.org/libei/interfaces/ei_device/index.html#ei_devicestart_emulating). So this should not ever happen. I still couldn’t fully wrap my head around the full inner workings of xdg-desktop-portal-InputCapture and libei(s), but as I understand it, the xdg-desktop-portal via a (gtk)callback tells libeis (read: the wayland compositor) to start capturing the input device. So at this point deskflow is not really involved, but: Deskflow captures and uses the file-descriptors of the xdg-desktop-portal and handles the libei (client-side) implementation. Reading the cautious comments in EiEventQueueBuffer.cpp, I think that due to the “akward” and specifically not ordering-retaining handling of the queue/buffer there, we might get into the situation of start_emulating() being called on a device that is already emulating…

In any case, the deskflow-server code does not itself call start_emulating() on the device, the wayland compositor in the end causes this libei(s) protocol violation. So, at this point, the best deskflow-server can do is handle it properly, and this is where I think something is not right.

One key step for understanding the behaviour was obtaining debug logs from the xdg-desktop-portal by running dbus-monitor --session "interface='org.freedesktop.portal.InputCapture'" in a seperate terminal. This showed that when the issue (calling start_emulating() on an already emulating device) happens, the xdg-portal-input-capture, which causes the pointer to vanish from the server’s screen, is never released. It perpetually stays in the “activated” state. See this video that I awkwardly took with my phone because I screwed up my ssh connection:

https://github.com/user-attachments/assets/a27e8ca4-dd60-438b-a93b-e6b26e03f7d5 (note in the video that I can’t click at all, but when I wiggle the cursor the plasma/KDE animation [huge cursor] still triggers)

In the video, one can see that the xdg-portal-InputCapture was activated with a negative (?) pixel value, which is for me so far consistent in the case when the cursor on the server-screen disappears. I tried to find out if negative values are explicitly (dis)allowed or reasonable, but didnt find conclusive evidence. Maybe someone here knows how this is in xdg/wayland? According to https://wayland.app/protocols/xdg-shell#xdg_positioner , xdg_positioner disallows negative values. However, wayland subsurfaces seem to allow negative values (see https://wayland-client-d.dpldocs.info/wayland.client.protocol.wl_subsurface_set_position.html). I am a bit confused here tbh.

After reading a lot of code, it occurred to me that when the protocol violation occurs, the libeis (read: wayland compositor-side) semi-gracefully stops the connection to the libei-client (read: deskflow), by sending the following signals/events:

EI_DEVICE_STATE_EMULATING → EI_DEVICE_STATE_REMOVED_FROM_CLIENT → EI_DEVICE_STATE_DEAD → EI_EVENT_DISCONNECT

Here, this is where I see the root cause of the bug. In the handling of EI_EVENT_DISCONNECT on the side of libei/deskflow, the xdg-portal-InputCapture is not released, and thus the cursor stays stuck. If we release it as part of the handling of EI_EVENT_DISCONNECT, together with some other minor code improvements there, we can have deskflow handle the severed EI<–>EIS connection gracefully. The mouse/keyboard on the server is usable, and deskflow will try to re-establish the InputCapture (on KDE/plasma popping up the dialog window again) when one tries to enter the client screen.

Further issues identified

One time (when the discussed bug was not hit at all actually), valgrind caught a segfault in deskflow (without any codechanges form my side, debug built). So, it is still possible that I caught a different bug that only is visible in slow debug builds, and the actual “cursor stuck” bug comes simply from undefined behavior due to a corrupted heap. This was a KDE/plasma wayland server with a windows client. For the record, here is what valgrind logged before dying due to the segfault:

==31020== Invalid read of size 8
==31020==    at 0x25086A: ClientProxy1_6::handleClipboardSendingEvent(Event const&, void*) (ClientProxy1_6.cpp:54)
==31020==    by 0x250D51: TMethodEventJob<ClientProxy1_6>::run(Event const&) (TMethodEventJob.h:49)
==31020==    by 0x13E633: EventQueue::dispatchEvent(Event const&) (EventQueue.cpp:242)
==31020==    by 0x13DD68: EventQueue::loop() (EventQueue.cpp:107)
==31020==    by 0x1B8578: ServerApp::mainLoop() (ServerApp.cpp:750)
==31020==    by 0x1B8B88: ServerApp::standardStartup(int, char**) (ServerApp.cpp:801)
==31020==    by 0x1BEDAC: standardStartupStatic(int, char**) (AppUtilUnix.cpp:36)
==31020==    by 0x1B89B4: ServerApp::runInner(int, char**, int (*)(int, char**)) (ServerApp.cpp:782)
==31020==    by 0x1BEDF7: AppUtilUnix::run(int, char**) (AppUtilUnix.cpp:41)
==31020==    by 0x15B1EB: App::run(int, char**) (App.cpp:99)
==31020==    by 0x131908: main (deskflow-server.cpp:58)
==31020==  Address 0x9419ad0 is 0 bytes inside a block of size 456 free'd
==31020==    at 0x48498DD: operator delete(void*, unsigned long) (vg_replace_malloc.c:1181)
==31020==    by 0x251378: ClientProxy1_8::~ClientProxy1_8() (ClientProxy1_8.h:15)
==31020==    by 0x23FC41: Server::handleClientDisconnected(Event const&, void*) (Server.cpp:1336)
==31020==    by 0x24B4E9: TMethodEventJob<Server>::run(Event const&) (TMethodEventJob.h:49)
==31020==    by 0x13E633: EventQueue::dispatchEvent(Event const&) (EventQueue.cpp:242)
==31020==    by 0x13DD68: EventQueue::loop() (EventQueue.cpp:107)
==31020==    by 0x1B8578: ServerApp::mainLoop() (ServerApp.cpp:750)
==31020==    by 0x1B8B88: ServerApp::standardStartup(int, char**) (ServerApp.cpp:801)
==31020==    by 0x1BEDAC: standardStartupStatic(int, char**) (AppUtilUnix.cpp:36)
==31020==    by 0x1B89B4: ServerApp::runInner(int, char**, int (*)(int, char**)) (ServerApp.cpp:782)
==31020==    by 0x1BEDF7: AppUtilUnix::run(int, char**) (AppUtilUnix.cpp:41)
==31020==    by 0x15B1EB: App::run(int, char**) (App.cpp:99)
==31020==  Block was alloc'd at
==31020==    at 0x4845F93: operator new(unsigned long) (vg_replace_malloc.c:487)
==31020==    by 0x214526: ClientProxyUnknown::initProxy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, int) (ClientProxyUnknown.cpp:197)
==31020==    by 0x214958: ClientProxyUnknown::handleData(Event const&, void*) (ClientProxyUnknown.cpp:240)
==31020==    by 0x214F09: TMethodEventJob<ClientProxyUnknown>::run(Event const&) (TMethodEventJob.h:49)
==31020==    by 0x13E633: EventQueue::dispatchEvent(Event const&) (EventQueue.cpp:242)
==31020==    by 0x291805: StreamFilter::filterEvent(Event const&) (StreamFilter.cpp:89)
==31020==    by 0x26DF77: PacketStreamFilter::filterEvent(Event const&) (PacketStreamFilter.cpp:182)
==31020==    by 0x291859: StreamFilter::handleUpstreamEvent(Event const&, void*) (StreamFilter.cpp:94)
==31020==    by 0x291993: TMethodEventJob<StreamFilter>::run(Event const&) (TMethodEventJob.h:49)
==31020==    by 0x13E633: EventQueue::dispatchEvent(Event const&) (EventQueue.cpp:242)
==31020==    by 0x13DD68: EventQueue::loop() (EventQueue.cpp:107)
==31020==    by 0x1B8578: ServerApp::mainLoop() (ServerApp.cpp:750)
==31020== 
==31020== 
==31020== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==31020==  Bad permissions for mapped region at address 0x29F6D8
==31020==    at 0x29F6D8: ??? (in /home/REDACTED_USERNAME/git/personalForks/deskflow/build/bin/deskflow-server)
==31020==    by 0x250D51: TMethodEventJob<ClientProxy1_6>::run(Event const&) (TMethodEventJob.h:49)
==31020==    by 0x13E633: EventQueue::dispatchEvent(Event const&) (EventQueue.cpp:242)
==31020==    by 0x13DD68: EventQueue::loop() (EventQueue.cpp:107)
==31020==    by 0x1B8578: ServerApp::mainLoop() (ServerApp.cpp:750)
==31020==    by 0x1B8B88: ServerApp::standardStartup(int, char**) (ServerApp.cpp:801)
==31020==    by 0x1BEDAC: standardStartupStatic(int, char**) (AppUtilUnix.cpp:36)
==31020==    by 0x1B89B4: ServerApp::runInner(int, char**, int (*)(int, char**)) (ServerApp.cpp:782)
==31020==    by 0x1BEDF7: AppUtilUnix::run(int, char**) (AppUtilUnix.cpp:41)
==31020==    by 0x15B1EB: App::run(int, char**) (App.cpp:99)
==31020==    by 0x131908: main (deskflow-server.cpp:58)

I would really hope that this actually is not the real root cause, and we can simply ignore this one-time find for now, as debugging this heap corruption will be fairly painful imho…

If you read all of this, thanks for your attention guys, let me know if you have further thoughts/ideas!

/claim #8005

Suggested next steps

I have some confidence that this MR fixes an actual (albeit rare) bug. I have personally not observed the wayland-server cursor getting stuck bug with my code, but since it was anyway rare for me on the release build with KDE/plasma, I can’t be sure. So I would suggest to merge and test-drive this PR and see if it might fix the bug :–)

Recruiting

Bounties

Community

Legal