Tuesday, September 16, 2014

Screenshots, SendKeys, and Sixty-Four Bits

There are a couple of issues with the Internet Explorer driver that have been around since IE10 was released. They're pretty annoying when people encounter them, and the report for the first issue goes something like this:
I'm using Internet Explorer 10 (or 11), and when I call the sendKeys method, the keystrokes happen very slowly. Like one keystroke every 5 seconds. I'm on a 64-bit version of Windows, and I'm using the 64-bit IEDriverServer.exe. If I use the 32-bit version of the driver, the problem doesn't occur, but I really need to test using the 64-bit version because I need to test 64-bit IE. What's the deal?
The report for the second issue usually reads as follows:
I'm using Internet Explorer 10 (or 11), and even though I'm on 64-bit Windows, I'm using the 32-bit IEDriverServer.exe, because I was having problems with sendKeys being slow. Now, though, when I take a screen shot, it only shows the visible portion of the page. How can I take full page screen shots like I could when I use the 64-bit IEDriverServer.exe?
Both of these issues are fully documented in the Selenium issue tracker (#5116 for the sendKeys issue, and #5876 for the screenshot issue). A comment in each issue mentions that any fix would require "a massive rearchitecture of the IE driver's binary components, [so] no timeline is (or will be) available" for the delivery of a fix. What causes these issues? How are they related? Why would a fix be so darned difficult? The answers to those questions can all be summed up with a simple answer: "Windows Hooks." 

What is a Windows Hook?

All Windows applications have a routine in them called a "message loop." The message loop repeatedly calls the GetMessage API function, and processes messages sent to the application as they arrive in its queue. Hooks are a feature of the Windows message handling system that allow a developer to intercept, examine, and modify the message being sent to the application. By installing a hook, a developer could, for example, validate that a certain message was processed by the window being hooked. Or they could modify a message sent to the window to represent that the operating system could do things it actually can't. It's a clever mechanism, but it does have a few requirements.

First of all, the code being run when the hook is called (the "hook procedure") must exist in a dynamic-link library (DLL). That is, it cannot be simply a function exported from a compiled executable. The reason for this is that the code is actually going to be loaded into two applications, the application installing the hook, and the application being hooked. Using a DLL is the only way to avoid certain conflicts that would arise with loading one executable into the process space of another.

Secondly, the DLL must be of the same "bitness" of the process being hooked. In Windows, a 32-bit executable cannot load a 64-bit DLL. The converse is also true, that a 64-bit executable cannot load a 32-bit DLL. Incidentally, this is the root reason that there are two versions of IEDriverServer.exe, but that's another story for another time.

Windows Hooks and the IE Driver

The IEDriverServer.exe uses hooks for its implementation of a couple of features. The first use of a hook is in the processing of keystrokes. By default, the driver uses the Windows PostMessage API function to simulate keystrokes. It does this by sending a WM_KEYDOWN message, followed by WM_CHAR and WM_KEYUP messages for each key. However, PostMessage is asynchronous, so the driver has to wait to make sure that the WM_KEYDOWN message is processed before sending the other messages, otherwise keystrokes could be sent out of order, making key sequences garbled. The driver does this by installing a hook into IE's window procedure, and listening for the WM_KEYDOWN to be processed before proceeding. It also puts in a timeout of about five seconds waiting for the message to be processed to make sure that the driver doesn't wait forever. Note that the code path is slightly different if you're using the requireWindowFocus capability, using the SendInput API function instead, but the driver still uses a hook to make sure messages are processed before moving on.

The second place the driver uses a hook is when taking screenshots. The IE driver takes screenshots using the PrintWindow API function. PrintWindow can only take a screenshot of the visible portion of any given window, which means that in order to get a full-page screenshot (as required by the WebDriver API), the window must be sized large enough to display the entire page without scroll bars. However, Windows does not allow the window to be resized larger than the visible screen resolution. When we ask IE to resize itself, a WM_GETMINMAXINFO message is sent on a resize event so the IE can figure how large a window can be. By intercepting that message with a hook, and modifying the max values, we can trick IE into thinking that a window can be sized greater than the screen resolution would otherwise allow.

Since the IE driver makes use of hook procedures, the bulk of the IE driver is actually implemented in a DLL. So as to avoid having to manage multiple files when using the IE driver, this DLL is embedded as a resource inside the IEDriverServer executable, and extracted to the temp directory at runtime. Once extracted, it's loaded into memory by IEDriverServer, and the main entry point of the DLL is called. This gives the driver a way to inject itself into the IE process using hooks to accomplish what it needs to. This worked great up to and including IE 9.

What Happened in IE10?

When IE 10 was first released, we started to see reports of the two aforementioned issues coming in. Since version 7 of Internet Explorer, there has been the notion of multiple processes for a single "instance" of IE. There was the notion of a "manager" or "broker" process, which managed the outer, top-level window of Internet Explorer. All HTML rendering and ActiveX controls are managed by a "content" process. Through version 9 of Internet Explorer, these processes were the same bitness. That is, running 64-bit IE meant you got a 64-bit manager process, and 64-bit content processes. Running 32-bit IE meant you were using a 32-bit manager process, and 32-bit content processes. This all changed with IE 10.

One major change in IE 10 is that the manager process on 64-bit versions of Windows will always be a 64-bit process. By default, though, content processes remained 32-bit. This allowed the main process to be a 64-bit process, but still allowed the browser to remain compatible with all of the existing browser plug-ins for IE, which are overwhelmingly 32-bit. There are other reasons for this change, and I'm oversimplifying the architecture a bit. If you're truly interested in the deep details of the architecture, I encourage you to read Eric Lawrence's blog post that lays out the implications in great detail. Anyway, there are ways to force 64-bit content processes for IE 10 and above, but these will break the IE driver due to Protected Mode issues.

So for IE 10 and above, the situation is that we have a 64-bit process, which handles the main outer window, and a 32-bit process which owns the inner window where HTML content is rendered. The driver's window hook procedure for taking screenshots must be attached to the main outer window; the driver's hook procedure for verification of message processing while simulating keystrokes must be attached to the content window being automated. Remember that since the driver executable can only be 32-bit or 64-bit, but not both, the DLL in which the window hook procedures reside can only be the same bitness as the executable. Let's explore the implications of this.

For the 32-bit IEDriverServer, the hook procedure can be successfully attached to the (32-bit) content window for use with sendKeys. Attempting to install the window hook for screenshots is attempting to install a window hook into the manager process, which is 64-bit, which can't load the 32-bit DLL into its process space, so the hook installation fails, and screenshots are truncated.

Conversely, for the 64-bit IEDriverServer, the hook can be successfully installed in the top-level window for use in taking screenshots, because the process owning that window is a 64-bit process. However, when the driver attempts to install the hook into the (32-bit) content window to detect message processing during sendKeys, the DLL is 64-bit, and can't be loaded by the 32-bit executable which owns the content window. This means that the timeout is invoked for every keystroke, with sendKeys waiting about five seconds for each key.

Why Would This Be So Hard To Fix?

By now, I hope we have a better understanding of what the root cause of both issues is. What would it take to fix the issue? A naive implementation would just attempt to bundle both 32- and 64-bit DLLs in with a version of the server and be done with it. However, this won't work because the DLL where the hook procedure lives must be loaded by both executables, IEDriverServer.exe, and the IE process owning whichever window is being hooked.

The only way to completely and correctly resolve the issue would be to create a pair of executables, with a related pair of DLLs, and have the two executables establish some way of working together via an interprocess communication channel. With this approach, now we'd be asking a user to download and manage two executables instead of one, or to use an installer of some kind. Since the project aims for an "xcopy deploy" without requiring the use of an installer, that's a larger burden than one can expect users of the IE driver to undertake.

While it's tempting to simply suggest creating a second executable and embedding it as a resource for extraction at runtime just like we do with the DLL, that approach is flawed as well. Many antivirus and malware monitors will happily allow any application to place a DLL in the temp directory and let an executable call LoadLibrary on it, but having an executable file magically appear and attempt to run in the temp directory throws up all kinds of flags. Making it a requirement to disable your antimalware software before using the IE driver is not something I'd be comfortable with.

Creating a second executable, and its attendant DLL for using the hook procedure, and figuring out some way for the two executables to communicate with each other amounts to a massive rearchitecture of the IE driver. With Microsoft's announcement of the sunset of support for all legacy versions of IE in January 2016, and with the creation of their own WebDriver implementation, it's not clear that the benefit of making these intrusive changes will outweigh the cost of doing so.


  1. Thanks for the detailed explanation, Jim - excellent article.

    Perhaps it would be worth adding an update since you fixed the screenshot issue in v2.47.0.3?

    It would also be good if #5876 could be updated with this fixed information too since it's one of the first links that appear in a Google search for issues with Selenium screenshots.

    (for others reading this page, see https://github.com/SeleniumHQ/selenium/blob/master/cpp/iedriverserver/CHANGELOG for fix details)

  2. I might suggest that it might be seen as convenient to have a single executable to allow the web driver to work. I am sure it is. However I think that for a lot of people who work in the test automation space, having a Selenium WebDriver that works reliably and predictably is considerably more important than how many files it consists of or whether it happens to use an installer or if it can run as a windows service.

    I can understand that it is a lot of work and IE is going to be sunset in the future. The reality is that with IE in general that a lot of organisations get stuck on old versions for a long time relative to the other browsers.

    Re-architecting the driver might also not be as bad if someone were willing to give you a hand...maybe..

  3. Hello Good article.

    All you write is based on sloppy programming style. If the driver would be programmed correctly it would notice that the hook could not be set and throw an exception "Please use the 32 bit driver instead on this computer." instead of typing 1 character per 5 seconds!

    It shows once again that the Selenium programmers are far from being Windows experts. What a nonsense to send keystrokes with PostMessage() and then implement a windows hook to check if the keystroke has been received! That is complete nonsense!

    The correct way would be to use SendMessage() instead. If they use PostMessage() for the reason that they think an event handler in Internet Explorer could eventually block the SendMessage() call, then there is a simple solution: SendMessageTimeout().

    Remove this nonsense window hook and use SendMessageTimeout() and all is fine. This is very easy to fix! There is no need to hook another DLL into IE for something as basic as sending keystrokes to another process!

    1. In response to your first assertion, how, exactly do you anticipate determining the hook can't be set? In the case of attempting to set a cross-bitness hook, SetWindowsHook *doesn't fail*. If it did, doing as you suggest would, in fact, be easy, and would have been implemented long ago. As it stands, the current implementation does check for the possibility of not being able to set the hook, and does log the issue. Whether the proper thing to do is rudely throw an exception (which would be a surprising change in existing behavior) is a matter for discussion in a different forum. I will not undertake the discussion here.

      As to your second assertion, the behavior you describe is a legacy of when the code in question was shared with the Firefox native events implementation. Now that it no longer is, perhaps changing to use SendMessageTimeout might be appropriate. If, y'know, you can get around the COM restriction from calling SendMessage or its derivatives from inside a COM message handler.

      On a personal note, I love comments that rudely call me incompetent. Honestly, it's uncalled for in a public place, and would be far better addressed in private. If you get paid to write code, I do hope your sense of professionalism is better at your place of employment than you've displayed here.

    2. Hey @Elmu: Consider toning it down some. Most of your comments are coming across as extraordinarily rude. Perhaps you don't realize the tone.

      Regardless, why don't you spend some time contributing your ideas to the WebDriver project? In case you haven't noticed, all the project stakeholders happily accept reasonable pull requests.

  4. Hi to all , just choose IEDriverServer_Win32_2.52.1