Tuesday, September 16, 2014

Screenshots, SendKeys, and Sixty-Four Bits

There are a couple of issues with the Internet Explorer driver that have been around since IE10 was released. They're pretty annoying when people encounter them, and the report for the first issue goes something like this:
I'm using Internet Explorer 10 (or 11), and when I call the sendKeys method, the keystrokes happen very slowly. Like one keystroke every 5 seconds. I'm on a 64-bit version of Windows, and I'm using the 64-bit IEDriverServer.exe. If I use the 32-bit version of the driver, the problem doesn't occur, but I really need to test using the 64-bit version because I need to test 64-bit IE. What's the deal?
The report for the second issue usually reads as follows:
I'm using Internet Explorer 10 (or 11), and even though I'm on 64-bit Windows, I'm using the 32-bit IEDriverServer.exe, because I was having problems with sendKeys being slow. Now, though, when I take a screen shot, it only shows the visible portion of the page. How can I take full page screen shots like I could when I use the 64-bit IEDriverServer.exe?
Both of these issues are fully documented in the Selenium issue tracker (#5116 for the sendKeys issue, and #5876 for the screenshot issue). A comment in each issue mentions that any fix would require "a massive rearchitecture of the IE driver's binary components, [so] no timeline is (or will be) available" for the delivery of a fix. What causes these issues? How are they related? Why would a fix be so darned difficult? The answers to those questions can all be summed up with a simple answer: "Windows Hooks." 

What is a Windows Hook?

All Windows applications have a routine in them called a "message loop." The message loop repeatedly calls the GetMessage API function, and processes messages sent to the application as they arrive in its queue. Hooks are a feature of the Windows message handling system that allow a developer to intercept, examine, and modify the message being sent to the application. By installing a hook, a developer could, for example, validate that a certain message was processed by the window being hooked. Or they could modify a message sent to the window to represent that the operating system could do things it actually can't. It's a clever mechanism, but it does have a few requirements.

First of all, the code being run when the hook is called (the "hook procedure") must exist in a dynamic-link library (DLL). That is, it cannot be simply a function exported from a compiled executable. The reason for this is that the code is actually going to be loaded into two applications, the application installing the hook, and the application being hooked. Using a DLL is the only way to avoid certain conflicts that would arise with loading one executable into the process space of another.

Secondly, the DLL must be of the same "bitness" of the process being hooked. In Windows, a 32-bit executable cannot load a 64-bit DLL. The converse is also true, that a 64-bit executable cannot load a 32-bit DLL. Incidentally, this is the root reason that there are two versions of IEDriverServer.exe, but that's another story for another time.

Windows Hooks and the IE Driver

The IEDriverServer.exe uses hooks for its implementation of a couple of features. The first use of a hook is in the processing of keystrokes. By default, the driver uses the Windows PostMessage API function to simulate keystrokes. It does this by sending a WM_KEYDOWN message, followed by WM_CHAR and WM_KEYUP messages for each key. However, PostMessage is asynchronous, so the driver has to wait to make sure that the WM_KEYDOWN message is processed before sending the other messages, otherwise keystrokes could be sent out of order, making key sequences garbled. The driver does this by installing a hook into IE's window procedure, and listening for the WM_KEYDOWN to be processed before proceeding. It also puts in a timeout of about five seconds waiting for the message to be processed to make sure that the driver doesn't wait forever. Note that the code path is slightly different if you're using the requireWindowFocus capability, using the SendInput API function instead, but the driver still uses a hook to make sure messages are processed before moving on.

The second place the driver uses a hook is when taking screenshots. The IE driver takes screenshots using the PrintWindow API function. PrintWindow can only take a screenshot of the visible portion of any given window, which means that in order to get a full-page screenshot (as required by the WebDriver API), the window must be sized large enough to display the entire page without scroll bars. However, Windows does not allow the window to be resized larger than the visible screen resolution. When we ask IE to resize itself, a WM_GETMINMAXINFO message is sent on a resize event so the IE can figure how large a window can be. By intercepting that message with a hook, and modifying the max values, we can trick IE into thinking that a window can be sized greater than the screen resolution would otherwise allow.

Since the IE driver makes use of hook procedures, the bulk of the IE driver is actually implemented in a DLL. So as to avoid having to manage multiple files when using the IE driver, this DLL is embedded as a resource inside the IEDriverServer executable, and extracted to the temp directory at runtime. Once extracted, it's loaded into memory by IEDriverServer, and the main entry point of the DLL is called. This gives the driver a way to inject itself into the IE process using hooks to accomplish what it needs to. This worked great up to and including IE 9.

What Happened in IE10?

When IE 10 was first released, we started to see reports of the two aforementioned issues coming in. Since version 7 of Internet Explorer, there has been the notion of multiple processes for a single "instance" of IE. There was the notion of a "manager" or "broker" process, which managed the outer, top-level window of Internet Explorer. All HTML rendering and ActiveX controls are managed by a "content" process. Through version 9 of Internet Explorer, these processes were the same bitness. That is, running 64-bit IE meant you got a 64-bit manager process, and 64-bit content processes. Running 32-bit IE meant you were using a 32-bit manager process, and 32-bit content processes. This all changed with IE 10.

One major change in IE 10 is that the manager process on 64-bit versions of Windows will always be a 64-bit process. By default, though, content processes remained 32-bit. This allowed the main process to be a 64-bit process, but still allowed the browser to remain compatible with all of the existing browser plug-ins for IE, which are overwhelmingly 32-bit. There are other reasons for this change, and I'm oversimplifying the architecture a bit. If you're truly interested in the deep details of the architecture, I encourage you to read Eric Lawrence's blog post that lays out the implications in great detail. Anyway, there are ways to force 64-bit content processes for IE 10 and above, but these will break the IE driver due to Protected Mode issues.

So for IE 10 and above, the situation is that we have a 64-bit process, which handles the main outer window, and a 32-bit process which owns the inner window where HTML content is rendered. The driver's window hook procedure for taking screenshots must be attached to the main outer window; the driver's hook procedure for verification of message processing while simulating keystrokes must be attached to the content window being automated. Remember that since the driver executable can only be 32-bit or 64-bit, but not both, the DLL in which the window hook procedures reside can only be the same bitness as the executable. Let's explore the implications of this.

For the 32-bit IEDriverServer, the hook procedure can be successfully attached to the (32-bit) content window for use with sendKeys. Attempting to install the window hook for screenshots is attempting to install a window hook into the manager process, which is 64-bit, which can't load the 32-bit DLL into its process space, so the hook installation fails, and screenshots are truncated.

Conversely, for the 64-bit IEDriverServer, the hook can be successfully installed in the top-level window for use in taking screenshots, because the process owning that window is a 64-bit process. However, when the driver attempts to install the hook into the (32-bit) content window to detect message processing during sendKeys, the DLL is 64-bit, and can't be loaded by the 32-bit executable which owns the content window. This means that the timeout is invoked for every keystroke, with sendKeys waiting about five seconds for each key.

Why Would This Be So Hard To Fix?

By now, I hope we have a better understanding of what the root cause of both issues is. What would it take to fix the issue? A naive implementation would just attempt to bundle both 32- and 64-bit DLLs in with a version of the server and be done with it. However, this won't work because the DLL where the hook procedure lives must be loaded by both executables, IEDriverServer.exe, and the IE process owning whichever window is being hooked.

The only way to completely and correctly resolve the issue would be to create a pair of executables, with a related pair of DLLs, and have the two executables establish some way of working together via an interprocess communication channel. With this approach, now we'd be asking a user to download and manage two executables instead of one, or to use an installer of some kind. Since the project aims for an "xcopy deploy" without requiring the use of an installer, that's a larger burden than one can expect users of the IE driver to undertake.

While it's tempting to simply suggest creating a second executable and embedding it as a resource for extraction at runtime just like we do with the DLL, that approach is flawed as well. Many antivirus and malware monitors will happily allow any application to place a DLL in the temp directory and let an executable call LoadLibrary on it, but having an executable file magically appear and attempt to run in the temp directory throws up all kinds of flags. Making it a requirement to disable your antimalware software before using the IE driver is not something I'd be comfortable with.

Creating a second executable, and its attendant DLL for using the hook procedure, and figuring out some way for the two executables to communicate with each other amounts to a massive rearchitecture of the IE driver. With Microsoft's announcement of the sunset of support for all legacy versions of IE in January 2016, and with the creation of their own WebDriver implementation, it's not clear that the benefit of making these intrusive changes will outweigh the cost of doing so.

Wednesday, September 10, 2014

Using the Internet Explorer WebDriver Implementation from Microsoft

Microsoft recently delivered an implementation of an Internet Explorer driver. This is great news, and should be a real help to users of the Selenium project. Concurrent with that release, the IEDriverServer.exe has been updated to take advantage of this new implementation.

The integration with IEDriverServer has been implemented as a new command-line switch on IEDriverServer.exe. By launching IEDriverServer.exe using the --implementation=<value> switch, you can force the executable to use a specific driver implementation. Valid values for the switch are:
  • LEGACY - Uses the existing open-source driver implementation
  • VENDOR - Forces the driver to use the Microsoft implementation, regardless of whether prerequisites are met, or whether the installed version of Internet Explorer is the proper version (will throw an exception when creating a new session if the prerequisites are not installed and configured properly)
  • AUTODETECT - Uses the Microsoft implementation for IE 11, if the prerequisites are installed, falling back to the open-source implementation if the components are not present (still under development in the IEDriverServer.exe code)
If no value is specified, or if the value passed in is not one of those listed above, IEDriverServer.exe will use the existing open-source implementation. This is only a temporary default; the intent is for the default to shift to be the Microsoft implementation as the specification on which it is based matures.


In order to use the Microsoft implementation, you'll need a few prerequisites.

Caveats and Provisos

This integration with IEDriverServer.exe should be considered experimental at the time of this writing. First, please realize that the Microsoft implementation only supports IE11. There are no announced plans for Microsoft to support other versions of Internet Explorer with this WebDriver implementation.

Also, the Microsoft implementation strictly follows the W3C WebDriver Specification. Since the spec is currently an editors' draft, it does not completely describe the WebDriver API. In other words, there are some features that are implemented in the open-source implementation that are not documented yet in the spec, which in turn means that they are not implemented in the Microsoft implementation.

Additionally, there are some small differences in the objects sent back and forth across the JSON Wire Protocol between the spec and the open-source implementation. The implications are that there may need to be changes made to the individual language bindings to pass the proper JSON payload across to the Microsoft implementation. There is considerable pressure not to update the language bindings, since, again, the spec is currently an editors' draft, and there have been bugs filed against it to have it's protocol more closely match the existing open-source language bindings' implementations.

In the interest of allowing users to be able to experiment with the Microsoft implementation, the .NET bindings have had all of the necessary protocol changes grafted in, with explicit comments in the source code to have the changes removed when the spec is finalized and all implementations are consistent. The Java bindings have a partial implementation, and can launch IEDriverServer.exe with the proper command-line parameters to enable use of the Microsoft implementation, but the protocol changes have not yet been implemented. Unfortunately, there is no timetable for this work for other language bindings.


So here's an example of what the code looks like to enable and use the Microsoft implementation of the IE driver, using C# code. It's pretty straightforward.

public static void DriveIEUsingMicrosoftImplementation()
    InternetExplorerDriverService service =
    service.Implementation = InternetExplorerDriverEngine.Vendor;
    IWebDriver driver = new InternetExplorerDriver(service);

Conversely, one could launch IEDriverServer.exe with the appropriate command-line switch and use RemoteWebDriver to talk to that running instance.