In order to understand how to use browser technologies to automate pages that use some form of authentication, it is useful to know what happens when you browse to such a page. What's actually happening when your browser prompts you for some form of credentials, usually a user name and password, before it will let you access a given resource on the web?
At the risk of dropping down to a ridiculously low level, let's talk about how browsers transfer data for browsing websites. First, an obligatory disclaimer. I'm going to deliberately gloss over using pages served via secure HTTP ("https"), and I'm going to ignore mostly-binary protocols like HTTP/2 for this series. Those items, while important, and may impact the outcomes you see here, are beyond the scope of this series.
Most of the time, a browser is using the Hypertext Transfer Protocol (or HTTP) to communicate with a given web server. When you type in a URL in your browser's address bar, your browser sends off an HTTP request (that's what the "http://" means at the beginning of the URL), and receives a response from the server. For the following examples, we'll be using Dave Haeffner's excellent Selenium-focused testing site, The Internet, which is designed to provide examples of challenging things a user might encounter when automating web pages with Selenium, and a hosted version of which is available at http://the-internet.herokuapp.com. Here's what a typical exchange might look like:
Browser sends:
GET http://the-internet.herokuapp.com/checkboxes HTTP/1.1
Host:the-internet.herokuapp.com
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language:en-US,en;q=0.5
Accept-Encoding:gzip, deflate
Connection:keep-alive
Upgrade-Insecure-Requests:1
Browser receives back:
HTTP/1.1 200 OK
Connection: keep-alive
Content-Type: text/html;charset=utf-8
Content-Length: 2008
Server: WEBrick/1.3.1 (Ruby/2.2.5/2016-04-26)
Date: Thu, 23 May 2019 23:44:54 GMT
Via: 1.1 vegur
<body of HTML page here>
This is what happens for virtually every time a browser makes a request for a resource. The important thing to note is in that first line of the response. The "200 OK" bit means that the server had the resource and was sending it in response to the request. Now let's look at a request for a resource that is protected by authentication:
Browser sends:
GET http://the-internet.herokuapp.com/basic_auth HTTP/1.1
Host: the-internet.herokuapp.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Browser receives back:
HTTP/1.1 401 Unauthorized
Connection: keep-alive
Content-Type: text/html;charset=utf-8
Www-Authenticate: Basic realm="Restricted Area"
Content-Length: 15
Server: WEBrick/1.3.1 (Ruby/2.2.5/2016-04-26)
Date: Thu, 23 May 2019 23:52:24 GMT
Via: 1.1 vegur
Note the all-important first line of the response, which says "401 Unauthorized". That tells us that we have a page that requires authentication. Note that if you asked your browser to browse to the page http://the-internet.herokuapp.com/basic_auth, you would have been prompted for a user name and password. Note in the response the line that says Www-Authenticate: Basic realm="Restricted Area". That tells the browser that the "Basic" authentication scheme is expected, and that the user's user name and password are required, and so the browser prompts you, and then it re-sends the request to the server, but with an additional header. If you used the proper credentials for the aforementioned URL (user name: admin, password: admin), you'd see something like the following:
Browser sends:
GET http://the-internet.herokuapp.com/basic_auth HTTP/1.1
Host: the-internet.herokuapp.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Authorization: Basic YWRtaW46YWRtaW4=
Browser receives back:
HTTP/1.1 200 OK
Connection: keep-alive
Content-Type: text/html;charset=utf-8
Content-Length: 1643
Server: WEBrick/1.3.1 (Ruby/2.2.5/2016-04-26)
Date: Thu, 23 May 2019 23:59:31 GMT
Via: 1.1 vegur
<body of HTML page here>
Clearly, that additional header that says Authorization: Basic YWRtaW46YWRtaW4= tells us that the browser must've done something with those credentials we gave it. If only we had a way to intercept the unauthorized response, calculate what needs to go into that authorization header, and resend the request before the browser had the chance to prompt us for credentials, we'd be golden. As luck (and technology) would have it, we do have exactly that ability, by using a web proxy. Every browser supports proxies, and Selenium makes it incredibly easy to use them with browsers being automated by it. The next post in this series will outline how to get that set up and working with Selenium.
At the risk of dropping down to a ridiculously low level, let's talk about how browsers transfer data for browsing websites. First, an obligatory disclaimer. I'm going to deliberately gloss over using pages served via secure HTTP ("https"), and I'm going to ignore mostly-binary protocols like HTTP/2 for this series. Those items, while important, and may impact the outcomes you see here, are beyond the scope of this series.
Most of the time, a browser is using the Hypertext Transfer Protocol (or HTTP) to communicate with a given web server. When you type in a URL in your browser's address bar, your browser sends off an HTTP request (that's what the "http://" means at the beginning of the URL), and receives a response from the server. For the following examples, we'll be using Dave Haeffner's excellent Selenium-focused testing site, The Internet, which is designed to provide examples of challenging things a user might encounter when automating web pages with Selenium, and a hosted version of which is available at http://the-internet.herokuapp.com. Here's what a typical exchange might look like:
Browser sends:
GET http://the-internet.herokuapp.com/checkboxes HTTP/1.1
Host:the-internet.herokuapp.com
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language:en-US,en;q=0.5
Accept-Encoding:gzip, deflate
Connection:keep-alive
Upgrade-Insecure-Requests:1
Browser receives back:
HTTP/1.1 200 OK
Connection: keep-alive
Content-Type: text/html;charset=utf-8
Content-Length: 2008
Server: WEBrick/1.3.1 (Ruby/2.2.5/2016-04-26)
Date: Thu, 23 May 2019 23:44:54 GMT
Via: 1.1 vegur
<body of HTML page here>
This is what happens for virtually every time a browser makes a request for a resource. The important thing to note is in that first line of the response. The "200 OK" bit means that the server had the resource and was sending it in response to the request. Now let's look at a request for a resource that is protected by authentication:
Browser sends:
GET http://the-internet.herokuapp.com/basic_auth HTTP/1.1
Host: the-internet.herokuapp.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Browser receives back:
HTTP/1.1 401 Unauthorized
Connection: keep-alive
Content-Type: text/html;charset=utf-8
Www-Authenticate: Basic realm="Restricted Area"
Content-Length: 15
Server: WEBrick/1.3.1 (Ruby/2.2.5/2016-04-26)
Date: Thu, 23 May 2019 23:52:24 GMT
Via: 1.1 vegur
Note the all-important first line of the response, which says "401 Unauthorized". That tells us that we have a page that requires authentication. Note that if you asked your browser to browse to the page http://the-internet.herokuapp.com/basic_auth, you would have been prompted for a user name and password. Note in the response the line that says Www-Authenticate: Basic realm="Restricted Area". That tells the browser that the "Basic" authentication scheme is expected, and that the user's user name and password are required, and so the browser prompts you, and then it re-sends the request to the server, but with an additional header. If you used the proper credentials for the aforementioned URL (user name: admin, password: admin), you'd see something like the following:
Browser sends:
GET http://the-internet.herokuapp.com/basic_auth HTTP/1.1
Host: the-internet.herokuapp.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Authorization: Basic YWRtaW46YWRtaW4=
Browser receives back:
HTTP/1.1 200 OK
Connection: keep-alive
Content-Type: text/html;charset=utf-8
Content-Length: 1643
Server: WEBrick/1.3.1 (Ruby/2.2.5/2016-04-26)
Date: Thu, 23 May 2019 23:59:31 GMT
Via: 1.1 vegur
<body of HTML page here>
Clearly, that additional header that says Authorization: Basic YWRtaW46YWRtaW4= tells us that the browser must've done something with those credentials we gave it. If only we had a way to intercept the unauthorized response, calculate what needs to go into that authorization header, and resend the request before the browser had the chance to prompt us for credentials, we'd be golden. As luck (and technology) would have it, we do have exactly that ability, by using a web proxy. Every browser supports proxies, and Selenium makes it incredibly easy to use them with browsers being automated by it. The next post in this series will outline how to get that set up and working with Selenium.
great blog man love it
ReplyDelete