Selenium Testing at ZocDoc
Selenium tests are a core part of our test arsenal here at ZocDoc. Ideally we would be able to exhaustively test using only unit and other light-weight tests (we have many of those as well), but these tests by their very nature are just not able to find failures caused by interactions of different subsystems an end-user would experience while using our site. So in an imperfect world, we must deal with the many flaws, higher cost of maintenance, and inherent flakiness of Selenium tests, because at the end of the day they do provide significant value.
This blog post will show the details of our Selenium testing setup here at ZocDoc and how we achieve (for the most part) fast and detailed feedback to developers as well as a clear pass/fail signal while handling test flakiness and intermittent failures.
We host the hardware for our Selenium infrastructure ourselves in a data center. Our web servers are a mix of VMs and physical hardware, depending on the environment. The machines actually running the Selenium tests are simple Windows 7 VMs (currently around 250 of them).
While we do not necessarily want to maintain this hardware ourselves, it was just a prudent decision to do so because moving this infrastructure into the cloud – while certainly buzz-worthy – would cost us much more. This is mainly because we have a large number of tests, we test very often and we want to provide fast feedback.
We evaluated both AWS/EC2 as well as integrated test providers such as SauceLabs, but for our purposes self-hosting provides much better value.
In addition to the cost advantage, self-hosting also allowed us to build out our own software infrastructure on top of it that enables optimizations specific to our site and the way we test.
All of our VMs that run Selenium tests (bots) are based on the same image, which is just a plain OS with a few optimizations to address bottlenecks.
The two bottlenecks on bots we were able to address to some degree are CPU and file I/O, which together limit the number of bot VMs we can host on the same physical hardware.
We achieved another interesting performance improvement by reducing the I/O produced by actually running Selenium tests. Each Selenium test by default creates a new browser profile and caches pages and resources it downloads on disk. This means a lot of I/O. At the same time, we do want the test isolation a new browser profile brings.
The solution was configuring a small RAM disk on each bot and using this for the browser profile, cached files, and our own video recording. This brought down overall I/O on the host machines by almost a factor of 20, to levels where we can comfortably host a large number of bots on spindle disks. Even with SSDs, which we are still using in some environments, this is important because of write endurance, which limits their lifetime.
Figure 1: Disk I/O on host machine before introducing RAM disks
Figure 2: Disk I/O on host machine after introducing RAM disks
Architecture Overview - Running Selenium Tests
Selenium tests are slow since there is a lot of overhead involved – we need to send out the test assemblies to our bots, open a browser, simulate end user actions that result in HTTP requests, and record video.
On the plus side, they are easily parallelizable. This means that the overall test duration is determined by the degree of parallelism - how many tests are running concurrently against your web server(s) – and how long each individual test takes.
We have deployed a custom Selenium grid at ZocDoc that satisfies a few very specific requirements while addressing efficiency and speed across our different testing environments:
- Ability to pass all test assemblies and dependencies (.NET dlls) to bots over the wire in the form of a zip file
- Support for different testing environments with different priorities and configurable pool size
- Ability to set Selenium test order (random, by average run time, failed tests first)
- Dynamic control of parameters that may be different per test run or environment, such as degree of parallelism.
In our Selenium grid we distinguish between three different roles:
- Central Command (CC): Maintain a pool of bots and make them available to requestors based on priority and availability
- Server: Define the set of tests that should be run and send test assemblies to bots that were assigned to it by CC, report test results as a whole to the CI server
- Client: Run Selenium tests and report individual test results
Figure 3: Interactions between Server, Central Command and Client
Our setup also enables us to assign different priorities for different test runs. This, combined with a target pool size (the number of bots a test run ideally wants to use concurrently), enables us to support multiple concurrent test runs without sacrificing performance. Besides our “main” CI Selenium test runs, we also run Selenium tests on-demand for feature branches that developers are testing before they are ready to merge into the master branch.
These on-demand Selenium test runs are lower priority. When there is a lot of contention for bots (e.g. when there are multiple on-demand runs at the same time) these runs may use fewer bots, but if there are enough bots for everyone, they will run at “full speed.”
Metrics and Monitoring
ZocDoc is a very metrics driven company, and our Continuous Integration environment is no exception – it’s very hard to improve if you do not know where you are. For Selenium tests we record almost anything we can get our hands on:
- Full test log
- Test outcome success/failure
- Start/end time, duration
- The bot the test ran on
- The host machine
- The TeamCity build ID this test run is associated with
- The server that requested the test run
Additionally, we track aggregate data for test runs so we can follow trends over time, e.g. full test run duration.
In addition to tracking outcomes, we also keep track of what our bots are doing at any time so we can optimize their usage. This information is provided by our Central Command service and written to a DB.
Many improvements in the past (for example, test run priorities) were spawned by looking at common usage patterns visually for which we have our internal CI metrics web site:
Figure 4: Tracking bot usage
Providing developer feedback
One of the core goals for the Automation team at ZocDoc is to make our developers more productive (the other, obviously, is to stop bad code from making it to Production). This means providing as much detail as possible about the circumstances of Selenium test failures to make them easier to reproduce.
Every failed Selenium test at ZocDoc produces the following artifacts:
- A failure log. All steps the Selenium test ran through up until it failed, possibly exception details. The developer writing the Selenium test is responsible for adding logging statements with appropriate granularity.
- A screenshot of the web page taken when the test failed
- An HTML page. The raw HTML file of the page the test failed on
- A video of the failing test from start until 5 seconds after the failure occurred
- A HAR file with the complete HTTP network traffic
All of these artifacts are linked to in the failure log that we make available through our CI server
Figure 5: Selenium test failure log with linked test artifacts
Most of these artifacts are self-explanatory, but I want to add some more detail on the HAR file, which has proven very useful. The HAR format, short for HTTP Archive is a standardized file format to store web browser interactions as JSON. It is used in one form or another in Chrome, Firebug, and Fiddler.
Since we are a .NET shop we integrated Fiddler Core into our Selenium testing setup on our bots. Fiddler Core basically allows you to programmatically set the bot’s web proxy and intercept any outgoing or incoming traffic by simply subscribing to a callback
(FiddlerApplication.BeforeRequest). We record all of this traffic and write out a HAR file in the case of a test failure.
Since Fiddler itself natively supports HAR files, developers can then open up the generated file in Fiddler (File|Import Sessions…| HTTP Archive) and review all browser interactions, HTTP return codes, and all other data.
Figure 6: Reviewing test failure HTTP network traffic with Fiddler
Because Selenium tests are full end-to-end tests they can fail for many reasons:
- The web site feature is broken: The test did its job – yay!
- The test itself is broken or brittle: I consider a test brittle if it depends on “optimal conditions” and is not resistant to outliers, this could be for example because it is using sleep statements or waiting for Ajax requests to finish instead of waiting for expected elements to appear on the page.
- Concurrency problems:We run many, many tests in parallel so occasionally we do get SQL deadlocks. This typically is not the test’s problem but a problem with the feature and a sign that this is a bottleneck might eventually be a problem in Production when we scale up. If we see the same type of deadlock re-occurring we will look into changing the underlying queries/indexes, etc.
- An external dependency is currently unavailable
While we do want to let developers know if their tests are flaky, we still want to maintain a clear pass/fail signal and not be “red” just because of a flaky test or other circumstances beyond our control.
To facilitate stable tests and ongoing development we allow tagging Selenium tests as “in development”. Tests under development are run separately and will not break the build. Only when a test has been passing reliably for some time in this mode is it then promoted to a test that breaks the build. Tests can be demoted as well – if a test has become flaky, it hurts us more than it helps, so at that point we demote it back to “in development” until the team that owns the test can fix it.
We also allow re-running a number of tests for each test run (currently < 0.5% of our total tests), but any single test may only fail once. If these tests pass on a rerun, they will not break the build. Regardless of whether a test breaks the build or not, we track and report these failures and send an automated email to the owner of the test, indicating that it has been failing.
Because we store all test results in a database we have good statistics on our flaky tests and have a page on our metrics web site that lists these tests sorted by the number of times they failed in a given time frame. These tests can then be prioritized for fixing by the owning team and/or be demoted back to in-development.
In the end, testing is done for a purpose: to make sure our site works as expected, and that we don’t introduce bugs into Production. Intelligently leveraging our testing environment means we can not only run tests very fast, but also reduce false negatives introduced by brittle or unstable tests, and help out with reproduction of failures using appropriate test artifacts. Selenium tests are not 100% reliable, but they are a very valuable tool in the arsenal of tests we use to achieve this goal.