Engineering Blog

We’re the ZocDoc engineering team, and we love sweet code, dogs, jetpacks, TF2, robots, and lasers... all at once. Read about our adventures here, or apply for a career at ZocDoc!

Experiment Design

1. Follow The Checklist Manifesto

A great experiment seems so simple! It’s mostly common sense – we all probably learned some version of this in grade school science class. But often, we end up in situations where we don’t have the data we want, we haven’t collected enough samples for analysis, the results are vague, and we’re scrambling. That’s why it’s worth it to take the time before we start an experiment to have a plan.

The above is a checklist for experiment design. A checklist is a simple way to cover your bases and is proven to reduce mistakes – that’s the checklist manifesto. If you can answer all the questions in the checklist, you’ll be in good shape (even though things can and likely will change).

2. State the Research Question

Basically, be sure to gather data with a question in mind or an idea of what you want to accomplish. Be as specific as possible. Without that, you may find yourself with an important question to answer but without the right data or poor data quality.

Example 1: Take something that seems as simple as measuring customer retention by join date. First, what defines a join date? What if their contract changes or you upsell them? And what about different segments of the population (contract types, age/gender of the customer, etc.)? It seems straightforward, but when you don’t ask a specific question, it causes confusion and wastes time – especially when several parties are involved. You’ll end up in a rabbit hole of problem definition.

Another common issue is the analytical rabbit hole – when you’re trying to figure out a stopping point. There’s always another way to cut the data, transform it, or model it. If you have a clear idea of what you want to accomplish, this should lend itself to a timeframe and a measure of success or failure.

Stating your research question up front will also force you to analyze any relevant existing data you have – it will force you to size the problem and the prize.

3. Identify the Population

What population are we trying to describe or analyze? This could be all users of your service, or it could just be a subset.

Example 1: We may want to target weekday and weekend users differently. Other commonly used subpopulations are gender, age, market, and referrer (browser type, device, etc.).

Example 2: Another consideration is misleading data, especially when aggregating over different subpopulations. A classic example is known as Simpson’s Paradox. So, in the table below, you can see that treatment A performs better for both Men and Women. But when we aggregate over both populations, treatment B performs better. This is because there was bias in the traffic to each group – in other words, the traffic was not split evenly across groups. This is a real example looking at two treatments across large and small kidney stones. Larger kidney stones were typically given the stronger treatment A, causing the bias.

The lesson here is really to make sure that subpopulations (if they exist) are separated or are evenly distributed across treatment groups.

4. Define the study protocol

Study protocols are statistics 101, so it’s worthwhile to quickly review.

First, we have prospective vs. retrospective studies. In prospective studies, we’re collecting subjects and then looking at subsequent events. In retrospective studies, we’re studying past events.

Second, we have experimental vs. observational studies. Experimental studies are ones in which the researcher controls the assignment of subjects to treatment groups – these are typically prospective studies. When you hear the terms single blind, double blind, etc., they’re talking about randomized assignments and about experimental studies. One nice advantage of randomizing assignment is that we should be less susceptible to bias (recall Simpson’s paradox), in the sense that our subpopulations should be evenly split between treatment groups (e.g. old vs. young, men vs. women, etc.). Observational studies are ones in which (for whatever reason) assignment isn’t possible. For example, studies on race or sex (not groups you can select into) where people are prospectively monitored for disease, etc.

Lastly, we can split into longitudinal studies where data is collected over an extended time period and cross-sectional where we collect data over a fixed time point.

Given this, we can make a decision on study protocol based on money, time, etc.

Example 1: Imagine we want to determine whether customer reviews are causal to or simply correlated with sales of a particular product. We could conduct a prospective, experimental study to control review numbers and examine subsequent impact on sales. The pain points are the time needed to run the experiment, the cost of sales lost and, depending on the sample size, the number of users affected. An alternative option would be a retrospective study to examine the correlation between number of reviews and views per sale.

Example 2: A/B tests are prospective, experimental and cross-sectional. In A/B tests, we use the system infrastructure to help choose between alternate features for the website. The idea is to randomly split traffic between two different variants – the A configuration and the B configuration – and examine which does better in terms of revenue, conversion rate, or whatever else we choose. Using this method, we can optimize our site and the business.

5. Identify the Data to Collect

Next, identify the data to collect. Think about the data you want. Is it categorical, numerical, censored, ranked, etc.? This affects the downstream analysis method and probably the sample size. Understand your data sources and how they work. What are the sources? How is the data structured? How was it collected? What’s the quality of data you are measuring? In other words, know the nuances of the data.

Example 1: For data science studies, data is often collected from disparate sources, aggregated, and then analyzed. An obvious but common problem is making sure this process actually runs smoothly. Having databases that go stale or go out of sync is more common than anyone likes to admit.

A good tip is to figure out ways to profile or check your data quality. Generally, if something is off and you do a good job with this, you’ll see it.

Example 2: Know what the data means. Very typically, there are several different ways of measuring important variables; depending on the analysis, we may care more about different segments of the population.

The better you know your data, the more reliable your analysis will be. People who know the data won’t trust your analysis unless they feel confident that you know it as well.

6. Set the metric and method for measuring success

In most of our cases, this metric will be something like purchase rate or conversion rate. To whatever degree possible, make sure that all other variables in the experiment are controlled (or fixed) besides the chosen measurement metric(s); the other steps in the checklist will help to ensure this. Have an analysis method in mind to be sure you’re collecting the right data and the right amount of data (i.e. the required sample size). In order to understand sample size determination, we first need to understand how populations are compared – optimally, the answer is statistical hypothesis testing. ‘Statistical hypothesis testing’ refers to the process of using observed, collected data to choose between competing hypotheses.

Example 1: For A/B testing, we try to falsify the hypothesis that there is no difference in conversion rate between the A and B variant. For this, we can use a proportion t-test to calculate statistical significance. With this information, we can reverse calculate the sample size. You can read more about sample size calculation for A/B tests here or here.

Example 2: Often in user testing, we ask users to rate alternate designs of the site or a feature for different characteristics. The appropriate analysis method here is conjoint analysis. Conjoint analysis is a statistical technique used in market research to determine how people value different features that make up an individual product or service. The objective of conjoint analysis is to determine what combination of a limited number of attributes is most influential on respondent choice or decision making. A controlled set of potential products or services is shown to respondents and by analyzing their preferences between these products, the implicit valuation of the individual elements making up the product or service can be determined. These implicit valuations (utilities or part-worths) can be used to create market models that estimate market share, revenue, and even profitability of new designs.

As you can see, there are many different methods for analysis. The right one is highly dependent on the data you are collecting and the study design. If you are unsure of the right method, consult a statistician or data scientist.

And while sample size calculation is very important to ensure that we run a study for long enough, we can likely still make inference even if we aren’t able to hit a target. With A/B testing, we’re in a world of plenty in terms of samples. But in most medical studies, for example, this is not the case. This is because of cost and resources, and people understand that. You’ll see studies published in good journals with as few as 10-15 samples in each treatment group. In the absence of the ideal, there are rules of thumb for minimum sample sizes; we can make inference on the samples we have, and use them as a jumping off point. To summarize, we do not want to “run to significance” – what’s the point of that? And we don’t want to run forever either.

7. Review and Peer Review

After this is done, you’ll want to present your results – and a big part of presentation is making people feel comfortable that you’ve covered your bases. There’s no secret here: be your own harshest critic, clarify your assumptions, and check your work. Things will change as you go along, but the more you have planned upfront, the better off you will be. And don’t forget to have your study peer reviewed. Others may point out things you haven’t thought of, or data you didn’t know was available.

As you execute the study, documentation and publication are critical. Likely, you will have to make some assumptions that you didn’t think of. Keep a log for tracking these! You and others might incorrectly interpret and use results if you haven’t documented your assumptions. Also, keep track of unexpected findings. These may be seem like a problem, but are important as a source of future questions to answer. Also, unexpected findings may be relevant to other parties.

8. Avoid The Pitfalls of Experiment Design

Lastly, for any study, you’ll get questions about reliability. These are the pitfalls of experiment design.

  • Confounders: These are cases where you haven’t accounted for a correlating factor that is the true relation. For example, if we were to study people with lung cancer, we may see a correlation with drinking because many smokers drink.
  • Correlation vs. causation: Causation can be hard to show with retrospective studies, as in the case of the relationship between consumer reviews and sales previously discussed.
  • Dependence between samples: If you assume independence and it isn’t true, this affects the power of your study. An example here is treating repeated measurements over time as independent instead of related.
  • Bias
    • Selection bias: Individuals or groups selected as a sample may not be representative of the population studied.
    • Detection bias: This occurs when a phenomenon is more likely to be observed for a particular set of study subjects. For instance, doctors may be more likely to look for diabetes in obese patients than in thinner patients, leading to an inflation in diabetes among obese patients because of skewed detection efforts.
    • Reporting bias: This involves a skew in the availability of data, such that observations of a certain kind are more likely to be reported. Exclusion bias: This arises due to the systematic exclusion of certain individuals from the study.
    • And many more!

9. In Action: The Checklist for an A/B Test

Pop Quiz! Below is the checklist in action for an A/B test. Did you guess the right answers?

Selenium Testing at ZocDoc

Selenium tests are a core part of our test arsenal here at ZocDoc. Ideally we would be able to exhaustively test using only unit and other light-weight tests (we have many of those as well), but these tests by their very nature are just not able to find failures caused by interactions of different subsystems an end-user would experience while using our site. So in an imperfect world, we must deal with the many flaws, higher cost of maintenance, and inherent flakiness of Selenium tests, because at the end of the day they do provide significant value.

This blog post will show the details of our Selenium testing setup here at ZocDoc and how we achieve (for the most part) fast and detailed feedback to developers as well as a clear pass/fail signal while handling test flakiness and intermittent failures.

Hosting

We host the hardware for our Selenium infrastructure ourselves in a data center. Our web servers are a mix of VMs and physical hardware, depending on the environment. The machines actually running the Selenium tests are simple Windows 7 VMs (currently around 250 of them).

While we do not necessarily want to maintain this hardware ourselves, it was just a prudent decision to do so because moving this infrastructure into the cloud – while certainly buzz-worthy – would cost us much more. This is mainly because we have a large number of tests, we test very often and we want to provide fast feedback.

We evaluated both AWS/EC2 as well as integrated test providers such as SauceLabs, but for our purposes self-hosting provides much better value.

In addition to the cost advantage, self-hosting also allowed us to build out our own software infrastructure on top of it that enables optimizations specific to our site and the way we test.

Physical Infrastructure

All of our VMs that run Selenium tests (bots) are based on the same image, which is just a plain OS with a few optimizations to address bottlenecks.

The two bottlenecks on bots we were able to address to some degree are CPU and file I/O, which together limit the number of bot VMs we can host on the same physical hardware.

Naturally, a browser running a Selenium test will consume CPU on any client operation (browser rendering, Javascript, etc.) But the CPU consumption is aggravated by simultaneously recording video. Initially we used camstudio_cl, which is basically a stripped down command line version of CamStudio optimized for screen capture. This encoder still uses a significant amount of CPU, though. So after a brief comparison of codecs we settled on the Expression Encoder Screen Capture Codec which is using around 50% less CPU and produces output files that are less than 20% of the original size. The latter is a significant improvement in terms of I/O. This is a free codec but it is of course Windows only, so that might or might not work for your testing environment.

We achieved another interesting performance improvement by reducing the I/O produced by actually running Selenium tests. Each Selenium test by default creates a new browser profile and caches pages and resources it downloads on disk. This means a lot of I/O. At the same time, we do want the test isolation a new browser profile brings.

The solution was configuring a small RAM disk on each bot and using this for the browser profile, cached files, and our own video recording. This brought down overall I/O on the host machines by almost a factor of 20, to levels where we can comfortably host a large number of bots on spindle disks. Even with SSDs, which we are still using in some environments, this is important because of write endurance, which limits their lifetime.

Figure 1: Disk I/O on host machine before introducing RAM disks

Figure 2: Disk I/O on host machine after introducing RAM disks

Architecture Overview - Running Selenium Tests

Selenium tests are slow since there is a lot of overhead involved – we need to send out the test assemblies to our bots, open a browser, simulate end user actions that result in HTTP requests, and record video.

On the plus side, they are easily parallelizable. This means that the overall test duration is determined by the degree of parallelism - how many tests are running concurrently against your web server(s) – and how long each individual test takes.

We have deployed a custom Selenium grid at ZocDoc that satisfies a few very specific requirements while addressing efficiency and speed across our different testing environments:

  • Ability to pass all test assemblies and dependencies (.NET dlls) to bots over the wire in the form of a zip file
  • Support for different testing environments with different priorities and configurable pool size
  • Ability to set Selenium test order (random, by average run time, failed tests first)
  • Dynamic control of parameters that may be different per test run or environment, such as degree of parallelism.

In our Selenium grid we distinguish between three different roles:

  • Central Command (CC): Maintain a pool of bots and make them available to requestors based on priority and availability
  • Server: Define the set of tests that should be run and send test assemblies to bots that were assigned to it by CC, report test results as a whole to the CI server
  • Client: Run Selenium tests and report individual test results

Figure 3: Interactions between Server, Central Command and Client

Our setup also enables us to assign different priorities for different test runs. This, combined with a target pool size (the number of bots a test run ideally wants to use concurrently), enables us to support multiple concurrent test runs without sacrificing performance. Besides our “main” CI Selenium test runs, we also run Selenium tests on-demand for feature branches that developers are testing before they are ready to merge into the master branch.

These on-demand Selenium test runs are lower priority. When there is a lot of contention for bots (e.g. when there are multiple on-demand runs at the same time) these runs may use fewer bots, but if there are enough bots for everyone, they will run at “full speed.”

Metrics and Monitoring

ZocDoc is a very metrics driven company, and our Continuous Integration environment is no exception – it’s very hard to improve if you do not know where you are. For Selenium tests we record almost anything we can get our hands on:

  • Full test log
  • Test outcome success/failure
  • Start/end time, duration
  • The bot the test ran on
  • The host machine
  • The TeamCity build ID this test run is associated with
  • The server that requested the test run

Additionally, we track aggregate data for test runs so we can follow trends over time, e.g. full test run duration.

In addition to tracking outcomes, we also keep track of what our bots are doing at any time so we can optimize their usage. This information is provided by our Central Command service and written to a DB.

Many improvements in the past (for example, test run priorities) were spawned by looking at common usage patterns visually for which we have our internal CI metrics web site:

Figure 4: Tracking bot usage

Providing developer feedback

One of the core goals for the Automation team at ZocDoc is to make our developers more productive (the other, obviously, is to stop bad code from making it to Production). This means providing as much detail as possible about the circumstances of Selenium test failures to make them easier to reproduce.

Every failed Selenium test at ZocDoc produces the following artifacts:

  • A failure log. All steps the Selenium test ran through up until it failed, possibly exception details. The developer writing the Selenium test is responsible for adding logging statements with appropriate granularity.
  • A screenshot of the web page taken when the test failed
  • An HTML page. The raw HTML file of the page the test failed on
  • A video of the failing test from start until 5 seconds after the failure occurred
  • A HAR file with the complete HTTP network traffic

All of these artifacts are linked to in the failure log that we make available through our CI server

Figure 5: Selenium test failure log with linked test artifacts

Most of these artifacts are self-explanatory, but I want to add some more detail on the HAR file, which has proven very useful. The HAR format, short for HTTP Archive is a standardized file format to store web browser interactions as JSON. It is used in one form or another in Chrome, Firebug, and Fiddler.

Since we are a .NET shop we integrated Fiddler Core into our Selenium testing setup on our bots. Fiddler Core basically allows you to programmatically set the bot’s web proxy and intercept any outgoing or incoming traffic by simply subscribing to a callback (FiddlerApplication.BeforeRequest). We record all of this traffic and write out a HAR file in the case of a test failure.

Since Fiddler itself natively supports HAR files, developers can then open up the generated file in Fiddler (File|Import Sessions…| HTTP Archive) and review all browser interactions, HTTP return codes, and all other data.

Figure 6: Reviewing test failure HTTP network traffic with Fiddler

Handling Failures

Because Selenium tests are full end-to-end tests they can fail for many reasons:

  • The web site feature is broken: The test did its job – yay!
  • The test itself is broken or brittle: I consider a test brittle if it depends on “optimal conditions” and is not resistant to outliers, this could be for example because it is using sleep statements or waiting for Ajax requests to finish instead of waiting for expected elements to appear on the page.
  • Concurrency problems:We run many, many tests in parallel so occasionally we do get SQL deadlocks. This typically is not the test’s problem but a problem with the feature and a sign that this is a bottleneck might eventually be a problem in Production when we scale up. If we see the same type of deadlock re-occurring we will look into changing the underlying queries/indexes, etc.
  • An external dependency is currently unavailable

While we do want to let developers know if their tests are flaky, we still want to maintain a clear pass/fail signal and not be “red” just because of a flaky test or other circumstances beyond our control.

To facilitate stable tests and ongoing development we allow tagging Selenium tests as “in development”. Tests under development are run separately and will not break the build. Only when a test has been passing reliably for some time in this mode is it then promoted to a test that breaks the build. Tests can be demoted as well – if a test has become flaky, it hurts us more than it helps, so at that point we demote it back to “in development” until the team that owns the test can fix it.

We also allow re-running a number of tests for each test run (currently < 0.5% of our total tests), but any single test may only fail once. If these tests pass on a rerun, they will not break the build. Regardless of whether a test breaks the build or not, we track and report these failures and send an automated email to the owner of the test, indicating that it has been failing.

Because we store all test results in a database we have good statistics on our flaky tests and have a page on our metrics web site that lists these tests sorted by the number of times they failed in a given time frame. These tests can then be prioritized for fixing by the owning team and/or be demoted back to in-development.

Final Words

In the end, testing is done for a purpose: to make sure our site works as expected, and that we don’t introduce bugs into Production. Intelligently leveraging our testing environment means we can not only run tests very fast, but also reduce false negatives introduced by brittle or unstable tests, and help out with reproduction of failures using appropriate test artifacts. Selenium tests are not 100% reliable, but they are a very valuable tool in the arsenal of tests we use to achieve this goal.

Zeus

As ZocDoc engineers, we care a lot about performance. From optimizing SQL queries to fine-tuning our search algorithm to tweaking our JavaScript, we’re always looking for ways to make our site just a little bit faster. Our patients’ and doctors’ time is valuable, so we strive to deliver a snappy experience for them on ZocDoc.

But we also care about our developers’ experience building ZocDoc; our time is valuable, too. With so many exciting projects to work on, the last thing we want is to waste time on repetitive tasks like starting up our local web servers.

The workflow for getting ZocDoc running on our dev machines involves compiling code, deploying assemblies, resetting IIS, and warming up local caches. In the olden days, that meant sitting around twiddling our thumbs (i.e. surfing reddit) for five minutes while Visual Studio froze trying to build the solution. Once Visual Studio finally calmed down, we’d switch to the command line to reset IIS. And then we would pull open a web browser to load the caches. In the end, this process took about ten minutes and required manual intervention every step of the way.

Last fall, we finally decided to automate that process with some nifty PowerShell scripts. And so ZEUS (Zoc-tastically Excellent User Scripts) was born. With ZEUS, that workflow is now entirely automated by a single command. No more frozen Visual Studio. No more manual intervention. We just type “build -warmup” and forget about it.

Since then, we’ve used ZEUS to automate a number of other tedious tasks. We can open pull requests on GitHub, execute unit and Selenium tests, and queue runs of our continuous integration system all with just a few keystrokes.

That leaves our engineers to focus on more interesting problems like changing modern healthcare and other important stuff.

Zocron

It seems like every web application needs to have some facility for running tasks periodically. Whether you’re maintaining caches, asynchronously consuming queues, or doing dataset analysis, automating work on a schedule is pretty much ubiquitous.

ZocDoc is no exception! We have hundreds of jobs that run throughout the day, so having a service to run these reliably is important. There are plenty of job scheduler options available. Cron is a popular choice, but problematic for us since we mostly use Microsoft technologies. SQL Server Jobs are reliable and proven, but getting SQL to interact with our CLR codebase is not realistic. Windows Scheduled Tasks are hard to configure if you don’t have production credentials. So, for a variety of reasons, we made our own job scheduler which we’ve affectionately named “Zocron”.

Design Philosophy

ZocDoc’s requirements in a job scheduler were not peculiar, but they drove our decision-making when implementing Zocron. These requirements were:

Suitable for a Variety of Tasks

With Zocron, we wanted to put an end to the various, sometimes one-off, job schedulers that had cropped up at ZocDoc. Bringing these all under one roof would afford for a single system to test and would allow all jobs to enjoy enhancements as they’re introduced. The benefits to unifying the scheduling infrastructure are obvious, but we also wanted to avoid making all jobs suffer from dealing with cruft that might come from a few. For example, we didn’t want jobs that simply pinged an http endpoint periodically to have to deal with overhead that would be required by a heavy-duty data processing job.

Consistently Available

ZocDoc has an aggressive code push schedule, meaning we can deploy new code as frequently as once a day. On the website, we need to provide the same functionality to our patients even when code is deploying and services are shut down. The same attitude prevailed for our job scheduler: the services depending on jobs running reliably should not have to worry about Zocron downtime.

Expressive in Scheduling

What good is a job scheduler if you can’t schedule a job to run a fortnight after the first full moon preceding the vernal equinox? Well, maybe that’s too expressive. But we certainly needed something more sophisticated than “run every n minutes”.

Easy to Monitor

A system is only as good as the tools that visualize its activity. Zocron needed to report what it was doing and when, all in an unambiguous way.

With those needs in mind, we went to work developing the infrastructure for running jobs. While developing, we tried to stick to two simple ideals. Firstly, the scheduling infrastructure should be agnostic about the nature of the tasks it runs. Zocron shouldn’t behave differently for long-running jobs and short-lived ones, and it shouldn’t have special cases. Secondly, tasks themselves should be ignorant about what else might be running at the same time. We feel that, by sticking to these ideals, Zocron has given us a platform to do scheduled work that will serve us for a long while.

Figure 1: Real-time job monitoring

Architectural Overview

The Zocron application itself is a .NET application that is installed as a Windows Service. We’ve deployed it to several machines in our production environment, and we push new code for it every day along with our web servers. Each instance of the Zocron application is called an “agent.” Some agents run on heavy-duty machines and maintain large in-memory caches for important datasets, while other agents are more minimally provisioned. Together, this pool of Zocron agents share the workload of our scheduled jobs.

Configuring Zocron jobs is a matter of updating values in a normalized, relational database. Jobs are defined in a ZocronJob table. They are comprised of one or more tasks, themselves defined in a ZocronTask table and related by a ZocronJobStep table. And, since we love the expressiveness of Cron expressions, we store those expressions in a ZocronJobSchedule table. The relational database is what gives Zocron much of its flexibility: a shared initialization task can be reused in many jobs, multiple Cron expressions can be used to provide a rich schedule, and resource demanding jobs can be assigned to more powerful agents. There’s plenty of other minutiae that can be controlled in these tables, and they are all editable from a web interface that maintains an auditable change log.

The database is also where we log execution information. Each time a job is locked by an agent, its ZocronJobId and scheduled time are inserted into a ZocronJobRunLog table. Similarly, as tasks start and end, information is recorded in a ZocronTaskRunLog table. This gives us details about when something starts, finishes, and what the result was. We then expose this information to the dev team through web tools, making it easy to monitor the status of our jobs in real-time. Synchronization

When a Zocron agent starts, it executes a stored procedure in the database to lock some number of eligible jobs. The stored procedure combines information about a job’s schedule (ZocronJobSchedule table) with its history (ZocronJobRunLog table). The query is basically asking, “Are there any jobs that are scheduled to run now that aren’t already running?” It does this with the help of a clever cross apply, a CLR function, and some covering indexes. The magic happens in the code snippet below.

SELECT zj.ZocronJobId, ISNULL(CAST(x.ScheduledDateEt as datetime), @minDateThresholdEt) as LastRunScheduledDateEt INTO #jobHistory FROM ZocronJob zj outer apply ( SELECT top 1 zjrl.ScheduledDateEt, zjrl.EndDateUtc FROM ZocronJobRunLog zjrl with (updlock, holdlock) WHERE zj.ZocronJobId = zjrl.ZocronJobId ORDER BY zjrl.ScheduledDateEt DESC ) x INSERT INTO ZocronJobRunLog (ZocronJobId, ScheduledDateEt, ZocronAgentId, StartDateUtc) SELECT top (@numberToLock) x.ZocronJobId, x.ScheduleToLock as ScheduledDateEt, @zocronAgentId as ZocronAgentId, @nowUtc as StartDateUtc FROM ( SELECT zjs.ZocronJobId, csLast.Occurrence as ScheduleToLock, ROW_NUMBER() over ( PARTITION BY zjs.ZocronJobId ORDER BY csLast.Occurrence DESC ) as rn FROM #jobHistory jh inner join ZocronJobSchedule zjs on jh.ZocronJobId = zjs.ZocronJobId cross apply CrontabLastOccurrence_fn(zjs.CronExpression, jh.LastRunScheduledDateEt, @nowEt) csLast ) x WHERE x.rn = 1

The first statement gets the last run for each job and stores them in a temp table. The second statement then gets the Cron expression schedules for each job and parses them with a CLR function. This function returns the last occurrence of the given expression if one exists between the given date range. Because we use a cross apply, the jobs that aren’t scheduled to run right now get filtered out. The ROW_NUMBER() function is used because it is possible for a single job to have multiple schedules, so we should only select at most one row per job. Finally, we insert the jobs we want to lock into the run log table. The ZocronJobRunLog table serves two important purposes here. First, it is a record of which jobs are executed and when (which will come in handy for our monitoring tool). Second, it indicates that a job scheduled to run at a certain time has been assigned to a specific agent. When a job instance is assigned to an agent, no other agents will attempt to grab that same scheduled instance. Because we can guarantee this, each agent will execute the stored procedure on its own and jobs will get dealt out to them. SQL Server’s transactions and locks make it possible to interleave stored procedure calls without any additional synchronization: the updlock hint on the first statement acquires an exclusive lock, while the holdlock hint persists that lock until the procedure is committed. Therefore, if two agents execute the stored procedure at the same time, one will block the other.

Learnings

ZocDoc has been using Zocron for about a year now. During that time, we’ve learned a lot about job scheduling and made modest enhancements to Zocron.

We learned early that exposing the proper monitoring signals demystifies how Zocron is working. We’ve got a dashboard that lists the jobs that are currently running along with jobs that are scheduled to run but haven’t yet. We look for blocking chains to make sure no agent is holding database locks for long periods of time. And comparing job run durations against previous runs help us identify gradual and sudden slowdowns.

Also, we’ve come to really appreciate the flexibility database-backed configuration gives us. We can open up a web page, change job settings, and see the effects immediately, all without recycling processes or redeploying binaries. If something is not working as expected, that behavior is brought to our attention immediately and we can turn jobs off. Though we can’t foresee everything that might happen when new code is pushed, having everything in the database gives us options to respond to almost any situation.

Lastly, one of the biggest problems that arose for us was maintaining good quality of service when a single job started consuming large amounts of system resources. Our expectation was that jobs running concurrently would be isolated from each other, but it turned out that if one job loaded a dataset large enough to page to disk, all jobs on that agent would suffer huge performance penalties. Our engineers took a good long look at the problematic jobs and applied these good citizenship strategies:

Consume from Queues

When you have a job scheduler, it is tempting to schedule something for midnight that loads every appointment ever and calculates something about each of them. However, that kind of naïve dataset processing can’t scale forever, especially when you are going nationwide. Our best jobs, therefore, consume from a change queue. For any dataset, when an entity is added or changed, a row is inserted into a change log table. Each change is assigned an auto-incrementing identifier. Zocron jobs can then track the value of the last change identifier processed. On the next run of the job, Zocron will pull back the changes with identifier values greater than that of the last processed one. It is a simple but reliable way of detecting which entities have changed and should be reprocessed.

Bound the Working Set Size

Even when tracking changes, though, it is possible to be overwhelmed. If a burst of changes come in, it may not be safe to load all of them up and reprocess them. Therefore, whenever a Zocron job is getting items to work on, we always cap the number that is pulled back. So, if there have been 100,000 changed entities since the last time a Zocron job ran, it will only process the first 10,000 of them. A subsequent run of the job can take care of the next 10,000. The job will just keep restarting until all the changes have been consumed. By batching up work like this, not only can we allow garbage collection to free up the memory from the earlier changes, but we saved our work incrementally, making it easier to recover when failure arises.

More Frequent, Small Runs

But if you limit the amount of work a job can do, will it ever finish doing what it is supposed to do? It will if you schedule it frequently enough. We’ve found that if a job is scheduled to run with small batches frequently throughout the day, it can easily process everything it needs to. Plus, you avoid having a slug of heavy-duty tasks all kicking off at midnight.

After changing our larger jobs to be friendlier to the health of the agent, the reliability of all jobs improved. Today, we’re looking at ways to further isolate jobs from each other, including wild ideas like running jobs in separate app domains.

Hello, ZocDoc – Training New Developers

As ZocDoc has grown we’ve scaled our systems massively. But it’s not just our technology that has had to scale – our technology organization has as well.

While we pride ourselves on only hiring the best developers that we can find, they all have one fatal flaw: they weren’t born with an innate knowledge of the various technologies we use to build ZocDoc. Unfortunately, modern science has yet to develop the ability to upload new knowledge directly into people’s brains. So while we wait for that to happen, our only choice is to train them.

The Before Times

Back in the day, “training” consisted of giving a new developer a computer, showing them how to open Visual Studio, and then throwing them off the proverbial deep end. Okay, perhaps it wasn’t so dire, but there really was not a formal or well defined training process. The typical experience for new developers on day one was basically: get a cursory overview of our codebase from another developer and then be assigned an introductory project which would allow them to dip their toes into the source code. As developers worked their way through the project, a natural form of training would occur as they sought answers to the questions that inevitably arise.

Sure, it was informal and ad hoc, but it worked. And so long as we remained small, it would continue to work. Unfortunately, our own insistence on growth rendered this approach untenable.

The Now Times

Eventually, we hired a big enough wave of new devs that we were faced with a problem. Training them all in the manner we were accustomed to would have been a nightmare; we would waste resources repeating the same information, and when questions arose only one person would benefit from the answer.

The solution was clear: we needed to train them simultaneously. If that solution seems eye-rollingly obvious, it’s because it should be. Nothing about how we train our developers is particularly revelatory, nor should it be. Teaching always works the same way at a fundamental level: have an expert talk about it, answer questions people have, and then let them practice it. Rinse, repeat. However, we consciously made two philosophical decisions that guided how we would approached the topics we would cover: breadth over depth and no magic.

Breadth over Depth

Breadth over depth means we cover many topics a little rather than a few topics a lot. This sounds like the adage, “jack of all trades, master of none,” but it is ideal given the circumstances. First, training is “only” 2-3 weeks, so as a practical consideration we just can’t cover topics to the level of detail that you would in a semester-long college course. And second, we don’t need to: our new hires are self-learners, and will naturally seek out that depth themselves. Therefore, the goal of training isn’t really to educate per se, but to simply show our trainees all the various things they may not know so that in the future when they encounter the topic again they’re not completely ignorant. Of course they’re not completely alone; they have their peers and experienced developers to guide them.

No Magic

If you’ve done software development you’ve probably heard the term “magic.” “Magic” is pejorative jargon meaning any software functionality whose implementation may not be readily understood by a competent developer. Magic is bad because when something goes wrong (and it inevitably does) it’s hard to diagnose the problem. You have to painstakingly deconstruct it while things are breaking (possibly in production!).

However, if you’re an inexperienced developer learning a new technology stack, a lot of things seem like magic to you. Good training demystifies complex systems so that they don’t appear magical.

The Process

Besides those philosophical points, there shouldn’t be anything too surprising about training itself. Long before our first official “class” of trainees arrived, a small group of developers broke down the various topics we felt it was important to cover. The final list in no particular order was:

  • ZocDoc Business
  • ZocDoc Architecture
  • .NET/C#
  • SQL
  • The Internet/Web
  • HTML/CSS
  • JavaScript
  • ASP.NET MVC
  • UX
  • Git
  • Developer Tools
  • Software Engineering
  • Team Fortress 2
  • Testing
  • Release Engineering
  • Interviewing

Those are a lot of topics to cover in two weeks, especially considering that certain topics (.NET, SQL, JavaScript) have both basic and advanced sections. But again, breadth over depth. Some of the topics, like HTML, may seem really basic, but if you’ve never actually made a web page before, we’re not going to hold that against you. It is better that some trainees be bored than others miss crucial information. A few of these topics are noteworthy. For example, ZocDoc Business and ZocDoc Architecture function as bookends to our training. On the very first day we’ll teach ZocDoc Business, a course which covers all the things about ZocDoc that you probably can’t glean from the outside: our processes, our structure, and cool new things that we’re working on that aren’t ready to be shared with the public just yet. ZocDoc Architecture happens at the very end of our training, and is really the only course that’s specifically about development at ZocDoc. Again, we do this for breadth over depth. Our developers will be learning ZocDoc-specific stuff anyway, so we give them as much general knowledge as possible. After we broke down the topics, courses were assigned to volunteer developers and designers who put together materials. Once materials were completed, they were reviewed and presented to existing developers. Feedback was given and materials were revised. Finally, when our first wave of new hires arrived, they were all seated together in ZocDoc University, our largest conference room. They were then sequestered there with no food, no water, and only a single monitor for the full duration of training (okay, they had food and water – free lunch, in fact). As each course was presented it was recorded and its materials committed to source control for posterity. The end goal of training was to have all the new hires build a simple feedback web application we call “ZocTweets”:

All trainees worked towards the same goal so they could help and learn from each other. When training was done, all the trainees underwent code reviews. Experienced developers came in, critiqued the trainees’ code, made suggestions, and asked questions. The trainees were rewarded with a desk with four monitors and maybe a little bit of extra knowledge on top of the experience.

The After Times

At the end of training, new developers aren’t going to be as productive as an experienced developer, and we don’t expect them to be. They still have a lot to learn, especially around the more intricate parts of our architecture. The training simply prepares them to solve the problems they’ll face.

Most of the lessons learned since rolling out this process have to do with gauging the difficulty of a given topic (which is tricky when you live and breathe this stuff). Our recommendation for anyone designing similar training is to go even more basic than your gut feeling when it comes to your material.

And that’s the nature of the beast: We’re continually refining our training, and we hope it gets even better in the future. It’s a learning experience. You might say even say we’re in training…

October Hackathon Session 1

We just wrapped up our sixth hackathon, and suffice it to say we had a blast! This was our biggest and best hackathon yet, and the next one won’t come soon enough

The first day started out like every hackathon should: We took a hot tub break. Just kidding, we actually started working right off the bat. (We hit the hot tub about twenty minutes later.)

Dinner was served by a familiar face - Carl! We ate, watched the Giants game (everyone at ZocDoc is a Giants fan), then powered up Team Fortress 2, AKA TF2, AKA the best game ever, AKA why aren’t you playing TF2 right now? We played until Carson gave everyone a spy knife and the game was no longer fun.

By day two, everyone was already showing real progress with their projects. It was great to get away from all the hustle and bustle of the city and think about what goals we want to accomplish in the coming months. Hacking poolside is especially relaxing with the cool Long Island breeze and fresh air making their way through the spacious accommodations.

Around midday, the rest of our crew arrived, and we got in some Frisbee and full court basketball. Don’t think of basketball as a nerd game? The skill and trash-talking would probably surprise you. We do tire easily, though, so we put in a few more hours of hacking before some midnight TF2 and Settlers of Catan.

Most memorable exchange from day two: “Anyone interested in going to ocean?” “Nope. Freezing. So cold.” “Huh? It’s warm. Fish live there.”

Only in the Hamptons can ZocDoc devs be so free enough to even ponder the temperature of the water in their vicinity.

Day three: A new game has been invented! We dub thee, soccer tennis volleyball. (Editor’s note: This “new game” has been has been invented on at least one previous occasion. Watch for the lawsuit.) You start with a soccer ball, rally up a few devs, split up the teams on a tennis court, and proceed to play volleyball; same rules as volleyball except with a lower net and using your feet ONLY. Best game ever!

More TF2 ensued (notice a trend?). One of our devs decided his only goal was to run around with a scout and club people with the baseball bat all game every game. Now, TF2 is a strategic game with checks and balances. It requires team coordination in order to achieve the winning objectives. Imagine a random crazy person running around with the senseless goal of whacking people. It actually ended up proving quite an effective distraction and his team won. Another board game was whipped out at this point to provide an alternative to Settlers of Cataan: Dominion (a card game).

Day four: One last session of TF2 must be played before we pack up and head home. The takeaway for the weekend? The new ZocDoc class can dominate the veterans at TF2!

Spring/Summer 2012 Hackathon Report

Spring/Summer 2012 Hackathon Report

Another season, another ZocDoc Hackathon. Since the last hackathon this winter, ZocDoc’s development team has nearly doubled in size (because, of course, ZocDoc is hiring!), requiring us to split the spring/summer Hackathon into two four-day weekends. So this time, we have two Hackathon reports, one from software engineer Matt, a.k.a. “Goose”, the other from User Experience lead Chris, a.k.a. “Turbo”.

GOOSE: Hackathon starts with packing everything we need into vans and heading out to the Hack House. Packing this time around was a bit unusual thanks to the split. In any given development room, half of the people were trying to code while the other half were noisily taking apart workstations, putting everything that fit into trash bags and bubble wrapping the rest. I’m pretty sure we actually had to send Deansy out for a bubblewrap resupply.

TURBO: While the engineers bubblewrapped their 30-pound tower computers and loaded the vans (in a process known as “packathon”), the user experience design contingent, of which I am a member, cruised out to the Hamptons with nothing but our sleek MacBooks Air and a small library of inspirational design books.

Also, the UX team had the advantage of making our journey in Chris Hlavaty’s sweet, sweet van:

Now, when you think “Hackathon”, you probably think “computer programmers.” But ZocDoc’s development team is a combined group of both engineers and user experience designers. So while the developers were coding away, we were banging on Photoshop and OmniGraffle. But it’s really the same thing: we’re all just hunkering down and geeking out over an idea that we think we can kick ass on for four straight days. What’s great about bringing design and technology together is that we all have a chance to collaborate in an ad hoc way and produce even better Hackathon results.

GOOSE: Once we’ve got all the vans packed it becomes an unspoken race to get to the Hack House (first van there will get first choice in beds). Once we’re all there we quickly set up the essentials…

We packed a bunch of folding chairs to work on, but it pays to be creative. I pulled a table up against a couch and used that as a desk. (Best office chair I’ve ever used.) Then we settled down and got hacking.

TURBO: After unpacking, I got to work doing a deep dive on the UX for a really complicated new ZocDoc feature, creating flowcharts and photoshop mockups, and rapidly iterating the designs through small group reviews every few hours with my teammates, both designers and developers.

Basically, Hackathon consists of three activities: Hacking, games (of both the indoor and outdoor varieties), and eating.

The challenge of feeding twenty nerds glued to their computers isn’t a small one. Thankfully, Hack House comes equipped with two great chefs, Carl and Anna, who prepared delicious lunch and dinner for the hackathoners every day. With the food catered, we could focus on the tough problems we came to solve, while still having the energy to take breaks and clear our heads.

A few of us took a quick jaunt to the nearby beach for a dip in the cool water, and another day we checked out an estate sale next door. I thought I’d find some sweet mid-century modern furniture, but instead all I got was an inflatable crocodile to wrestle in the pool.

GOOSE: Unfortunately, there was no gator wrasslin’ at the first week, but we still had plenty of fun. Thanks to Team Fortress 2 and League of Legends, we didn’t even have to leave our coding nests during breaks. Of course, some of our more active ranks got outside and enjoyed the sun. Plenty of basketball, tennis and soccer went down. We even invented a game that was a cross between volleyball, tennis and soccer (TURBO: Which ZocDoc CTO Nick Ganju quickly mastered).

Hack House has a pool table in the basement which strangely had 25 balls. Naturally, we had some intense games of 25-ball cut throat. Table Tennis, our company’s unofficial pastime, was also played. Some of us even looked up at night and saw the big dipster.

TURBO: After much weaving of code and pushing of pixels, and a lot of kicking of soccer balls over a tennis court net, we presented our projects to each other. Our teams focused a lot on improving our internal tools, especially on some sweet data visualization tools to help us keep a better eye on the pulse of ZocDoc. For my project, we were able to make a lot of progress by changing the design pretty radically a few times, but I still had time to hunker down and make a fairly polished finished product by the end.

Concurrent Current: Concurrency’s Currency’s as Current as Currants

You like concurrency, right? Everybody likes concurrency. More and more stuff is happening at once – and when a $300 laptop comes with 16 cores, it’s just a waste to not put them to good use.

Well, there’s good news and bad news. The good news is that .NET 4.0 comes with a sweet set of collections and classes that tremendously help with handling concurrency. You just declare a concurrentInsertYouCollectionHere variable, and it promises to handle your threadsafety correctly. That counts for a lot in web world, when you’re getting a million hits per minute, your users are posting videos to your timeline, patients are booking doctor appointments, doctors are updating their schedules, etc. It’s awesome, right?

Right – but concurrency comes with costs. That’s the bad news. Handling concurrency is hard. Not even NP-hard. Just super-hard. It can elude the greatest minds when handled carelessly (see the infamous heisen-bugs). It’s just not something you want to code unprepared for.

So harken, children, to the tale of ConcurrentDictionary. It’s the story a hero with a dark side. At ZocDoc, we use ConcurrentDictionary a lot. It’s a lifesaver sometimes. You can declare a static ConcurrentDictionary of things, and add to it, modify it, get from it in a thread-safe manner. Our internal caching tools leverage this class pretty extensively.

Sometimes we have a two-part key. Let’s say (int, int) or (long, long), but we need access to all keys with the first part, too. In short, we have something like this:

ConcurrentDictionary<int, ConcurrentDictionary<int, TValue> > two level index.

It performed well for a long time (since we migrated to .NET 4.0) but eventually we started noticing glitches in GC activity on our production servers. Obviously, we’ve been growing at a crazy pace, and we’ve always given GC a hard time – but it always caught up. Then, suddenly, it started coughing more often than usual. Dr. House wasn’t around to give us a deus ex machine diagnosis, so we had to investigate.

As it turned out, we had a problem with our dumps. Err rather, taking and analyzing several particularly large dumps revealed the problem. Our preferred divination tool, WinDbg, revealed we had way too many GEN2 objects. The poor garbage collector had to walk a massive graph of objects to do its job – and what it was really choking on were the objects in the Large Object Heap, which – albeit short-lived – are born old (like Gen2 objects and certain East Asians).

For some reason we had a lot of objects in Gen2 – a lot of objects we hadn’t realized existed. And they were just objects or object arrays. (Weird!) As usual, we dug in deeper and discovered the one to blame. His name was ConcurrentDictionary. The knight in shining armor turned out to have some nasty gangrene under the hood.

ConcurrentDictionary allocates an array of 4* concurrencyLevel objects that are used for synchronization purposes. The default level of concurrency (when you don’t pass it in the constructor) is equal to the number of logical processing cores on the machine. So your regular 4 core Xeon allocates (at two threads per core) an array full of 32 objects. And what if you have 1.5 million of those dictionaries? Smooth move, buddy – you just added 48 million objects for a GC to walk. And your concurrency level is definitely not 48 million…

Clearly, we had to come up with a better solution. One way would be to take a database approach and store everything in a sorted dictionary (e.g. red / black tree), sacrificing a little bit of speed for better memory efficiency and range queries.

But when we examined our codebase, it turned out we didn’t really need range queries (apart from occasional time searching). What we needed was a hashtable of hashsets / lists / dictionaries. Sometimes we can get away with ConcurrentDictionary, TValue> if we don’t need access by the first part of the key. Tuples are neat, but they are also classes. If you write your own tuple as a struct with the same behavior as Tuple, you will gain a little extra speed and GC efficiency. This is especially true when using it with regular dictionary (which uses structs as internal storage, as opposed to classes in ConcurrentDictionary).

Eventually, we settled on a technique called lock striping. Instead of allocating one object for locking purposes, we now allocate an array of those and partition the set of accessible elements into roughly equivalent subsets. This means that one object in the “Lock array” is responsible for only parts of the information stored in the data structure.

Introducing

ConcurrentTwoLevelDictionary<TKey1, TKey2, TValue> where TKey1 : IEquatable<TKey1>, IComparable<TKey1> where TKey2 : IEquatable<TKey2>, IComparable<TKey2>

It offers all the methods you would expect from a decent dictionary interface, and it also covers overloaded methods of concurrent dictionary like:

public void Add(TKey1 key1, TKey2 key2, TValue value) public void Add(STuple<TKey1, TKey2>, TValue> item) public bool ContainsKey(TKey1 key1, TKey2 key2) public bool TryGetValue(TKey1 key1, TKey2 key2, out TValue value) public bool TryUpdate(TKey1 key1, TKey2 key2, TValue newValue, TValue comparisonValue) public TValue AddOrUpdate(TKey1 key1, TKey2 key2, Func<TKey1, TKey2, TValue> addFactory, Func<TKey1, TKey2, TValue, TValue> updateFactory)

But to really make our lives easier, we built methods that utilize only the first part of the key, like this:

public IEnumerable<TValue> this[TKey1 key1] public bool Remove(TKey1 key) public bool ContainsKey(TKey1 key)

The internal implementation uses ConcurrentDictionary at the top level and regular dictionaries in the lower levels. Each operation (apart from reads and writes to the top level which are atomic i.e. provided by concurrent dictionary) uses something like this:

lock (locks[GetLockNumber(key1)]) { \\stuff }

The GetLockNumber function returns an integer from the range [0,locks.Length). We use key1 hashfunction to give us a roughly equivalent distribution of locks, and to ensure that accesses to the same key take the same lock.

Occasionally, we need to utilize a function like this:

private void AcquireAllLocks() { for (int i = 0; i < locks.Length; i++) { Monitor.Enter(this.locks[i]); } }

…and ReleaseAllLocks equivalent. Those operations are needed when we clear out the dictionary, get a Count, or perform other tasks like GetKeys that operate on the whole dictionary. And that’s it! We replaced a few instances of ConcurrentDictionary> and we killed, oh, 50 million objects on our object Gen2 list. (The fewer objects there, the better.) Hooray, concurrency FTW!

GOTO Where You Need To Go

[Happy April Fools’ Day! We hope you liked our little joke.]

You’re sitting at your desk, coding along, when you find yourself in this pickle: you need to reuse some code. You’ve got a nice little snippet that does exactly what you want, but it’s already being used in some other method. You stare longingly at that piece of code drooling and mumbling, "I want to go to there."

Why not just goto that code? Have we forgotten our roots? Have we forsaken what some might call literally the oldest trick in the book? We at ZocDoc haven’t forgotten where we came from. So we started using gotos. At first just a little. Then we got hooked on the sweet, sweet nectar that is the goto statement. Then we started thinking…

We have lots of new developers starting in June and C# has some really complicated syntax. There are for loops, while loops, foreach loops, and recursion. How many different kinds of loops do you really need?! Then there are function calls, classes, inheritance, etc. You shouldn’t have to be an expert programmer just to do a simple thing like adding two numbers and printing the output to the console!

Using goto solves these problems. It simplifies the language, cuts down on training time, and make your codebase easier to understand. It’s a no-brainer! The advantages of this new system are endless. Our code is less indented. Our code is flexible, a goto can go anywhere in the current scope. No need for parameters – everything is statically shared. Oh, and say goodbye to nasty stackoverflow exceptions! Our goto pattern is marginally faster than regular method calls and it is immune to stackoverflows. (See our code examples below)

Additionally, we’ve found a plugin (goto.js) to add goto functionality to javascript, which means that developers now only have to learn one language. You can literally copy and paste back-end code into the front-end!

Some developers were worried that we’d abandon some of our most popular coding conventions when we made this switch. Fear not, devs. We still fully encourage the use of x1, x2, x3, x4 as label names for your gotos to stop the proliferation of those hard to read prose-y names.

Unfortunately we weren’t able to get rid of all if statements and function calls, but we did succeed in getting rid of most control structures. If only C# had a comefrom statement, we could get rid of function calls! Despite this obvious limitation, we still feel this discovery is a big win… So much so that we are calling this simplified language C## (some of us are calling it Dmaj, there isn’t a clear winner yet).

Some people say that goto is bad, or makes your code “hard to follow.” Well, why do you think breakpoints were invented? If goto is good enough for the linux kernel, it’s good enough to use at ZocDoc. What can we say?

Haters Gonna Hate

goto-it people!

For Loops the Archaic Way

    for (int i = 0; i < 10; i++) 
    {
        //wow it's so hard to remember all this pesky syntax. 
        //  was it semicolons? 
        //  how many? 
        //  what order do the expressions go in? 
        //  omg!
        Console.WriteLine(i);
    }  

… and Simplified to use goto

    int i = 0;
    x1:
    Console.WriteLine(i);
    i++;
    if (i < 10)
    {
        goto x1;
    }

Fancy OO-Style Inheritance

class Greeter
{
    public virtual void SayHello(Friend f)
    {
        f.HighFive();
        Console.WriteLine("hello");
    }
}

class FrenchGreeter : Greeter
{
    public override void SayHello(Friend f)
    {
        f.Kisses();
        Console.WriteLine("Bonjour!");
    }
}

…Simplified to use goto

void Greetings()
{
    Console.WriteLine("are you french? y/n");
    bool isFrench = bool.Parse(Console.ReadLine());

    goto greet;

    franch:
    //because c# doesn't have a comefrom statement, 
    //    it's not practical to use goto:kisses;
    //    the most pragmatic thing for us to do was to allow function calls 
    //    in very limited circumstances.
    //    however, comefrom syntax would look like this:
    //        goto: kisses;
    //        comefrom: kisses;

    kisses();
    Console.WriteLine("Bonjour");
    goto done;

    mercan:
    highFives();
    Console.WriteLine("Hello");
    goto done;

    greet:
    if (isFrench)
    {
        goto franch;
    }
    else
    {
        goto mercan;
    }

    done:
    Console.WriteLine("greetings have concluded");
}  

Winter Hackathon

Every few months the ZocDoc dev team has a weekend hackathon for developers to drink and play team fortress 2 work on really cool projects outside the scope of everyday tasks. Because hackathon is offsite, it starts with packathon. We pack up tables, chairs, food, drinks, computers and monitors. Some of us (Dan) tend to be extra careful with our packing.

bubble wrap

With everything safely stored in cargo vans, we buckled up and headed for the Hamptons. Four hours and multiple Crunchwraps Supreme later, we pulled into an icy driveway; the temperature couldn’t have been above 10 degrees.

it's cold!

But our chef Carl had beat us to the house by a few hours, so we walked inside to find a bowl of freshly made guacamole and the smell of a steak dinner cooking. We unpacked everything in record time, and (once again) transformed an elegant living room into a hacker’s paradise, setting up folding tables and filling every room with monitors.

the room

With all of the developers crammed into one room, the house started buzzing with project brainstorming sessions, the sound of aggressive typing and some nice jams courtesy of turntable.fm. A lot of us worked on infrastructure projects that weren’t necessarily glamorous, but make us nerds happy and our web servers happier (more to come on performance improvements in a future post).

Some of us worked on new internal tools for our sales, marketing, and operations teams. Others worked on a fun Kinect-powered map display of searches and/or appointments (which you can move or zoom with hand gestures), that will run on a TV in our New York office. Since ZocDoc has some awesome data, we love to build tools for visualizing it, so a few developers worked on a new generic graphing tool that can filter, group, split, aggregate, and display any data set. The data visualization application does all of its work client side with javascript, so you can manipulate your data faster than a lightning kick.

But the real work began around 10 PM every night, when we fired up TF2. Though we’re all “engineers” at ZocDoc, we found that only a handful of us are true engineers. The rest can’t resist the glory of flamethrowing someone’s face off or ubercharging into a melee of BLU to capture the last point. Quick protip for the TF2 rookies (cough, cough, Carson cough): when you see an Ubercharge coming your way, aim for the medic’s feet. You can launch him into the air, and if he flies far enough he’ll lose his lock on the heavy – then it’s a free for all.

hard work

A few of us even braved the cold to play some basketball on the court in the front yard, and we also organized a massive four-hour game of poker. (This is important, because every developer secretly believes he could win the World Series of Poker if he played seriously for a few months.)

As hackathon starts with packathon, so it ends with packathon. On Sunday, we packed up the NASA control center we’d created and returned the Hamptons mansion to its natural state. We headed back the city – all slightly better at TF2, slightly worse at sleeping, and ready to implement the awesome projects we’d come up with at WNTRHCKTHN 2012.

the crew

How To Win With Chun-Li

On the ZocDoc dev team, we have two video games that are kind of a big deal. First, there’s Team Fortress 2. TF2 is fun, but we don’t tend to take it that seriously – it’s mostly a relaxing activity at the end of a day of hacking. Then there’s Street Fighter II: Champion Edition. This game is serious business. Nick is the definite champion (at least when he’s playing Ken), but competition among the rest of us can get fierce.

I’m here today to talk about Chun-Li, the lone female character in this version of the game. She’s my strongest character, and she can be underappreciated. Ken is the “default” character, Ryu is a bit faster, Bison is haxx, and Dhalsim has the flashy moves. But Chun-Li is fast and has the moves to keep your opponents off-balance, if you use them unpredictably enough. According to Wikipedia, she’s also the first playable female character in a fighting game.

Here are a few of my favorite techniques:

  • Jump in with a short or medium kick. (If you use medium, make sure you’re not pressing down at the time – otherwise you’ll do a head-stomp.) Follow it up with one of these:
    • A throw (towards and fierce) – works even through a block.
    • A low kick into lightning kick – usually they’ll block, but lightning kick does a lot of damage through block. And if they miss the block, you can chew up half their health this way.
    • A neck-breaker (the kick where you flip over them, which is towards and roundhouse)
  • Neck-breaker to knock them down, then neck-breaker again as they get up. Repeat until they remember which direction you have to hold to block the neck-breaker, then throw when they’ve blocked it.
  • Air-throw. This is a very powerful move, if you have the reflexes to pull it off. All you have to do is towards + punch when you’re near them in the air. This can also be used if you’re stuck in a corner and they do a jumping attack – just jump straight up and toss them to the ground.
  • Note that the head-stomp (down + medium kick in the air) goes through several attacks that normally damage anyone that touches them, such as Blanka’s electricity and Bison’s psycho crusher. It doesn’t do much damage, but it can throw the opponent off-balance.
  • Special note for facing Ken / Ryu / Sagat: try to walk towards them to lure them into doing a dragon punch, but back out just in time. When they come down from the dragon punch, throw them.

Overall, stay in the air a lot (except when facing Blanka, Vega, or Guile), focus on your kick and your range, and try to deny them the opportunity to touch you. Don’t neglect your throws.

Now go for that perfect!

Measuring and Optimizing SQL Performance

At ZocDoc, we like performance. A lot. (We also like performance alot. He’s the one wearing tap shoes.) Performance was an especially important word for us during 2011. We expanded across the country and saw an increase in search volume by several orders of magnitude – while managing to reduce the time of our search algorithm by orders of magnitude (and made it smarter to boot).

But scaling this rapidly wasn’t entirely seamless. While few of our public-facing pages even touch SQL Server, we have many internal pages and processes that do. All this means that the database (which talks with several web servers and other applications) is our main obstacle to horizontal scalability. Unfortunately, your ability to optimize database code is only as good as your ability to measure it. And ZocDoc’s measurement tools just weren’t good enough.

That’s why we developed ZDSqlCommand. SqlCommand is a class in the .NET framework that is responsible for talking to Sql Server. ZDSqlCommand wraps it, exposes the functions on SqlCommand that people use, and does some tracking for each.

Let’s get a little nitty-gritty. If you introduce a wrap like ZDSqlCommand, you now need everyone to use it, right? That turns out to be easy when you have SqlHelper (ZocDoc’s micro-ORM that uses ZDSqlCommand under the hood). ZDSqlCommand is a drop-in replacement for SqlCommand, so it was simple to update our remaining non-SqlHelper code that was written with SqlCommands.

Finally, we set up FxCop and created a rule so that all “new SqlCommand()s” break the build. If you break the build, you have to put a dollar in Allen’s Steve McQueen lunchbox. That lunchbox is going to pay for a sweet keg party someday, or so he says. Anyway, all code that runs SQL statements in ZocDoc now goes through ZDSqlCommand in one way or another. Problem solved!

How does ZDSqlCommand help us measure and optimize? For each method on SqlCommand, ZDSqlCommand creates an object called SqlLogItem which records how long the command took to execute, the stack trace, the query text, when the query was executed, bytes sent and received, and the parameters. Note that stack trace (with line numbers) uniquely identifies the web page and code path that invoked a SQL statement.

///example of a ZDSqlCommand method
public object ExecuteScalar()
{
    var stack = GetAndCheckStackTrace();
    ResetConnectionStats();
    var sw = new Stopwatch();
    sw.Start();

    //delegate to the SqlCommand
    var scalar = mySqlCommand.ExecuteScalar();
    sw.Stop();

    var connStats = connection.RetrieveStatistics();
    SqlLogItem.Record(this, sw.Elapsed, connStats, stack);

    return scalar;
}

And thus begins the long, strange journey of SqlLogItem. We then serialize it to disk for offline processing, and store it in a list in the HttpContext.Items collection (a key value store tied to a single web request). On the Request_End event, we check for this list and do two things with it.

First, we check if anyone called a query inside a loop. This is trivial: if 5 or more log items have the same stack trace and their query text doesn’t contain Insert or Update (which can legitimately be called from a loop), it’s deemed a query in a loop. If this happens, the entire dev team gets an email and someone goes and fixes it.

Second, each web server has a static Dictionary of SqlStat objects. These are the aggregates of all the individual log items, keyed by their stack traces. SqlStat is an object with TotalCalls, TotalQueryDuration, StackTrace, and QueryText. The most useful report is simply to sort these by TotalQueryDuration. Because many of the same queries are run over and over through the day, TotalQueryDuration typically increases over time, and it’s difficult to figure out if a query is running normally or spiraling out of control. So we divide TotalQueryDuration by ServerUptime to get time percentage, which is a more useful metric. If the query runs at the same rate throughout the day, the percentage will stay relatively constant. If it runs at the same rate but begins executing slowly (for whatever reason), the percentage will go up. If the query is really heavy but only runs when you first boot the server, the percentage amortizes over the day.

You can see here that each column is a different web server. Each row is a different query, and the top of each column is the total for that server. At the very top we sum up the values for all the web servers and show the overall totals. The big number is the percent time (query time / server uptime), which can go over 100 because web servers are multi-threaded. To optimize, we simply work our way down the list. Because we have a lot of expensive queries that only run once at startup, our graph looks a little crazy for the first 30 minutes after booting a server, but afterwards it stabilizes and shows an accurate picture of where the DB load is coming from. This page is displayed on the dev room wall-monitors, and it automatically refreshes every minute.

Additionally, when any query goes over 10 percent, it turns red on this chart and our team gets a text message about it. This has come in handy a few times. Once, we found that SqlServer wasn’t using the correct index for a certain query because that index’s statistics were skewed by an unusual load pattern. Another time, we realized an automated process was running a heavy command once a minute instead of once every 30 minutes.

Finally, we get a nightly email of the stats page in case we ever want to know how a query has behaved historically. It’s boss! This is optimization at its finest – leafing through data, finding the kinks, and making everything work a little smoother. It’s so easy a dog could do it!


Barkley keeping an eye on the SQL stats

Client-Server Data Persistence with Backbone.js and .NET MVC

Although originally conceived to sit on top of a Rails backend, Backbone’s RESTful nature makes it easy to use with .NET MVC (or any RESTful backend). This post will explain how to implement client-server data persistence using .NET MVC and Backbone.

Backbone is a fantastic little JavaScript framework for providing MVC-like structure to client-side applications. There are already plenty of resources and examples available that demonstrate how to write an application in Backbone. This example will bypass most of Backbone’s client-side features, and instead focus on binding Models and Collections to a .NET MVC application.

The Interface

Our example app will consist of a list of doctors. Doctors can be added and removed from the list, and their Zociness can be toggled on or off. Backbone assumes a RESTful backend, so we’ll need to write our .NET MVC Controller to handle requests to GET, POST, PUT and DELETE doctors. The interface will look like this:

  • HTTP GET to /zoc/docs
  • HTTP POST to /zoc/docs
  • HTTP PUT to /zoc/docs/id
  • HTTP DELETE to /zoc/docs/id

Creating the Service

We’ll start by creating a Doctor view model. This class defines a doctor and will be shared by .NET and Backbone.

NOTE: Backbone expects all models returned by the server to contain a (case-sensitive) field called id.

public class Doctor
{
    public Guid id { get; set; }
    public string name { get; set; }
    public bool isZocd { get; set; }
}

Next, we’ll create a ZocController with a Docs action. This method retrieves a list of Doctors from a database (using an arbitrary GetDocs method) and returns a JSON serialized response.

public class ZocController : Controller
{
    public ActionResult Docs()
    {
        return Json(GetDocs(), JsonRequestBehavior.AllowGet);
    }
}

With this method in place, an HTTP GET to /zoc/docs might return an example response like this:

[
    {
        "id": "eb12a423-3e79-4d5a-807b-bc8ebec22654",
        "name": "Dr. Hibbert",
        "isZocd": true
    },
    {
        "id": "a9002b2e-3bcd-4674-be20-fa801c308bef",
        "name": "Dr. Nick",
        "isZocd": false
    }
]

.NET MVC allows us to overload the Docs action by filtering on the HTTP verb. This is great, however each overloaded method must contain different arguments. To get around this, we’ll create three distinct methods to handle POST, PUT and DELETE requests and alias the action name back to “Docs” (hat tip).

[ActionName("Docs")]
[HttpPost]
public ActionResult HandlePostDoc(Doctor doc)
{
}

[ActionName("Docs")]
[HttpPut]
public ActionResult HandlePutDoc(Doctor doc)
{
}

[ActionName("Docs")]
[HttpDelete]
public ActionResult HandleDeleteDoc(Guid id)
{
}

Let’s take a closer look at the arguments accepted by these methods. The HandlePostDoc and HandlePutDoc methods both accept the strongly-typed Doctor model we defined earlier. Our Backbone application will send the JSON serialized model in the request body, and .NET will automatically bind it to the specified type. Pretty cool! The completed service looks something like this:

public class ZocController : Controller
{
    public ActionResult Docs()
    {
        return Json(GetDocs(), JsonRequestBehavior.AllowGet);
    }

    [ActionName("Docs")]
    [HttpPost]
    public ActionResult HandlePostDoc(Doctor doc)
    {
        doc.id = Guid.NewGuid();

        CreateDoc(doc);

        return Json(doc);
    }

    [ActionName("Docs")]
    [HttpPut]
    public ActionResult HandlePutDoc(Doctor doc)
    {
        UpdateDoc(doc);

        return new EmptyResult();
    }

    [ActionName("Docs")]
    [HttpDelete]
    public ActionResult HandleDeleteDoc(Guid id)
    {
        DeleteDoc(id);

        return new EmptyResult();
    }
}

NOTE: Backbone uses the id of a model to check for server persistence. In our HandlePostDoc method, we set the id and return the serialized model to let Backbone know that the state has been saved on the server.

Binding Backbone to the Server

Now we’ll define our Doctor on the client using Backbone’s Model and Collection classes. Because Backbone assumes a conventional REST API, the Collection’s url attribute is all that’s needed to sync the client and server. (If you’re not using a RESTful service, it’s pretty straightforward to override Backbone’s sync method with one of your own.)

window.Doctor = Backbone.Model;

window.Doctors = Backbone.Collection.extend({

    model: Doctor,

    url: '/zoc/docs'

});

Manipulating Data on the Client

If everything is implemented correctly, our application will perform full CRUD operations asynchronously on the client and persist state to the server. For example, we can request a list of doctors:

var doctors = new Doctors();

doctors.fetch();

GET /zoc/docs

Create a new doctor:

doctors.create({
    name: 'Dr. Zoidberg',
    isZocd: false
});

POST /zoc/docs

Get an existing doc and modify some properties:

var doctor = doctors.get('4a93fa1e-aa91-4c9f-966a-035f3575d563');

doctor.save({
    isZocd: true
});

PUT /zoc/docs/4a93fa1e-aa91-4c9f-966a-035f3575d563

And delete a doctor completely:

doctor.destroy();

DELETE /zoc/docs/4a93fa1e-aa91-4c9f-966a-035f3575d563

To MVC or Not To MVC

That was definitely the question we were facing here at ZocDoc. On one hand, we could continue to develop features on our aging ASP.NET WebForms architecture. On the other hand, we could migrate to ASP.NET MVC, which comes with a lot of benefits: better separation of concerns, more fine-grained control of the user interface, a powerful URL routing system and a compact view engine (Razor). But the amount of overhead involved in migrating our application onto an entirely new framework was staggering. With a complete rewrite out of the question and the thought of bringing another .aspx file into the world way too depressing to think about, the only solution was a hybrid ASP.NET WebForms/MVC application.

After some initial apprehension about the potential ugliness of a hybrid .NET app, we arrived at a surprisingly simple solution. The URL routing system that ships with .NET MVC is sophisticated enough not to interfere with requests to the WebForms application. By registering new routes with MVC (while taking care to avoid collisions with legacy routes), we are able to serve requests through MVC without interfering with the existing application. This approach has allowed us to take advantage of all the benefits of MVC development without rewriting the entire application. And over time we can replace the old aspx pages with MVC pages.

Here is a rundown of how we added MVC to our ASP.NET web site project. (For more information, check out Hanselman’s comprehensive blog post on how to mix MVC with a WebForms Application. Our “Web Site” project type is a bit rickety, so we had to take a more manual approach – but the steps are essentially the same.)

<compilation debug="true" targetFramework="4.0">
    <assemblies>
        <add assembly="System.Web.Abstractions, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31BF3856AD364E35" />
        <add assembly="System.Web.Helpers, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31BF3856AD364E35" />
        <add assembly="System.Web.Routing, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31BF3856AD364E35" />
        <add assembly="System.Web.Mvc, Version=3.0.0.0, Culture=neutral, PublicKeyToken=31BF3856AD364E35" />
        <add assembly="System.Web.WebPages, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31BF3856AD364E35" />
    </assemblies>
</compilation>  



protected void Application_Start()
{
   /* snip */
   RegisterRoutes(RouteTable.Routes);
}

This is the only tricky part because you have to be careful to avoid collisions with existing routes. The router that comes with MVC will ignore files on disk, so most of our *.aspx pages don’t need to be ignored explicitly. Unfortunately we have a whole bunch of legacy routing code that can collide, so we have to consider these cases when writing new components in MVC. Controller and Action names must be chosen wisely to avoid collisions.

public static void RegisterRoutes(RouteCollection routes)
{
    routes.IgnoreRoute("{resource}.axd/{*pathInfo}");
    routes.MapRoute(
        "Default", // The name of the route
        "{controller}/{action}/{id}", // The URL with parameters
        new { controller = "Home", action = "Index", id = UrlParameter.Optional } // The parameter defaults
    );
}

Here is a very basic Controller/View implementation.

~/Controllers/DoctorController.cs

public class DoctorController : Controller
{
    //serves urls like: /doctor/index and /doctor/ (index is the default)
   // gives back a view based on the file ~/Views/Doctor/Index.cshtml
    public ActionResult Index()
    {
        return View(GetDoctorIndexView());
    }

   //serves urls like: /doctor/profile/456. 456 is passed as a param to the profile function
   // gives back a view based on the file ~/Views/doctor/Profile.cshtml
    public ActionResult Profile(string id)
    {
        return View(GetDoctorProfileView(id));
    }
}

Because we specified a default Action of “Index” in our default route definition, a request to “/doctor/” will be routed to the Index Action in DoctorController and serve the view ~/Views/doctor/Index.cshtml. Similarly, a request to “/doctor/profile/12345″ will be routed to the Profile Action, with the id parameter already parsed from the URL, and serve the view ~/Views/Doctor/Detail.cshtml.

To use Razor syntax within a view, inherit WebViewPage and use a .cshtml file extension. For a full overview of Razor, read Scott Gu’s lengthy post that inexplicably uses images for code samples so you can’t copy and paste anything. Also, it’s worth noting that Gu’s blog post appears to be the only source for documentation on Razor–there is no official documentation for some bizarre reason.

In the view below, List must correspond to the argument passed to View() returned by the Index action method. Within the View, this object is accessed using the Model parameter.

In general, all code blocks will be prefixed with the @ symbol. Multi-line should be wrapped in curly brackets, e.g. @{}, and multi-token statements should be wrapped in parentheses, e.g. @().

~/Views/Doctor/Index.cshtml

@inherits WebViewPage<List<Doctor>>
@{
    Layout = "~/Views/Doctor/Layout.cshtml";
    ViewData["PageTitle"] = "Doctor Index";
}
@if (Model.Count > 0) {
    <ol>
        @foreach (var doctor in Model) {
            <li>
                <a href="@Doctor.Url">@doctor.Name</a>
            </li>
        }
    </ol>
} else {
    <div>There aren't any doctors!</div>
}

Here’s an example of a detail view.

~/Views/Doctor/Detail.cshtml

@inherits WebViewPage<Doctor>
@{
   Layout = "~/Views/Doctor/Layout.cshtml";
   ViewData["PageTitle"] = "Doctor Detail";
}
<h1>@Model.Name</h1>
<img src="@Model.Image" alt="Portrait of @Model.Name" />
<address>@Model.Address</address>

Both of these views specify a Layout file. Layout files are useful for defining markup and content that will be present on every page. In this example, we are defining the basic DOM structure, but this could be expanded to include things like global headers, footers, navigation, etc. The contents of the View will take the place of @RenderBody() method. Additionally, .NET provides the ViewData dictionary to pass data from the Controller to the Layout. Here we’re using it to set the value of the <title> tag.

~/Views/Doctor/Layout.cshtml

<!doctype html>
<html>
<head>
    <meta charset="utf-8" />
    <title>@((string)ViewData["PageTitle"])</title>
</head>
<body>
    @RenderBody()
</body>
</html>

That’s it! As you can see, it’s easy to get started on MVC if your app is already heavily webforms-based. You can still create new aspx pages if needed and have MVC running side by side. At ZocDoc, we are doing all new work in MVC, and are making efforts to convert old aspx pages to MVC. Check back soon for more discussion of the MVC migration, including:

  • Custom base controller
  • ActionFilters and Custom ActionFilters for handling forms, ajax requests, etc.
  • Validation, client validation, custom data annotations
  • Backbone.js integration

What Is A Hackathon, And Why Is It So Awesome?

They say everybody wants to go to heaven, but nobody wants to do what it takes to get there. Fortunately for the ZocDoc dev team, the closest place to heaven is out on Long Island – and getting there just requires a van!

But if you’re thinking of throwing your own hackathon, don’t stop with a van. You’ll also need computers. And beer. And Frisbees. For your convenience, I decided to lay out this step-by-step guide to your very own superfun weekend of coding, partying, and everything in-between.

  1. Gather your buddies.
    On Thursday evening, someone yelled, “Code Burnt Umber!” The entire engineering team jumped into matching blue boiler suits, dashed into a secret passage in our equipment closet, slid down a 400-foot fireman’s pole, charged up the flux capacitor, and piled into the ZocMobile. Then we drove through the Queens Midtown Tunnel.

  2. Go to a secret hideout.
    The idea of the hackathon is to put aside our normal, day-to-day bug fixes and requests, and focus on pet projects – stuff we’ve had in our heads, but never could find the time to do. We traveled out to scenic Long Island and holed up in a beautiful old house. I’m pretty sure there was a bumping party across the bay in East Egg, but we had bigger things on our minds.

  3. Code, code, code! This is where the magic happens. Drink a beer, chase it with a Red Bull, and pound out that code. At ZocDoc, we’re all part of the team because we love our work. If you’re not yelling smack about how hard you reduced the load on the server, you’re doing it wrong.

  4. Eat, drink and be merry. Make sure to get seriously loose after every marathon code sesh. We punctuated ours with amazing meals from our chef (thanks, Carl!), tennis matches, Team Fortress 2 tournaments (I can’t tell you how many times I heard, “Spy! Carson’s a spy!”), and some crazy-competitive Frisbee games. Don’t be fooled by our nerdiness – we can hurl plastic discs with the best of them.

  5. Be a little sad to leave, but feel awesome about what you’ve accomplished.
    As much fun as we had – and we had tons – the reward is in going home with a job well done. At ZocDoc, our highest priority is to the patients we serve, and every little thing we do to give them better access to healthcare feels fantastic.

So that’s how we run our hackathons, and this is the sort of fun you can expect on a regular basis when you work at a place as amazing as ZocDoc. If this recipe works for your team, tweet at us or hit us up on Facebook and let us know!