Web services performance testing: a pilgrim’s progress. Part 3

This particular blog series is probably going to take me as long to finish as it did the medieval Muslim residents of al-Andalus to make a Hajj. So the “pilgrim” in this blog series title is apt.

It’s easy enough to set up a “load test” in a testing tool. It’s a little more challenging to frame the questions you want to ask about performance.

The services whose performance I was testing request data on policies that vary by insured risk state. The policy data resides in an IBM z/OS mainframe system and Datacom databases. The architecture works something like this:

  • .NET Web service request for policy data is made
  • Request is routed through middleware Web services that scrape mainframe screens or query Datacom
  • Middleware Web service returns policy data, or a SOAP fault, to .NET
  • .NET passes back the data to the requestor as XML

The team was especially concerned about the performance of the components that scraped the screens. Screen scraping can be slow, and our code would be sharing the subsystem that scrapes the data with a finicky Java messaging framework. Also, the middleware in question is very much due for an upgrade.

After thinking about these issues as well as some consultation with the project team, I designed my performance tests to record:

  • Response time in milliseconds per request
  • The risk state of the policy data being requested
  • The number of policy terms in the response: the usual number is two but the minimum number is one when the request is successful. If two terms are present, the number of screens that needs to be scraped doubles.
  • The size in bytes of the response. It is possible that a single-term response could be as large or larger than a two-term response in some cases, depending on the amount of data per mainframe screen and whether certain screens were used for that policy.

I made a couple more decisions based on the fact that I was testing in production (see part 1 of this series).

  • Requests would be made every 30 to 60 seconds over a period of a couple of hours for a total of 200 or so. Earlier tests at a higher frequency did not always go well (the aforementioned Java messaging framework was a rather feeble canary in the coal mine).
  • The team felt that the volume of requests (one or two a minute) was a realistic prediction of actual production load. The static frequency is not realistic, but my concern was again to avoid interfering with other production usage.
  • I felt that a sample size of 200 was “decent.” Since prod support staff had to monitor the production systems as I ran the test, a run time of anything over a couple of hours would not have been reasonable.

In the next posts I’ll review how I recorded and reported on data using SoapUI, Groovy, and the R statistics language.