Web accessibility metrics – “What are they for then?”


Yesterday I participated in the W3C’s, Web Accessibility Initiative’s (WAI) Website Accessibility Metrics Online Symposium. Details and access to the papers and presentations of the symposium is available at: http://www.w3.org/WAI/RD/2011/metrics/.

This blog post is not an attempt to give a comprehensive report of the symposium but to air some of my thinking about it and how it relates to ongoing work I am involved in at the Open University where I am employed as a Senior Research Fellow with an internal consultancy role on accessibility.

Personal basis for interest in web metrics

I have been working on technology for people with disabilities since 1991. Since 1998 when I joined the Open University that has been focused at technology which enables teaching and learning. My academic background is in cybernetics and I usually describe myself as a systems engineer. So my main interests are in access to systems and systems behaviours that can be enabling. Most systems today have web-based interfaces so web accessibility is an important issue. In the interdisciplinary teams I have led or been part of, and in the accessibility work of the Institute of Educational Technology for the rest of the university, our evaluation of accessibility has put highest value on user (disabled student) evaluations. These are normally based on observational studies with participants interacting with functioning prototypes followed up by structured interviews. For pragmatic reasons extensive expert evaluations supplement these end-user evaluations (early in development and for procurement assessments they are often the best method). However these expert evaluations are not based on the automated or semi-automated evaluation tools, often associate with the metrics reported in this symposium, that evaluate against the web accessibility standards. Rather they are based on heuristic methods interacting with the prototypes using a range of assistive technologies (ATs) and access techniques to in effect emulate the users with different disabilities. This is to answer two key research questions for a range of different users:

  • Can the disabled user undertake the actions intended by the design?
  • What will the end-user experience be (compared with a user not deploying AT or access approaches)?
So the web accessibility guidelines were and still are not core to our evaluation work, although the access principles are the same in both. (Note we have been involved in more conventional accessibility evaluation against the standards too but not in methodological development here.) Where the standards have been important is in communicating to developers what needs to be done and in supporting their QA practices.
I became aware of the work on web accessibility metrics sometime around 2002, especially the work that was subsequently sponsored by the European Commission. I was part of the accessibility community (and there was a sizeable number of us) who was quite sceptical. I could not see the value (for my work) of a single score for the accessibility of a set of web pages. What I wanted to know was who could access the site/interface, with few problems, who would have significant problems, where the deficits were and what could be done about them? I needed fine-grained information not an overall metric. I have only just envisaged a possible use for metrics in my work in facilitating a systems behaviour in e-learning that I point to at the end of this blog post. This was my motivation for taking a detailed look at the state of the art of accessibility metrics at this stage and hence my participation in the seminar.
I am currently undertaking, due to complete before Christmas, an internal standards review project. I am reviewing all internal web accessibility policy statements and standards (we historically have had a silo situation which we are seeking to rectify) against WCAG 2.0 and the British Standard Institutes BS8878 “Web Accessibility Code of Practice”. So my attention is currently on the standards at some level of detail.
I have given this rather long preamble so you can judge the perspective for my comments below.

A definition of Web Accessibility Metrics

Web metrics in general quantify a result for assessments of properties of web pages and their use; they might include:

  • Web usage and patterns
  • User supplied data
  • Transactions
  • Site performance
  • Usability
  • Financial analysis (ROI)

Web accessibility metrics try to give an assessment of the level of accessibility against a given standard e.g WCAG 2.0.

What are they for?

Three basic questions about any metric:

  • What should you measure?
  • How do you measure it?
  • What do you do with the data once you have it?
Most people in the field would argue for Web Accessibility Metrics as a measure of the degree of accessibility of a web page or collection of web pages. The main school of thought has been defining accessibility in terms of conformance to web accessibility guidelines like WCAG 1.0 or WCAG 2.0. A lot of the research in the field has been in terms of defining the form of the measure that makes up the particular metric concerned, implementing tools to automate its application and then researching the validity and reliability of the metric. However, from my perspective on the field, the 3rd question as to what you actually do with the metric is much neglected by the web accessibility metrics research community.
What do you do with relative rankings of the accessibility of web sites?
  • Large scale comparative studies: It seems to be that the most obvious use case and the one that such metrics have had most impact to date is in the large-scale comparative study of websites in a particular domain, with the possibility of doing so over time.
If a credible, stable metric of web accessibility was to be established (at the moment we have many with differing properties) this enables such investigations of the form: What is the overall level of web accessibility in UK public-sector websites?; Accessibility in on-line shopping sites: and improving situation?; … etc. Such studies can be important in informing high level policy and legislation.
  • Litigation: [I will confine myself to the UK legal situation here.] In the UK we have anti-discrimination legislation not accessibility legislation. This is now based on the Equality Act 2010, which builds on the Disability Discrimination Act (DDA) last amended 2005.
It is unlawful for any provider of services to the public, or educational establishment (in my case), etc. to discriminate in that provision against a person with a disability on the basis of their disability. What is more, they are required to make “reasonable adjustments” to meet the needs of disabled people and to be anticipatory in so doing.
Now, websites are not specifically mentioned in this UK legislation but they are in the codes of practice that accompany it. I always argue that if you think about it from the outset dealing with web accessibility is reasonable. However this is yet to be tested in a court of law. (There was a case some years ago when the RNIB begun court proceedings against a major supermarket chain because of the inaccessibility of their on-line shopping site. The case was settled out of court, RNIB worked with the company concerned to improve their site and everyone won, except the legal position on web accessibility was not clarified.)
If a case of web accessibility did ever go to UK courts it is most likely that expert witnesses would be called for both the prosecution and the defence to establish firstly, was the person(s) concerned substantially disadvantaged, if so was this because the site in question was inaccessible (see note below). Then if not was it reasonable that the provider of the web site had made it accessible? I could see a role here for web accessibility metrics and large-scale studies of numerous sites. Then with an evaluation of the site in question using the same metric a “score” could be given as to its level of accessibility and comparisons made with other sites. However would any of the current metrics and the body of research around them stand up in a court of law? (I would never appear for the defence in such a case but feel if I did I could knock some holes in the existing metrics to try and discredit them, if I could others would be able to too.)
[Note – I had a recent exchange on LinkedIn with the Accessibility Expert who appeared for the prosecution in the famous case (for those of us in the field then) when the web site for the 2000 Summer Olympics in Sydney was taken to court under Australian law for poor accessibility. He made the point he had to tell the court whether the site was accessible or not, i.e. a binary assessment. My reaction was “if that’s the law then the law’s an ass” [Charles Dickens’ Oliver Twist]. Metrics can have a role here in educating the law and the wider world that accessibility is not a binary property. Indeed it is a property that will be different for different users but I fear metrics are less helpful here.
  • Remedial Action: It seems to me that web accessibility metrics are poor tools at identifying where remedial action is required. However in the final section I allude to a future scenario where they may have a role.
  • Others? … Please feel free to suggest some in comments to this blog post.

Further Questions:

I will leave a few other questions undiscussed but they are informing my thinking about web accessibility metrics:

  • What are web accessibility guidelines for?
  • What does a metric try to give a measure of (how do they related to the guidelines)?
  • Who are they for, who are the users of the tools that produce the metrics then the consumers of the resulting metrics?
  • What are they for (in addition to the points raised above)?

Specific examples of schemes of web metrics

I just list here the specific schemes of web metrics mentioned in the papers of the symposium. I try and give a defining characteristic for some but make no attempt at a comparative study.

  • WAB Score [Paper 1] The Web Accessibility Barrier (WAB) score metric was proposed by Parmanto and Zeng (2005). It is a method that enables identification and quantification of accessibility trends across Web 1.0 websites and Web 2.0 websites. The WAB score formula tests 25 WCAG 1.0 criteria that can be evaluated automatically.
  • Failure rate [Paper 1], [Paper 6] The failure-rate metric computes the ratio between number of accessibility violations over the number of failure points. – First propose by Sullivan and Matson (in 2000) possibly the start of web accessibility metrics.

Part of the Unified Web Evaluation Methodology developed in 3 linked EU projects. Based on WCAG 1.0. Migration strategy to WCAG 2.0 published but not yet executed, see: Paper 11. The UWEM score function for presenting large-scale web accessibility monitoring results. The calculation yields a continuous ratio with a minimum of 0, in case no barriers are found. If all tests fail each time they are applied the score reaches its maximum value 1.

  • Barriers Impact Factor, BIF [Paper 2]

BIF reports, for each error detected in evaluating against WCAG 2.0, the list of assistive technologies/disabilities affected by such an error then: The calculation of the ratio yields continuous results with a minimum of 0, if no barriers are found. On the other hand if all tests fail each time they are applied the score reaches its maximum value 1.

BIF(i) = Σerror #error(i) x weight(i); the total BIF is: tBIF = Σi BIF(i) and the average BIF is: aBIF = tBIF/#pages

· i represents the assistive technologies/disabilities affected by detected errors;
· BIF(i) is the Barrier Impact Factor affected the i assistive technology/disability;
· error(i) represents the number of detected errors which affect the i assistive technology/disability;
· weight(i) represents the weight which has been assigned to the i assistive technology/disability.

WAQM is a fully automatic metric designed to measure conformance (currently to WCAG 1.0) in percentage terms. The ratio between potential failure-points and actual violations is computed for all checkpoints that can be tested automatically, that is, the failure-rate. The severity of checkpoint violations is considered (by WCAG 1.0 priorities) and each failure-rate is weighted by this severity. (Interesting but to my view inconclusive comparison between evaluations undertaken by WAQM (based on WCAG 1.0) vs expert? evaluations against WCAG 2.0 pointed to in Paper 6.)
  • SAMBA [Paper 4] a Semi-Automatic Method for measuring Barriers of Accessibility (SAMBA), it integrates manual and automatic evaluations on the strength of barriers harshness and of tools errors rates.

BITV-Test is a semi-automated web-based accessibility evaluation tool employing a rating approach. It undertakes page-level rating and aggregation of page level ratings in a overall test score. BITV-Test’s 50 checkpoints map to WCAG level AA. Each checkpoint has a weight of 1, 2 or 3 points, depending on criticality.

When testing a page per checkpoint, evaluators assess the total pattern or the set of instances and apply a graded Likert-type scale with five rating levels:

1. pass (100%)
2. marginally acceptable (75%)
3. partly acceptable (50%)
4. marginally unacceptable (25%)
5. fail (0 %)

Ratings reflect both the frequency and criticality of flaws. For ratings other than a full “pass”, a percentage of the weight is recorded. Page level rating values are aggregated over the entire page sample. At a total score of 90 points or more, the site is considered accessible.

The final BITV-Test (they also have self-assessment and design support versions) is a tandem test, in other words, two qualified evaluators test independent of each other and harmonise their results only once they have finished their respective test runs.

  • eChecker [Paper 8], not a metric but an automated web page accessibility tool that evaluates according to UWEM and was used in Paper 8 in a comparative study with eXaminator.

eXaminator has its roots in manual evaluations made by experts (since 2000). Unlike metrics such as WAQM, which seeks to achieve a failure rate for each page or UWEM, which seeks a failure rate for each checkpoint, eXaminator assigns a score to a specific occurrence in a page. The metric (the authors argue) is faithfully to the definition of WCAG’s compliance and the unit of conformity: the page.

  • Logic Scores Preferences (LSP) method [Paper 9],

LSP an aggregation model (based on neural networks) that computes a global score from the intermediate scores. (Dujmovic, 1996). These intermediate scores consist of failure-rates or the absolute number of accessibility problems. (Paper 9 reports using this approach in both Device Tailor and User Tailored metrics)

  • eGovMon Project [Paper 11] Paper reported on the issues uncovered by this Norwegian project in trying to update UWEM to a new metric based on WCAG 2.0 (a non-trivial tasks as discussed in the paper)

A critique of Web Accessibility Metrics (Martyn’s views)

How much do they help developers find and fix accessibility deficits? My thinking to date, is that for my context very little. However I am open to be persuaded otherwise from other’s experience (so please add a comment). A possible role for them at systems wide accessibility review in an eLearning context is envisaged in the final section of this blog.

A good thing recognised in almost all Web Accessibility Metrics approaches is that accessibility is not a binary issue. Web sites are not either accessible or not but have degrees of accessibility. In fact they have degrees of accessibility for different users and this is not recognised in any of the approaches known to me (but happy to be corrected). So few if any of the approaches enable statements like, “this site while reasonably accessible to screen-reader users but would be problematic for those with a hearing impairment or who were colour blind”, to be directly and correctly deduced.

None of the web accessibility metrics considered in [Paper 3] directly addresses the developers’ efforts needed to correct the accessibility problems. That paper went on to consider the impact of which accessibility deficits were due to deficits in templates used in the authoring of the sets of web pages under review. However, this raises the more general question from the perspective of the manager or web developer: what does the accessibility metric tell me about what will be the cost (in terms of time and effort) of improving that metric to a given level, for a given set of web resources? I would argue that none of the existing metrics facilitate this, although the data collected in calculating the metric will also be helpful in evaluating the cost or remedial action. Is this a feature facilitated in the automated and semi-automated tools created to calculate web metrics? I.e. do the tools make available the useful data? Estimates of cost of remedial action here are thus mostly facilitated by automated/semi-automated evaluation techniques not the metric. The one thing the metric may give is a scale on which to be able to say: how much will it cost to improve by so much and then by a degree further. However I have never heard managers of web resources frame the question this way. It is usually what is it going to cost to address the deficits to meet WCAG 2.0 Level AA (for example)? I am not sure metrics help here.

Where are the users? I find this the most disturbing situation around accessibility metrics (well and around web accessibility standards too). I am yet to encounter any work (and I would be delighted to have it pointed out to me) where attempts have been made to verify if the metrics correlate to the access experience of disabled people. I know that such a study would be difficult and costly to do because it would have to be done at scale and involve a large diversity of users to be meaningful. However until such work is done then we are just in a self-referential circle convincing ourselves we have something of real worth. This follows from the fact that the correlations that have been done are between expert evaluations and the metrics generated by various tools both working to the same standards which, as far as I am aware, have not undergone large-scale assessment against the experience of diverse users of web sites where they have been rigorously applied. [I am not questioning the validity of WCAG 2.0 here – I might elsewhere 😉 just asserting the importance of user evaluation in ensuring validity.]

The other users to consider here are the consumers of the metrics. Are the metrics meeting their needs? Are the metrics well understood by those that use them?

The importance of context Context is very important to the evaluation of user experiences. This is a long-established principle in evaluations undertaken by my Institute (established long before I was there). The web accessibility metrics reviewed here, for the most part, remove context. This issue was raised and discussed in the paper by Markel Vigo, of the University of Manchester, entitled “Context-Tailored Web Accessibility Metrics” [Paper 9].

Accessibility as process

BS 8878 provides a framework that allows definition – and measurement – of the process undertaken by organisations to procure an optimally accessible web site, but is at present a copyrighted work and not freely available. In comparison to a purely technical WCAG conformance report, the nature of the data being gathered for measurement means that inevitably the measurement process is longer; but it also provides a richer set of data giving context – and therefore justification – to current levels of accessibility.

[David Sloan, Brian Kelly Paper 10]

This paper, entitled “Web Accessibility Metrics For A Post Digital World“, rather than presenting results of previous work was more a position paper presenting a perspective on possible future directions for metrics that stood out as distinct from the other papers. It was closely aligned to my own views, but that is perhaps not surprising as I am a regular follower of Brian’s blog. (I know David and Brian quite well and respect them both.)

I commend Brian Kelly’s blog, which covers broader issues than accessibility, he has beaten me to getting up a post relating to this metrics workshop): http://ukwebfocus.wordpress.com/

One theme of the paper is that measuring accessibility should not be restricted to web pages. That it should evaluate to what extent, interpreting to the OU’s context, disabled students can achieve the same learning goals as other students. This may include by alternative learning activities, or by using alternative online resources, or resources in alternative formats. This has been a major theme in my work for the last 10 years in the development of the AccessForAll metadata based approach for managing alternatives and implementations of it in EU4ALL. There has always been a tension, in evaluating for accessibility between those that assume a universal accessibility approach (one size fits all) and those that seek to facilitate flexibility and adaptability via alternatives and personalisation. It is always easier to measure something tightly defined and unchanging but that may not be the best access solution.

On of the strengths of BS8878 is that it has the perspective of embedding accessibility considerations in a company or organisation. (Note the link is to the BSI shop to order a paid copy. UK universities may be able to obtain a copy without further charge if their libraries subscribe to BSI online). BS8878 has a 16 step model of web product development from the pre-start to post-launch of the web product. It is noteworthy that only 4 steps reference WCAG 2.0.

What I understand David Sloan and Brian Kelly to be suggesting is that there could be a role for metrics across such a process. BS8878 provides a framework against which “measurement” could be made. While currently reflecting on how BS8878 might be applied across the university, and meeting this proposal, I am left with the questions:

  • What would be the nature of measurements against BS8878’s 16 step model?
  • Would there be any value in a metric that somehow aggregated these measurements?

Under “Major Difficulties” the paper raises the following point:

The obvious difficulties in defining and implementing an accessibility metric that incorporates quality of user experience and the quality of the process undertaken to provide that experience are the complexity of the environment to be measured – i.e. not just a collection of resources that enable an experience, but also evidence of organisational activity taken to enhance inclusion.
[David Sloan, Brian Kelly Paper 10]
They cite the TechDis Accessibility Passport as one possible way forward. Within the Open University a programme called Securing Greater Accessibility (SeGA) is embedding accessibility considerations across our processes (c.f. BS8878) and providing the mechanisms to record what steps have been undertaken to “enhance inclusion” at both the Module level and the web asset level.

The link between web standards, web metrics and Learner Analytics within a University.

Some ideas are just beginning to emerge in my mind that might suggest a role for accessibility metrics within the OU’s eLearning context. This was triggered by a presentation last week on another internal project on Learner Analytics. This might be the only bit in this 4,500+ word blog post that is original to me. However if that is not the case and anyone knows of a similar idea please flag it. If colleges give me the confidence that it is an idea worth exploring I will write it up as a briefing paper in the New Year.

The Open University has about 13,000 disabled students, It uses a Virtual Learning Environment (VLE), based on Moodle, that manages the timely presentation of on-line resources to students as they undertake their studies. (It does more besides and there are other systems integrated with it and along side it but that description will suffice for this discussion.) The Learner Analytics project is exploring what data about the student’s experience of their studies can be readily extracted from the VLE and other systems and what could be meaningfully deduced from it. I have raised the possibility that what ever can be analysed could potentially be factored across disability types or even, my preference but more challenging, functional abilities. (There are some technical and some data protection issues here yet to be explored.)

For example, if comparing student completion rates across different modules (across all modules if you wish), it would be possible to detect if there was any different patterns for students with disabilities and then if it was different for students with a particular disability or ideally a particular access requirement.

Drop-out rates are a challenge for any university, funding is often linked to them, and even if not they are a key measure of the university’s success in its teaching and learning. Disabled students traditionally have had higher drop-out rates than students who have not declared a disability. So reducing drop-out rates among disabled students is a highly desirable goal. In the above example it will be possible, from the Learner Analytics to identify which Modules are apparently presenting significant barriers to students with disabilities (there could be other explanations).

Identifying the Module only gets us so far. A Module may be made up of hundreds of assets. The barriers to learning could be diverse and at the teaching and learning level or the technical level, or could be population selection effects, etc. However it seems to me reasonable to want to undertake an accessibility audit of the assets of this module. To be able to do so in an easy automated way, at least for the first pass, seems highly desirable. This is where there is a possible role for accessibility metrics. An accessibility metric, based on an agreed standard like WCAG 2.0 AA, could be assessed for all assets on their production and it travel with them in their metadata or be stored in a database. This cold be part of the “passport” approach. However even if this were not the case, when a set of assets to be investigated has been identified as suggested, automated testing of just those assets could be undertaken. If the metrics indicated that core elements of the course had major access challenges for the students who were dropping out then an intervention point has been identified and some information about its nature collected. Thus data for possible future Learner Analytics is generated. Ideally this accessibility perspective on drop-out could be checked against other data the university collects on reasons for drop-out possibly supplemental with interviews of a sample of the students concerned.

It must be stated that we have very little understanding as yet of the experience of OU students (and students in general) when studying on-line. There is another internal OU project that will be looking at that to some degree in the New Year. So for example we have no sense of the balance between possible reasons for drop-out among disabled students and therefore what is the correlation between access issues in Module assets and drop-out. Nor, how this issue compares in significance with others such as health issues, time demands, family issues, etc. However we can say that as more of the university’s teaching and learning goes on-line, accessibility is going to become of increasing importance to meeting the learning goals of our disabled students and managing it efficiently is going to be vital for the university. This approach in part addresses both those drivers.

References (not linked to above)

J.J. Dujmovic (1996) A Method for Evaluation and Selection of Complex Hardware and Software Systems. International Computer Measurement Group Conference, 368-378

Parmanto, B., & Zeng, X. M. (2005). Metric for web accessibility evaluation. Journal of the American Society for Information Science and Technology, 56(13), 1394-1404.

4 thoughts on “Web accessibility metrics – “What are they for then?”

  1. Our Prioritzation Model includes a weight for each best practice called “tractibility” – the degree of remediation required to fix a violation. This combined with the use case results which are delivered as part of the report assist developers to identify issues which have a high impact on the disabled user experience that can be fixed with minimal effort.

    1. Thanks Sam,

      I only included in my bog post metrics that were specifically mentioned in the papers of the W3C’s (WAI) Website Accessibility Metrics Online Symposium. I am sure there are others out there. I will include your AMP Prioritization Model in a more detailed review I will be undertaking for the Open University in the New Year.

      Thanks again for bringing this to my attention,


  2. Hi Martyn,

    a very interesting read! I think reservations about the value of metrics are quite justified. I have always wondered why one central main outcome of the EU projects in the WAB Cluster, UWEM, was practically unknown and/or completely neglected by accessibility experts. Could be down to bad dissemination but I feel it may also say something about its relevance. Right now, I take part in developing WCAG-EM, the WCAG Evaluation Methodology, and I have to admit that I am not sure that the eventual outcome will have much impact.

    Benchmarks based only on criteria that can be checked automatically are not reliable (they can however indicate that there is probably ignorance / absence of effort with sites scoring badly on the 1/3 of criteria half-way amenable to automatic testing). The larger amount (about 2/3) of success criteria require human assessment and simply cannot be automated. And for any metric to be meaningful, it has to take into account the criticality of the problem. A site may be quite good overall, but completely inaccessible due to a few critical problems (CAPTCHA without alternative mode, keyboard trap, and the like).

    It is true that a score which would yield more differentiated information (e.g., quite good for screen reader users but abysmal for users with visual impairments) would be desirable, and it can be produced. It just needs a mapping of success criteria to types of disability, often a one-to-several mapping as many criteria affect several classes of impairment.

    Politically, however, differentiating by classes of impairment seems rather unpopular; the emphasis is instead on the “design-for-all” approach. WCAG tries to wrap up all requirements for all classes into one set of criteria. However, you can argue that a very large group (visually impaired people, elderly people) is badly served by WCAG since several requirements critical for them (focus visibility, good contrast, scalability) are only included on Level AA, not on the base Level A. (Some have argued that this was due to an underrepresentation of this group in the WCAG working group – I don’t know.)

    The problem with involving users in accessibility evaluations is that they are varied and on top of that vastly different, especially regarding skills in using assistive technology. Then they use different makes and versions of AT, only some of which will make content accessible. A control study of metrics would be desirable but a monumental effort, especially if you want a large enough sample of sites.

    Human assessment based on checkpoints can take information on accessibility support into account and translate them into evaluation guidance that can be applied also by non-disabled testers. If our screen reader tests find that for a significant part of the population, a particular technique will not do the trick, the checkpoint rating advice can reflect that. But this means that checkpoint suites have to be continuously updated as newer generations of user agents and assistive technology will at some point interpret the techniques that had a too small installed base to begin with (WAI-ARIA is an example).

    Finally, many or most people evaluating sites *do* produce remedial comments in addition to whatever score or conformance result they produce. This is vital for the commissioner who wants a list of things to remedy. This is central for most experts I am aware of – how it is done will differ, as will ranking approaches. The web accessibility metrics papers focused on metrics because that was the topic. This does not mean however that on the whole, people who are ‘into metrics’ are not interested in actual users or practical tests, or a recognition of use context.

    Regards, Detlev (I was co-author of paper 7 in that event, by the way).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s