OFSTED, moderation of inspection grades, and what we can learn from universities

InspectI’ll begin this post by writing about how assessment works in higher education, and then explain how this shows us potential shortcomings in the OFSTED system of grading schools, with reference to one Lead Inspector and a sample of 50 inspections.

In universities, we do a lot of high-stakes marking, so we have to be very careful about our processes to make sure they are transparent and defensible at all times. There are different ways of approaching this, for example second marking (when a colleague marks papers to check the original marks make sense), blind second marking (where said colleague has no idea what the original marks might have been) and even third marking (where the first two colleagues are in dispute, and another view is felt to be appropriate). These approaches generally apply to essay-based answers, projects, dissertations and reports. Marking is double checked by an external examiner, to ensure conformity with national standards with comparable courses, and any anomalies investigated.

For certain subjects, that might involve more scientific or mathematical answers involving technical or numerical content, there are techniques such as scaling that are typically used. This means that if a cohort of, say, 200 students all score uncharacteristically badly on a question, the pass mark for that question can be raised or lowered if it was felt that the original question was pitched wrongly in the context of the overall examination, as well as what is usually expected at a particular level. That way students get a fair and reasonable result, and standards do not fluctuate wildly if there happens to have been a change in the assessment team, for example. Again, marking is checked by an external examiner, and any scaling has to be justified to him/her in the context of national standards.

If an examination board really wants to probe assessment standards, it is possible to track how individual colleagues or groups of colleagues assess over time, in terms of a particular set of criteria, or statistical norm, depending on the subject under consideration and the group size. This then feeds into ongoing staff training. Academics tend to see assessment standards as an ongoing work in progress, routinely checked and altered, and underpinned by principles of fairness and parity. People take it very, very seriously indeed. This is why I am not being my usual jokey self in this blog post.

With this in mind, recently I spent some time looking at how OFSTED inspection grades vary amongst inspectors, and which factors influence this. To that end, I would like to present a case study of an individual inspector to give an example of quite how variable grades can be in comparison with a national norm. I am not saying this is the case for everyone, or making a pronouncement about OFSTED in general. I am just saying that, in this case, there is a case for OFSTED/Serco/Tribal to moderate grades internally, as part of ensuring professional standards are met. Then, and only then, can the public have real confidence in inspection findings.

My approach

  1. I carried out an internet search to locate all OFSTED officially published inspection reports with the same Lead Inspector (n=50), who has inspected primary schools for two different subcontracting agencies, but who does not appear to have been an HMI (a centrally employed inspector).
  1. I have gone to considerable lengths to find as complete a data set as possible, by digitally searching published OFSTED reports, with triangulation against relevant newspaper articles reporting school inspections. I was also allowed access to the Watchstead website to check the data, which was very useful (thank you, Watchstead).
  1. I logged the overall inspection grade given by this Lead Inspector in each case.
  1. I calculated the overall proportion of inspection grades in each category given by this Lead Inspector, in percentage terms.
  1. I compared this percentage to the officially published OFSTED average grades overall for all inspectors in each category, which were available on the OFSTED website.


        1. Some caution is required in interpreting the data, as a sample of 50 inspections means that one or two unusual incidences may skew the findings more than it should, more so than if we had a sample of, say, 100 inspections.
        2. The table below lists the 50 inspections carried out by the Lead Inspector over the last 9 years, and overall inspection grades awarded in each case. I have removed the names of the schools as it identifies the Lead Inspector concerned very easily and that is not the point of the exercise here.
        3. The figure below demonstrates the pattern of the Lead Inspector’s inspection grades over time. It tells us that in recent years, the inspector has become considerably more likely to award Level 4 grades to schools. This corresponds to a decreased frequency of inspections carried out by this Lead Inspector during the period 2010-2014, when the new regulations applied. Click the icon in the bottom right hand corner if you want to enlarge the chart/table.
  • As stated above, the inspection regulations changed after 2010, but the overall OFSTED proportions of schools getting level 2 or 4 has roughly stayed the same during that period.
  1. In the case of this Lead Inspector, the table below represents the proportion of grades given during  the period 2005-2013. The second column represents the OFSTED average for the same period. I have listed the 2010-2014 OFSTED averages in column 3, but I have not done so for the Lead Inspector as we only have data for fifty school inspections, so that seems unhelpful. Note: columns will not add up to 100% due to complex rounding.

                        Inspector   OFSTED 05-14    OFSTED 10-14

Level 1           10%                13%                            10%

Level 2          30%                50%                           50%

Level 3          40%                34%                           36%

Level 4          20%                7%                              6%

  1. It would fair therefore to conclude, with the caveat that this is a relatively small sample of 50 inspections, that on the basis of the publicly available data, this Lead Inspector appears to be around three times more likely to give a Level 4 grade to a school than the overall OFSTED average.

The problem with this is that we don’t know:

1. If this inspector is being specifically sent to schools in trouble, hence the lower grades. However it is usually directly-employed HMI that are sent to schools in trouble, as I understand it, rather than a sub-contracted inspector, as in this case (I am sure someone will correct me if I am wrong).

2. If this inspector has become more or less reliable in terms of judgements over time, compared to the OFSTED guidelines and the opinions of inspection peers (I found many incidences where this inspector was working alone, in small primary schools).

3. How inspection grades are defended internally by inspectors to one another. And if we don’t know this, then we have no idea how accountable inspectors are for their decisions.

This is why OFSTED needs to tell us more about how its moderation processes work, or if it has none, then simply to implement some as soon as possible. Otherwise wild vacillations and inconsistencies will continue to make parents, teachers and pupils very nervous indeed. If surgeons can publish their personal outcomes, then surely so can inspectors?


4 thoughts on “OFSTED, moderation of inspection grades, and what we can learn from universities”

  1. I would imagine that the contracted ISPs could well have inspectors that specialise in schools with poor academic data (therefore more likely to be failing) they would certainly be allowed to have such a thing.

    Ofsted does have a moderation procedure of another inspector looking through the evidence files and proofing the report (I am fairly certain this is how it works) although this is not always perfectly robust as the farce of January 2014 will attest.

    I would be inclined to check the geographical and size element I vaguely remember that small schools are more likely to be judged inadequate (surely that can’t just be down to this inspector) and if they were also in a poor region that could explain much of the variation you observe I suspect.

    The other thing I would mention it is not just your sample size of inspections that matter it is the sample size of inspectors. If we studied all inspectors we would expect some to just observe outstanding practice and others to just observe inadequate practice (even assuming there was no targeting of where inspectors go). It would be highly unlikely for all inspectors to follow the national distribution for inspection grades and we should expect outliers.

    Sorry for the geeky reply.

    1. Don’t worry, geeky is good. My argument really hinges on the fact that we are expected to take a lot of this on the say-so with little qualifying evidence. That doesn’t engender a great deal of confidence in the system, in my view.

  2. To write a proper answer to the information I need some surrounding information ( I am Russian and don’t know the structure of OFSTED thoroughly). My ground question is “Whether there is a hierarchy among inspectors- you write “double checking of an external examiner” and whether the three first chekings exist on the same level. Then does the data of the tables.contain the comparison of all levels of OFSTED checking with that your control inspector? Now my views and proposals: for checking,inspectors should choose the core cases in education on the one hand and some fragments from ordinary on going teaching process; then it is advisable to check one and the same case but at different times ( to exclude casual influence in cheking); try to check with warning teachers and without warning them 9but creating the atmosphere of co work and friendliness); when being checked a teacher should not experience being afraid if he/she is not up to the mark; when cheking i think it is sometimes reasonable to give several attempts.
    Best regards, Tinyakova Elena

