Concerted efforts have been made in recent years to promote better practices in the evaluation of EU legislation and programs. The related guidance on the content and conduct has been revised and extended, and processes have changed. A strengthened ‘evaluation cycle’ has been designed to provide the evidence needed at each stage of the policy cycle (see Figure 1).
The Regulatory Scrutiny Board provides early-stage advice to commission services on evaluation design. It assesses the evaluations against six parameters and gives them either a positive or negative rating in a check and feedback system, with the goal of fostering good practice.
It is widely recognized that more progress is required to deliver consistently strong evaluative evidence on the performance of EU policies and programs. Evaluation designs are often not capable of generating robust estimates of impact in ways that give confidence that the intervention caused the observed change, confirming ‘what worked.’
Evaluation specifications are often loaded with more questions than addressable with the time or resources available. The data needed to achieve deep insight are often scarce.
In its role as a supplier of evaluation services to directorates-general across the commission and EU agencies, ICF receives, assesses, and responds to dozens of evaluation specifications every year. We see the same challenges repeatedly appearing in different policy contexts. There are many examples of good practice in individual policy areas, but there is a need to mainstream these across the evaluation community.
As evaluation professionals, we believe that further improvement in the performance of the EU’s ‘evaluation system’ is achievable. Targeted adjustments could help the insights gained at each stage inform the decisions made at the next step of the policymaking process, facilitating progress towards our collective goal of better public policy. These adjustments are captured in our six-point plan, as summarized below.
Ambition 1: An intervention logic and theory of change should be prepared and published soon after the measure is adopted
Too many evaluations still start with the development of an intervention logic for the policy or program that the study is to address. It may be that no intervention logic was prepared when the measure was designed some years previously. Alternatively, an intervention logic was developed as part of the ex-ante impact assessment but the changes made during the political process meant that the original intervention logic no longer fit the measure that was adopted.
Retrospectively building an intervention logic several years after the policy was adopted is not optimal. The assumptions of the original architects of the policy about how and why it would work, and what would happen in the absence of a new intervention, are rarely well documented.
Contemporary evaluators find themselves projecting their perceptions of the original problem onto the past, and developing post hoc assumptions about how the measure supposed to function.
This problem could be addressed by requiring an intervention logic as part of the post-adoption protocols for any new policy or program. The intervention logic should be accompanied by a description of the theory of change—i.e., a narrative explanation of how the intervention is expected to work—including the assumptions about matters such as the wider conditions within which it is implemented and the potential risks.
If the impact assessment does not already well document it, the documentation should also include explicit discussion of the ‘do nothing’ scenario that was the alternative to the policy or program, as projected into the future.
Ambition 2: An evaluation plan should be prepared and published before the measure is implemented
Another problem that bedevils ex-post evaluation is that the information needed to address the questions that matter has not been collected. In many instances, the ex-post evaluation comes too late for accurate data to be recovered. Implementation costs were not measured, unsuccessful program applicants have disengaged, and memories of how decisions were influenced have faded.
It would be better to ensure from the outset that the monitoring system is aligned to the evaluation needs, and that information is collected at the right time. Preparation of an evaluation framework that identifies the critical questions and the knowledge that will be needed to address them would do this.
The intervention logic specified above should be accompanied by an evaluation plan that defines the key questions to be addressed by a future evaluation, and identifies the data that will need to be collected to address them. It could build on the monitoring and evaluation plan that should be included as part of the ex-ante impact assessment—updated and expanded as required to capture the impacts of any changes made during the legislative process.
It should identify data sources and where additional research effort is likely to be needed. Requirements for the collection of monitoring data by program participants, Member State authorities, etc. should be specified in detail to ensure that the figures are comparable. The plan should include measures to check, on an ongoing basis, that the data collected are fit for purpose.
In an EU context, this advance planning increases the scope to align Member States’ evaluation plans with the evaluations conducted at EU level, and specify more closely the requirements for data collection at Member State level. It would give more power to the collective EU and national investment in evaluation.
It can also ensure that research effort is focused on the expected impact ‘hot spots’. If, for instance, significant adjustment costs are expected when new legislation comes into force, then arrangements should be made for specific research at that point.
For it to be fit for purpose, this evaluation plan may require more detailed scoping studies to examine the options and their respective merits in more detail. Such studies can create space for innovative thinking and testing the application of new evaluation approaches.
A requirement for an independent quality check on evaluation plans could help to ensure a high standard of development. There could be a tiered approach in which the RSB assessed plans linked to more significant legislation.
Ambition 3: Make more use of robust evaluation designs that test for causal links between the intervention and the observed changes
Many EU evaluations use theory-based approaches to explore whether the measure worked as intended. Their appraisal of impacts usually involves comparison of the ‘before’ situation (which is often not well documented) with the ‘after’ situation, some years after the measure was adopted.
As an approach to identifying attributable impacts this evaluation design has severe limitations—it is not possible to isolate the effects of the intervention from all the other influences on the situation of concern. If factors unrelated to the measure may have contributed to the observed changes in the data, it is hard to make a robust case for what the actual effects of the policy were.
It can be possible to generate evidence of a higher standard if the impact evaluation design is specified early on, and ideally before the measure is implemented. Early preparation means that we do not miss opportunities to use an evaluation approach that provides robust evidence of a causal link between the intervention and the observed changes in the indicators of interest.
The Maryland Scientific Methods Scale (SMS), and variants thereof, is often referenced in this context. A representation of the scale, which originated in the evaluation of crime prevention measures in the U.S. but has since been applied to many other fields, is provided in Figure 2.
It describes a series of levels of increasingly robust types of evidence on the impact of an intervention. There are many variations of the Maryland scale, but all share the same basic structure. The ‘before and after’ comparisons typical of much EU evaluation activity today are situated at the lowest level. Evaluations that compare the impact of the program to a counterfactual or control group will qualify for Level 3. When the ‘treatment’ applied by the program is fully executed and there are robust control groups, the evidence qualifies as Level 5 (a randomized control trial, or RCT).
In a public policy environment, the higher levels of the scale are often out of reach. Randomized application of EU legislation would not be desirable or feasible. But in programs, there are often opportunities to aim for Level 3.
For this to be achieved, advanced thinking must occur in advance so that the need for data on the control, or counterfactual, can be provided for in the monitoring plan. A variety of analytical techniques are available to help in situations where counterfactuals may be hard to identify.
There are examples of counterfactual impact evaluation used for EU program evaluations, and the European Commission’s Joint Research Centre has a Centre for Research on Impact Evaluation. But, studies of this design are still relatively uncommon in EU policy.
Beyond Brussels, some evaluation commissioners are now setting SMS Level 3 as the minimum standard for impact evaluations. The EU evaluation system is not yet ready for such a rule to be applied to European policies and programs, but there is certainly scope to set an ambition to do better.
In a context where there is ever-increasing pressure to demonstrate the impact of public spending, robust evidence of influence has significant value.
Ambition 4: Use scoping studies to help bring design innovation and new techniques into more substantial evaluations
Designing a robust assessment of a large, complex program with multiple objectives applied across 28 countries is a non-trivial challenge. Development of a creative evaluation design that makes the best use of available information and methods requires time and space and may benefit from a diversity of expert inputs.
Existing evaluation procurement makes innovation difficult. The incentives on both commissioner and contractor favor reduction of risk rather than methodological experimentation.
The evaluation study terms of reference typically prescribe the method in great detail. On the Commission’s evaluation framework contracts, contractors generally are given 10 or 15 working days from issue of tender to submission of their offer. They will often have had no advance notice of the tender’s release, so they have to focus much of their efforts within that period on finding background research and relevant subject matter expertise. They typically have little visibility of the existing data available to them.
As noted above, scoping studies can have an instrumental role in advance evaluation planning. Scoping studies, used selectively, can be equally useful in ex-post evaluations. Commissioned before the primary evaluation, they provide space to identify and test options, in addition to exploring ideas and taking advice from a wider variety of methodological and other experts.
We propose that scoping studies be used more extensively, with a focus on large and complex EU programs. These scoping studies should not be procured via the main evaluation contracts, and could perhaps involve the JRC. The contractor and advisors involved in the scoping study should be prohibited from undertaking the subsequent evaluation to ensure a fair, competitive procurement of the follow-on the main study.
Ambition 5: Ensure that impact assessment and evaluation research capture a representative sample of the target stakeholders
A common issue with both ex-ante impact assessments and ex-post evaluations is that they are not equipped to provide the robust estimates required by the specifications.
Rather than survey statistically significant samples of the affected stakeholders, they rely on engagement with intermediaries, such as industry associations, or a non-representative—and often small—sample of stakeholders. A public consultation does not provide equivalent results. The respondents are self-selected, and there is no way to know whether they are representative of the wider population of the target stakeholder group.
This issue frequently arises in the assessment of potential or actual impacts on small and medium-sized enterprises (SMEs), a constituency of specific concern in the development of EU policy.
In many sectors, SMEs are numerous. It is often the case that individual firms are not actively involved with representative organizations. Available SME panels have been limited in their coverage.
Collecting data from a large enough number of SMEs in all Member States for the results to be safely taken as representative of the broader population is not straightforward, and not without cost.
In some sectors firms are not easily engaged via online surveys; telephone or even face-to-face methods may be required. And for ex-post evaluations, more than one wave of research may be needed.
The problem also arises where policy targets individual people, such as programs relating to employment, integration or health, or large-scale communication campaigns. Without direct, properly-sampled evidence from those affected, it is much harder to develop meaningful assessments of impact.
If the specification and resourcing of ex-ante impact assessments and evaluations do not address this issue, there will continue to be higher levels of uncertainty attached to the estimates produced by impact assessments, especially those relating to measures that affect a large number of stakeholders.
A proportionate approach should be taken. Where there is a risk of substantial impacts on a large number of firms or individuals, the impact assessments should be resourced to assess them accurately.
The collection of data through large scale surveys using traditional methods is expensive. For evaluations, there may be scope to cascade the sampling obligation into the agreement reached with the Member States on the evaluation plan. But, looking ahead, there is also a need to look for new sources of ‘ready-made’ data (i.e., big data) to cost-effectively gather and analyze vast quantities of information for EU evaluation purposes. New data sources, including online data, and new methodologies will be needed to provide affordable, reliable insights about human behavior in this digital age.
Ambition 6: Develop a cross-cutting program to capture and disseminate lessons from impact assessments and evaluations to help improve evaluation practice
A large number of evaluations and impact assessment studies are completed each year by contractors working on behalf of the commission and the EU agencies. Collectively, these represent a significant investment in policy analysis and a substantial body of evaluation practice. Yet, little is done to capture the lessons available from individual studies and the portfolio as a whole for evaluation practice and future policy development.
This could easily be resolved by:
- including a ‘lessons learned’ requirement in terms of reference of evaluations. The evaluator would need to provide a commentary on the success of the applied methodologies and lessons learned and utility of the program’s monitoring system and evaluation planning.
- supporting an analytical program that looks across the stock of evaluations and monitors the ongoing flow of new reports to identify common learning points, then using these learnings to provide practical advice to evaluation commissioners and practitioners for future application.
By taking this systematic approach, the program would be able to identify gaps in methodologies and resources that, if addressed, could help to improve evaluation practice.
The program could also look at the evidence on the efficacy of different policy measures and the evidence on ‘what works’ in different areas of policy, where there are critical gaps that could be addressed by commissioned research. In impact assessments, this cross-cutting approach would help to identify areas where more guidance would improve the consistency across impact assessment studies (e.g., in the use of standardized cost factors).
In this way, a small additional investment could significantly increase the added value of overall evaluation spending and help foster learning and continuous improvement in evaluation practice and policy design.