Validation of PerformanceAssessments
Norashid Bin DARNI
INTRODUCTION
Authentic assessment is a form of alternative assessment. It is different in nature as compared to standardized testing. As mentioned by Wellington, Thomas, Powell, and Clarke (2002), authentic assessment embodies a whole raft of alternative or non-traditional assessment techniques. These forms of assessment are required as means of validating student’s ability to apply knowledge and skills to real-world situations. Posner (1994) also mentioned that authentic assessment should attempt to determine what a student knows and can do in the real-life context that have purposes besides assessment and that, therefore, cut across subject matter boundaries. Authentic assessment is expected to stimulate students to develop skills or competencies relevant to their future workplace. But authenticity is not an ‘objective’ quality as such; it is subjective and dependent on who is judging the authenticity. The authenticity of assessment is argued to be important for preparing students for the unexpected global economy.
The concept of authentic assessment clearly has its roots in performance assessment, particularly of work skills requiring error-free performance of complex psychomotor and cognitive tasks with speed and accuracy. There are two essences of authentic assessments. Authentic tasks supply valid direction, intellectual coherence and motivation for day-to-day work and performer-friendly feedback and guidance will allow the learner to improve his/her performance over time.
There are many criteria or features of authentic assessment. Wiggins (1998) stated that authentic tasks should be realistic, requires judgment and innovation, allowing the students to do the subject, the tasks must replicate or simulate the workplace, students must be able to use a repertoire of knowledge and skills and must be given ample opportunity to rehearse, practice, consult and get feedback. He also reasoned that authentic tasks must have evidence to validate skills. Newmann and Archbald (1992) also indicate that a major goal for authentic achievement is to cultivate the kind of higher order thinking and problem solving capacities useful to both individuals and society. The mastery gained in school is likely to transfer more readily to life beyond school. This is identical to Wiggins’s (1998) idea earlier when he mentioned about the simulation of the workplace.
Authentic and direct assessments of performances are examined in the light of contrasting functions and purposes having implications for validation, especially with respect to the need for specialized validity criteria tailored for performance-based assessment. The absence of a reliable and valid performance-based assessment leads to arising issues in standards. Thus, there is a need to ascertain the reliability and validity of performance-based assessments so that the assessment results are beneficial towards students’ development.
Cumming (1997) stated that performance assessments or authentic assessments are related to performance situated learning, problem-based and competence focused. These are the various interpretations of authentic assessments which have been used by many authors. In fact, authentic assessments encompass all of these elements. Wiggins’s (1993) definition of performance assessment is that performance is the execution of some tasks or processes which has to be assessed through actual demonstration, which is a productive activity. The construction of authenticity assumes that assessment of performance involves only direct observation of the performance. It is then necessary to assess the student within the relevant learning context. According to Cumming (1997), authentic assessments must consist of complexity that will allow students to exercise a repertoire of knowledge and skills.
The skills needed to perform the task as a technician or as support personnel requires a repertoire of competencies. It is not merely a single skill that is used to perform a job function. For example, the scope of the IT Backup Specialist requires the person to understand networks, backup and disaster recovery, system administration, network security and storage networking. As such, the mastery of theoretical knowledge is not adequate for vocational institute students to perform this job function well. Hands-on experience and technical skills are equally important for them to carry out the various related tasks well. In the job function, no errors or ignorance will be tolerated but within a classroom setting, this is where we allow students to commit mistakes, relearn, redo, reflect and make corrections. Therefore, an authentic learning environment should provide such avenues for our students to experience a simulated workplace within a classroom setting. Hence, based on the views of Cumming (1997) and Wiggins (1993), assessment of authentic learning would be best carried through performance assessment. It is believed that practical lessons conducted in an authentic environment coupled with performance assessment would be ideal for vocational institutions as the aim of the authentic learning is to prepare the students with skills to perform in the working industry.
In the context of vocational education, Gulikers, Bastien, and Kirschner (2006) pointed out that authenticity is expected to be crucial in preparing students for the dynamic world of work that characterizes current society. Authentic, performance assessment is expected to stimulate students to develop skills or competencies relevant to their future professional workplace. They also mentioned that the concept of authenticity can be viewed from two different angles, the theoretical and practical angle. The practical angle focuses on examining what determines authenticity in the perception of different users. The gaps between teaching in school and the real world and between assessment and task and what occurs in the real world are the major problem in education. The school standards are not aligned to the expectations of the real world. They suggested that to foster authentic learning and to improve student achievement, it is imperative that authentic assessment is aligned to authentic instruction as well as real world expectations. It is by focusing on resembling real-world performance, performance assessment is likely to cover the relevant aspects of real-world performance. Authenticity is thus argued to be important for preparing students for the unexpected world of work. They further contended that since education is to prepare students for the real world, at least in the vocational education, the authenticity of an assessment should be defined by its resemblance to students’ current or future professional practice.
Formative and Summative Assessments
At the Institute of Technical Education (ITE), we create a learning environment based on the understanding that we need to help prepare students to fulfill the need in the workforce. As such, the assessment framework at ITE provides directions for our students to acquire technical knowledge and skills to apply knowledge into the real-world context. The assessment framework follows a sequential format.

Figure 1. Assessment Framework
In Figure 1, formative assessments refer to our use of the assessment information from the authentic performance tasks to provide formative feedback to students based on the criteria and standards on the related rubrics. In other words, in the new assessment framework, teachers are expected to provide the opportunities for students to receive formative feedback and improve the quality of their work over time. This is akin to the bite-site formative assessment as we incorporate frequent, descriptive feedback after each performance task. There are four performance tasks and each of them is given to students after a series of practical lessons. The final piece of work, representing students’ culminating perfomances, is a summative assessment. Therefore, assessment of competence is not carried out merely on the basis of a single, snapshot of performance (i.e., summative assessment). Rather, feedback is given over time to inform students about the next steps to take in order for them to improve their performances (i.e., formative asssessment).
Such an assessment framework gives lecturers or raters the ability to observe patterns of success, failures and the reasons behind them. We also created an authentic environment by purchasing updated equipments that are currently used by the companies. One of the key principles for the autheniticity of performance assessment is that it must allow students to do the subject and to carry out exploration and work within the discipline (Wiggins, 1998). As such, we created an authentic environment so that students will have first-hand experience administering and managing the real equipments that they will encounter when they join the workforce.
Validation of Performance Assessment
The assessment points to the importance of performance assessment in preparing students for their future professional workplace. This has provided directions for vocational institute to follow and redesign our assessments so that the tasks given to students are authentic to their learning. Wiggins (1998) suggests several indicators for authentic assessment including observable performance dealing with ill-structured and complex issues. Developing an authentic, performance assessment should start with an analysis of the professional practice and situation to find out what kind of knowledge, skills and attitudes/dispositions experts use when handling the situation and how they use them. This analysis provides an up-to-date performance standard for developing a performance assessment or for evaluating the authenticity of a performance assessment.
In vocational institutes or occupational types of education where student education for a specific profession or where students are more practically instead of theoretically oriented, performance assessment should also be aligned to more occupation-specific content and value as well as performance standards that define knowledge, skills and attitudes/dispositions of a specific field of work. At ITE, we designed a curriculum that is well aligned with the standards needed for future workplace and hence well-developed performance assessments are needed to provide students with ample opportunities to learn and perform in an authentic environment. One of the clear claims for performance assessment being introduced is that it will contribute to improved student achievement. This is also the aim of the institution, which is to provide technical education to our students. Performance assessment is thus believed to be more consistent with a re-conceptualization of teaching and learning as a richer, context-bound experience that does not rely solely on rote learning and regurgitaion of technical terms and procedural skills.
The validity of performance assessments is very important for curriculum developers to examine the accuracy of the focus and context of the assessment. In addition, the evidence of validity of the assessment data derived from the performance assessments will increase public confidence of the use and interpretation of the data for making important decisions about students. According to Messick (1994), there are six aspects of construct validity that need to be examined in the validation of performance assessments. Each of them is described and explained as below:
Content
The content aspect of construct validity that examines the evidence of the content is typically done through an analysis of the content of tasks, the curriculum and the domain theory. This aspect of the construct validation relies on the expert judgments about the boundaries of the curriculum, skills and contents measured by tasks. Content standards are expected to provide the basis for both curriculum and assessment specifications. Content analysis can occur in two phases: (1) initial task(s)/assessment development and (2) review after the assessment has been constructed. Phase 1 looks into developing assessment blueprints for the contents and skills to be assessed. It consists of specific instructional objectives of skills and the task(s) that must be carried out, and which must be consistent with the intent of the module. The documentation must also provide clear specific standardization for the scoring of the assessment, in the form of rubrics. It must provide justification for the assessors to award scores for the performance of the students during the assessment. Relative weightage and criticality of the specific task must also be identified very clearly. Phase 2 provides evidence regarding the judgments of the assessment developers. Curriculum development has to be balanced and focused on the characteristics of task(s). The assessment criteria cannot be too few as this will lead to a construct interpretation that is too narrow (i.e., construct irrelevant variance). Therefore, curriculum developers must look at the breadth of the task and examine the relationship between the task and the domain objectives of the program.
Substantive
This aspect of construct validity examines thinking skills that are used as a basis for performance on task(s). As tasks need to be representative and relevant to content and skill specifications, so processes used in completing the tasks need to be representative and relevant to the processes that constitute the construct of interest. Messick (1996) stated that the need to acquire empirical evidence and that the ostensibly sampled processes are actually engaged by respondents in task performance. This includes examining responses using “thinking alouds” or empirical investigations of process models. Contents standards can also be shown by words such as ''explain'', "describe'' and ''analyze''. This is a qualitative method of judging the process used to solve a problem by asking students to solve a problem orally. Apart from reviewing the tasks, the review of scoring rubrics needs to show that scores are based on the successful completion of a process.
Structural
This aspect of construct validity examines the scoring system (i.e., rubrics) as it relates to the construct domain. It is also crucial to get comparable scores especially when we have multiple raters to judge students’ performances. Hence, inter-rater reliability is very crucial to ensure the consistency of teachers’ judgement. There are several models for reliable scoring, which include multiple readers, anchors or benchmarks, adjudication, training and calibration checks, social and statistical moderation. This can lead to high levels of rater consistency (Brennan, 1996; Miller, 1998; Shavelson, Baxter, & Gao, 1993).
Generalizability
This aspect of construct validity measures the consistency of assessment results across levels of random facets of the assessment. Generalizability studies take into account the raters and tasks as source of error for consistency of assessment procedures under different conditions of raters and tasks. For performance assessments, generalizability has been a focal point for the examination of validity. Major concerns have been around the raters' scoring. Most of these methods have been refined many times such that it can increase inter-rater reliability. Moderation between raters is also an element that increased the reliability of scoring. Shavelson (1993) mentioned that only two solutions can reduce error associated with heterogeneity: (1) the number of tasks can be increased and (2) the construct and task can be defined more narrowly. However, the first solution may not really be practical because of assessment time limitations; students may not be able to complete too many performance-based tasks. The second solution may make the inferences from the task scores becomes too narrow, as the construct of the assessment is more narrow than it was previously. Miller (1998) reported that the number of tasks to attain comparable levels of generalizability varies by the type of task. Longer, more complex tasks require fewer tasks for adequate levels of generalizability. In examining the generalizability, there is also a need to have a greater understanding of the sources of person-by-task interaction and how it can be reduced as an error source for performance assessments. In short, generalizability studies are important for an understanding of the use of performance assessment as an accountability measure (i.e. summative assessment).
External
This aspect of construct validity examines the relationship of task scores to variables external to the assessment tasks which provide another important source of validity evidence. This type of validity is used to examine if assessment relationships are consistent with the knowledge base and theory of the construct. Student performance during the performance assessments should be related to instructional effects. Wiggins (1989) mentioned that the impetus for performance assessments is that they should mirror the teaching and learning process and provide a better measure to accountability.
Blank and Engler (1992) reported increases in achievement across time on performance assessment. However, this type of data does not examine whether students do better on particular tasks depending on the form or content of the assessment or whether learning really occurs on the broader construct. Firstly, all achievement measures should be sensitive to instructional effects. Much impetus of the performance assessment is that they should mirror the teaching and learning process and provide a better measure of accountability. Secondly, construct variance should be higher than method variance as there is a range of methods that can be used to measure the same construct. Thirdly, the performance should be fair without giving any subpopulation an advantage based on construct irrelevant factors. For example, gender bias and ethnic bias.
Consequential
This aspect of construct validity focuses on the intended and unintended consequences of the use of assesment information and the impact on score interpretation and use. Inter-rater reliability is the extent to which two or more individuals (coders or raters) agree. Inter-rater reliability addresses the consistency of the implementation of a rating system. It involves the presence of two or more assessors during an assessment and looks into establishing a common undertsanding of the standards of assessment between two or more assessors. Different assessors may award different scores based on his own judgment. Hence, inter-rater reliability checks against any biasness and uneven standards.
Shepard (1993) argued that consequences are an integral part of validity because they affect the overall construct use and interpretation of test scores or assessment information. A distinction needs to be drawn between consequences of assessments that do not affect the inferences and uses, and consequences of the assessments that do affect the inferences or uses. The former is not a part of validity, whereas the latter is. Consequences can be intended or unintended. Intended consequences might include changes in the instructional and circular practices of teachers that lead to better learning environments for students (Linn & Baker, 1996). Unintended consequences might include bias in the assessment, leading to misinterpretations for some subpopulation (Bond, 1995).
In addition to Messick’s (1994) six aspects of construct validity, Linn, Baker and Dunbar (1991) discussed eight validation criteria for performance assessments. Four of these eight criteria are identical to Miller and Linn (2000). The four which are distinct from those above are:
Fairness
Fairness looks into the equality of opportunities without having a sense of biasness towards any race, gender, or language. It also looks into the fairness of selection and scoring of performance-based tasks and scoring of responses. Stiggins (1987) stated that it is critical that the scoring procedures are designed to assure that performance ratings reflect the true capabilities of a student, not just a function of the perceptions and biasness of the judges evaluating the performance based on the characteristics of the student.
Content Quality
This criterion looks into the quality of the content, which must be consistent with the best current understanding of the field and at the same time reflective of what are to be assessed to be aspects of quality. Moreover, the selected tasks to assess a given content domain should themselves be worthy of the time and efforts of students and raters. One strategy to assure content quality is to involve subject matter experts not only in task review but also in curriculum design.
Content Coverage
This criterion refers to the comprehensiveness such as scope, of content coverage provides another potential criterion of interest. Collins, Hawkins, and Frederiksen (1990) noted that if there are gaps on coverage, teachers and students are likely to underemphasize those parts of the content domain that are excluded from the assessment. An inadequate content coverage may lead not only to misleading high scores, but also to a distortion of the instruction provided.
Meaningfulness
This criterion refers to the meaningfulness of criteria to students is worthy of attention. The rationale is that students get to deal with meaningful problems that provide worthwhile educational experiences. Task analysis can provide some relevant information to this criterion. Studies such as a the NAEP (National Assessment of Educational Progress) have shown that meaningfulness of tasks during practical lessons lead to high levels of motivation. Ames (1992) stated that tasks are dimensions within a classroom that influence a student's levels of motivation.
The above-mentioned criteria are very useful in providing a direction for us to validate our performance assessment, which is currently being pursued.
Effectiveness of Formative Assessment
Importance of feedback
Based on our interview with some of the ITE students who had gone through the performance assessment at ITE, they stated that receiving feedback during the formative assessments was helpful as it provided them with useful information to enhance their performances over time. They stated that their teachers provided them with the necessary guide and assistance during the reviews of their performance. This helped them discover their mistakes and opened them up to improving their performances.
Increased Confidence Level
The students stated that the formative assessments also increased their confidence for the summative assessment. They shared taht because of the availability for them to model after their teachers and practice the skills taught, they became more confident in the summative assessment. They also mentioned that knowing what to expect from the assessment rubrics helped them alleviate their worries.
Relevance of Formative Assessments to the Summative Assessment
The students found that knowing how they performed and what actions or steps to take based on the formative feedback over time is very helpful for them to be persistence on tasks and the knowledge and skills they acquired over the multiple performance tasks are very relevant to the knowledge and skills needed in the summative assessment.
REFLECTION ON THE BENEFITS AND CHALLENAGES
Messick’s (1994) six aspects of construct validity provide an excellent framework for the validation study of the performance assessment at ITE. It also allows us to indentify the strengths and limitations of the implementation of the performance assessment.
In the analysis above, the absence of inter-rater reliability was discovered. In the current practice, a single assessor conducts the assessment single-handedly and rate the scores. It can prove to be very technically challenging to handle a class of 35 students singly. The presence of another rater will ensure that the scoring is more comparable. The scores of the 1st rater and the 2nd rater will be compared for any discrepancy and the raters need to arrive at a consensus score on each criterion on the assessemnt rubrics. This procedure ensures that the task scores given to the students are reliable and valid as it helps minimize rater biasness.
The challenges that educators face in vocational institute is the ever-changing technology. In this 21st century where technology is rapidly changing, competencies required to perform a job function expands. Hence, educators must consistently equip themselves with the current skills and competencies before they can impart them to the students. We also do note from the interviews conducted that formative assessments are important and critical to build students’ confidence and equip them with the necessary knowledge and skills to attempt the summative assessment.
Students also highlighted the importance of feedback so that they will know their areas for improvement. Formative feedback informs them on how to improve on the quality of their work. The effectiveness of feedback and continuous learning through formative assessments have enhanced the performance of students during the summative assessment. It has proven to be very effective for the students in the vocational institution. Performance assessment also supports the alignment of authentic learning, which is very important for the students to experience real-world skills needed in the future workforce. We have also seen that formative assessment is important in developing and scaffolding the competencies required for students to perform a task. Hence, it is critical that we place a greater emphasis on the formative assessment component in the design and implementation of performance assessments at ITE.
The importance of this study is to validate the current practices in the deployment of performance assessment. As there is no other post-secondary institution that has performed such validation study, it is a good step for us so that we can reflect on our current practice and seek ways to improve it. The study helps us gather some feedback from the students on the implementation of the performance assessment and formative assessment, which focus on assessing both the process and product.
Educators at vocational institutions must see the value of performance assessments and what they aim to achieve. The importance of feedback, formative and summative assessments must be embraced by the teaching staff at vocational institutes. Feedback towards students performance after the formative assessments could be done in many creative ways, be it individual or group-focused feedback. Feedback given to students aims not only to highlight students’ strengths and weakness but also to suggest further steps needed to be taken by students to improve the quality of their work.
High-quality professional development should be given to ITE lecturers to educate them on the curriculum design and the development of performance assessments. ITE lecturers who are new to the formative assessment practice need to be trained on the design and implemenattion of formative assesssment, which is embedded in the performance-based tasks. In short, the task of improving the assessment literacy of ITE lecturers through effective professional developemnt is of equally important to that of teachers in schools.
References
Ames, C. (1992). Classrooms: Goals, structures, and student motivation. Journal of Educational Psychology, 84(3), 261-271.
Anastasi, A. (1950). The concept of validity in the interpretation of test scores. Educational and Psychological Measurement, 10, 67-78.
Baker, F.B. (1964). An intersection of test score interpretation and item analysis. Journal of Educational Measurement, 1(1), 23-28.
Brown, J.D. (2004), Performance assessment: Existing literature and directions for research. Second Language Studies, 22(2), 91-139.
Buckendahl, C., Smith, R., Impara, J., & Plake, B. (2002), A comparison of Angoff and Bookmark standard setting methods. Journal of Educational Measurement, 39(3), 253-263.
Cizek, G. (1993). Reconsidering standards and criteria. Journal of Educational Measurement, 30(2), 93-106.
Freebody, P. (2005). Queensland Curriculum, Assessment and Reporting Framework. Brisbane: The University of Queensland.
Gulikers, J. T. M., Bastiaens, Th. J., & Kirschner, P. A. (2006). Authentic assessment, student
and teacher perceptions: The practical value of the five dimensional-framework. Journal of Vocational Education and Training, 58, 337-357.
Haladyna, T.M., Nolen, S.B., & Haas, N. (1991). Raising standardized achievement test scores and the origins of test score population. Educational Researcher, 20(5), 2-7.
Hambleton, R.K. (1999). Setting performance standards on educational assessments and criteria for evaluating the process. University of Massachusetts at Amherst.
Impara, R., & Plake, B. (1997). Standard setting: An alternative approach. Journal of Educational Measurement, 34(4), 353-366.
Jackson, T., Graugalis, J., Slack, M., & Zachry, W. (2002). Validation of performance assessment: A process suited for rasch modelling. American Journal of Pharmaceutical Education, 66.
Kane, M (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425-461.
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Dunedin: University of Wisconsin and University of Otago.
Kelly, T. L. (1921). The reliability of test scores. The Journal of Educational Research, 3(5), 370-379.
Linn, R. (1994). Performance assessment: Policy promises and technical measurement standards. Educational Researcher, 23(9), 4-14.
Linn, R., Baker, E., & Dunbar, S .(1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15-21.
Linn, R., & Baker, E. (1996). Can performance-based student assessments be psychometrically sound? In J. B. Baron & D. P. Wolf (Eds.), Performance-based student assessment: Challenges and possibilities. Ninety-fifth Yearbook of the National Society for the Study of Education, (pp. 84-103). Chicago: University of Chicago Press.
Messick, S (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5-11.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13-23.
Messick, S. (1998). Test validity: A matter of conscience. Social Indicators Research, 45, 35-44.
McMullen, C. & Braithwaite, I. (2005). “Authentic” assessment: Workplace based formal learning setting the stage for ongoing informal learning. Charles Sturt University
Miller, M., & Linn, R. (2000). Validation of performance-based assessments. University of Florida and University of Colorado.
Moss, P.A., Girard, P.J., & Haniford, L.C. (2006), Validity in educational assessment. Review of Research in Education, 30, 109-162.
Newmann, A., & Archbald, D. (1992). The nature of authentic academic achievement. SUNY Press, 71-84.
Posner, G. (1994). The role of assessment in curriculum reform. Peabody Journal of Education, 69
(4), 91-99.
Primo, M.A.R., Baxter, G.P., & Shavelson, R.J. (1993). On the stability of performance assessments. Journal of Educational Measurement, 30(1), 41-53.
Shavelson, R., Baxter, G., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30(3) , 215-232.
Shepard, L. (2000). The role of assessment in a learning culture. Educational Researcher, 29(7), 4-14.
Spencer, B.D. (1983). On interpreting test scores as social indicators: Statistical considerations. Journal of Educational Measurement, 20(4).
Stiggins, R. (1987). Design and development of performance assessments. Educational Measurement: Issues and Practice, 6(3), 33-42.
Terwilliger, J. (1997), Semantics, psychometrics and assessment reform: A closer look at authentic assessments. Educational Researcher, 26(8), 24-27.
Walker, D.F., & Schaffarzick, J. (1974). Comparing curricula. Review of Educational Research, 44(1), 83-111.
Wellington, P. , Thomas, I., Powell, I., & Clarke, B. (2002). Authentic assessment applied to engineering and business undergraduates consulting teams. International Jounal of Engineering Education, 18(2), 168-179.
Wiggins, G. (1998). Educative assessments: Designing assessments to inform and improve student performance. San Francisco: Jossey Bass.
Willingham, W.W., Pollack, J.M., & Lewis, C. (2002). Grades and test scores: Accounting for observed differences. Journal of Educational Management, 39(1), 1-37.