The focus on a number of different language sub-skills or categories can also help to improve marker reliability; the assessor is supposed to give each test taker a separate mark for each category. All these separate marks are then combined to give the overall score, which is related to the process of weighting. In most oral tests or test tasks some categories are more emphasized than others according to the test purpose(s), so a weighting system is used as shown in the following example taken from Underhill (1987, p. 97).

Grammarmarked out of 10 then multiplied by 3

Vocabularymarked out of 10 then multiplied by 3

Pronunciationmarked out of 10 then multiplied by 2

Fluencymarked out of 10 then multiplied by 1

Contentmarked out of 10 then multiplied by 1

Total score 10

In sum, it can be asserted that the marking key plays a very essential role in the design of language tests in general, of oral language tests in particular to ensure the quality of reliability. It must be involved in the whole process of test development from the beginning. Language teachers or test developers should thus take the ways to mark test performance into sound consideration throughout the test construction process.

In the oral test operationalization process, consequently, language teachers or test designers must take great care over not only the selection test types and proper elicitation techniques for the intended test tasks but also the design of a marking key for each test task.

2.5 qualities of a good test

The previous three sections are concerned with the techniques and procedures for developing oral language tests whereas Section 2.5 is related to the qualities of a good test, i.e. whether the test results can reveal test takers’ actual ability to orally use the language. A test used to elicit test takers’ actual language proficiency must reveal such qualities as validity, reliability and practicality.

2.5.1 validity

Test validity generally is concerned with the degree to which a test actually measures what it is supposed to measure. In other words, it refers to the correspondence between abilities to be assessed and real indication of these abilities in a test, so a test is said to be invalid when there is no relationship between them. The concept of validity includes such detailed aspects as content validity, construct validity and predictive validity. A test is said to have content validity if its content represents a sample of the language skills, structures, etc. with which it is meant to be concerned ( Hughes, 1989). When embarking on the test construction, a test writer should first draw up a table of test specifications, describing in very clear and precise terms the particular language skills and areas to be included in the test. Not less important is the construct validity of a test. A test with construct validity is capable of measuring certain specific characteristics in accordance with a theory of language behaviour and learning. In other words, construct validity ‘examines whether the instrument permit inferences about underlying abilities.’ (Cohen, 1996). According to Hughes (1989), the word ‘construct’ refers to ‘any underlying ability or trait which is hypothesised in a theory of language ability’. This ability or trait is defined by Bachman and Palmer (1996) as ‘the domain of generalisation to which our score interpretations generalize’. Certain learning theories or constructs are believed to underlie the acquisition of abilities and skills. Another approach to test validity is to measure the degree of the agreement between results of the test and those provided by some important task at some future point.

