The Present Participle: February 2009

My first essay is DONE!

A Discussion and Evaluation of Essential Quality Standards in Approaches to the Psychometric Assessment of Personality

Luke Fullagar
Monash University

Abstract

Diagnostic and predictive value of personality assessment is primarily contingent on the psychometric quality standards of standardisation, reliability and validity. Standardisation enforces uniformity in scoring and administration, and normalises test performance against identifiable samples. Reliability studies measure consistency between testing contexts such as time intervals, between test items, on alternate test forms and between different examiners. Three predominant validity studies evaluate a test’s accuracy in assessing an intended construct: construct (whether a test accurately captures the construct), content (whether the test adequately, accurately and proportionately covers critical aspects of the construct), and criterion-related validity (a comparison of test scores to relevant criterion measures, measured either contemporaneously or at interval. Syncretic integration of theoretical and empirical approaches is preferred for future development of scientific personalty assessment.

A Discussion and Evaluation of Essential Quality Standards in Approaches to the Psychometric Assessment of Personality.

While subjectively-focussed approaches to personality such as psychoanalytic observation, humanistic psychotherapeutic models and projective techniques have sought to apply and further generate personalty theories through idiographic means, strong criticism has been levelled at these approaches for deficiency in the provision of empirically substantiated conclusions (Meier, 1994). Over the last century, nomothetic methods have developed to subject personality theory to scientific scrutiny, generate original theoretical frameworks (e.g. Cattell’s 16PF (Cattell, 1946), five factor personality model (McRae & Costa, 1987)) and develop exacting objective tests for the measurement of personality constructs in both general settings (e.g. the Minnesota Multiphasic Personality Inventory (MMPI) or NEO Personality Inventory which relies on the five-factor model (Costa & McRae, 1985)) and clinical diagnostic settings (e.g. the Millon Clinical Multiaxial Inventory, now closely aligned with personality disorder constructs in the DSM-IV (Millon & Meagher, 2004)) (Coolidge & Segal, 2004).
While myriad methods of personalty testing are in operation across clinical and public settings, including objective questionaries, checklists, sentence completion tests, interviews, self-report and peer-reported ratings, thematic apperception tests, figure drawing and Rorschach or Holtzman inkbot tests (Eysenck, 2004), the diagnostic and predictive value of these methods hinge on the extent to which they meet the psychometric quality standards of standardisation, reliability and validity.

Standardisation

Test standardisation serves two main purposes: to generate appropriately controlled conditions for scientific observation via uniformity in administration and scoring procedure (Anastasi & Urbina, 1997), and to generate interpretive value by ensuring measurement of a respondent's relationship with test constructs is not absolute, but rather, relatively compared to normal or average performance of an identifiable sample group (Coolidge & Segal, 2004; Kline, 1993).
Uniformity in procedure appropriately limits a test’s diagnostic and predictive relevance to standardisation sample profiles and situational contexts assessed in test development. Standardising procedural controls advances the likelihood that a respondent's relationship to the test construct will be the central independent variable of measurement (Anastasi & Urbina, 1997). In practice, test manuals manage subtle factors influencing performance through formulised administration directions (Coolidge & Segal, 2004) covering variables as disparate as time limits, required and prohibited materials, oral instructions, preliminary demonstrations, handling methods for test taker queries, speaking pace, voice tone and inflection, pauses, and facial expressions (Anastasi & Urbina).
Since no true zero on measurement scales arises in psychological testing, norms are necessary to evaluate a test taker’s performance relative to both the average performance of standardisation samples (large, representative samples with profiles (e.g. race, age, gender) consistent with the test’s intended design) and any standard deviations from this norm (frequency of varying degrees of deviation above and below the average) (Anastasi & Urbina, 1997).
Size and representativeness are the key variables in assessing adequacy of standardisation samples. Samples must reliably reflect the intended population and be sufficiently large to render standard errors of descriptive statistics (i.e. mean, standard deviation and distribution) to appropriately low levels (Kline, 1993). Because differing degrees of homogeny arise within samples, systematic statistical methods of random and stratified sampling are employed to generate representativeness (Shum, O‘Gorman & Myors, 2006). Samples are considered random if an equal chance exists that any individual in a given population can be selected, and drawing one member does not influence the likelihood of any other member being drawn (Shum, O‘Gorman & Myors, 2006). Although random number tables and sampling by interval are adopted to stimulate random samples, each is contingent on the quality and relevance of source inventory (e.g. census data). Kline (1993) argues that due practical limitations (cost, substantial sample size requirements) random sampling be limited to circumstances where critical target population categories evade developers (and therefore preclude preliminary stratified sampling). Stratified sampling expands from the practical limitations of random sampling by dividing heterogenous populations into smaller, more homogenous populations (e.g. age, sex) relevant to test scores, and proportionately combining these results to form representative samples of the wider population (Kline, 1993). Cattell, Eber and Tatsuoka (1970) argue that due to qualitative reductions in generalisation biases, size for size, stratified sampling is more effective than random sampling at generating representative samples.

Reliability

Reliability is not a property of the test itself, but rather, a relative measure of test consistency in particular contexts (Thompson, 2003; Thompson & Vacha-Haase, 2003; Shum, O’Gorman & Myors, 2006) such as: between test takers at different times (test-retest reliability), between test items (internal consistency) or different sets of test items (alternate-form and split-half reliability), and between different examiners or scorers (inter-rater reliability). To meet brevity requirements, this paper does not evaluate inter-rater reliability.
Test-Retest Reliability
Test-retest reliability measures stability in scores across time. Assuming constructs to be stable over test occasions (Coolidge & Segal, 2004), a correlation coefficient is commonly adopted to statistically express degrees of similarity between score sets on a scale from +1 (highest correlation) to -1 (0 being no agreement), and performance differences across tests are interpreted (according to classical test theory) as a reflection of administration and measurement error (Friedenberg, 1995). Coefficient scores above 0.8 are acceptable, and exceptional above 0.9 (Coolidge & Segal, 2004). These scores are squared to produce percentage-based expressions of score set agreement, and by negative inference, measurement error (e.g. 0.7 squared illustrates 49% difference in individual characteristics and 51% measurement error) (Kline, 1993). Intervals are determined by inferences drawn regarding construct stability, and the literature on accepted intervals is inconsistent (e.g. Kline (1993) suggests a three month minimum, cf. Segal and Coolidge (1994) and Friedenberg (1995) who report intervals of one week (for less-stable constructs) to one month).
Statistical analysis does not illumine the qualitative nature of error proportions which include: chance variables (e.g. subject’s mood, health), measurement error (e.g. poorly standardised scoring and test instructions) and factors which boost or otherwise distort measurement (e.g. effects of memory and practice during short intervals, insufficient sample size and representativeness to manage standard error) (Friedenberg, 1995; Kline, 1993).
Carry-over memory and practice effects are sometimes mitigated by alternate form administration, which tests different items from the same domain of construct characteristics. To account for effects of varying test items, correlation coefficients assess ‘alternate-form reliability’, that is, score consistency between form administrations. Differences are interpreted as measurement error in selecting test items (however, again, coefficient results do not illumine qualitative causes of variance, and additional studies (e.g. counterbalancing techniques) further exploring causation) (Friedenberg, 1995). High alternate-form reliability also assists in drawing inferences about domain scope by substantiating additional test item sets capable of illustrating the given domain.
Internal Consistency/Scale Reliability
To mitigate the potential for both estimation and relevance errors in choosing representative samples, internal consistency measures the stability between test items in assessing the relevant domain of characteristics that infer test constructs. The domain-sampling model (Nunnally, 1967) statistically scores the standard error of measurement (SEM) (broadly, a standard deviation analysis) (Shum, O’Gorman & Myors, 2006). Mathematical derivation of SEM is beyond this paper, but important is its reliance on a ‘reliability coefficient’, reflecting proportions of observed score variance due to true score variance, where high true score variance renders error score variance, and therefore propensity for measurement error, low (Shum, O’Gorman & Myors).
Reliability coefficients are measured in three main ways. Classical psychometric theory employs ‘split-half reliability’ testing that divides completed tests into two half tests (usually via odd/even splitting item number splitting) (Friedenberg, 1995; Kline, 1993) which are correlated similar to the test-retest coefficient (cf. where split-halves have largely different variances, an alternative correlation equation proposed by Guttmann (1945) may be preferred (Friedenberg)). Logic underlying split-half reliability infers that items equally representative of a domain should produce similar split-test performance. To account for statistical effects that non-linearly decrease scores of shortened test lengths, the Spearman-Brown prophecy or prediction formula is commonly applied to upwardly adjust correlation results (Aiken & Groth-Marnat, 2006).
The other two main methods, the eminent Cronbach’s (1951) alpha (Alpha) and the KR20 (& lesser used KR21) formulas (Kuder & Richardson, 1937) are separately derived, yet cover largely the same domain. The KR20 and KR21 only apply where scores are dichotomous (e.g. 0 or 1) (Thompson, 2003), whereas Alpha also applies to multiple scale items. Alpha extends the two-scale limitation of split-half testing by calculating average score correlations between an item and all other items. Scores above 0.8 are generally considered appropriately reliable (Segal and Coolidge, 1994). Alpha values are, however, limited by two variables: they are misleadingly lowered on short test and inflated on long tests, and are dependent on a high first factor concentration which affects interpretive correlation against multi-concept scales (Segal and Coolidge, 1994).
In addition to quantifying reliability, opinion largely favours high internal consistency as reflective of high validity (Guilford, 1956; Nunnally, 1978). However, Cattell and Kline (1977) dissent, arguing that since tests measure breadth in variables, and test items must be more specific than these variables, high correlations in consistency between items will render the test variables narrow (what Cattell refers to as ‘bloated specifics’), and thus not valid. Notwithstanding the merits of this theoretical argument, little evidence exists of tests in which items correlate well with the criterion score but not with each other (including Cattell’s own 16PF personality test (Barret & Kline,1982)) and on this basis, high internal consistency is perhaps best considered a necessary, but not sufficient, indicator of reliability (Kline, 1993).
With significant success (Thompson, 2003b), Generalisablity Theory (Cronbach, et al., 1972) and Item-Response Theory (Lord, 1980) have each sought to expand classical test theory to account for multiple and simultaneous sources of error and explicitly connect reliability measurement to the contextual purpose of measurement.

Validity

Validity questions the extent to which a test accurately assesses the intended construct within its contemplated context, and is determined by reference to situational evidence from a number of validation strategies (Cohen & Swerdlik, 2005). Validity provides evidence-based judgement of the usefulness and meaning of inferences drawn from test applications (Coolidge & Segal, 2004). Whereas reliability indicates a test’s success in producing consistent scores of stable constructs, validity indicates which stable constructs a test measures (Freidenberg, 1995).
Classically, three validation strategies are considered the primary ‘trinitarian’ model for validity: construct (the extent to which a test accurately captures a specific theoretical construct it was designed to measure), content (whether the test adequately, accurately and proportionately covers the aspects of this construct as evidenced in the literature), and criterion-related (comparison of test scores to relevant criterion measures, measured contemporaneously (concurrent validity) or at interval (predictive validity)). This evaluative delineation is adopted here notwithstanding endorsement of Messick’s (1995, as cited in Cohen, & Swerdlik, 2005) criticism that trinitarian demarcation is an ultimately artificial, arbitrary separation of a multi-method, yet, unitary concept, that may properly include concepts like face validity (the subjective estimate of what a test purports to measure ‘on its face’), socio-cultural implications of test scores, and consequences of use (Cohen & Swerdlik).
Construct Validity
Construct validity is increasingly accepted as an ‘umbrella’ validity test (Anastasi, 1992; Kagan, 1988; Cohen & Swerdlik, 2005) which utilises evidence from a network of rational and statistical sources, including content and criterion-related investigations, to assess a test’s ability to predict or diagnose the test construct in the context it is operating (Meier, 1994). For example, Cronbach (1955, 1989) alone considered construct validity discernable from such wide ranging phenomena as group differences, correlation matricies, factor analysis, item inter-correlations, change over occasions, studies of individual test performance, examination of items, score stability, and varying test procedures experimentally (Meier, 1994). Murphy and Davidshofer (1998, p.103) wryly concluded that “almost any information gathered in the process of developing or using a test is relevant to its validity”. Armed with diverse data, weighting methods, especially between contradictory or incorrect items, are key.
Campbell and Fiske (1959) elucidate a theoretical methodology emphasising assessment of both convergent and discriminant validity (ie. that valid tests should correlate with other tests measuring the similar constructs and simultaneously diverge from those measuring different constructs). Seeking to mitigate the effects of method variance (that data collection methods influences data collected), they proposed a view of test constructs as interconnected ‘trait-method units’ capable of evaluation through a multitrait-multimethod (MTMM) correlation matrix including at least one test of both a hypothetically convergent and divergent construct, and evidence of diverse testing methods.
In building an MTMM matrix, the mathematical procedures collectively referred to as factor analysis, are often employed to obtain evidence of both discriminant and convergent validity (Cohen & Swerdlik, 2005). Factor analysis reductively extracts strongly correlated variables (‘factors’), from psychometric data which are interpreted (once empirically confirmed) as qualities of fundamental difference. By correlating against other established similar (convergent) or distinguishable (discriminant) factors the extent to which the factor determines the test score (or ‘loading’) is assessed. Of course, for factor analysis to produce meaningful statistics, it is essential that any influences on the method of measurement (for example, halo effects (Thorndike, 1920) (a form of positive rater bias) or racial and sex-based rater bias (Landy & Farr, 1980)) are also investigated and accounted for (Cohen & Swerdlik).
Content Validity
Central to content validity assessment is the adequacy domain-specific content sampling from the universe of items inferring the construct. This investigation is critical to ensuring that generalisable inferences drawn about the applicability of the construct to the relevant test scores are accurate. Content validity is less relevant for personality testing than other measures (e.g. intelligence) (Aiken & Groth-Marnat, 2006; Kline, 1993) and discussion here is consequently limited. Nonetheless, important strategies for establishing personalty test content validity include: seeking expert opinion on item range and evidence of relevant literature reviews during all development phases (Aiken & Groth-Marnat). Moreover, salient socio-cultural biases should be considered in establishing the context of content domain boundaries (Meier, 1994).
Criterion-Related Validity
Concurrent validity is more pertinent than predictive validity for personality testing (Aiken & Groth-Marnat, 2006), and is often utilised in test administrations intended to distinguish between average scores of different classes (e.g. socio-economic or clinical diagnostic groups). Each are assessed on two types of statistical evidence: a validity coefficient (usually the Pearson product moment correlation coefficient), which assesses score correlations between criterion and predictor measures, and expectancy data, which illustrate the likelihood of where scores will arise on a criterion measure (Cohen & Swerdlik, 2005).
Validity coefficients assess the standard error of estimate (SOE) being the standard deviation between predicted and actual criterion scores (Freidenberg, 1985), and regression techniques use results to generate banks of highly correlated predictors (Meier, 1994). When more than one predictor is introduced, developers minimise overlap by testing ‘incremental validity’, which questions the extent to which additional predictors illumine aspects of criterion measures deficient in the prior predictor arrangement. Expectancy tables compare score percentiles against essential criterion criteria (e.g. pass/fail, valid/not valid) (Cohen & Swerdlik, 2005) such that illustrative correlations between scales and volume intervals can be interpreted.
Criterion measures must also be relevant, valid and uncontaminated (Cohen & Swerdlik, 2005), meaning criterion measures must be germane to the test’s intended measure, criterion measure validity indicators must match predictor context and purpose, and the criterion measure itself must not be based, in whole or in part, on the predictor measures (which erroneously results in self prediction) (Cohen & Swerdlik).

Conclusion

Although empirical methods have been clearly demonstrated to contribute to psychometric quality assurance, Epstein’s criticism that “self-report personality research has had a strong emphasis on empiricism to the partial or total exclusion of theory” (Epstein, 1979, pp.364, 377) illustrates the imprudence of imbalanced focus on empiricism. Indeed, positions which consider psychometric assessment as “barest of empirical science“ aimed at “nothing but description and prediction” and rendering theoretical questions “illumination by way of metaphor and similes” (Spearman quoted in Gould, 1981, p.268) have been strongly criticised in the literature on the basis that they employ statistically based descriptions in the absence of causal explanations (Meier, 1994; see for instance comments about the five-factor model by McRae & Costa, 1987; Lamiell, 1990). Meier (1994) further suggests scientific cultural trends among research psychologists favouring empirical methods have overemphasised the aggregation of data at the expense of any focus on individual task or item; the construct of traits over states and situations; and linear relations between tests and criteria over non-linear associations. These comments illustrate the encouraging outcomes that may arise from theoretical and empirical approaches being utilised concurrently, and to syncretically inform each other, in the construction and interpretation of personalty testing and assessment.

References

Aiken, L. R., & Groth-Marnat, G. (2006) Psychological testing and assessment. Boston: Pearson Education Group.
Anastasi, A. (1992). A century of psychological science. American Psychologist, 47, 842.
Anastasi, A., & Urbina, S. (1997). Nature and use of psychological tests. In Psychological testing (7th ed., pp.2-31). Upper Saddle River, NJ: Prentice Hall.
Barrett, P. and Kline, P. (1982). An item and radical parcel analysis of the 16PF Test. Personality and Individual Differences, 3, 259-270.
Campbell, D. T. & Fiske, D. W. (1959). Convergent and discriminant validity by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Cattell, R. B. (1946). The description and measurement of personality. New York: Harcourt, Brace, & World.
Cattell, R. B., Eber, H. W. & Tatsuoka, M. M. (1970). Handbook for the Sixteen Personality Factor Questionnaire. Champaign, IL: Institute for Personality and Ability Testing.
Cattell, R. B. and Kline, P. (1977). The scientific analysis of personality and motivation. London, Academic Press.
Cohen, R. J., & Swerdlik, M. E. (2005). Validity. In Psychological Testing and Assessment (6th ed., 156-189) Boston, MA: McGraw-Hill.
Coolidge, F. L., & Segal, D. L. (2004) Objective assessment of personality in psychopathology: An overview. In Hilsenroth, M. J. & Segal, D. L. (Eds.), Comprehensive handbook of psychological assessment: Volume 2 – Personality assessment (pp.3-13). Hoboken, New Jersey: John Wiley & Sons.
Costa, P. T., Jr., & McRae, R. R. (1985). The NEO Personality Inventory manual. Odessa, FL: Psychological Assessment Resources.
Cronbach, L. J. (1951) Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334.
Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn (Ed.), Intelligence: Measurement, theory and public policy (pp.147-171). Urbana: University of Illinois Press.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioural measurements: Theory of generalizabilty for scores and profiles. New York: Wiley.
Epstein, S. (1997). The stability of behaviour: on predicting most of the people much of the time. Journal of Personality and Social Psychology, 37, 1079-1126.
Eysenck, M. (2004). Personality. In Psychology: An international perspective (pp.445-481). Have, UK: Psychological Press.
Freidenberg, L. (1995). Psychological Testing: Design, Analysis and Use. Boston: Allyn and Bacon.
Gould, S. J. (1981). The mismeasure of man. Norton: New York.
Guilford, J. P. (1956). Psychometric Methods. New York, McGraw-Hill.
Guttman, L. (1945). A basis for analysing test-retest reliability. Psychometrika, 10, 255-282.
Kagan, J. (1988). The meanings of personality predicates. American Psychologist, 43, 614-620.
Kline, P. (1993) The Handbook of Psychological Testing. London, England: Routledge.
Kuder, G. F., & Richarson, M. W. (1937). The theory of the estimation of reliability. Psychometrika, 2, 151-160.
Lamiell, J. T. (1990). Explanation in the psychology of personalty. Annals of Theoretical Psychology, 6, 153-192.
Landy, F. J., & Farr, J. L. (1980). Performance Rating. Psychological Bulletin, 87, 72-107.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum
McRae, R. R., & Costa, P. T., Jr. (1987). Validation of the five-factor model of personality across instruments and observers. Journal of Personality and Social Psychology, 56, 586-595.
Meier, S. T. (1994). The Chronic Crisis in Psychological Measurement and Assessment: A Historical Survey. San Diego, CA: Academic Press, Inc.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist. 50(9), 741-749.
Millon, T., & Meagher, S. E. (1994). The Millon Clinical Multiaxial Inventory-II (MCMI-III). In Hilsenroth, M. J. & Segal, D. H. (Eds.), Comprehensive handbook of psychological assessment: Volume 2 – Psychological assessment. Hoboken, New Jersey: John Wiley & Sons.
Murphy, K . R., & Davidshofer, C. O. (1988). Psychological testing. Englewood Cliffs, NJ: Prentice Hall.
Nunnally, J. C. (1967). Psychometric theory. New York: McGraw-Hill
Nunnally, J. C. (1978). Psychometric theory. New York: McGraw-Hill
Shum, D., O’Gorman, J., & Myors, B. (2006) Psychological testing and assessment. Melbourne, Victoria: Oxford University Press.
Thompson, B. (2003). Understanding Reliablity and Coefficient alpha, really. In Thompson, B (Ed.), Score Reliability: Contemporary Thinking on Reliablity Issues (pp.3-23). Thousand Oaks, CA: Sage Publications.
Thompson, B. (2003b). A brief introduction to Generalisablity Theory. In Thompson, B (Ed.), Score Reliability: Contemporary Thinking on Reliablity Issues (pp.43-58). Thousand Oaks, CA: Sage Publications.
Thompson, B., & Vacha-Haase, T. (2003). Psychometrics is datametrics: The test is not reliable. In Thompson, B (Ed.), Score Reliability: Contemporary Thinking on Reliablity Issues (pp.123-148). Thousand Oaks, CA: Sage Publications.
Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4, 25-29.

Monday, February 23, 2009

"What human beings can be, they must be".

Thursday, February 19, 2009

Is the End Nigh?

Inexplicable Oddity

Wednesday, February 18, 2009

A Discussion and Evaluation of Essential Quality Standards in Approaches to the Psychometric Assessment of Personality

Thursday, February 12, 2009

Of droughts and flooding rains...

My Sites

Blog Archive

Photos

Twitter Tweets

Twitter Updates

Friends

My Favourite Sites

Like Us On Facebook