A Survey of Tactics for Supporting Test Use Without Large Validation Samples

Six Strategies for Supporting Personality Assessment Without Large Validation Samples


Paschal Baute

Institute for Human Responsiveness, Inc.

Alan D. Mead

Institute for Personality and Ability Testing



Paper presented at the Paper presented at the 14th Annual Conference of the Society for Industrial and Organizational Psychology, Inc., April, 1999, Atlanta, GA.  Correspondence concerning this paper should be addressed to: Paschal Baute, 6200 Winchester Road, Lexington, KY 40509 or Alan Mead, 1801 Woodfield Drive, Savoy, IL 61874.



Six Strategies for Supporting Personality Assessment Without Large Validation Samples

Few I/O practitioners will dispute the need for some sort of a priori support of the use of an assessment before it is used for selecting employees.  In the ideal situation, organizations would collect the necessary test and performance data so that a high-quality validation study could be conducted. 

However, from long experience, we know that there are many instances in which an organization cannot or will not (or does not believe that it can or will)  perform ideal research.  The organization may even have talked to another consultant with a more expedient approach.

This paper will examine a few of the reasons organizations do not commit the resources needed to perform ideal research and then discuss six alternatives to the traditional means of supporting test use.  We will focus on the specifics of personality assessment, which is in many ways more complex than ability testing.

Just how many people are we talking about?  Sufficient sample sizes for validation with high power are surprisingly large: assuming the true validity of an intelligence test is .51 (Schmidt & Hunter, 1998) but only observed at a level of .25, a one-tailed test, and the common significance level of .05, then you need 150 individuals to have a 90% chance of detecting the validity coefficient as significant (Cohen, 1988).  There are many selection opportunities in which there are many fewer than 150 total incumbents.  What can a practitioner do in this situation?

The practitioner with might be able to support the use of cognitive ability testing on the basis of accumulated wisdom regarding the “general-purpose-predictor” nature of general mental ability (GMA; see Schmidt & Hunter, 1998).  However personality has not enjoyed quite the unanimous research support for quite the same amount of time.  Besides, there is the problem of setting cut-scores or otherwise determining who will “pass” and who will “fail.”

Or, the practitioner could simply not use a psychometric assessment, relying instead on interviews or qualifications checks.  Or the practitioner could simply rely on tests of GMA.  However, there are several reasons why this is not always practical (e.g., interviews are expensive and probably not highly reliable; GMA generally has significant adverse impact).  This paper will assume the practitioner has good reason to use personality measures and is motivated to muster some other form of support other than a traditional, large-scale validation.

This talk will discuss six strategies for supporting the use of an assessment without a large validation sample: 

·        other forms of validation evidence:  content & construct,

·        other sources of validity data:  pooling validity data across jobs or organizations,

·        generalizing or transporting validity from elsewhere,

·        gathering some data: contrasting groups and “success profiling,”

·        expert judgment, and

·        literature review coupled with job analysis

Alternatives to criterion-related validity

            There are certainly alternatives to traditional criterion-related validation study for supporting assessment use.  Sackett and Arvey (1987) describe several of these with regard to ability testing.  This paper attempts to extend those means, and some new ones, to the realm of personality testing.

Obviously these strategies may be pursued in parallel to provide additional assurances.

Content validity.  Content validation is a popular alternative to criterion validation even when a sufficient sample might be assembled.   Lawshe (1975) has published a quantitative method with which to evaluate the results, so the procedure need not be strictly qualitative.  Some consultants feel that content validation, as a rational process, is more easily managed and more reliable than are criterion-related validation studies.

In terms of personality assessment, content validation has some difficulties.  For example, the items of personality tests are often less face valid.  In context of an ability test used for selecting programmers, one could ask a performance question.  Personality items cannot usually be written as performance items; typically items ask what the candidate would do and the candidate may explicitly misreport or deceive themselves about their potential.  Many personality items are more oblique which leads directly to a lessened degree of face validity. 

Level of analysis becomes important.  Different results may be obtained for analyses of items, narrow scales and broad (big-five) scales.  For example, a sales job may require a high degree of social boldness to talk to strangers but it may be impeded by a correspondingly high degree of interpersonal warmth and concern for other’s well-being, both of which are components of the broad factor Extraversion.  A law enforcement role might require a healthy respect for rules and laws and yet not require organization and attention to detail although both dimensions are strongly related to the global factor of Self-Control (known as Conscientiousness in most discussions of the big-five).

Further, non-linearities may cause problems which may or may not be detected in the content validation.  For example, we have observed samples of law officers in which a healthy respect for the law was ubiquitous but higher degrees were actually negatively related to ratings of performance.  Undoubtedly, this arose from a curvilinear relation between rule-consciousness and performance such that performance rises with rule-consciousness to a point and then declines as greater degree of concern for rules make a officer less likely to use good judgment and lack prioritization in enforcing the greater good.

In a content analysis, this may come up as subject matter experts (SME’s) say something like, “Yeah, each of these items is related to the job but I don’t want people who answer ‘extremely true’ to all of them—they would be too rigid!” 

Or it might arise that the SME’s can imagine different ways in which any answer to the question might indicate better performance.  In job analysis situations, we have observed SME’s endorse statements to the effect that incumbents must “take charge and provide leaderships” and also “follow orders and be a good follower” and similar, seemingly opposing, statements.

Pool data across jobs.  Research has suggested that some personality traits, such as conscientiousness, might be predictive across all jobs (Barrick & Mount, 1991).  Further, may organizations have company-wide mandates such as quality or customer service—mandates which seem to require behaviors which personality traits, such as conscientiousness or agreeableness, might best predict.  In these situations, it seems plausible that one component of the selection system might be a personality assessment to predict aspects of performance common to all (or many) jobs in an organization.

This approach is bolstered by current “character” thinking such as the Josephson Institute’s “Six Pillars of Character” (Josephson Institute of Ethics, 1997) which hypothesizes that six trait/values are key to all activities for all people, regardless of their others beliefs.  These “pillars” are: trustworthiness, respect, responsibility, justice and fairness, caring, and citizenship.  Note that many of these seem closely related to Barrick and Mounts’ (1991) results.

The logical extreme of this is synthetic validity (Primoff, 1975) in which a test is evaluated against small elements of job performance which are shared to a degree across many different jobs.  The crux of synthetic validity is to produce a valid composite validity based on the importance of the individual job elements. Hollenbeck and Whitener (1988) present one solution to the problem.  Note that this still requires a number of employees be involved in the research but they need not be from a single job (or even related jobs).

Pool data across organizations.  While some organizations (e.g., Coca-Cola and Pepsi) have such adversarial relations that cooperation is unthinkable, firms in other industries frequently cooperate to some degree on professional projects.  The well-known LOMA and LIMRA organizations are two examples of such industry-wide cooperation.  Public sector organizations are particularly likely candidates for this sort of shared cost-cutting.

Numerous problems can prevent or hinder this approach.  The most obvious is politics; many organizations, even those which would seem to the outside observer to be likely partner, will have political or other agendas which prevent them from cooperating.  For example, the timing of hiring may prevent police jurisdictions from being able to cooperate; one jurisdiction may need to test before the end of the year while another jurisdiction may have years of life in their current roster and so feel no urgency to participate.

Even if seemingly comparable performance measures can be found or constructed, they will likely be affected at least slightly by the host organization.  The effect due to the organization may be difficult or impossible to assess.  What do you do if the officers of one department are rated a full standard deviation higher, on average, than those of another department?  Are the first officers really better performers or was the difference really in the judgment, orientation, or biases of their supervisors (who provided ratings)? In extreme cases, it becomes necessary to deal with organization effects in some way, perhaps by standardizing within organization before combining data.

If you follow this approach, let us advise you to resist the temptation to provide results based on individual organizations.  For example, do not show how one firm’s salespeople “stack up” against the total group based on four organizations.  Such comparisons serve a different purpose than to support the use of a selection instrument and may cause problems in obtaining honest participation in future projects.

Generalize or transport validity.  In some cases previous research exists which is of such quality and quantity that local validation cannot hope to compete and can only lead to more, rather than less, uncertainty regarding an instrument’s use.  For example, given the broad support of Barrick and Mounts’ (1991) large-scale meta-analysis, one wonders whether there is any point in local validation of a measure of conscientiousness?  It would be no great surprise to find a significant positive result; and, if one were to obtain a non-significant or negative relation would any local validation be sufficient to counter Barrick and Mounts’ results?

What is the difference between generalization and transportation of validity?  The first refers to attempts to show that the specifics of a job do not matter and the results hinge on the variability of the corrected coefficients across the studied jobs.  The latter refers to a less ambitious procedure whereby one shows that a target job is substantially the same as one for which validity has already been established.  Thus transportability rests on similarity of KSAO’s across the previously researched and target job.

Small-N data designs.  Small samples do not preclude research but they require different designs and expectations.  A well-known generalization among statisticians is that estimation of the higher moments of a distribution require many more cases than lower moments.  A validation method which depends on higher moments will require many more participants than one which contents itself with lower moments.  As a very rough rule of thumb, assume that estimating a mean might typically require a sample of 20 to 30 while estimating a correlation (which requires estimation of two means, two standard deviations, and a covariance) might require 90 to 200 participants.

One widely-used small-N design requires managers to nominate 20-30 high-performing incumbents.  These incumbents are tested and assessed and their mean profile becomes the “template” used to evaluate latter candidates.  For example, one might seek candidates which are within one or two SEM’s (computed by the test publisher on a very large general population sample) of the mean profile.  One could even compute a distance measure for each candidate (see Chapter 11 of Cattell, Eber, & Tatsuoka, 1970).  Unless one uses higher-order information, this design provides no information about which traits distinguish high from low performance. 

A slight variation of this “best-group” design does allow requires 10 to 15 best performers and 10 to 15 worst performers.  Using this design, t-tests can be used to determine which personality dimensions distinguish between best and worst performers.  This design attempts to overcome the small sample size by maximizing power through the magnification of the effect size (picking the very best and worst employees).  This design has been criticized (CITE) and it cannot completely overcome the small sample size.  However it seems to us an excellent fall-back if a larger sample cannot be had.  Or for pilot-testing before a predictive study using the test in a “life” interim study.

Expert judgment.  Studies (e.g., Schmidt, Hunter, Croll, & McKenzie, 1983) have suggested that experienced (see Hirsh, Schmidt, & Hunter, 1986) industrial psychologists can predict the validity of a test better than a small sample might.  While this approach may seem “soft” or “subjective” to many numbers-oriented I/O psychologists, it need not be nor must it appear this way to business people nor legal professionals.  For example, advanced accounting is largely a study of estimating quantities, often through expert judgment.  Expert testimony is extremely widespread in courts and although courts have recently restricted expert testimony, this action affirms it’s fundamental importance and worth.  In other words, using expert judgment may not seem at all strange to hiring managers or corporate counsel.

In order to use expert opinion, it is important that the opinions be free from conflict of interest.  Outside, independent experts should be sought who have experience with the job in question.  They should be given as clear a goal as possible and their estimates should be combined (using a mean, consensus, etc.).  There are no rules about how many experts are needed however more is clearly better; we recommend four or more experts be consulted whenever possible.

As an example process, IPAT recently collected expert judgments several jobs, including customer service.  We found seven experienced I/O practitioners who had an average of 1.4 validation studies in an average of 9.3 years working with this population.  Table 1 shows the mean profile derived from these seven experts.  (Starting at the bottom), the globals Extraversion, Anxiety and Self-Control seem most related to performance (Anxiety negatively).  The primaries scales suggest that one wants a person who cares about others and wants to help them (A=.21) and is cheerfully energetic (F=.18), but also rule-abiding (G=.18) and organized (Q3=.09) and practical (M=-.07).  Anxiety is reflected in the validity estimates for emotional stability (C=.14), apprehension (O=-.09) and frustrated tension (Q4=-.08).  Some of these validities are quite small but they build in composites built from several primaries since the primaries have low intercorrelations.

Job analysis.  In many ways, job analysis can be thought of as a structured expert judgment process.  So job analysis should be considered at least as viable an option as expert judgments of validity.  And virtually all I/O psychologists would say that a thorough job analysis was necessary but is it sufficient?

We believe the answer can be yes.  A traditional criterion-related validation study process would be four steps: job analysis, selection/development of assessments, validation, implementation.  If the validation step cannot be completed, it might be more worthwhile skipping it on the basis of a strong job analysis than substituting a less common method.  Interim use of a measure, while data for a predictive validation study, is supported by the Uniform Guidelines.

It might be prudent to do a “re-translation” of the job analysis results.  Similar to content validation of the personality dimensions, this would involve describing the KSAO’s resulting from the job analysis to job experts and having them rate the job relevance of these dimensions.

Caveats and Problems

Criterion-related validity is often seen as the “king” of validation efforts and the linchpin of many selection systems.  So the first caveat is that there must be a degree of risk in using alternative designs.  Although many of these are related to traditional techniques or logical outgrowths, they will be unfamiliar even to knowledgeable HR directors, lawyers, and judges.  On the other hand, many of these approaches can be describes in terms which make every bit of logical sense as criterion-validity.

How would one select people.  Without a traditional predictor composite, one cannot rank all candidates and hire form the top.  How would selection actually occur?

All of these methods could be used to express a validity-coefficient-like weight for each dimension.  Using these weights, one could derive a predictor composite based on regression.  Alternatively, one could compute the deviation of each candidate from an ideal profile.

The second approach may be particularly appropriate for personality data.  Although not widely publicized in the I/O literature, many personality theorists hypothesize that personality may well have curvilinear relationships with evaluative dimensions like job performance.  For example, a certain degree of dominance may be required of a manager but too much may be damaging, preventing him or her form delegating, damaging coworker relations and weakening obedience to superiors.  We have observed these sorts of relations several times in our own work.

Some selection systems require cut-scores.  If the personality test scores are linear with performance, cut-scores can be set using use a variation of the Angoff (19xx) method.  In this procedure, SME’s pick the scores likely of minimally-competent candidates and these are combined into cutting scores.  If linearity is not an appropriate assumption, two well-chosen cutting scores make the resulting decision linear with performance.


Barrick, M. B., & Mount, M. K.  (1991).  The big-five personality dimensions and job performance:  A meta-analysis.  Personnel Psychology, 44, 1-26.

Cattell, R. B., Eber, H. W., & Tatsuoka, M. M.  (1970).  Handbook for the Sixteen Personality Factor Questionnaire (16PF).  Champaign. IL: Institute for Personality and Ability Testing.

Cohen, J.  (1988).  Statistical power analysis for the behavioral sciences.  Hillsdale, NJ:  Erlbaum Associates.

Hirsh, H. R., Schmidt, F. L., & Hunter, J. E.  (1986).  Estimation of employment validities by less experienced judges.  Personnel Psychology, 39, 337-344.

Hollenbeck, J. R; & Whitener, E. M.  (1988).  Criterion-related validation for small sample contexts: An integrated approach to synthetic validity.  Journal of Applied Psychology, 73, 536-544.

Josephson Institute for Ethics.  (1997).  The six pillars of character.  http://longwood.cs.ucf.edu/ ~MidLink/cc/pillars.html.

Primoff, E. S.  (1975).  How to prepare and conduct job element examinations.  Washington:  Research Section, Personnel Research and Development Center, U. S. Civil Service Commission.

Sackett, P. R., & Arvey, R. D.  (1987).  Selection in small-N settings.  In N. Schmitt & W. C. Borman (Eds.) Personnel selection in organizations, (pp. 432-439).

Schmidt, F. & Hunter, J. E.  (1998).  The validity and utility of selection methods in personnel psychology:  Practical and theoretical implications of 85 years of research findings.  Psychological Bulletin, 124, 262-274.

Schmidt, F. L., Hunter, J. E., Croll, P. R., McKenzie, R. C.  (1982).  Journal of Applied Psychology, 71, 432-439.

Table 1.

16PF mean profile derived from expert judgment.

16PF Primary Scale

Mean Judged Validity








Emotional Stability












Social Boldness


















Openness to Change












Global Scales