section one

Why use ability tests in recruitment

Almost a century of evidence suggests that cognitive ability is the strongest predictor of job performance in cognitively complex work (Schmidt & Hunter, 1998). Moreover, the more complex the job, the greater the predictive validity of cognitive ability (Bertua et al., 2005). This makes cognitive ability testing integral to quality of hire in professional, technical, and managerial work, more so than any other assessment.

Audio Reading: The theory behind Test Parntership Ability Tests

Designed for accessibility: listen to the narrated post for your convenience.

Research also shows ability testing to be a fair way to screen applicants. From a neurodiversity perspective, cognitive assessments show little group differences, especially when reasonable adjustments are provided (Camden et al., 2024). Additionally, psychometrics have many tools at their disposal to ensure the fairness of ability tests, including adverse impact analysis and differential item functioning.

chart on differential item functioning

We believe the scientific argument for using ability tests in recruitment is overwhelmingly strong, and thus have developed several suites of ability tests for use in selection and assessment.

section two

The general factor of cognitive ability (g)

The general factor of cognitive ability (also known as g) is the predominant theoretical model for ability testing, and Test Partnership’s ability assessments follow this model.

diagram of general cognitive ability

Research has shown repeatedly and without exception that batteries of ability tests intercorrelate with each other, creating a positive manifold (McDaniel, 2010). This suggests that an underlying factor links these specific abilities, and when extracted through factor analysis, this factor represents the majority of the reliable variance in cognitive assessment scores.

The g-factor is a mathematical representation of cognitive ability itself, which can be defined as “a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience” (Gottfredson, 1994). The more a test (or battery of tests) loads onto the g-factor, the more effectively it measures cognitive ability, and thus the stronger its predictive ability.

It is therefore imperative that test providers conduct factor analyses on their assessments, ensuring that they are suitably g-loaded. Aside from criterion-related validity, this represents the most powerful form of validation available for cognitive assessments, and is core to the theoretical understanding of cognitive ability itself. Moreover, given the ubiquity of g whenever studied, no plausible alternative model for cognitive ability exists, and those which have been presented have very little support (Visser, 2006).

Test Partnership has conducted hundreds of factor analyses on its assessments, ensuring that our tests are sufficiently g-loaded, and thus suitable for high-stakes selection. Consequently, we are highly confident in the construct validity of our assessment suites, and have undertaken the most rigorous analyses possible to ensure their quality.

Follow me for more expert insights

Stay updated with my latest tips, insights, and advice to help you stay ahead in your industry.

section three

Specific aptitude tests and test batteries

Test Partnership offers a range of specific aptitude assessments, designed to measure numerical reasoning, verbal reasoning, inductive reasoning, critical thinking, and mechanical reasoning.

We believe the best way to measure general cognitive ability is through batteries of specific aptitude tests, rather than stand-alone assessments.

Stand-alone assessments typically employ a variety of item types, each measuring different aspects of cognitive ability. This approach makes practice and preparation harder, and thus encourages guessing or cheating behaviours.

Instead, we recommend using a battery of specific aptitude tests, providing a consistent testing experience which is easy to prepare for. Moreover, when specific aptitude test scores are averaged, the aptitude-specific variance cancels each other out, leaving a purer and more g-loaded measure of cognitive ability.

chart of maximum information provided

This approach is followed in the clinical and educational spaces, with full-scale IQ tests measuring a multitude of different aptitudes (Silva, 2008). However, in the occupational and commercial space, 2–4 aptitude tests will be sufficient for an adequately g-loaded aggregate score.

section four

Test Partnership’s Suite of Ability Tests

Test Partnership offers a range of different ability tests, each measuring distinct aptitudes within general cognitive ability. This includes:

  • Insights Numerical Reasoning: This numerical reasoning test is designed to measure a person’s ability to solve problems and make decisions using quantitative information.
  • Insights Verbal Reasoning: This verbal reasoning test evaluates a person's ability to solve problems and make decisions using verbal (written/spoken) information.
  • Insights Inductive Reasoning: This inductive reasoning test is designed to evaluate a candidate's ability to identify patterns, work with logical processes, and think abstractly.
  • Concepts Critical Thinking: This critical thinking test measures a candidate’s ability to conceptualise, apply, analyse, and evaluate complex written information.
  • Concepts Data Analysis: Concepts Data Analysis measures a candidate’s ability to work with complex quantitative data, both statistical and financial.
  • Mechanical Reasoning: Our mechanical reasoning test evaluates a candidate’s ability to work with mechanical principles and solve physical problems.
  • Dynamic Assessment Suite: Test Partnership has developed a suite of six gamified ability tests, measuring verbal, numerical, and logical reasoning in a dynamic, game-based environment.

Comprehensive technical manuals, fact sheets, and white papers have been written to support these assessments and can be found readily on our website.

section five

How gamification improves cognitive ability measurement

Although gamification is typically associated with greater candidate experience and reduced test anxiety (Leutner et al., 2023), we have good reason to believe that gamification improves cognitive ability measurement.

Firstly, gamified assessment tends to be very quick, allowing you to measure a wider range of aptitudes per unit time (Pedersen et al., 2023). This means more content validity, and ultimately greater predictive validity than traditional assessments.

Additionally, gamified tasks designed using game mechanics are themselves more cognitively complex than traditional assessments, which are more limited. Consequently, tasks can be designed to measure multiple aptitudes simultaneously, boosting validity.

Gamified assessments also allow us to measure aptitudes which are largely inaccessible via traditional assessments. Reaction speed and working memory can’t be measured effectively with static tests; gamification makes it possible, again boosting content validity.

Given the peripheral benefits of gamified assessments, including their resistance to AI-based cheating, we see gamified and dynamic assessments becoming the norm in future.

section six

How Test Partnership measures reliability

Historically, assessments designed using classical test theory (CTT) would rely on internal consistency measures, such as Cronbach’s alpha, Kuder-Richardson Formula 20, or even split-half reliability. While these statistics are adequate for fixed-form assessments, they cannot handle missing data, and are unsuitable for randomised, linear-on-the-fly (LOTF), or computer adaptive testing (CAT).

Because Test Partnership employs CAT, item response theory (IRT) is required to model our assessments, and thus we rely on a more advanced set of statistics to quantify reliability.

When trialling items for the first time, naturally CAT cannot be achieved until the item difficulties are established. Instead, items are administered randomly to participants, ensuring an even spread of completions across the entire item bank. Thousands of participants are required for this process, and Test Partnership ensures that large samples are recruited for trialling. At a minimum, 250–500 participants per item will be recruited, with many items seeing substantially more than this.

Test Partnership uses the Rasch model, a special case of the 1-Parameter Logistic (1PL) model. Rasch analysis allows us to both calculate reliability of observed data, and forecast the reliability of eventual CATs. Rasch-Person reliability is a rough analogue to CTT reliability, and is interpreted in the same way (minimum 0.7 standard).

Using the standard deviation of person measures, the number of items to administer, and the item targeting for the eventual CAT, we are able to forecast the level of reliability achieved by a CAT. Using this formula, we ensured that every assessment exceeds the reliability threshold by design.

Post-launch, reliability and standard error (SE) are then tracked directly, ensuring that the assessments meet our stringent reliability requirements both in theory and practice.

section seven

Our approach to item analysis

Again, historical approaches to item analysis hinged on CTT-based statistics, particularly item-total correlations. Analysts would investigate the point-biserial correlations between question responses and total score, with higher correlations indicating greater levels of internal consistency.

This approach doesn’t work as well with IRT-based assessments, as total score doesn’t adequately quantify performance. Instead, the Rasch model gives us a range of statistics to help analyse items from a quality control perspective.

Firstly, we investigate potential mis-keys in the dataset, rekeying questions which have been keyed erroneously. Second, we would identify items which show low INFIT statistics, a useful quality control statistic quantifying fit to the Rasch model. Items with overly high INFIT statistics will be investigated and either edited for retrial or dropped. This process is repeated until the assessment acceptably fits the Rasch model.

Additionally, a distractor analysis will be conducted, investigating the effectiveness of the item's distractors. If an item shows pseudo-keys, the item will again be edited for retrial or dropped from the item bank. For polytomous data, category and threshold ordering would also be investigated.

Once a final set of items has been arranged, we would then look to ensure broader fit to the Rasch model. We would look for instances of local dependence, multidimensionality, and excessive guessing in the questions. Differential item functioning (DIF) is also conducted at this stage, helping to identify instances of bias in the questions, allowing us to edit or remove the offending items.

Once these have been satisfied, a final set of item difficulties will be generated, and the item analysis stage of development is complete. Investigations into adverse impact, validity, and user experience can then commence, moving to the next stage of development.

section eight

Ongoing item bank development

One of the chief advantages of item banking and IRT-based assessment is the ease with which new items can be developed. Items can be trialled alongside an existing item bank, and following a successful item analysis, added to the bank to expand it. This means that questions can be added or removed to better improve the overall quality of the item bank.

Adding additional items enhances the benefits of item banking in general, improving the quality of assessments. In the context of CATs, bigger item banks increase the item targeting, ensuring that candidates are given the statistically optimal questions for them. Having a wide range of possible questions increases the probability of this happening, improving measurement.

Test Partnership carefully monitors the quality of its item banks, removing items upon finding evidence of DIF or item drift. Concurrently, new items can be created and added to the item bank, improving the difficulty coverage and improving the content as a whole.

Additionally, new item banks can be spun out of existing ones, for example when creating a bespoke set of ability tests. Rather than reinvent the wheel and start from scratch, Test Partnership can trial a set of items against the original item bank, and then give that expanded item bank exclusively to a particular client. This allows us to rapidly create new bespoke assessments by building on previous work, rather than starting from the beginning.

section nine

How Test Partnership Adheres to Item Writing Best Practices

Test Partnership adheres to a range of item writing best practices, helping us to develop quality assessments for selection and development. In particular, a number of key rules underpin our item writing practice, increasing the probability of reliability, fairness, and validity, as well as meeting the assumptions of item response theory (IRT).

Tricky questions

We explicitly try to avoid writing trick questions, and seek to remove trick questions from item banks if they are detected. Trick questions display several psychometric properties which make them unsuitable for use in high-stakes selection, and for many different reasons.

When it comes to cognitive, knowledge, and skills tests, the objective is to identify people with higher levels of the desired ability, and those are the individuals who should be the highest scoring. Trick questions violate this rule, as even the highest performers may answer trick questions incorrectly, making item discrimination more difficult.

Lastly, trick questions reduce the level of trust between the assessment provider and the test takers. If trick questions are employed, candidates will inevitably feel undermined by the assessment provider, as if they were being set up for failure. Candidates will then second-guess every question they are given, reducing performance and harming the candidate experience.

Unidimensionality

Unidimensionality is a key requirement for IRT modelling (unless explicitly multidimensional), and this is reflected in our item writing guidelines. Multidimensionality violates the assumptions of IRT, as performance on the assessment could depend on variables other than the intended psychological construct.

An example of this could be a numerical reasoning question that requires very specific mathematical knowledge. Ideally, the assessment should measure numerical reasoning, not mathematical knowledge, adding unwanted variance in the scores. Instead, items should be pure measures of the intended psychological construct, without interference from additional constructs.

To achieve this, questions must focus specifically on the construct of interest, and not require any prior knowledge or experience. With cognitive abilities, they should be answerable by anyone with sufficiently high levels of cognitive ability, and should not require any specific knowledge or training.

With personality questionnaire items, avoiding double-barrel questions is one way we avoid multidimensionality. Double-barrel questions often end up measuring multiple psychological constructs, weakening the questions and increasing the likelihood of dropping them after item analysis.

Lastly, Test Partnership is of the opinion that multidimensional IRT models are still in their infancy, and are not ready for use in high-stakes selection and assessment. Moreover, given the unidimensional nature of general cognitive ability and the specific aptitudes that underpin it, unidimensional IRT should continue to be the default option for high-stakes psychometric testing.

Avoid local dependence

Local dependence is a psychometric phenomenon, where questions seem to interfere with one another. For example, if two questions share a set of otherwise unique distractors, and you choose a specific distractor on a previous question, that could help identify the correct answer.

This makes performance on specific questions highly intercorrelated, causing similar issues to multidimensionality. Consequently, local dependence violates IRT’s assumptions, and would require that these questions be dropped.

In practice, candidates who receive the pair of locally dependent items will be given an unfair advantage (or disadvantage) over candidates who didn’t, reducing the validity of the assessment. This is particularly true for item banked and adaptive assessments, which is the case for Test Partnership.

When writing items, we are careful to avoid linking questions together and creating instances of local dependence. Each question should be entirely standalone, even if they share a common stem. If questions fail to meet this guideline, local dependence can be diagnosed during the item analysis and the offending items will be removed.

Readability and reading age

Readability is essential to good assessment design, as candidates won’t be able to properly engage with the content unless they understand it. Even for assessments which evaluate a candidate's verbal reasoning skills, unnecessary complexity, ambiguity, or verbosity reduces the effectiveness of assessments — they don’t improve it.

To ensure readability, we aim for a level of reading complexity that is appropriate for anyone who has finished secondary education. Even for assessments which are targeted at graduate level or above, the reading level remains at a secondary education level, ensuring that language is simple enough to understand.

Additionally, item writers are advised to avoid being overly specific or under-specific, reducing both verbosity and ambiguity. Content should only convey necessary information, but avoid missing anything essential to the item. This means that candidates can read the required information within the time limit, without omitting key details that are required for clarity.

Ambiguous items are likely to be flagged during item trialling, and thus removed from the item bank. Similarly, overly specific or verbose items are likely to be flagged, as many participants will fail to answer within the allotted time limit. By following these guidelines, more items will make it through trialling successfully, increasing the size of eventual item banks.

Varied question topics

The majority of Test Partnership’s assessments are designed to be industry-agnostic, and thus applicable for almost any role. This is the case with our off-the-shelf cognitive ability tests, which employ computer adaptive testing (CAT) to support their use in any context.

Consequently, question topics must be varied, and not be overly skewed towards one industry, topic, sector, or domain. Naturally, overexposure to specific topics could confuse candidates and harm the candidate experience, causing them to believe they have been given the wrong assessment.

That being said, we do want to include some content that is recognisably related to certain industries, roles, and sectors. When writing items, we aim to keep the context as commercially relevant as possible, such that candidates applying to particular industries may well encounter some content which is directly related to their role.

However, Test Partnership does offer fully bespoke assessments, of which all content will be designed with a specific role, industry, and company in mind. Questions could all be related to law, finance, consultancy, or any chosen industry. Items could even be about the actual company commissioning the bespoke assessment and its history. This greatly enhances candidate and stakeholder buy-in, and is a strong option for volume assessment.

Full range of difficulties

The benefits of CAT are contingent on having large and varied item banks. Questions of varying difficulties are required to effectively assess candidates across the ability continuum, and assessment reliability is contingent on item difficulty coverage.

As Test Partnership uses the Rasch model for assessments, item difficulties are of key importance when administering items to candidates. Consequently, a wide range of easy, moderate, and difficult items will be required for each assessment, and this is reflected in item writing practice.

Item writers are instructed to write a range of question difficulties, both within item banks and within testlets. Inevitably, item difficulties tend to follow a normal distribution, as does candidate ability. Given that organisations tend to set their pass mark close to the average score, extra precision around the mean difficulty makes logical sense.

However, any gaps in a Wright map can be detected during trialling, and ongoing item bank expansion can improve item coverage over time. This ensures that candidates at the extreme ends of the ability continuum are effectively assessed, increasing reliability, validity, and fairness.

Discuss the science with our psychometricians

Find out more about how our assessment platform is validity through robust science.