logo logo European Journal of Mathematics and Science Education

EJMSE is is a, peer reviewed, online academic research journal.

Subscribe to

Receive Email Alerts

for special events, calls for papers, and professional development opportunities.

Subscribe

Publisher (HQ)

RHAPSODE
Eurasian Society of Educational Research
College House, 2nd Floor 17 King Edwards Road, Ruislip, London, HA4 7AE, UK
RHAPSODE
Headquarters
College House, 2nd Floor 17 King Edwards Road, Ruislip, London, HA4 7AE, UK
Research Article

Use of Item Response Curves to Evaluate the Validity of the Force Concept Inventory in Secondary Schools in Uganda and Comparison with Other Educational Contexts

Kent Robert Kirya , K. K. Mashood , Lakhan Lal Yadav

This study examines the validity of the Force Concept Inventory (FCI) in Ugandan secondary schools using Item Response Curves (IRCs) and provides a co.


  • Pub. date: June 15, 2025
  • Online Pub. date: May 07, 2025
  • Pages: 79-95
  • 51 Downloads
  • 230 Views
  • 0 Citations

How to Cite

Abstract:

T

This study examines the validity of the Force Concept Inventory (FCI) in Ugandan secondary schools using Item Response Curves (IRCs) and provides a comparative evaluation of its effectiveness across different educational contexts. The survey focused on Senior Four students preparing for the Uganda Certificate of Education (UCE) examinations, with a representative sample of 941 students (aged 15–17) selected through a multi-stage sampling technique. The initial analysis employed Classical Test Theory (CTT) metrics before the detailed analysis of IRCs for the FCI items. The CTT evaluates item-level and whole test statistics like item difficulty level, discrimination index, and reliability. The CTT indices revealed that the FCI was highly challenging, with an average score of 5.76 out of 30 and a low-reliability coefficient (α = 0.15). Additionally, 73.3% of the items showed poor discrimination, and some distractors were ineffective. The detailed analysis of IRCs showed that several FCI items are inefficient in the context of the Ugandan education system. The IRCs also demonstrated a widespread choice of distractors for many items, with overall scores falling below the threshold indicative of a generally agreed-upon understanding of Newtonian physics. Comparative analysis from other global contexts studies suggests that language barriers, curriculum differences, and instructional methods influence student performance. These findings underscore the necessity of adapting the FCI tool to better fit local educational contexts and implementing additional instructional strategies to enhance conceptual understanding. A more culturally and contextually adapted diagnostic tool may improve physics education and better assess students’ conceptual comprehension of force and motion within the region.

Keywords: Concept inventories, conceptual understanding, cross-cultural comparison, force concept inventory, item response curves.

description PDF
file_save XML
Article Metrics
Views
51
Download
230
Citations
Crossref
0

Introduction

A concept inventory is a set of multiple-choice questions (MCQs) used to diagnose students’ alternative conceptions and their conceptual understanding of concepts related to a physics topic. These MCQs are based on a range of essential physics ideas developed, with one correct answer and three to four distractors based on the student’s alternative conceptions (Sands et al., 2018). Several studies have tried to identify and explain students' alternative conceptions which are not consistent with the views accepted by the scientific community and impede learning, and hinder them from becoming experts (Bani-Salameh, 2016; McDermott & Redish, 1999; Savinainen et al., 2005; Smith et al., 1994). Researchers and physics educators take great care when thinking about the situations to present and develop plausible distractors that represent a range of partially correct understandings and students' alternative conceptions for an inventory.

Despite its widespread adoption, the validity and applicability of the FCI can vary significantly across different educational systems. Studies conducted in the United States, China, and Finland have revealed differences in students’ performance based on factors such as instructional methods, curriculum content, and linguistic influences (Bao & Redish, 2006; Hake, 1998; Savinainen et al., 2005). For instance, research in China has shown that students excel in procedural problem-solving but struggle with conceptual understanding, while studies in the United States indicate that interactive engagement methods can reduce, but not eliminate, misconceptions about Newtonian mechanics. However, there is limited research on how the FCI functions in developing countries, particularly in Africa, where instructional approaches often emphasize rote memorization and teacher-centred methods. Given these pedagogical differences, it is crucial to evaluate how the FCI performs in different educational contexts to determine whether conceptual difficulties are universal or context-dependent. By comparing students' responses across various global settings, including Uganda, this study aims to assess the effectiveness of the FCI as a diagnostic tool and explore the need for localized adaptations to enhance its validity and reliability. Such insights will contribute to the broader effort of improving physics education worldwide by ensuring that assessment instruments are both culturally and pedagogically relevant.

Despite the widespread use of the FCI as a diagnostic tool in physics education, its validity and appropriateness for diverse educational settings remain underexplored, particularly in developing countries such as Uganda (Kirya et al., 2021a). The effectiveness of any assessment tool depends on its ability to measure students’ conceptual understanding within specific cultural and educational contexts accurately. The Ugandan secondary school curriculum differs significantly from those in Western educational systems, where the FCI was initially developed. The structure and content of a national physics curriculum play a significant role in shaping how students understand and internalize fundamental concepts such as force and motion. The Ugandan secondary school curriculum, as outlined by the National Curriculum Development Centre (NCDC, 2019), places a strong emphasis on theoretical knowledge and routine exercises but provides limited opportunities for hands-on, interactive engagement and inquiry-based learning. The mismatch between Uganda’s dominant instructional methods and the pedagogical environments in which the FCI was originally designed raises concerns about its validity in this context.

Although force and motion are well-known concepts in science, with applications in everyday life, most students of all ages, regardless of race, have been identified as having alternative conceptions to Newtonian concepts (Laverty & Caballero, 2018; McDermott & Redish, 1999). Multiple-choice constructs (items) have proved to be useful tools to evaluate student alternative conceptions of students. This is why the FCI was chosen as a diagnostic instrument in this study. Although administration and analyses were carried out in many countries across the globe, it has been observed that no survey has been conducted in the Ugandan education context. The suitability of the FCI in this context and the findings of this study are regarded as critical by using the item response curves (IRC) to determine the inventory’s appropriateness.

Item Response Curves (IRCs), which give specific information about how students answer individual items, provide a reliable technique for evaluating the validity of assessment instruments like the FCI. IRCs can highlight trends in student answers that might clarify if an item is misinterpreted or whether a concept is persistently difficult for a variety of student populations (Morris et al., 2006, 2012). This analytical method is especially useful in situations when the educational environment may be very different from the one in which the evaluation instrument was created. This study intends to assess the tool's efficacy in gauging students' comprehension of Newtonian physics in this setting by applying IRCs to the FCI data gathered from secondary schools in Uganda. By plotting response patterns across different ability levels, IRCs reveal whether high-performing students consistently select correct answers while lower-performing students choose distractors. This approach helps identify problematic items, such as those that fail to differentiate between students’ abilities or are frequently misinterpreted. Given the linguistic and instructional differences in Uganda’s education system, IRCs can determine whether FCI items are culturally or conceptually misleading. By applying IRC analysis, this study seeks to determine the study objectives and its questions:

Research Objectives

Analyzing the validity of the FCI in Ugandan secondary schools using IRCs.

Comparative evaluation of the FCI insights from Ugandan secondary schools and global educational contexts

Research Questions

How do IRCs reveal the effectiveness of the FCI in assessing students’ understanding of Newtonian mechanics in Ugandan secondary schools?

How does the performance of Ugandan secondary school students on the FCI compare to that of students from other educational and cultural settings?

This study is important for enhancing physics instruction in Uganda because it will offer evidence-based suggestions for applying the FCI and offer adjustments to improve its accuracy and relevance. Teachers can create focused interventions to enhance learning outcomes by better understanding students' conceptual difficulties and ensuring that assessment instruments are valid and reliable. Revealing students' alternative conceptions and patterns will make it possible to create assessment instruments that will provide insight into students' thinking and a relationship between concept items and their regulated notions. Additionally, aligning assessment instruments with Uganda’s cultural and instructional realities will contribute to broader efforts to strengthen science education in developing regions (Kirya et al., 2021a, 2022).

Literature Review

The FCI has been used in a variety of ways. The FCI has helped educators and researchers examine their teaching-learning practices and instruction (Kirya et al., 2021b). Problem-solving skills, attitudinal measures, and physics knowledge frameworks have correlations with post-instruction FCI performance (Stoen et al., 2020). Hestenes and Halloun (1995) have suggested that the FCI can also be utilized as a placement examination, but some studies have concluded otherwise (C. Henderson, 2002).

The FCI’s significance as a vital instrument in physics education research and practice is reinforced by several studies conducted in different countries and cultures. For example, a study conducted in Malaysia validated the instrument's capacity to differentiate between different student comprehension levels (Murshed et al., 2020). Thus, it is shown that high school physics students’ conceptual comprehension of force and motion can be measured with the FCI, which has a high item reliability. Students throughout the world have similar errors about basic physics concepts like force, despite variations in their performance. One common misperception is that Newton's first law of motion is violated when an object requires a constant force to remain in motion. Research has shown that students in a variety of educational settings, including the USA, China, and other countries, hold this misconception (McDermott & Redish, 1999). Because of commonplace encounters that endorse intuitively sound but inaccurate notions of motion and force, this misunderstanding endures. These misconceptions arise from everyday experiences that reinforce intuitive but incorrect notions about motion and force (Hestenes et al., 1992). The Alternative Mechanics Survey (AMS) provides an open-ended approach to assessing student understanding, offering an alternative to traditional multiple-choice assessments, while still ensuring efficient automated scoring (Parker et al., 2023). It demonstrates that free-response versions of Force Concept Inventory (FCI) questions work effectively.

Several studies are pointing out shortcomings of research-based assessments, particularly the FCI. According to research carried out by Madsen et al. (2016), many faculty involved in the study opined that research-based assessments "do not measure many of the things they care about, or are not applicable in their classes." Laverty and Caballero (2018) analyzed whether the FCI and other most common physics CIs, align with the "framework of three-dimensional learning" (3DL), which combines the "scientific practices", "core ideas" and "cross-cutting concepts". It was found that the FCI mainly addresses 1D questions (related to "core ideas"), 63%, with a small fraction (17%) addressing 2D learning, while 20% with no dimensions. Le et al. (2025) identified and examined the student models of four skills in cross-content areas in physics, conceptual relationships, applications of algebra, vectors and visualizations, in the context of the FCI, together with other mechanics CIs and found that these CIs do not sufficiently address all of the above-mentioned skills. Meta-analysis of thirty-eight international studies examined the gender differences in the FCI and the gendered mean score of the FCI was found to be significant in favour of males (Santoso et al., 2024).

Countries differ greatly in how well their high school physics students do because of differences in curricula, methods of instruction, and evaluation techniques. Furthermore, Morley et al. (2023) claim that research-based assessments (RBAs) among different populations lack validation. This conclusion was reached by examining the measurement invariance of the Eaton and Willoughby five-factor model on the FCI across gender and racial intersections. It is demonstrated that the FCI can accurately assess latent variables like Newtonian thinking and physics identity across groups by enabling measurement invariance across gender and race. The application of the FCI in equity research in physics education is supported by this validation. It's emphasized that intersectionality matters when it comes to equity research, and more validation is needed to make sure evaluations do not mask injustices (Morley et al., 2023).

Indian pre-university students showed a bimodal performance distribution on some of the well-known concept inventories in physics, unlike the single-peak patterns observed in American and Chinese students (Mashood & Singh, 2019). This difference stems from two learning experiences: the integrated mode (IM), where schools collaborate with private coaching centres, and the non-integrated mode (NIM), which lacks such support. IM students performed better, while NIM students had weaker conceptual understanding, highlighting educational inequalities driven by economic disparities. Despite this, Indian students' scientific reasoning skills were comparable to those of their American and Chinese peers, suggesting that content knowledge does not directly impact reasoning ability. Additionally, academic self-efficacy is influenced by students' immigrant backgrounds and social environments. In Germany, students of the former Soviet Union and Turkish descent rely more on verbal and social encouragement outside school, whereas non-immigrant students show a stronger school-based self-efficacy link. This emphasizes the role of cultural background in shaping self-efficacy and the need for culturally responsive teaching (Gebauer et al., 2021).

By improving physics education's efficacy and equity, the validation initiatives will eventually aid in the creation of more inclusive teaching strategies and evaluation instruments. Fair learning environments will be promoted by the strict validation procedures used in educational evaluations.

Approaches to addressing these issues differ between nations and educational frameworks. For example, findings from a study in Japan, which used a modified item response curve (MIRC) for an item of the FCI showed that that question did not adequately assess students’ understanding of atmospheric pressure (Shoji et al., 2021).

To reduce the test time, Han et al. (2016) validated half-length FCIs (HFCIs), which could be used when the main assessment goals are total scores or score gains. Yasuda et al. (2021) used computerized adaptive testing for the FCI (FCI-CAT) to shorten the test time by more than 50%. Parker et al. (2023) developed and tested a modified version of the force concept inventory, which used computer-marked free-response items.

The FCI is written in plain English with no scientific or mathematical terminology. However, it has not been well received in the African context. Little efforts have been made to develop and validate physics CIs in Africa (Kirya et al., 2022; Mbwile& Ntivuguruzwa, 2023), however, despite being the least computationally demanding of the multiple-choice question analysis approaches, the IRC methodology has yet to be empirically tested in Africa, specifically in Uganda, in context of the FCI (Kirya et al., 2021b). The IRC is a graph that displays, for test-takers at various points in the ability range of an assessment tool, the probability of the correct/incorrect item responses plotted to the item. The practice of analyzing items for physics topic knowledge is now widely used in physics education (Reyes & Rakkapao, 2020). Several researchers use different techniques to examine concept inventories, including the FCI, to understand their respective domain ideas. The following is a literature study to understand PER trends better and explore the IRCs approach for item analysis.

Morris et al. (2006) provided a straightforward method for assessing multiple-choice items and their responses that goes beyond the conventional metrics of difficulty and the effectiveness of distractions. By examining student responses to FCI items, Morris et al. created an IRC analysis and demonstrated how to utilize it. The strategy uses item response theory as a foundation for educational measurement and entails the development and qualitative analysis of IRC. Concept item analysis and assessment technique is used by researchers and educators to enhance the test items currently used to gauge student comprehension and develop better tests that are more suited to the various student ability levels. It is believed that the IRC analysis and assessment is a useful technique for assessing multiple-choice items like those found on diagnostic tests. It helps researchers to create and assess items and response options that are used to test specific ideas. Morris et al. (2012) evaluated the FCI after the IRC concept was established in 2006, indicating that the technique is easy and provides similar results in examining item performance. IRC allows for more insights than more advanced and complicated item response theory investigations since it characterizes correct and wrong answer options. The IRC analysis goes beyond the usual dichotomous-scoring technique, as alluded to in the Item Response Theory.

Morris et al. (2012) claim, using the FCI as an example, that IRC analysis offers detailed information to complement other approaches and better represents the FCI's creators’ objectives. Student interviews and teacher experiences are used to tailor the FCI distractors to the specific physical misunderstandings that have been found in the population. The IRC analysis provides examples illustrating how different distractors appeal to students of various levels in an easy-to-understand graphical format (Morris et al., 2012). Because distractors are never equally "wrong," assessing ability more sensitively, effectively awarding partial credit for some erroneous responses implies a more profound understanding than other wrong answers and offers a better experience than other wrong answers. Therefore, it is crucial to assess the effectiveness of items and other response options in terms of how well they gauge student understanding and misunderstandings. The IRC analysis for the FCI contends that IRC analysis is by far the most suitable method for formative assessments such as concept tests in PER. The IRC analysis can be used by researchers and educators to create a test bank for conceptual assessments that are free of incorrect items and answer choices, enabling a more thorough and equitable evaluation of students.

Mashood (2014), who had followed the preceding researchers’ theoretical arguments, agreed that learning from IRCs was beneficial (Morris et al., 2006). The IRCs for all 39 items are displayed; the right choice curves for all items have a positive slope and are favourably associated with ability level (total score). The level of difficulty of the items and their discriminating value are vividly shown, providing visual confirmation of the insights acquired through statistical analysis. The IRCs helped dig deeper into the quality of a few items for which the indices were not working. These items’ IRCs revealed beneficial attributes that might have otherwise been overlooked.

Student perceptions of force and motion are shaped by their own experiences, and student physics education varies depending on language, culture, and educational system (Ishimoto et al., 2017). The Japanese and American students' perspectives on force and motion are compared using the IRCs of the force and motion conceptual evaluation (FMCE). The analogy is suggested because the FMCE instrument was developed utilizing studies on American students exploring their perspectives on force and motion. As a result, the items may operate differently in the eyes of Japanese students. IRCs were utilized by Ishimoto et al. (2017) to investigate item-by-item comparisons thoroughly. When the proportion of each response was displayed as a function of the overall score, individual items performed beautifully across all performance levels. Although a few IRC plots showed changes between populations, most IRCs displayed patterns that were common to both correct and incorrect answers. Students often behave the same way when interacting with FMCE items, despite differences in culture, language, and educational background.

Richardson et al. (2021) used an independent dataset of more than 6,500 American students' FMCE responses to generate IRCs to assess their claims, which Ishimoto et al. (2017) validated. This demonstrated that both Japanese and American students are equally prone to select incorrect responses to items. Ishimoto et al.'s (2017) results are consistent for the majority of FCME items after converting the IRCs to vectors and doing a quantitative comparison of each answer item using dot product analysis. The pedagogical benefits of utilizing IRCs were simple to build and examined qualitatively and intuitively for diverse populations (Morris et al., 2006, 2012). They offer data on all potential responses and analyze student thoughts before teaching students of all ability levels. When the survey instrument used to collect the data is translated, IRCs can be used to compare data from extremely varied populations (Ishimoto et al., 2017). By establishing if the translated assessment yields equivalent results, a closer look at item functioning assists the test validation process.

The IRCs have been used more frequently in the validation process for learning tests such as the FCI. This has given rise to a graphical depiction of how students with varying ability levels respond to specific test items, providing insights into how these items work with a range of student demographics. The IRCs can be useful in locating aspects of the FCI that are problematic or might not work as planned for various student groups (Wang & Bao, 2010). The technique has shown to be especially helpful in identifying linguistic or cultural biases that could compromise the FCI’s accuracy in non-Western contexts. To investigate whether the FCI is still a reliable instrument for evaluating Ugandan students' comprehension of Newtonian physics, this study improves on earlier research by using IRCs to analyze the FCI data from secondary schools in Uganda.

Apart from the technical features of IRCs and their implementation, it's crucial to consider the broader background of physics education in Uganda. The education system in Uganda is confronted with distinct elements, such as substantial class numbers, restricted resources, and varying degrees of teacher preparation. These elements may have an impact on how well students interact with and comprehend physics material, which may have an impact on how well they score on tests such as the FCI. Prior studies have brought attention to the necessity of culturally sensitive teaching methods that take into consideration the distinctive features of Ugandan classrooms (Opolot-Okurut, 2010). The purpose of this study is to determine whether the FCI, which was created in a distinct educational and cultural setting, is capable of accurately capturing the unique characteristics of Ugandan students' grasp of motion and force.

In conclusion, this study review finally highlights the necessity of contextualizing educational assessments within the cultural and academic contexts in which they are used. Although the FCI has been validated in some foreign contexts, a thorough assessment of its validity in the Ugandan education system is necessary due to its distinct features. This work intends to contribute to the larger area of physics education research by using IRCs to examine FCI data from secondary schools in Uganda. It will do this by shedding light on how assessment instruments can be modified or validated to ensure their efficacy in various educational situations. This study has implications for educational evaluations used in non-Western cultures more generally as well as for the FCI’s use in Uganda.

Methodology

Population and Sampling Technique

This survey focused on senior four students in Ugandan secondary schools, including both male and female students. The FCI received responses from a representative sample of senior four UCE (Uganda Certificate of Education) students aged 15 to 17. A multi-stage sampling technique was used to ensure a diverse and representative sample. Six secondary schools were randomly selected from each of Uganda's four geographic regions (Creswell, 2014). At each of the 24 selected schools, classroom physics instructors randomly chose 60 students to complete the FCI, resulting in a total of 1,440 students. The study addressed potential sampling biases, such as gender and geographic imbalance, by ensuring both genders were represented and schools were randomly chosen across regions. This approach aimed to reflect the diversity of the student population and enhance the generalizability of the results. By involving physics instructors in the random selection of students, the study minimized selection bias, ensuring a representative and unbiased sample, which strengthened its credibility.

Students’ Newtonian Mechanics Background

The Ministry of Education and Sports utilizes the National Curriculum Development Center (NCDC, 2019) to develop Uganda's physics syllabus. The syllabus is regularly reviewed and updated to align with global standards and recent advancements in the field. The main objective is to create a society that values physics, which includes understanding both natural and artificial phenomena as well as students' interpretations. One of the general topics in the UCE teaching syllabus is Newtonian Mechanics. Measurements, vector, and scalar quantities, states of matter, introduction to forces, Newton’s laws of motion, density, turning effect of forces and centre of gravity, machines, motion, linear momentum, friction between solids, mechanical energy, work, energy and power, pressure, properties of matter, Archimedes principle, fluid flow, properties of materials under stress and structures are among the subtopics that clarify the scope. The subtopics appropriately address the material covered by Hestenes et al. FCI (1992) in the section preceding.

Force Concept Inventory

To assess conceptual learning and students' conceptual understanding of Newtonian mechanics, the FCI is a collection of 30 multiple-choice items that are frequently used in introductory physics courses (Hestenes et al., 1992). Five response possibilities are provided for each item, one of which correlates to the accurate physical conception and the other four to typically unscientific alternative explanations. To assess the construct and content validity of the FCI items in Uganda, the authors engaged eight physics lecturers (as experts) from Ugandan universities. We ensured that these experts validate content using a systematic approach that considers the key components of content validity (Rusticus, 2014). The experts evaluated the FCI items to determine whether they addressed the ideas that the instrument was intended to cover. In this procedure, the FCI theoretical construct was compared to the physics curriculum's force concept learning objectives.

The FCI was administered as it is because the English language is the primary medium of instruction in Ugandan schools. The content validity was assessed using the following criteria: i) clarity of phrasing, ii) suitability of items, iii) use of acceptable English language, iv) deletion of small words or phrases, v) item display format, and vi) clarity of instructions (Creswell, 2014). The FCI test instrument’s content validity was deemed to be suitable by the panel of national experts. The physics experts agreed that the test is valid and comprehensive enough to be used as a genuine assessment of force ideas in classroom physics. All the FCI items were deemed appropriate and related to the objectives of the Ugandan physics education syllabus by the experts.

Administration Procedures

Several crucial steps were taken in the validation process to make sure the Force Concept Inventory (FCI) was appropriate for the local context before its administration to senior four students in Ugandan secondary schools. First, a thorough review of the FCI was conducted, examining its relevance to the Ugandan curriculum and the students' level of understanding of physics concepts. Expert validation was sought from five physics educators and 3 educational officers from the NCDC to ensure that the items accurately assessed the intended concepts. The total number of FCI items was maintained, and a few minor adjustments were made based on the findings to improve the clarity and relevance of the items before the final administration of the FCI to the target population.

The FCI test was treated as a post-test FCI by the study and was administered by the physics teachers to the students in each of the selected schools. In the studied schools, the physics teachers contributed to the development of a secure and safe testing environment. The MCQ answers were to be based on what the students thought and felt was right, according to the teachers' instructions. Students were advised not to copy their peers' work and discouraged from doing so by making them adhere to the Uganda National Examinations Board's (UNEB) standards and guidelines for seating arrangements. We did not adhere as rigorously as we could to the instrument's time limit. We gathered all the response forms from students nationwide who had participated in the corresponding sample schools for analysis. Before being arranged for coding and analysis, the survey responses were scanned and cleaned. All the response sheets that were not fully responded to were eliminated throughout the scanning and cleaning procedure. Out of the 1,440 students who participated, 941 provided complete responses that the authors coded and analyzed. This was due to students having difficulty understanding the questions on the FCI, which resulted in incomplete answers, especially since they had not taken this kind of assessment before. The data analysis techniques are explained in the next section.

Analysis of Data

The ITEMAN software was used to perform a statistical analysis of the FCI response data. When run, the software generates descriptive statistics, providing key output variables that help assess the quality of the test items. The analysis focused on specific item characteristics, including the difficulty level, which indicates how easy or hard each item was for students, and the average score, which shows the overall performance on each item. The standard error of measurement was also calculated to assess the accuracy of the test scores. The efficiency of distractors (incorrect answer choices) was analyzed to determine how well they attracted incorrect responses, and the reliability coefficient was calculated to assess the overall consistency of the test. Additionally, the discriminatory power of each item was determined to evaluate how well each item distinguished between high and low-performing students. These parameters helped ensure that the FCI items were suitable for measuring students' understanding effectively. Education measurement theories claim these criteria are used to determine the appropriateness of a test item (Al-Khadher & Albursan, 2017). The IRC technique, used in this study, is a valuable research tool for constructing and evaluating test items and response options by displaying the qualities of each alternative across total scores. It was employed to assess the FCI items by plotting the percentage of students at each ability level against their total scores for each potential response option (Mashood, 2014; Morris et al., 2006, 2012; Reyes & Rakkapao, 2020).

Results

At the initial stage of the study, the Force Concept Inventory (FCI) was assessed using Classical Test Theory (CTT), which considers multiple statistical indices to evaluate the effectiveness of test items. The CTT analysis focused on four key metrics: item difficulty (p), discrimination index, reliability coefficient (alpha), and efficiency of distractors. To assess the appropriateness of the FCI items and response alternatives, we further examined IRCs. Table 1 presents a statistical summary of students’ performance on the FCI test.

Table 1. Summary of the Scores

Statistic Scored Items
Number of Students (n) 941
Items: 30
Mean: 5.76
Mean P: 0.19
SD: 2.28
Max Score: 25
Min Score: 0
Alpha: 0.15
SEM: 2.1

Table 1 shows the analysis of a test with 30 items taken by 941 students, a large sample size that makes the results more reliable. On average, students answered only 5.76 items correctly, with 19% of the test items answered correctly overall, indicating the test was very challenging and students had difficulty understanding the concepts. Most scores were close to the average, with the highest score being 25 and the lowest 0, showing a wide range in performance.

Figure 8

Figure 1. Marks distribution of Students' Scores

With a total of 25 FCI items correctly answered, one student had the highest score, which is an outlier. Since it does not correspond with the overall pattern shown among the other students, this distinctive score is not displayed in Figure 1.

Figure 1 reveals that students who had a score of 5 had the highest percentage of students, which is much lower than the desired average. Most students found the FCI items challenging as indicated by the overall distribution and the concentration of scores at the lower end. This suggests that there may be gaps in the students’ conceptual comprehension of the force and motion concepts being assessed.

Analysis of Student Response Patterns

Students had difficulties selecting the right answers from the available multiple-choice options (see Table 2). For each item on the FCI, the percentage distribution of student responses among the five multiple-choice options (A, B, C, D, and E) is displayed in Table 2. Understanding how students comprehended or misunderstood the concept items being examined is gained by analyzing these percentages.

Table 2. Percentage of Responses to the FCI Items

FCI Alternatives FCI Alternatives
Item A B C D E Item A B C D E
1 23 22 16 25 14 16 21 24 28 20 07
2 14 28 19 29 10 17 28 16 18 24 14
3 13 42 17 09 19 18 13 22 30 21 14
4 46 22 08 13 11 19 25 15 14 33 13
5 13 16 42 15 14 20 30 25 27 08 09
6 36 36 11 09 09 21 17 42 32 13 06
7 22 32 12 11 23 22 17 35 15 24 09
8 23 28 10 20 19 23 21 12 27 35 05
9 14 27 20 12 28 24 24 19 28 15 14
10 11 21 14 43 11 25 19 29 18 22 12
11 12 24 34 20 10 26 21 28 25 16 10
12 10 32 25 09 24 27 31 28 23 10 08
13 18 38 28 09 07 28 17 33 18 27 10
14 43 33 10 11 03 29 25 30 22 10 13
15 17 29 30 20 04 30 12 25 28 19 16

Note. The underlined bold figure corresponds to the correct alternative

There is a noticeable distribution of responses among most of the options for most of the items, suggesting that students' choices were frequently split (Table 2). For example, in Item 2, responses were almost evenly distributed across B (28%), D (29%), and C (19%), indicating a lack of clarity in student understanding. Similarly, Item 6 showed that 36% of students selected both A and B, suggesting the presence of shared misconceptions. There are some other popular alternatives for several topics, which are in line with students' misconceptions. In Item 14, for instance, a sizable majority (43%) selected option A, while 33% selected option B. The remaining options were mainly disregarded. This implies that many students shared a common misconception. Item 19 likewise shows a dominating answer, with 33% of students selecting D, indicating that there is a shared misconception.

On the other hand, some items, like Item 21, have a significant concentration of right answers (the correct answer B of Item 21 received 42% of responses). This pattern suggests that many students have a good understanding of the concept. The distribution of responses across items, however, indicates that students did not always have an easy time with the inventory. They frequently divided their answers among several options instead of settling on the right response, which suggests that there are common misconceptions or challenges regarding the concepts being tested.

The CTT Indices Analysis

Difficulty Level

The difficulty level is determined by the percentage of students who choose the correct choice as their response to the test item. In this case, we calculated the item difficulty index using the Ding & Beichner mathematical formula (Ding & Beichner, 2009; Mashood & Singh, 2019). Furthermore, the absence of items with an easy difficulty index (0.75 to 1.00) suggests that the test did not include any item that can be used to accurately measure students' foundational knowledge or give a gradient of difficulty (see Figure 2).

Figure 10

Figure 2. Difficulty Index of the FCI item

The Discrimination Index

The discrimination index assesses how well an item can separate the top quartile and bottom quartile of students’ percentages. We used the students' scores with Ding and Beichner’s formula to calculate the discrimination index of FCI items. At least a positive 0.3 rating is necessary for a discriminating index criterion to be appropriate, and higher values are desired (Ding & Beichner, 2009; Mashood & Singh, 2019). However, item quality is often not reflected in the discrimination index. If other CTT indices for the item are acceptable, items with discrimination indices less than 0.30 may also be taken into consideration. Several factors can contribute to an item's low discriminating power (Ding & Beichner, 2009; Wu et al., 2016). In this context, twenty-two FCI items had a discrimination index of less than 0.3 (73.3%), and eight items had a discrimination index of less than 0.0, see Figure 3. According to a negative discrimination index, a student who typically performs worse than average is more likely to answer the question correctly than a student who typically does well. This is an unexpected outcome, and it typically means that the item is flawed in some manner (or that the students could simply be speculating).

Figure 11

Figure 3. Discrimination Index of the FCI items

Reliability and Standard Error

The test’s reliability coefficient (alpha = 0.15) was significantly lower than the acceptable threshold for educational assessments. The standard error of measurement (SEM = 2.1) also suggests that observed scores contain a considerable amount of random error, further reducing test validity. The FCI test results revealed much lower values of different CTT indices than their desired values (Wu et al., 2016).

Effectiveness of Distractor

Item distractors’ efficacy is measured as part of the item analysis process, which inventory developers utilize to determine the credibility and functionality of the distractors. We calculated the students’ frequency responses to each FCI item alternative in this study (see Table 2). Three distractors corresponding to three items (Items 14, 15, and 23) were inappropriate for use in the concept inventory. Specifically, two distractors were chosen by less than 5% of students, suggesting that even students with lower ability levels did not consider this alternative plausible. These nonfunctional distractors lower the item's overall quality because they do not help the item distinguish between students who understand the concept and those who do not. When distractors are too simple to identify as incorrect, they may unintentionally make it simpler for students to guess the correct answer, lowering the item's discrimination index.

Item Response Curves (IRC) Analysis

We describe our IRC analysis method and apply it to examine all 30 FCI test items in this educational setting (Mashood, 2014; Morris et al., 2006). Plotting the curves for the FCI items involved comparing the proportion of students who selected each response choice to the overall score. IRCs are used to evaluate the effectiveness of questions and their response choices and to present visual proof for the FCI item. To test effectiveness, we mainly used discrimination and the performance of the item response choices. An answer choice is considered sufficiently discriminating when the corresponding IRC discriminates between different ability levels of students, with its rapid change, which can be seen by its shape. Also, each answer choice should be chosen by a considerable number of students. We employed the logistic response function as a model to conduct a goodness-of-fit study of the FCI items' accurate responses (Kucharavy & De Guio, 2015). To construct an IRC graph for FCI items, we first sorted the students according to ability, after which we determined the percentage of each ability level. To demonstrate the IRCs’ visual proof for the FCI items studied, we categorized the effectiveness of items as efficient, moderately efficient, and inefficient, following the definitions by Morris et al. (2006). Efficient items are those whose answer choices are discriminating and are chosen by a considerable number of students. That means high-ability students perform well, and low-ability students perform poorly, resulting in a steep, well-defined curve for a correct response to an item. Moderately efficient items show some discrimination but with a less pronounced curve, indicating overlap between high- and low-ability students. Inefficient items have flat or poorly defined curves, suggesting they do not effectively distinguish between different ability levels. In our study, we selected FCI items 1, 4, 9, and 21 for detailed analysis, as illustrated in Figures 4 – 7.

Figure 12

Figure 4. IRCs of the FCI Item 1

Figure 13

Figure 5. IRCs of the FCI Item 4

Figure 14

Figure 6. IRCs of the FCI Item 9

Figure 15

Figure 7. IRCs of the FCI Item 21

For the FCI items, we were able to depict the percentage of students who selected each option at each ability level for each item with trace lines (see Figures 4 – 7). We are going over a few of the items (1, 4, & 9) in further depth to show how well we can benefit from IRCs.

Discussion

The reliability of the test was very low, meaning the items were not consistent in measuring the same concepts, possibly because they did not match the students’ knowledge levels. Additionally, the high standard error of measurement suggests the test scores were not very precise. Statistical output from the ITEMAN analysis emphasizes the need for test revision to align better with students’ understanding, enhance item consistency, and improve reliability. Even though the genders of the 941 students were not tabulated, the study was not gender-blind. The student's average score is 19.2%, with a standard deviation of 2.28.

To respond to our research questions, at the initial stage, we used CTT to analyze item difficulty (p), discrimination index, reliability coefficient (alpha), and efficiency of distractors. To thoroughly test the effectiveness of the FCI items and response choices, we performed a detailed analysis of IRCs. We compared our findings with those of other international studies and other cross-cultural and educational contexts.

The FCI Item Indices

The difficulty index;Students found the FCI test items challenging, with 26.7% of the items categorized as having a moderate difficulty level. This suggests many of the items were challenging for the students to accomplish. The FCI items were categorized as difficult because none of the students achieved the desired percentage level of mastery, which is 60%, on them (Hestenes & Halloun, 1995). To better prepare students for the ideas assessed by the FCI, it appears that the assessment should be modified or more training should be provided to students, based on the overall difficulty level of the FCI items to fall within the desired range of 0.3 to 0.7 in this educational context. Furthermore, the absence of items with an easy difficulty index (0.75 to 1.00) suggests that the test did not include any item that can be used to accurately measure students' foundational knowledge or give a gradient of difficulty (see Figure 2).

The Discrimination Index; The results demonstrate that, in this educational setting, students with varying levels of knowledge cannot be effectively distinguished by the FCI items due to their poor discrimination indices (Al-Khadher & Albursan, 2017; Ding & Beichner, 2009). Due to the items' low discrimination index, they may require significant content and phrasing revisions to better fit the unique context and the backgrounds of the students. The initial conclusion that the FCI, in its current form, is not a valid tool for assessing students’ knowledge in this setting is further supported by the clustering of students’ marks on the lower end of the scale, as shown in Figure 1. Eight items exhibited negative discrimination indices, meaning they failed to differentiate student ability and should be revised or removed.

The Reliability Index; These scores are well below the required levels of reliability, which normally require higher alpha values to signify strong internal consistency between test items. The low alpha value raises concerns about the assessment's overall quality because it implies that the FCI items may not consistently measure the same underlying concept (see Table 1). The plausible cause, according to Hestenes et al. (1992) may be the difficulty in understanding the Newtonian mechanics conceptual framework.

The Distractor Effectiveness; The analysis did reveal, however, that the majority of the FCI distractions employed in this study were functional, that is, they were chosen by many learners to be taken seriously as viable options. This implies that most of the items were able to provide students with a challenge, except for a few ineffectual distractions. Overall, most of the FCI distractions functioned satisfactorily in this situation, while a few items could need to be revised or weak distractors removed. Item Response Curves (IRC) have been used to further examine the functional distractors and correct responses, and provide a more comprehensive knowledge of the effectiveness of FCI items.

The Item Response Curves; We examined the IRCs of FCI items to analyze the effectiveness of different options of the questions, in more detail regarding efficient, moderately efficient, and inefficient items as claimed by Morris et al. (2006).

Figure 4 shows the IRCs of all choices of item 4. A clear visual depiction of item performance is given by the trace lines in Figure 4, which show the percentage of students at each ability level that selected each option. More clear information for discussion and comprehension is provided by the IRCs' fitting correct answers, compared to information gained by CTT analysis. First, the sigmoid curve for the correct response C of FCI Item 1 shows that a small percentage of low-ability students and a high percentage of high-ability students chose the correct response. IRCs for alternative responses A, D and E show that a high percentage of low-ability students and a negligibly small percentage of high-ability students chose these alternatives, while a high percentage of low- and middle-level students and a negligibly small percentage of high-ability students chose choice B. Also, each choice was made by a considerable number of students (14 per cent or more). That means the pattern is indicative of an effective item because it conforms to the behaviour expected of an effective item (Morris et al., 2006). IRCs of item 1 also show that more low-ability students have alternative concepts than high-ability students since their non-correct responses are higher. The patterns of IRCs for item 1 are similar to Morris et al. (2012).

Figure 5 shows the IRCs of all choices of item 4. FCI Item 4 evaluates students’ understanding of collision forces (Hestenes et al., 1992). The difficulty and discrimination indices for Item 4 are very low. Table 2 reveals that the correct option E was chosen by 11% of the frequency responses, indicating that the item was not easy. For this item, the correct response rate of the students from this study is far below that of the study conducted on students from three universities in the USA by Morris et al. (2006). Choices A (a truck exerts more force), B (a greater amount of force is exerted on the truck than the car), C (none of the vehicles exerts a force), and D (the car exerts a force, but the truck does not) appeal to 89% of students. The correct response E displays a continuous curve on the IRC plot starting from the low frequency with low-scoring students and increasing frequency with moderately scoring students (monotonically increasing as ability levels improve). Alternative A is the most common wrong response option and is chosen by a high percentage of low- and middle-ability students, for a large range of total scores (1 - 9), and a very low percentage of high-ability students, hence it is a moderately discriminating response choice. The IRCs for the other alternatives are virtually linear throughout an extensive range of total scores (2 – 12), which makes them meaningless. Option C, B, and D’s IRC graphs are practically linearly flat. As a result, the response options appear to appeal to students of various ability levels. This could be due to students guessing and selecting responses at random. Hence, FCI item 4 is classified as an inefficient item. The IRCs presented here have some similarities and some differences with the results from Morris et al. (2006). For Ugandan students, options B, C, and D do not function very well, choice A is moderately discriminating, and the item is ineffective, while for American students, options A and D do not function very well, and the item is moderately effective. Learning Newton's third law within the context of Newtonian physics is one of the most complex and challenging concepts. Different studies related to learning curves for human training (Hamade, 2011) and neural networks (Hanson et al., 2024), and conceptual difficulties (Feldman, 2003; Savinainen et al., 2005) and references therein) have shown that it is difficult to learn complex concepts, which can be made easier by simplifying them, using combinations with varying features as per complexity, decomposition in different categories as per complexity and connectedness, using rules, examples, analogies, bridging representations, etc., which require extended exposure/time to reach a threshold for fast learning. Learning Newton’s third law requires a thorough conceptual grasp of different aspects of the concept. This can happen when the student considers force in terms of ‘interaction and process’ instead of ‘object property,’ which can be facilitated using combinations of bridging representations (Savinainen et al., 2005).

Students’ understanding of the effect of an impulse on an object with uniform motion is tested in FCI Item 9. The correct frequency response (28%) E indicates that FCI Item 9 is moderate. For Item 9, the correct response rate of the students from this study is far below that of the study conducted by Morris et al. (2006), in the context of the USA. Figure 6 shows the IRCs of all choices of item 9. IRC for Item 9’s correct option E provides no meaningful information for distinguishing students of varying abilities and fails to differentiate between them. The functionality of the distractors is more instructive than the correct option E. For a wide range of scores, IRCs for Item 9 are nearly similar (flat), indicating that the options are ambiguous, complex, and non-discriminatory. Therefore, Item 9, appears to be inefficient since its IRC did not exhibit the anticipated discriminating pattern. This item's fitted curve for a correct response did not match the logistic model well, indicating that there may be difficulties with the item's content or distractor selections preventing it from operating as intended. This item is like the FCI Item 21 which shows similar IRC plots with many incorrect answers at all students’ abilities. The items do not reflect a growing sigmoid curve of the correct score. As a result, the items are inefficient and uninformative, and they should be revised to fit this context.

A study conducted in the USA, by Wang and Bao (2010), using IRT, showed that Q. 4, 9 and 21 are more difficult items, while Q. 1 is an easy item. Q. 1 and 4 are more discriminating compared to Q. 21, while Q. 9 and 21 had a higher probability of guessing a correct answer compared to Q. 4. They also showed that some FCI items do not function very well. Planinic et al. (2010) used a Rasch model to administer the FCI for Newtonian and non-Newtonian populations of students in Croatia and found that the FCI functions in different ways for these two categories of students. They found problems with the effectiveness of some FCI items and suggested some changes for betterment. For deciding the functionality of any item, the above-mentioned studies did not use the set of wrong response choices, while for IRCs, all response choices are used while deciding the effectiveness of an item. After utilizing the IRCs to examine the effectiveness of the FCI items and demonstrating the students' performance and comprehension of force concepts in the Ugandan setting, it is important to compare the findings with those from different educational settings.

The FCI’s performance has been found to vary across different populations due to cultural, educational, and linguistic differences (Bao et al., 2009; Gebauer et al., 2021; R. Henderson & Stewart, 2017; Mashood & Singh, 2019). Previous studies indicate that non-Newtonian populations, gender differences, and variations in physics instruction methods can all impact FCI results (Bao et al., 2009; R. Henderson & Stewart, 2017; Mashood & Singh, 2019; Mears, 2019; Planinic et al., 2010; Shoji et al., 2021; Traxler et al., 2018; Yasuda et al., 2021; Yasuda & Taniguchi, 2013). Such factors should be considered when interpreting findings and adapting the FCI for diverse educational settings, as discussed in detail in the next section.

Comparative Analysis of the Cross-cultural and Educational Contexts

Studies by Mashood and Singh (2019) and Bao et al. (2009) indicate variations in FCI performance across different countries. Ugandan secondary school students performed similarly to students in India's non-integrated mode (NIM) setting but significantly below students in China, the USA, and India’s integrated mode (IM). A study by Planinic et al. (2010) in Croatia similarly found that secondary students scored below the Newtonian comprehension threshold, while university students performed better in the FCI. Planinic et al. (2010) also found problems with the functionality of several items and suggested some changes for improvement in the FCI.

Another study conducted in the UK found that the average FCI score of both male and female groups increased with increasing levels of qualification in physics. Also, males performed better compared to females for different levels (Mears, 2019). Another study was conducted in the USA to examine the differences in FCI average scores between genders and various racial and ethnic groups (R. Henderson & Stewart, 2017). The study showed gender gaps and ethnic biases in the average FCI scores. It was found that males outperformed females, while Caucasian students outperformed both Hispanic students and African American students. Another study in the USA found that at least 8 items in the FCI had substantial gender bias; among them, 2 items were biased in favour of females, while 6 items were unfair to females (Traxler et al., 2018). After the removal of these problematic questions, the gender gap in the FCI scores decreased by about 50%. A recent meta-analysis (Santoso et al., 2024) of 38 studies (21-American, 11 Asian, 3 European, 2 African and 1 Australian) showed that male students significantly and moderately performed better than their counterparts in the FCI in North America (NA) and non-NA regions, representing 21 and 17 studies, respectively.

A study in Japan administered an FCI to analyze students' understanding. The study employed a modified item response curve (MIRC) to demonstrate the relationship between responses to FCI Q.29 (Item 29) and overall student ability (Shoji et al., 2021). The trend indicated that, in the Japanese educational context, Q.29 did not adequately assess students’ comprehension of atmospheric pressure. While some students who grasped the idea chose incorrect answers after instruction, many students who initially selected the correct answer did not fully understand atmospheric pressure. As a result, from the pre-test to the post-test, the students' accurate response rates for FCI Q.29, which evaluated their comprehension of atmospheric pressure, dropped. Our study also found that FCI Q.29 was difficult for Ugandan secondary school students, as only 30% of students correctly responded to this question. The discrimination index for this question was found to be 0.05, which is far below the desired range and the difficulty index was found to be 0.3, which is the lowest limit of the moderate difficulty level, the value below it is considered a high difficulty.

The Japanese version of the FCI was translated by members of the Tokyo University of Science and Tokyo Gakugei University and administered to evaluate students' understanding by identifying issues with false negatives, false positives, and guessing due to translation inadequacies (Yasuda et al., 2011). The Japanese FCI had significant issues that indicated misunderstandings caused by ambiguities in the items themselves rather than by the student's knowledge. The inadequacies identified are attributed to cultural and educational differences between Japan and the original context of the FCI. This highlighted the need for additional research to guarantee the diagnostic tool's accuracy and reliability across various languages and educational systems. To promote more accurate assessments of students' conceptual comprehension of physics concepts, the study advises international academics to work on translating and developing educational diagnostic tools (Yasuda & Taniguchi, 2013; Yasuda et al., 2011). Future translations and the creation of diagnostic instruments both domestically and abroad are guided by the study.

Cultural and educational factors significantly contribute to the persistence of misconceptions about force in many countries. Interactive engagement methods like peer instruction and the use of technology in classrooms in the USA aim at addressing the misconceptions in physics, but their implementation is not yet uniform across all schools. Misconceptions about force are also prevalent, with many students believing that continuous force is needed to maintain motion (McDermott & Redish, 1999). The misunderstanding is partly due to the everyday language and experiences that contradict scientific principles and the diverse educational landscape results in varying approaches to science education, contributing to inconsistencies in students' conceptual understanding. In China, similar misconceptions about force persist. Students often hold the erroneous belief that an object in motion requires a continuous force to remain in motion (Bao & Redish, 2006).

It is important to comprehend and proficiently tackle prevalent misunderstandings to enhance physics instruction worldwide. In the USA, students' misconceptions are addressed and corrected in real-time through interactive engagement techniques like responding to questions using clickers and use of peer instruction methods (Mazur, 1997). Educational approaches, such as interactive engagement methods, may help address these misconceptions but are not uniformly implemented across all countries.

In Uganda's secondary schools, we administered the FCI as a standardized multiple-choice test. We mainly focused on the quantitative survey assessing the feasibility of the FCI in this educational setting. The FCI has been utilized globally for various purposes, including assessing conceptual knowledge, gauging the efficacy of instruction, drawing out students' preconceptions regarding the force, etc., in the study of physics education. This emphasizes the importance of force concepts in physics education in school, college, and university curricula. Nevertheless, it is noteworthy that the use of the FCI at the secondary level is not as common as it is at the college/university level. Higher gains were noted for honours than regular courses in high schools (Hake, 1998). This explains why there is a substantial gap in literature because it is not a suitable instrument for the secondary level, particularly in the Ugandan context. These findings are like those from some other educational settings as discussed above.

Conclusion

The study evaluated the Force Concept Inventory (FCI) using Classical Test Theory (CTT) to determine its effectiveness in assessing students' conceptual understanding of Newtonian mechanics. The findings indicate that the FCI was highly challenging for students, with an overall mean score of 5.76 out of 30 and an item difficulty index suggesting that no items were categorized as "easy." Furthermore, the discrimination index analysis revealed that most items failed to distinguish between high- and low-performing students, with 73.3% of the items scoring below the acceptable threshold of 0.3. The reliability coefficient (α = 0.15) was considerably lower than the desired benchmark, raising concerns about the consistency of the test results. Additionally, distractor analysis highlighted several ineffective answer choices that did not sufficiently attract lower-ability students, thereby weakening the test’s ability to assess conceptual understanding. Item Response Curves (IRCs) provided further insights into student responses, demonstrating that certain FCI items performed efficiently, while others required refinement. The study also acknowledges that cross-cultural and educational contexts may influence students' performance, as prior research suggests variations in FCI effectiveness based on gender, culture, and educational background. Overall, these findings underscore the need to modify the FCI to better align with the cognitive and instructional realities of secondary school students in Uganda.

Recommendations

Based on the study’s findings, the following recommendations are proposed to improve the validity and effectiveness of the FCI, in the context of Uganda.

Future research should focus on adapting and validating a customized FCI for Ugandan secondary schools, following different stages of the validation process. This tool should incorporate insights from Item Response Curves (IRCs) to identify item strengths and weaknesses, ensuring alignment with local educational standards, cultural contexts, and instructional practices, ultimately contributing to broader physics education advancements.

Using the validated customized FCI, examine how different approaches to instruction, including inquiry-based learning, active learning, and technology-enhanced classroom instruction, improve students' conceptual grasp of Newtonian physics. This can assist in determining the most suitable practices for eliminating misunderstandings in the Ugandan setting.

The efficacy and performance of the FCI should be compared in future studies across various groups and educational environments. This will aid in the identification of both context-specific and universal problems, advancing the creation of evaluation instruments that are both locally useful and globally adaptable.

Limitations

This study has several limitations that should be acknowledged. The study focused on secondary school students in Uganda, limiting the generalizability of the findings to other educational contexts with different curricula and instructional methods. Factors such as language barriers, variations in teaching methodologies, and student's prior exposure to physics concepts were not controlled, potentially influencing the outcomes. Future research should address these limitations to enhance the robustness of FCI assessments in similar contexts.

Ethics Statements

This study adhered to ethical guidelines for human subjects’ research, with approval from the University of Rwanda- College of Education and Uganda's National Council of Science and Technology. All participants provided informed consent after a detailed explanation of the study’s purpose, methods, and rights. Confidentiality and anonymity were strictly maintained, and data was securely stored for research use only. Participation was voluntary, with no penalties for withdrawal. The study followed the Declaration of Helsinki principles.

Acknowledgements

We extend our gratitude to ACEITLMS for their invaluable support. We also sincerely thank the students and physics teachers from secondary schools across Uganda for their cooperation, dedication, and significant contributions of time and effort to this study.

Authorship Contribution Statement:

The study was conceived and constructed by Kent, who also created the research technique and carried out the data analysis. The coauthors helped with the literature review, data collecting, and preliminary data analysis. They helped validate the data and offered crucial suggestions for improving the theoretical framework. Kirya drafted the paper, while the other authors reviewed and edited it for clarity and accuracy. All authors actively engaged in the writing process. After reading and approving the manuscript's final draft, each author pledged to take responsibility for every part of the work.

 

References

Al-Khadher, M. M. A., & Albursan, I. S. (2017). Accuracy of measurement in the classical and the modern test theory: An empirical study on a children intelligence test. International Journal of Psychological Studies9(1), 71-80. https://doi.org/10.5539/ijps.v9n1p71

Bani-Salameh, H. N. (2016). How persistent are the misconceptions about force and motion held by college students? Physics Education, 52(1), Article 014003. https://doi.org/10.1088/1361-6552/52/1/014003

Bao, L., Cai, T., Koenig, K., Fang, K., Han, J., Wang, J., Liu, Q., Ding, L., Cui, L., Luo, Y., Wang, Y., Li, L., & Wu, N. (2009). Learning and scientific reasoning. Science, 323(5914), 586-587. https://doi.org/10.1126/science.1167740  

Bao, L., & Redish, E. F. (2006). Model analysis: Representing and assessing the dynamics of student learning. Physical Review Special Topics - Physics Education Research, 2(1), Article 010103. https://doi.org/bpc2nb

Creswell, J. W. (2014). Research design: Qualitative, quantitative and mixed methods approaches (4th ed.). Sage.  

Ding, L., & Beichner, R. (2009). Approaches to data analysis of multiple-choice questions. Physical Review Special Topics - Physics Education Research, 5(2), Article 020103. https://doi.org/cc69r7

Feldman, J. (2003). The simplicity principle in human concept learning. Current Directions in Psychological Science, 12(6), 227-232. https://doi.org/10.1046/j.0963-7214.2003.01267.x

Gebauer, M. M., McElvany, N., Köller, O., & Schöber, C. (2021). Cross-cultural differences in academic self-efficacy and its sources across socialization contexts. Social Psychology of Education, 24, 1407-1432. https://doi.org/10.1007/s11218-021-09658-3

Hake, R. R. (1998). Interactive-engagement versus traditional methods: A six-thousand-student survey of mechanics test data for introductory physics courses. American Journal of Physics, 66(1), 64-74. https://doi.org/10.1119/1.18809

Hamade, R. F. (2011). Learning curves for CAD competence building of novice trainees. In M. Y. Jaber (Ed.), Learning curves: Theory, models, and applications (pp. 411-423). CRC Press.   

Han, J., Koenig, K., Cui, L., Fritchman, J., Li, D., Sun, W., Fu, Z., & Bao, L. (2016). Experimental validation of the half-length Force Concept Inventory. Physical Review Physics Education Research12, Article 020122. https://doi.org/j4dq

Hanson, S. J., Yadav, V., & Hanson, C. (2024). Dense sample deep learning. Neural Computation, 36(6), 1228-1244. https://doi.org/10.1162/neco_a_01666

Henderson, C. (2002). Common concerns about the Force Concept Inventory. The Physics Teacher, 40(9), 542-547. https://doi.org/10.1119/1.1534822

Henderson, R., & Stewart, J. (2017). Racial and ethnic bias in the Force Concept Inventory. In L. Ding, A. Traxler, & Y. Cao (Eds.), 2017 PERC Proceedings, (pp. 172-175). American Association of Physics Teachers. https://doi.org/10.1119/perc.2017.pr.038

Hestenes, D., & Halloun, I. (1995). Interpreting the force concept inventory: A response to March 1995 critique by Huffman and Heller. The Physics Teacher, 33(8), 502-506. https://doi.org/10.1119/1.2344278

Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force Concept Inventory. The Physics Teacher, 30(3), 141-158. https://doi.org/10.1119/1.2343497

Ishimoto, M., Davenport, G., & Wittmann, M. C. (2017). Use of item response curves of the force and motion conceptual evaluation to compare Japanese and American students' views on force and motion. Physical Review Special Topics - Physics Education Research, 13(2), Article 020135. https://doi.org/gdzz33

Kirya, K. R., Mashood, K. K., & Yadav, L. L. (2021a). A methodological analysis for the development of a circular-motion concept inventory in a Ugandan context by using the Delphi technique. International Journal of Learning, Teaching and Educational Research, 20(10), 61-82. https://doi.org/10.26803/ijlter.20.10.4

Kirya, K. R., Mashood, K. K., & Yadav, L. L. (2021b). Review of research in student conception studies and concept inventories- Exploring PER threads relevant to the Ugandan context. African Journal of Educational Studies in Mathematics and Sciences, 17(1), 37-60. https://doi.org/10.4314/ajesms.v17i1.3

Kirya, K. R., Mashood, K. K., & Yadav, L. L. (2022). Development of a circular motion concept item inventory for use in Ugandan science education. Journal of Turkish Science Education, 19(4), 1312-1327. https://doi.org/10.36681/tused.2022.176

Kucharavy, D., & De Guio, R. (2015). Application of logistic growth curve. Procedia engineering131, 280-290. https://doi.org/10.1016/j.proeng.2015.12.390

Laverty, J. T., & Caballero, M. D. (2018). Analysis of the most common concept inventories in physics: What are we assessing? Physical Review Physics Education Research14(1), Article 010123. https://doi.org/gc93mz

Le, V., Nissen, J. M., Tang, X., Zhang, Y., Mehrabi, A., Morphew, J. W., Chang, H. H., & Van Dusen, B. (2025). Applying cognitive diagnostic models to mechanics concept inventories. Physical Review Physics Education Research, 21(1), Article 010103. https://doi.org/pd2r

Madsen, A., McKagan, S. B., Martinuk, M. S., Bell, A., & Sayre, E. C. (2016). Research-based assessment affordances and constraints: Perceptions of physics faculty. Physical Review Physics Education Research12, Article 010115. https://doi.org/pf58

Mashood, K. K. (2014). Development and evaluation of a concept inventory in rotational kinematics [Doctoral dissertation, Tata Institute of Fundamental Research]. Homi Bhabha Centre for Science Education. https://bit.ly/4l7lhFG

Mashood, K. K., & Singh, V. A. (2019). Preuniversity science education in India: Insights and cross cultural comparison. Physical Review Physics Education Research, 15, Article 013103. https://doi.org/gfw67q  

Mazur, E. (1997). Peer instruction: Getting students to think in class. AIP Conference Proceedings, 399(1), 981–988. https://doi.org/10.1063/1.53199

Mbwile, B., & Ntivuguruzwa, C. (2023). Impact of practical work in promoting learning of kinematics graphs in Tanzanian teachers’ training colleges. International Journal of Education and Practice11(3), 320-338.  https://doi.org/10.18488/61.v11i3.3343

McDermott, L. C., & Redish, E. F. (1999). Resource letter: PER-1: Physics education research. American Journal of Physics, 67(9), 755-767. https://doi.org/10.1119/1.19122

Mears, M. (2019). Gender differences in the Force Concept Inventory for different educational levels in the United Kingdom. Physical Review Physics Education Research, 15(2), Article 020135. https://doi.org/ggn3dt

Morley, A., Nissen, J. M., & Van Dusen, B. (2023). Measurement invariance across race and gender for the Force Concept Inventory. Physical Review Physics Education Research, 19(2), Article 020102. https://doi.org/pd2s  

Morris, G. A., Branum-Martin, L., Harshman, N., Baker, S. D., Mazur, E., Dutta, S., Mzoughi, T., & McCauley, V. (2006). Testing the test: Item response curves and test quality. American Journal of Physics, 74(5), 449-453. https://doi.org/10.1119/1.2174053

Morris, G. A., Harshman, N., Branum-Martin, L., Mazur, E., Mzoughi, T., & Baker, S. D. (2012). An item response curves analysis of the force concept inventory. American Journal of Physics, 80(9), 825-831. https://doi.org/10.1119/1.4731618

Murshed, M., Phang, F. A., Bunyamin, M. A. H. B., & Binti, I. J. (2020). The reliability analysis for force concept inventory. International Journal of Psychosocial Rehabilitation, 24(5), 143-151. http://doi.org/10.37200/IJPR/V24I5/PR201677

National Curriculum Development Centre. (2019). Physics syllabus. https://ncdc.go.ug/books/physics-syllabus/

Opolot-Okurut, C. (2010). Classroom learning environment and motivation towards mathematics among secondary school students in Uganda. Learning Environments Research13, 267-277. https://doi.org/10.1007/s10984-010-9074-7

Parker, M. A. J., Hedgeland, H., Jordan, S. E., & Braithwaite, N. S. J. (2023). Establishing a physics concept inventory using computer marked free-response questions. European Journal of Science and Mathematics Education, 11(2), 360-375. https://doi.org/10.30935/scimath/12680

Planinic, M., Ivanjek, R., & Susac, R. (2010). Rasch model based analysis of the Force Concept Inventory. Physical Review Physics Education Research, 6(1), Article 010103. https://doi.org/10.1103/PhysRevSTPER.6.010103

Reyes, M. G., & Rakkapao, S. (2020). Item response curve analysis of Likert scale on learning attitudes towards physics. European Journal of Physics, 41, Article 045703. https://doi.org/10.1088/1361-6404/ab805c

Richardson, C. J., Smith, T. I., & Walter, P. J. (2021). Replicating analyses of item response curves using data from the Force and Motion Conceptual Evaluation. Physical Review Physics Education Research, 17(2), Article 020127. https://doi.org/pd2t

Rusticus, S. (2014). Content validity. In A. C. Michalos (Ed.), Encyclopedia of quality of life and well-being research (pp. 1261-1262). Springer. https://doi.org/10.1007/978-94-007-0753-5_553

Sands, D., Parker, M., Hedgeland, H., Jordan, S., & Galloway, R. (2018). Using concept inventories to measure understanding, Higher Education Pedagogies, 3(1), 173-182. https://doi.org/10.1080/23752696.2018.1433546

Santoso, P. H., Setiaji, B., Wahyudi, Syahbrudin, J., Bahri, S., Fathurrahman, Ananda, A. S. R., & Sodhiqin, Y.  (2024). Exploring gender differences in the Force Concept Inventory using a random effects meta-analysis of international studies. Physical Review Physics Education Research, 20(1), Article 010601. https://doi.org/pd2v

Savinainen, A., Scott, P., & Viiri, J. (2005). Using a bridging representation and social interactions to foster conceptual change: Designing and evaluating an instructional sequence for Newton’s third law. Science Education, 89(2), 175-195. https://doi.org/10.1002/sce.20037

Shoji, Y., Munejiri, S., & Kaga, E. (2021). Validity of the Force Concept Inventory evaluated by students’ explanations and confirmation using a modified item response curve. Physical Review Physics Education Research, 17(2), Article 020120. https://doi.org/jw9m

Smith, J. P., III., DiSessa, A. A., & Roschelle, J. (1994). Misconceptions reconceived: A constructivist analysis of knowledge in transition. The Journal of the Learning Sciences3(2), 115-163. https://doi.org/10.1207/s15327809jls0302_1

Stoen, S. M., McDaniel, M. A., Frey, R. F., Hynes, K. M., & Cahill, M. J. (2020). Force Concept Inventory: More than just conceptual understanding. Physical Review Physics Education Research, 16(1), Article 010105. https://doi.org/gg2mzc

Traxler, A., Henderson, R., Stewart, J., Stewart, G., Papak, A., & Lindell, R. (2018). Gender fairness within the Force Concept Inventory. Physical Review Physics Education Research, 14(1), Article 010103. https://doi.org/gctkcj 

Wang, J., & Bao, L. (2010). Analyzing force concept inventory with item response theory. American Journal of Physics78(10), 1064-1070. https://doi.org/10.1119/1.3443565

Wu, M., Tam, H. P., & Jen, T.-H. (2016). Educational measurement for applied researchers: Theory into Practice. Springer. https://doi.org/10.1007/978-981-10-3302-5

Yasuda, J.-I., Mae, N., Hull, M. M., & Taniguchi, M.-A. (2021). Optimizing the length of computerized adaptive testing for the force concept inventory. Physical Review Physics Education Research, 17(1), Article 010115. https://doi.org/gq776v   

Yasuda, J.-I., & Taniguchi, M.-A. (2013). Validating two questions in the Force Concept Inventory with subquestions. Physical Review Special Topics - Physics Education Research, 9(1), Article 010113. https://doi.org/pd2w

Yasuda, J.-I., Uematsu, H., & Nitta, H. (2011). Validating a Japanese version of the force concept inventory. Journal of the Physics Education Society of Japan, 59(2), 90-95. https://bit.ly/4l8ciUS

...