Summer 1998

Inside This Issue

Highlights in


New Standards Science

Assessment Resource
Service (MARS)

Performance Links
in Science

Third International
Mathematics and Science Study



National Science Foundation
4201 Wilson Boulevard #885
Arlington, VA 22230







Highlights in Assessment

Which question would you prefer to answer? Which is easier? For whom?

2. A rectangular picture is pasted to a sheet of white paper as shown.

What is the area of the white paper not covered by the picture?

165 cm2
500 cm2
1900 cm2
2700 cm2

2. The length of a rectangle is 6cm, and its perimenter is 16 cm. What is the area of the rectangle in square centimeters?


Reproduced from TIMSS Population 2 Item Pool. Copyright © 1994 by IEA, The Hague.

Striking an appropriate balance among assessment tools is no easy task. For example, traditional assessment relies heavily on using multiple-choice tests. These tests, however, may place more emphasis on facts than processes and sometimes simplify complex ideas. They do not always indicate what students really know ­ almost every student has benefited from making a "lucky guess," on a multiple-choice exam, or missed the point of the question entirely.

Newly developed performance assessments have weaknesses too. Some fail to meet psychometric standards of reliability and validity. Some do not take into consideration the contexts and nuances of instruction. New performance assessments are more expensive to develop and score.

When assessing student learning, it is important to understand and continually ask questions about assessment tools. It is also important to identify an appropriate mix of assessment tools and strategies.

Unlike multiple-choice exams, many standards-led assessments require students to demonstrate a broad range of problem-solving skills through performance of other "authentic" tasks which require of students the same kinds of skills they need for future success.

Assessing Student Learning in the '90s at NSF

In 1990, ESIE (at that time the Division of Materials Development, Research, and Informal Science Education) developed a special solicitation that initiated a series of seventeen assessment projects. The solicitation was part of NSFís response to the call from President Bush and the State Governors to be first in the world in mathematics and science by the year 2000.

The objectives of the assessment solicitation were

Many of the projects supported under the 1990 solicitation focused on developing performance based or extended response assessments (e.g., Balanced Assessmentís packages across the K-12 mathematics curriculum that formed the basis of the New Standards mathematics reference exam and tests); others provided the research on alternative forms of assessment.

These projects (most have been completed) have paved the way for developing the new assessment tools described in this newsletter.

Today, the ESIE assessment portfolio reflects the importance of using high-quality, accurate student learning assessment in gauging the effects of education reforms. New instructional materials, new partnerships, and changes in teacher enhancement all strive to improve student learning in mathematics and science. Assessing the student learning outcomes resulting from these activities is an integral part of the processes.

Instructional Materials are Fundamental to Education Reform

All projects funded under the Instructional Materials Development Program include embedded assessments as part of the development process. Therefore gathering data on teachersí use of embedded assessments is essential. For example, an award to the University of California, Berkeley, examines how embedded student assessment enhances the Science Education for Public Understanding (SEPUP) Project materials. These science materials provide a year long science program for middle school students, stressing issue-oriented science and the use of scientific evidence and risk-benefit analysis in making decisions. Assessments have been developed as the materials were developed. Now the focus is on studies of teacher use and student outcomes.

Technology is Everywhere, Even in Assessment

Researchers at the University of California, Irvine, have developed a conceptual framework and accompanying software for assessing student understanding of mathematics concepts. The software includes a sophisticated process to diagnose studentsí understanding of concepts in arithmetic, algebra, and geometry (using extended response and multiple choice questions), along with materials for additional practice, so students (and teachers) gain in understanding student weaknesses.

In addition, Videodiscovery is field-testing development of a CD-ROM software engine that delivers a collection of science performance assessments for grades 5-8. The content of the tasks is aligned with the national science education standards, and includes the use of video clips, graphics, virtual tools for information gathering, and simulation techniques. A scoring system is embedded in each performance event, eliminating the need for time-consuming teacher scoring. EA


The MOSAIC, A Quilt of Many Colors
Steven Klein

Using assessment tools, the Mosaic project examines a question that is central to reform efforts in mathematics and science instruction; namely, do students achieve more when their teachers use the reform practices than when they use more traditional practices?

The Mosaic project is gathering data at ten sites across the nation. Four sites are states (Connecticut, Louisiana, Massachusetts, and South Carolina) while six are school districts or combinations of districts (Columbus, OH; Detroit, MI; El Paso, Socorro, and Ysleta, TX; Fresno, CA; San Francisco, CA; and Philadelphia, PA). One or two additional sites will participate in 1998.

The project focuses on mathematics, or science, or both mathematics and science, depending on the site. There are on average 20 schools per site participating in the study with one or two grade levels per school. By the end of study, it is expected that over 1,000 classrooms and 25,000 students will have participated in one or more phases of the project.

The same basic research design is used at each site. The first step involves identifying approximately 10 schools at the site where mathematics and/or science reforms have been implemented, and identifying 10 similar schools (in terms of student demographic characteristics) where the reforms have yet to be implemented. The second step assesses student performance at each school in mathematics and/or science with both multiple choice and open-ended measures, including the use of hands-on tasks in science. The assessment measures used are the same as those normally administered at the site (such as part of a statewide testing program) as well as those tests chosen by the research team. The third step involves administering a questionnaire to teachers at all of the participating schools to better understand classroom practices.

RAND conducts statistical analyses of the tests and teacher questionnaire data to examine the following questions: (1) do students in the classrooms where the reform practices are used have higher test scores than do the students in other classrooms, (2) are the differences in mean scores between these two groups of students greater on the open-ended and hands-on tests than they are on the multiple choice measures, and (3) are the answers to these questions related to site, grade level, subject (mathematics or science), gender, race, or ethnicity?

The analyses conducted to answer these questions control for possible differences in student background characteristics between groups (such as the percentage of students enrolled in free or reduced lunch programs).

Although a single study cannot provide a definitive answer to the question of whether the reforms are effective, using over ten diverse sites across the nation will allow for the construction of a "mosaic" of evidence about reforms. The research may also provide insight regarding the circumstances under which the reforms are and are not successful in improving student performance.

Findings from the first wave of data collection are expected to be available the summer of 1998. The Mosaic project is conducted by RAND, a non-profit research institute, in collaboration with Horizon Research, under a grant from NSF. EA


Megan Martin

New Standards began in 1991 as a partnership of states and districts interested in developing an assessment system that would allow them to measure student progress in attaining the content standards emerging from the national professional organizations.

The first set of voluntary national content standards came from the National Council of Teachers of Mathematics Standards for Curriculum and Evaluation (NCTM, 1989). They called for students to exhibit skills, conceptual understanding, and problem solving. It was expected that the standards in other subject areas would follow suit. It was also clear that the assessment system would have to include multiple measures, including performance examinations and collections of student work.

New Standards started work in mathematics and developed examinations requiring students to respond to the full range of tasks that can be accomplished in a limited time period. In the case of mathematics and science, leadership for the work has come from the University of California Office of the President in Oakland, CA. The development of performance standards involved teachers and university faculty from across the United States. The examinations are called "reference examinations" since they provide data with reference to meeting national standards.

The Science Reference Exam

With National Science Foundation funding, science examination development started in November 1996, with the goals of good science, good technical quality, and good test design. An important strategy for New Standards and for the science examination development was to build upon existing work and expertise. Thus, staffs were assembled from successful state and national assessment programs. Potential examination items were solicited from over 300 sources including 50 state and foreign assessment agencies, universities, systemic initiative awardees, and professional testing contractors. Approximately 25 outstanding teachers, science education experts and assessment experts from classrooms, universities, and state departments of education were convened to offer advice on the most effective examination design to measure and report valid student achievement data referenced to the standards.

Item Development

Item prototypes are piloted in a small number of classrooms representing diverse student backgrounds and achievement levels. Students and teachers are interviewed; survey data are analyzed; and items reused as many times as necessary to get an item to work well. In addition, at each step in the development and piloting process, an independent scientific and equity review is coordinated by staff of the American Association for the Advancement of Science and incorporated into the revision process.

Staff at the Learning Research and Development Center carries out the technical quality assurance. In addition to the usual psychometric concerns about sample sizes, background data, and scoring accuracy, they designed and carried out special studies to make the examination as fun as possible. An initial study of item format compared student performance on two successive constructed response items on related topics versus performance on a multiple choice item followed by a constructed response question on a related topic. Research questions included the following:

Results from a small sample are interesting enough to warrant further investigation on a larger scale this year.

Since one wants the examination to measure what students know and can do in science, rather than their general reasoning ability, curriculum validity studies are also being conducted. One would like an average student "who works hard in a good science program to do well on our test and a bright student who doesnít study any science to do poorly on our test. Work is being done with district and systemic initiative leaders to identify classrooms characterized as "teaching for understanding toward the standards" so that we can compare student performance in those classrooms the performance of students in classrooms where "traditional" practice is taking place.

Next Steps

Piloting and field testing, equity and scientific review, and technical studies will continue through 1998 and a test will be available for widespread use in spring 1999. For further information, please contact Megan Martin, Director of Science Exam Development, EA


Sample New Standards Assessments:

Reproduced with permission for ESIE Access


Sandra Wilcox

It is widely recognized that appropriate assessment is essential to the forward progress of the national reform effort in mathematics education. As long as traditional tests of mechanical skills constitute the basis of public accountability, the message sent to teachers is clear and insistent: Teaching to the broader goal of mathematical power is not valued and will not be rewarded. Conversely, there is much evidence that the introduction of broad-spectrum performance assessment, based on specific standards, can lead to changes in the balance of learning activities in classrooms. Thus balanced assessment is an essential complement to other key tools now available for supporting reform, including the K-12 curriculum materials developed largely with NSF support.

The work in MARS is grounded in the belief that mathematics assessment should:

Among its many accomplishments, MARS produced 18 balanced assessment packages. A package contains 10-20 tasks, each accompanied by a task-specific scoring rubric and samples of student responses that have been scored and commented upon. The collection includes technology-based packages and portfolio support packages. Cuisenaire Dale Seymour is publishing two packages at each of grades 4, 8, 10, and 12, with the first of the packages available in spring, 1998.

With continuing support from NSF, MARS is extending development work to undertake four strands of activity: assessment design and development services; professional development services; investigations of key issues; and dissemination.

Custom tailoring is an essential feature of MARS. This responsive approach is also pro-active, since clients have limited experience in performance assessment.

Assessment Design and Development Services

Developing skills of local design teams. MARS works with clients who wish to develop the skills of local designers. The range of activities includes task design, scoring design, and managing and monitoring assessment. For three years, MARS has been working with the New York City Schools Division of Assessment and Accountability to (a) help with the further development of PAM, the district's performance assessment complement to standardized tests in grades 5 and 7, and (b) advance the skills of the PAM design team so as to broaden the range and variety of PAM tasks. In a series of workshops, MARS designers initiate a context for PAM designers to draft tasks, and then provide commentary and suggestions for revisions with the aim of optimizing the development process.

Developing customized local instruments. Even though high-quality assessment is at last becoming available, many clients have compelling reasons to develop their own assessment. MARS works with clients who wish to have custom-tailored instruments for student assessment or system monitoring. The clients are provided with a choice of materials, design to specifications, and tools to evaluate systems. Recently for the El Paso Collaborative for Academic Excellence (USI), MARS provided an instrument to systematically monitor progress in their efforts to reshape mathematics education at the middle grades. The test is designed to portray a more holistic picture of what the reform initiative is contributing to student learning.

Professional Development Services

Professional development is a key driver of systemic reform in mathematics education. MARS work has two aspects­professional development for performance assessment, and professional development through performance assessment.

Professional development for performance assessment. In this activity, issues related to the alignment of standards and standards-based curriculum with performance assessment come to the fore. For example, in work with several school districts in Michigan and Colorado, MARS conducted sessions in the context of revisions of state performance standards or curriculum frameworks, and changes in statewide assessments.

Professional development through performance assessment. A specific performance task, with samples of student work and scoring activities, provides a clear focus for teachers to reconsider what is meant by mathematical performance and what aspects are important. It can be explicit and challenge some of the values that underlie teachersí belief systems. It can be a particularly effective focus for teachers faced with implementing new, standards-based curricula. Assessment design and scoring exercises and materials can be used to deepen and broaden a teacherís own mathematical knowledge, an area of need for many teachers charged with enacting reform practices. And assessment tasks and student responses can be sites for teacher analysis of student understanding. In the case of professional development through assessment, curriculum, teaching, learning and assessment are all closely linked.

Building local capacity. MARS establishes a structure of working with leadership teams from client districts and states to support their work with lead teachers and other professional staff in the local settings. Making a 'Teachers Teaching Teachers' cascade process work effectively is not easy. It requires materials that provide substantial and sturdy support to the competence and confidence of mentors who themselves may be quite inexperienced in this new area. The development of such materials requires thorough systematic development with rich feedback, including observations and interviews. MARS is currently building on the pilot work in this area. The goal is to develop a set of resources ñ A Guide for Professional Development in Performance Assessment. 2

Sandra Wilcox leads the team of designers and researchers at Michigan State, Alan Schoenfeld leads the team at UC Berkeley, and Hugh Burkhardt leads the team at the Shell Centre, Nottingham.

The Mathematics Assessment Resource Service (MARS) is established at Michigan State University with partial funding from the National Science Foundation. MARS is the implementation phase of the earlier Balanced Assessment project (BA) which was funded by NSF to develop a comprehensive range of performance assessment tasks in mathematics.

MARS web site:


2 BA tasks form the core of several cases in Using Assessment to Reshape Teaching: A Casebook for Mathematics Teachers and Teacher educators, Curriculum and Staff Development Specialists. Wilcox & Lanier (Eds.). Erlbaum Associates (in press). The development of cases was supported, in part, by NSF
grant 9252881


T-Shirt: A MARS Example

T-Shirt is from the BA middle grades collection. Teachers are asked to work the task themselves. This provides the opportunity for teachers to (a) explore the mathematics embedded in the task, often pushing at the boundaries of their own mathematical knowledge, (b) consider what students need to know and be able to do to successfully engage and respond to the task, and (c) examine how the task embodies what is valued as important mathematics. While some teachers will apply the conventional Cartesian coordinate system to the task, many others use less conventional systems for locating points on the design. In this discussion, teachers develop a set of "Core Elements of Performance" that describe the mathematical essence of the task. For T-Shirt, the list typically contains the following:

Next, teachers are asked to examine student responses to the task. Here teachers analyze how students reason about the problem, what they seem to understand and what they are struggling with and what counts as evidence. As with the teachers, student responses vary. While many students employ a Cartesian coordinate system (either one quadrant or four quadrants), many others use invented systems. Teachers consider what might account for the various ways in which students approached the task. To the left is one piece of student work that teachers find intriguing. It generally raises intense discussion about what is valued in this response and, more generally, about how to help students develop and appreciate more efficient, even elegant, ways of solving problems.1

Teachers are then asked to take the task back to their own classrooms to see how their students engage with the task. Their own students' work and observations of students as they worked on the task become the content for the next session. This is followed with a Scoring Workshop that gives teachers a new kind of experience in the role of analyzing, interpreting and scoring student responses to the task. The outcome of the process is a set of qualitative descriptions characterizing four levels of student performance on the task based on the core elements of performance. The activities of the Scoring Workshop are designed for use in a professional collaborative setting ­ teachers scoring together around a table, with discussion. Though this is a scoring activity, it tends to bring out central issues related to curriculum, teaching, learning and assessment. At subsequent sessions various rubrics (e.g., analytic point scheme; holistic by category) are examined and the strengths and limitations of various schemes are discussed. EA


1 For some tasks, additional documentation including video recordings of small groups of students working on a task is available.


Edys Quellmalz

SRI International is leading the development of a greatly needed, specialized type of electronic library -- an on-line, standards-based, interactive resource bank of science performance assessments -- studying models of effective use of these resources. Performance Assessment Links in Science (PALS) provides an on-line assessment resource library designed to serve two purposes and user groups:

  1. The professional development needs of classroom teachers, and

  2. The accountability requirements of state education agencies and specially funded programs.

There are two primary goals:

In partnership with the Council of Chief State School Officers (CCSSO), three states (Connecticut, Illinois, and Kentucky), and two assessment consortia (the State Collaborative on Assessment and Student Standards [SCASS] and the California Systemic Initiative Assessment Consortium [CSIAC]), SRI is building on a one-year NSF planning grant. The new grant supports expansion of standards-based science assessment resources to elementary, middle, and secondary levels as well as studies of their use. In the second and third years of the project, SRI embeds on-line collaborations on science assessment within its virtual professional development center, TAPPED IN. Organizations interested in becoming partners may examine the prototype at and forward their reactions and feedback to us. EA


How Do We Measure Up? Focus on recent findings from the TIMSS
William Schmidt

The Third International Mathematics and Science Study (TIMSS) is the most ambitious large-scale cross-national educational research and assessment study ever conducted. Over a half of a million students' scores in mathematics and science are compared across 5 continents and 41 countries. The TIMSS study examined the 1994-1995 school years. The information collected by TIMSS researchers is extraordinary.

Members of the TIMSS research team evaluated more than 1600 textbooks and curricular policy documents from 48 countries. They administered country-level and school-level surveys as well as teacher and student surveys. They also conducted observational studies, videotape studies, case studies, administered tests requiring students to perform practical tasks in mathematics and science, and administered paper and pencil tests.


In the early 1990s, the situation regarding what children appeared to have learned indicates that we are not at all positioned to reach the high expectations set for our nation by our state governors and the President. We are not likely to be "first in the world" by the end of this century in either science or mathematics.

The only exception to this is in the fourth grade science. In the fourth grade our school children performed quite well in the paper and pencil test in science, outperformed by only one country and above the international average in mathematics. Yet, the US students that were four grades further into their schooling, in the eighth grade, had fallen substantially behind their international peers. These students performed below the international average in mathematics and just above the average in the science written achievement tests.

The findings suggest that our children do not start out behind those of other nations in mathematics and science achievement, but that somewhere in the middle grades, they fall behind. These results further point out that middle grade education in this country is particularly troubled - as the promise of our fourth grade children (particularly in science) is dashed against the undemanding curriculum of the nationís middle schools.

Some lessons: What you teach is what you get

Preliminary studies conducted by the US TIMSS Research Center suggest that standards and curricula in the United States compare unfavorably with those in the highest achieving countries in the TIMSS. Some of the key findings suggest that the following characteristics of US curricula are part of the explanation.

An unfocussed curriculum

One striking feature of textbook and curriculum guides in the US as compared to those in other countries, is the magnitude of the differences. Our textbooks are much larger and heavier than those of all other TIMSS countries. Fourth grade school children in the US have mathematics textbooks that contain an average of 530 pages and 397 pages in science - compare this to the average length of mathematics textbooks in other TIMSS countries intended for children of this age of 170 pages and 125 pages in science.

Also striking in the middle '90s is how our textbooks differ from most others in the extreme number of topics that they cover. US textbooks cover far more topics in Grades 4 and 8 then 75 percent of the nations participating in TIMSS. The extreme breadth of topics is presented in these textbooks at the expense of depth of coverage.

A static concept of basics

In the US it would appear that a common implicit definition of "basics" in education is content and skills that "are so important that they bear repeating - and repeating, and repeating". Arithmetic, for example, is a set of contents and skills visited and revisited in US classrooms year after year. Even in Grade 8, when most high achieving TIMSS countries concentrate their curricular focus on algebra and geometry, arithmetic is a major part of schooling in this country. Other TIMSS nations act as if far more mathematics and science topics are basic. In these countries basic content is so important that when introduced, the curriculum focuses on them. When basics first enter the curriculum, they receive concentrated attention so that they can be mastered, and children can be prepared to learn a new set of different basics in following grades. Focused curricula are the motor of a dynamic definition of basics. Among the highest achieving countries, each new grade sees new basics receiving concentrated attention to prepare for mastering topics yet to come that are more complicated. Most TIMSS countries introduce fifteen topics with intense curricular focus between the fourth and eighth grades. The highest achieving TIMSS countries introduce an average of twenty topics in this way. TIMSS' studies of curricula, textbooks, and teacher's instructional practices show that the common view of educational basics is different in the US.

At Grade 4, the definition of basic content in the US does not differ substantially from high achieving countries. However, in our country, the same elementary topics that form the core content at Grade 4, appear repeatedly in higher grades. What new content does enter the curriculum rarely does so with the in-depth examination and large amount of instructional time that characterizes other countries. In fact, on average in our country we introduce only one topic with this type of focused instructional attention between fourth and eighth grade in either mathematics or science.

The lack of instructional focus on those topics that are newly introduced at each grade may help explain the drop in student achievement levels in the US between Grade 4 and 8.

Undemanding standards

As suggested above, the consequence of lack of focus and coherence, and the static approach to defining what is basic, is that US curricula are undemanding compared to other countries, especially during the middle grades. Materials intended for our mathematics and science students mention a staggering array of topics, most of which are introduced in the elementary grades.

TIMSS Repeat

The National Science Foundation intends to support a TIMSS Repeat

(TIMSS-R) in 1999 to assess any detectable change in achievement represented by the cohort of students tested in the fourth grade TIMSS study as they enter the eighth grade. The findings of the study will be interesting, especially in light of the anticipated impact reform activities may have had in recent years in mathematics and science education. EA



Dr. Margaret Cozzens Named Vice Chancellor for Academic and Student
Affairs at the University of Colorado at Denver.

After serving as Division Director for almost six years, Dr. Margaret Cozzens will be leaving her position at the end of June to become Vice Chancellor for Academic and Student Affairs at the University of Colorado. Dr. Cozzens has done an outstanding job of leading the Division and the K-12 science, mathematics, and technology education communities, and her presence will be greatly missed.

Dr. Cozzens promises the K-12 community that she will not be far away, however. Recognizing that the key to success in improving education is forming lasting partnerships at all levels, Dr. Cozzens will continue to promote communication and collaboration among the undergraduate, graduate and K-12 communities. Please join us in wishing Dr. Cozzens the best of luck in her new role, and thank her for her continued commitment and dedication to education.

The 1997 Presidential Awards for Excellence in Mathematics and
Science Teaching Program

The 1997 Presidential Award for Excellence in Mathematics and Science Teaching (PAEMST) winners were honored in Washington, D.C. this June. A total of 214 teachers received the nation's highest commendation for K-12 math and science teachers this year. The Award recognizes a combination of sustained and exemplary work both in, and outside of, the classroom. Each award includes a grant of $7,500 from the National Science Foundation (NSF) to the recipient's school. Awardees also receive an expense-paid trip to Washington, D.C. to attend seminars and engage in professional discussions with their peers and with national legislators and education policymakers. Each awardee also receives a selection of gifts from private-sector contributors to the program. For more information about the PAEMST Program, please contact Dr. Janice Earle, Senior Program Director, at (703) 306-0422 or visit the PAEMST homepage at EA

Please visit the Division's homepage at or contact Dr. Janice Earle at (703) 306-1620 for additional information on ESIE assessment activities.

A Policymaker's Guide to Standards-Led Assessment, from the Education Commission of the States (ECS), describes how standards-led assessments differ from more traditional tests in that they are more closely linked to curriculum and incorporate pre-established learning performance goals.

The Guide may be obtained by contacting ECS at; by phone at (303) 299-3692; or by writing to ECS, 707 17th Street, Suite 2700, Denver, CO, 80202-3427.


The Foundation provides awards for research and education in the sciences and engineering. The awardee is wholly responsible for the conduct of such research and preparation of the results for publication. The Foundation, therefore, does not assume responsibility for the research findings or their interpretation.

The Foundation welcomes proposals from all qualified scientists and engineers and strongly encourages women, minorities, and persons with disabilities to compete fully in any of the research and education related programs described here. In accordance with federal statutes, regulations, and NSF policies, no person on grounds of race, color, age, sex, national origin, or disability shall be excluded from participation in, be denied the benefits of, or be subject to discrimination under any program or activity receiving financial assistance from the National Science Foundation. Facilitation Awards for Scientists and Engineers with Disabilities (FASED) provide funding for special assistance or equipment to enable persons with disabilities (investigators and other staff, including student research assistants) to work on NSF projects. See the program announcement or contact the program coordinator at (703) 306-1636. The National Science Foundation has TDD (Telephonic Device for the Deaf) capability, which enables individuals with hearing impairment to communicate with the Foundation about NSF programs, employment, or general information. To access NSF TDD dial (703) 306-0090; for FIRS, 1-800-877-8339

NSF 98-136