|
OR/MS Today - February 2009 Statistical Software Survey A Long Way from Flip Charts Biennial survey of statistical software highlights a growing list of capabilities. By James J. Swain All elections are historic, but this one was a doozy! The nomination and election of Barack Obama as president was dramatic, exciting and filled with surprises. Prior to the Iowa Caucus, for instance, Hillary Clinton was thought to be a shoo-in for the Democratic nomination, and her campaign expected that John Edwards would be her chief rival. Prior to his victory in New Hampshire, John McCain's campaign for the Republication nomination was declared dead. Over the course of the election, voting patterns changed, the distribution of the red and blue states was altered, and an unexpected tidal wave of new, exhilarated younger voters emerged, determined to make their voices heard. With so much uncertainty, it was difficult to understand what was going on. Yet with so much at stake, we needed to do just that. There was no shortage of statistics used to provide an explanation of the surprising events. Throughout the complex story, we were inundated with polling results as the commentators attempted to make sense of what was happening and to predict what it might mean. Enlivening the entire process was the dynamic "Magic Wall" at CNN, presided over by John King. As the Washington Post [1] reported a year ago, King "began poking, touching and waving at the screen like an over-caffeinated traffic cop," setting in motion "a series of zooming maps and flying pie charts," that he could call forth and as quickly replace with results down to the precinct level and just as easily return to "what-if" scenarios involving delegate counts. Rarely have so much data and so many relations among various factors such as region, age, political affiliation or socio-economic status been attempted to be portrayed on a news show. We have come a long way from the flip charts of Ross Perot or the dry erase board that Tim Russert used in 2000 to explain the results in Florida on election night. As the Magic Wall suggests, graphics provide a powerful way of presenting data, and the dynamic exploration of various graphics can be a powerful method of uncovering underlying relations within the data. Statistical software makes it possible to examine data in multiple ways, to perform comparisons and to look for multivariate relations. Since the plotting is limited to two- and three-dimensional projections of data, several methods have evolved to assist in forming a coherent view of the complete picture. In the matrix plot, for instance, variables are listed by row and column and the pairwise scatter plots occur in each position (sometimes with the univariate histograms in the diagonal positions). Hence, the scatter plots in each row represent the joint distribution of that variable with the variables represented by the columns. Of course, these essentially marginal views cannot provide the entire story. For instance, correlations may exist within groups of variables that are not observable by the pairwise scatter plots. In these cases, additional insight can be obtained through transformation of coordinates based upon principal components or factor analysis. Such techniques are often used when the underlying variation consists of groups of related variables, and these underlying factors are often fewer in number than the number of variables that are directly observed and plotted. In another example, Wainer [3] illustrates graphically how rotation of the random data from the Randu random number generator can highlight the limitations of the generator, where the points fall on only 15 planes in 3-space. Interactive graphical methods can also be used to explore relations within data. Brushing is a method in which a point or a group of points on a given plot can be highlighted in linked displays or simply used to provide access to their location within the data for detailed examination. Another form of interactive graphics is obtained through slicing [3]. In this case a variable is designated for slicing, and the data is divided into sets above and below the slice value of the variable. As the slice point is varied you are able to highlight the characteristics of the linked displays. In Wainer's illustration, the value of the slice cuts the set of residuals to reveal that the positive residuals come from a limited set of sources of data, in this case, a particular industrial sector. Automated data collection by satellites such as the Hubble Space Telescope has helped to change the amount of data available to cosmologists. Freeman, Richards et al. [2] note that cosmology was a data-starved field until the advent of new collection devices. In the last few decades this has greatly changed, and he points to examples such as the 200 million objects now available in the Sloan Digital Sky Survey. Statistical analysis is being combined with astronomy to produce the interdisciplinary field of astrostatistics. They illustrate how the analysis of the increased amount of data is providing greater resolution in their models and model parameters, as well as demonstrating how underlying relations within the data can be revealed. For example, in one case nonlinear methods are used to reveal a one-dimensional manifold that better explains the relative distance between different objects that is better than the Euclidian distance. Whether in astronomy or elections, one key to understanding is to identify groupings of similar items within the data. Classification methods seek to find relatively homogeneous subgroups that preserve the overall variability within the population while reducing the number of distinct groupings. Cluster analysis attempts to identify groupings of the most similar objects. Within the latest election this classification was indirectly the basis of several anecdotal groupings including "Joe Six-pack" and "Hockey Moms" who join the "Soccer Moms" of earlier elections. Such classification schemes are heavily used in marketing and politics. In simulation modeling clustering can be used as a basis for aggregation, allowing the range of inventory items, for instance, to be represented by a smaller group of aggregated, representative items with little loss in validity and a great gain in simplicity and efficiency. Finally, statistical methods are often used to determine the faint signals of interest within otherwise noisy data. In quality control, for instance, random variations within specified limits usually represent normal, random variation, whereas systematic trends signal an assignable cause to be investigated. The problem is similar to those faced by Homeland Security and credit fraud. In both cases it is desired to isolate the indications of a threat or an unauthorized transaction from among the myriad of private messages or normal financial transactions. Statistical methodology can play a role in a variety of similar areas such as environmental monitoring and disease prevalence, which also have potential implications in homeland defense. The American Statistical Association has interest groups in defense and security and is expanding the visibility of these problems to member statisticians. Products that provide statistical add-ins available for use with spreadsheets remain common. The spreadsheet is the primary computational tool in a wide variety of settings, familiar and accessible to all. Many procedures of data summarization, estimation, inference, basic graphics and even regression modeling can be added to spreadsheets in this way. An example is the Unistat Add-in for Excel. The functionality of products for use with spreadsheets continues to grow, including risk analysis and Monte Carlo sampling. Dedicated general and special-purpose statistical software generally have a wider variety and depth of analysis than available in the add-in software. For many specialized techniques such as forecasting, design of experiments and so forth, a statistical package is appropriate. Moreover, new procedures are likely to become available first in the statistical software and only later be added to the add-in software. In general, statistical software plays a distinct role on the analyst's desktop, and provided that data can be freely exchanged among applications each part of an analysis can be made with the most appropriate (or convenient) software tool. An important feature of statistical programs was the importation of data from as many sources as possible, to eliminate the need for data entry when data is already available from another source. Most programs have the ability to read from spreadsheets and selected data storage formats. Also highly visible in this survey is the growth of data warehousing and "data mining" capabilities, programs and training. Data mining tools attempt to integrate and analyze data from a variety of sources (and purposes) to look for relations that would not be possible from the individual data sets. Within the survey we observe several specialized products, such as STAT::FIT, which are more narrowly focused on distribution fitting than general statistics, but of particular use to developers of stochastic models and simulations.
OR/MS Today copyright © 2009 by the Institute for Operations Research and the Management Sciences. All rights reserved. Lionheart Publishing, Inc. 506 Roswell Rd., Suite 220, Marietta, GA 30060 USA Phone: 770-431-0867 | Fax: 770-432-6969 E-mail: lpi@lionhrtpub.com URL: http://www.lionhrtpub.com Web Site © Copyright 2009 by Lionheart Publishing, Inc. All rights reserved. |