Concepts
The main concepts are those of a grid of cells, called a sheet, with either raw data, called values, or formulas in the cells. Formulas say how to mechanically compute new values from existing values. Values are generally numbers, but can also be pure text, dates, months, etc. Extensions of these concepts include logical spreadsheets. Various tools for programming sheets, visualizing data, remotely connecting sheets, displaying cells' dependencies, etc. are commonly provided.
Cells
A "cell" can be thought of as a box for holding data. A single cell is usually referenced by its column and row (A2 would represent the cell containing the value 10 in the example table below). Usually rows, representing the dependent variables, are referenced in decimal notation starting from 1, while columns representing the independent variables use 26-adicbijective numeration using the letters A-Z as numerals. Its physical size can usually be tailored to its content by dragging its height or width at box intersections (or for entire columns or rows by dragging the column- or row-headers).
A | B | C | D | |
---|---|---|---|---|
01 | value1 | value2 | added | multiplied |
02 | 10 | 20 | 30 | 200 |
An array of cells is called a sheet or worksheet. It is analogous to an array of variables in a conventional computer program(although certain unchanging values, once entered, could be considered, by the same analogy, constants). In most implementations, many worksheets may be located within a single spreadsheet. A worksheet is simply a subset of the spreadsheet divided for the sake of clarity. Functionally, the spreadsheet operates as a whole and all cells operate asglobal variables within the spreadsheet (each variable having 'read' access only except its own containing cell).
A cell may contain a value or a formula, or it may simply be left empty. By convention, formulas usually begin with = sign.
Values
A value can be entered from the computer keyboard by directly typing into the cell itself. Alternatively, a value can be based on a formula (see below), which might perform a calculation, display the current date or time, or retrieve external data such as a stock quote or a database value.
Automatic recalculation
A standard of spreadsheets since the 1980s, this optional feature eliminates the need to manually request the spreadsheet program to recalculate values (nowadays typically the default option unless specifically 'switched off' for large spreadsheets, usually to improve performance). Some earlier spreadsheets required a manual request to recalculate, since recalculation of large or complex spreadsheets often reduced data entry speed. Many modern spreadsheets still retain this option.
Real-time update
This feature refers to updating a cell's contents periodically with a value from an external source—such as a cell in a "remote" spreadsheet. For shared, Web-based spreadsheets, it applies to "immediately" updating cells another user has updated. All dependent cells must be updated also.
Locked cell
Once entered, selected cells (or the entire spreadsheet) can optionally be "locked" to prevent accidental overwriting. Typically this would apply to cells containing formulas but might be applicable to cells containing "constants" such as a kilogram/pounds conversion factor (2.20462262 to eight decimal places). Even though individual cells are marked as locked, the spreadsheet data are not protected until the feature is activated in the file preferences.
Data format
A cell or range can optionally be defined to specify how the value is displayed. The default display format is usually set by its initial content if not specifically previously set, so that for example "31/12/2007" or "31 Dec 2007" would default to the cell format of date. Similarly adding a % sign after a numeric value would tag the cell as a percentage cell format. The cell contents are not changed by this format, only the displayed value.
Some cell formats such as "numeric" or "currency" can also specify the number of decimal places.
This can allow invalid operations (such as doing multiplication on a cell containing a date), resulting in illogical results without an appropriate warning.
Cell formatting
Depending on the capability of the spreadsheet application, each cell (like its counterpart the "style" in a word processor) can be separately formatted using the attributes of either the content (point size, color, bold or italic) or the cell (border thickness, background shading, color). To aid the readability of a spreadsheet, cell formatting may be conditionally applied to data; for example, a negative number may be displayed in red.
A cell's formatting does not typically affect its content and depending on how cells are referenced or copied to other worksheets or applications, the formatting may not be carried with the content.
Named cells
In most implementations, a cell, or group of cells in a column or row, can be "named" enabling the user to refer to those cells by a name rather than by a grid reference. Names must be unique within the spreadsheet, but when using multiple sheets in a spreadsheet file, an identically named cell range on each sheet can be used if it is distinguished by adding the sheet name. One reason for this usage is for creating or running macros that repeat a command across many sheets. Another reason is that formulas with named variables are readily checked against the algebra they are intended to implement (they resemble Fortran expressions). Use of named variables and named functions also makes the spreadsheet structure more transparent.
Cell reference
In place of a named cell, an alternative approach is to use a cell (or grid) reference. Most cell references indicate another cell in the same spreadsheet, but a cell reference can also refer to a cell in a different sheet within the same spreadsheet, or (depending on the implementation) to a cell in another spreadsheet entirely, or to a value from a remote application.
A typical cell reference in "A1" style consists of one or two case-insensitive letters to identify the column (if there are up to 256 columns: A–Z and AA–IV) followed by a row number (e.g., in the range 1–65536). Either part can be relative (it changes when the formula it is in is moved or copied), or absolute (indicated with $ in front of the part concerned of the cell reference). The alternative "R1C1" reference style consists of the letter R, the row number, the letter C, and the column number; relative row or column numbers are indicated by enclosing the number in square brackets. Most current spreadsheets use the A1 style, some providing the R1C1 style as a compatibility option.
When the computer calculates a formula in one cell to update the displayed value of that cell, cell reference(s) in that cell, naming some other cell(s), cause the computer to fetch the value of the named cell(s).
A cell on the same "sheet" is usually addressed as:
=A1
A cell on a different sheet of the same spreadsheet is usually addressed as:
=SHEET2!A1 (that is; the first cell in sheet 2 of same spreadsheet).
Some spreadsheet implementations allow a cell references to another spreadsheet (not the current open and active file) on the same computer or a local network. It may also refer to a cell in another open and active spreadsheet on the same computer or network that is defined as shareable. These references contain the complete filename, such as:
='C:\Documents and Settings\Username\My spreadsheets\[main sheet]Sheet1!A1
In a spreadsheet, references to cells automatically update when new rows or columns are inserted or deleted. Care must be taken, however, when adding a row immediately before a set of column totals to ensure that the totals reflect the additional rows values—which they often do not.
A circular reference occurs when the formula in one cell refers—directly, or indirectly through a chain of cell references—to another cell that refers back to the first cell. Many common errors cause circular references. However, some valid techniques use circular references. These techniques, after many spreadsheet recalculations, (usually) converge on the correct values for those cells.
Cell ranges
Likewise, instead of using a named range of cells, a range reference can be used. Reference to a range of cells is typically of the form (A1:A6), which specifies all the cells in the range A1 through to A6. A formula such as "=SUM(A1:A6)" would add all the cells specified and put the result in the cell containing the formula itself.
Sheets
In the earliest spreadsheets, cells were a simple two-dimensional grid. Over time, the model has expanded to include a third dimension, and in some cases a series of named grids, called sheets. The most advanced examples allow inversion and rotation operations which can slice and project the data set in various ways.
Formulas
A formula identifies the calculation needed to place the result in the cell it is contained within. A cell containing a formula therefore has two display components; the formula itself and the resulting value. The formula is normally only shown when the cell is selected by "clicking" the mouse over a particular cell; otherwise it contains the result of the calculation.
A formula assigns values to a cell or range of cells, and typically has the format:
=expression |
where the expression consists of:
- values, such as
2
,9.14
or6.67E-11
; - references to other cells, such as, e.g.,
A1
for a single cell orB1:B3
for a range; - arithmetic operators, such as
+
,-
,*
,/
, and others; - relational operators, such as
>=
,<
, and others; and, - functions, such as
SUM()
,TAN()
, and many others.
When a cell contains a formula, it often contains references to other cells. Such a cell reference is a type of variable. Its value is the value of the referenced cell or some derivation of it. If that cell in turn references other cells, the value depends on the values of those. References can be relative (e.g.,
A1
, or B1:B3
), absolute (e.g., $A$1
, or $B$1:$B$3
) or mixed row– or column-wise absolute/relative (e.g., $A1
is column-wise absolute and A$1
is row-wise absolute).
The available options for valid formulas depends on the particular spreadsheet implementation but, in general, most arithmetic operations and quite complex nested conditional operations can be performed by most of today's commercial spreadsheets. Modern implementations also offer functions to access custom-build functions, remote data, and applications.
A formula may contain a condition (or nested conditions)—with or without an actual calculation—and is sometimes used purely to identify and highlight errors. In the example below, it is assumed the sum of a column of percentages (A1 through A6) is tested for validity and an explicit message put into the adjacent right-hand cell.
- =IF(SUM(A1:A6) > 100, "More than 100%", SUM(A1:A6))
Further examples:
- =IF(AND(A1<>"",B1<>""),A1/B1,"") means that if both cells A1 and B1 are not <> empty "", then divide A1 by B1 and display, other do not display anything.
- =IF(AND(A1<>"",B1<>""),IF(B1<>0,A1/B1,"Division by zero"),"") means that if cells A1 and B1 are not empty, and B1 is not zero, then divide A1 by B1, if B1 is zero, then display "Division by zero, and do not display anything if either A1 and B1 are empty.
- =IF(OR(A1<>"",B1<>""),"Either A1 or B1 show text","") means to display the text if either cells A1 or B1 are not empty.
The best way to build up conditional statements is step by step composing followed by trial and error testing and refining code.
A spreadsheet does not, in fact, have to contain any formulas at all, in which case it could be considered merely a collection of data arranged in rows and columns (a database) like a calendar, timetable or simple list. Because of its ease of use, formatting and hyperlinking capabilities, many spreadsheets are used solely for this purpose.
Functions
Spreadsheets usually contain a number of suppliedfunctions, such as arithmetic operations (for example, summations, averages and so forth), trigonometric functions, statistical functions, and so forth. In addition there is often a provision for user-defined functions. In Microsoft Excel these functions are defined using Visual Basic for Applications in the supplied Visual Basic editor, and such functions are automatically accessible on the worksheet. In addition, programs can be written that pull information from the worksheet, perform some calculations, and report the results back to the worksheet. In the figure, the name sq is user-assigned, and function sq is introduced using the Visual Basiceditor supplied with Excel. Name Manager displays the spreadsheet definitions of named variables x & y.
Subroutines
Functions themselves cannot write into the worksheet, but simply return their evaluation. However, in Microsoft Excel, subroutines can write values or text found within the subroutine directly to the spreadsheet. The figure shows the Visual Basic code for a subroutine that reads each member of the named column variable x, calculates its square, and writes this value into the corresponding element of named column variable y. The y column contains no formula because its values are calculated in the subroutine, not on the spreadsheet, and simply are written in.
Remote spreadsheet
Whenever a reference is made to a cell or group of cells that are not located within the current physical spreadsheet file, it is considered as accessing a "remote" spreadsheet. The contents of the referenced cell may be accessed either on first reference with a manual update or more recently in the case of web based spreadsheets, as a near real time value with a specified automatic refresh interval.
Charts
Many spreadsheet applications permit charts, graphs or histograms to be generated from specified groups of cells that are dynamically re-built as cell contents change. The generated graphic component can either be embedded within the current sheet or added as a separate object.
Multi-dimensional spreadsheets
In the late 1980s and early 1990s, first Javelin Software and later Lotus Improv appeared and unlike models in a conventional spreadsheet, they utilized models built on objects called variables, not on data in cells of a report. These multi-dimensional spreadsheets enabled viewing data and algorithms in various self-documenting ways, including simultaneous multiple synchronized views. For example, users of Javelin could move through the connections between variables on a diagram while seeing the logical roots and branches of each variable. This is an example of what is perhaps its primary contribution of the earlier Javelin—the concept of traceability of a user's logic or model structure through its twelve views. A complex model can be dissected and understood by others who had no role in its creation, and this remains unique even today. Javelin was used primarily for financial modeling, but was also used to build instructional models in college chemistry courses, to model the world's economies, and by the military in the early Star Wars project. It is still in use by institutions for which model integrity is mission critical.
In these programs, a time series, or any variable, was an object in itself, not a collection of cells that happen to appear in a row or column. Variables could have many attributes, including complete awareness of their connections to all other variables, data references, and text and image notes. Calculations were performed on these objects, as opposed to a range of cells, so adding two time series automatically aligns them in calendar time, or in a user-defined time frame. Data were independent of worksheets—variables, and therefore data, could not be destroyed by deleting a row, column or entire worksheet. For instance, January's costs are subtracted from January's revenues, regardless of where or whether either appears in a worksheet. This permits actions later used in pivot tables, except that flexible manipulation of report tables was but one of many capabilities supported by variables. Moreover, if costs were entered by week and revenues by month, Javelin's program could allocate or interpolate as appropriate. This object design enabled variables and whole models to reference each other with user-defined variable names, and to perform multidimensional analysis and massive, but easily editable consolidations.
Trapeze, a spreadsheet on the Mac, went further and explicitly supported not just table columns, but also matrix operators.
Logical spreadsheets
Spreadsheets that have a formula language based upon logical expressions, rather than arithmetic expressions are known as logical spreadsheets. Such spreadsheets can be used to reason deductively about their cell values.
Programming issues
Just as the early programming languages were designed to generate spreadsheet printouts, programming techniques themselves have evolved to process tables (also known as spreadsheets or matrices) of data more efficiently in the computer itself.
End-user development
Spreadsheets are a popular End-user development tool. EUD denotes activities or techniques in which people who are not professional developers create automated behavior and complex data objects without significant knowledge of a programming language. Many people find it easier to perform calculations in spreadsheets than by writing the equivalent sequential program. This is due to several traits of spreadsheets.
- They use spatial relationships to define program relationships. Humans have highly developed intuitions about spaces, and of dependencies between items. Sequential programming usually requires typing line after line of text, which must be read slowly and carefully to be understood and changed.
- They are forgiving, allowing partial results and functions to work. One or more parts of a program can work correctly, even if other parts are unfinished or broken. This makes writing and debugging programs easier, and faster. Sequential programming usually needs every program line and character to be correct for a program to run. One error usually stops the whole program and prevents any result.
- Modern spreadsheets allow for secondary notation. The program can be annotated with colors, typefaces, lines, etc. to provide visual cues about the meaning of elements in the program.
- Extensions that allow users to create new functions can provide the capabilities of a functional language.
- Extensions that allow users to build and apply models from the domain of machine learning.
Spreadsheet programs
A "spreadsheet program" is designed to perform general computation tasks using spatial relationships rather than time as the primary organizing principle.
It is often convenient to think of a spreadsheet as a mathematical graph, where the nodes are spreadsheet cells, and the edges are references to other cells specified in formulas. This is often called the dependency graph of the spreadsheet. References between cells can take advantage of spatial concepts such as relative position and absolute position, as well as named locations, to make the spreadsheet formulas easier to understand and manage.
Spreadsheets usually attempt to automatically update cells when the cells they depend on change. The earliest spreadsheets used simple tactics like evaluating cells in a particular order, but modern spreadsheets calculate following a minimal recomputation order from the dependency graph. Later spreadsheets also include a limited ability to propagate values in reverse, altering source values so that a particular answer is reached in a certain cell. Since spreadsheet cells formulas are not generally invertible, though, this technique is of somewhat limited value.
Many of the concepts common to sequential programming models have analogues in the spreadsheet world. For example, the sequential model of the indexed loop is usually represented as a table of cells, with similar formulas (normally differing only in which cells they reference).
Spreadsheets have evolved to use scripting programming languages like VBA as a tool for extensibility beyond what the spreadsheet language makes easy.
Shortcomings
While spreadsheets represented a major step forward in quantitative modeling, they have deficiencies. Their shortcomings include the perceived unfriendliness of alpha-numeric cell addresses.
- Research by ClusterSeven has shown huge discrepancies in the way financial institutions and corporate entities understand, manage and police their often vast estates of spreadsheets and unstructured financial data (including comma separated variable (CSV) files and Microsoft Access Databases). One study in early 2011 of nearly 1,500 people in the UK found that 57% of spreadsheet users have never received formal training on the spreadsheet package they use. 72% said that no internal department checks their spreadsheets for accuracy. Only 13% said that Internal Audit reviews their spreadsheets, while a mere 1% receive checks from their risk department.
- Spreadsheets have significant reliability problems. Research studies estimate that roughly 94% of spreadsheets deployed in the field contain errors, and 5.2% of cells in unaudited spreadsheets contain errors.
-
- Despite the high error risks often associated with spreadsheet authorship and use, specific steps can be taken to significantly enhance control and reliability by structurally reducing the likelihood of error occurrence at their source.
- The practical expressiveness of spreadsheets can be limited unless their modern features are used. Several factors contribute to this limitation. Implementing a complex model on a cell-at-a-time basis requires tedious attention to detail. Authors have difficulty remembering the meanings of hundreds or thousands of cell addresses that appear in formulas.
-
- These drawbacks are mitigated by the use of named variables for cell designations, and employing variables in formulas rather than cell locations and cell-by-cell manipulations. Graphs can be used to show instantly how results are changed by changes in parameter values. In fact, the spreadsheet can be made invisible except for a transparent user interface that requests pertinent input from the user, displays results requested by the user, creates reports, and has built-in error traps to prompt correct input.
- Similarly, formulas expressed in terms of cell addresses are hard to keep straight and hard to audit. Research shows that spreadsheet auditors who check numerical results and cell formulas find no more errors than auditors who only check numerical results. That is another reason to use named variables and formulas employing named variables.
- The alteration of a dimension demands major surgery. When rows (or columns) are added to or deleted from a table, one has to adjust the size of many downstream tables that depend on the table being changed. In the process, it is often necessary to move other cells around to make room for the new columns or rows, and to adjust graph data sources. In large spreadsheets, this can be extremely time consuming.
- Adding or removing a dimension is so difficult, one generally has to start over. The spreadsheet as a paradigm really forces one to decide on dimensionality right of the beginning of one's spreadsheet creation, even though it is often most natural to make these choices after one's spreadsheet model has matured. The desire to add and remove dimensions also arises in parametric and sensitivity analyses.
-
- Multi-dimensional spreadsheets and tools such as Analytica avoid this important pitfall by generalizing the 2-D paradigm of the classical spreadsheet to a multi-dimensional representation.
- Collaboration in authoring spreadsheet formulas can be difficult when such collaboration occurs at the level of cells and cell addresses.
-
- However, like programming languages, spreadsheets are capable of using aggregate cells with similar meaning and indexed variables with names that indicate meaning. Some spreadsheets have good collaboration features, and it is inadvisable to author at the level of cells and cell formulas to avoid obstacles to collaboration, where many people cooperate on data entry and many people use the same spreadsheet. In collaborative authoring, it is advisable to use the range-protection feature of spreadsheets that prevents the contents of specific parts of a worksheet from being inadvertently altered.
Other problems associated with spreadsheets include:
- Some sources advocate the use of specialized software instead of spreadsheets for some applications (budgeting, statistics)
- Many spreadsheet software products, such as Microsoft Excel (versions prior to 2007) and OpenOffice.org Calc(versions prior to 2008), have a capacity limit of 65,536 rows by 256 columns (216 and 28 respectively). This can present a problem for people using very large datasets, and may result in lost data.
- Lack of auditing and revision control. This makes it difficult to determine who changed what and when. This can cause problems with regulatory compliance. Lack of revision control greatly increases the risk of errors due the inability to track, isolate and test changes made to a document.
- Lack of security. Spreadsheets lack controls on who can see and modify particular data. This, combined with the lack of auditing above, can make it easy for someone to commit fraud.
- Because they are loosely structured, it is easy for someone to introduce an error, either accidentally or intentionally, by entering information in the wrong place or expressing dependencies among cells (such as in a formula) incorrectly.
- The results of a formula (example "=A1*B1") applies only to a single cell (that is, the cell the formula is actually located in—in this case perhaps C1), even though it can "extract" data from many other cells, and even real time dates and actual times. This means that to cause a similar calculation on an array of cells, an almost identical formula (but residing in its own "output" cell) must be repeated for each row of the "input" array. This differs from a "formula" in a conventional computer program, which typically makes one calculation that it applies to all the input in turn. With current spreadsheets, this forced repetition of near identical formulas can have detrimental consequences from a quality assurance standpoint and is often the cause of many spreadsheet errors. Some spreadsheets have array formulas to address this issue.
- Trying to manage the sheer volume of spreadsheets that may exist in an organization without proper security, audit trails, unintentional introduction of errors, and other items listed above can become overwhelming.
While there are built-in and third-party tools for desktop spreadsheet applications that address some of these shortcomings, awareness and use of these is generally low. A good example of this is that 55% of Capital market professionals "don't know" how their spreadsheets are audited; only 6% invest in a third-party solution
Spreadsheet risk
Further information: Financial modeling § Accounting
Spreadsheet risk is the risk associated with deriving a materially incorrect value from a spreadsheet application that will be utilised in making a related (usually numerically based) decision. Examples include the valuation of an asset, the determination of financial accounts, the calculation of medicinal doses or the size of load-bearing beam for structural engineering. The risk may arise from inputting erroneous or fraudulent data values, from mistakes (or incorrect changes) within the logic of the spreadsheet or the omission of relevant updates (e.g., out of date exchange rates). Some single-instance errors have exceeded US$1 billion. Because spreadsheet risk is principally linked to the actions (or inaction) of individuals it is defined as a sub-category of operational risk.
In the report into the 2012 JPMorgan Chase trading loss, a lack of control over spreadsheets used for critical financial functions was cited as a factor in the trading losses of more than six billion dollars which were reported as a result ofderivatives trading gone bad.
Despite this, research carried out by ClusterSeven revealed that around half (48%) of c-level executives and senior managers at firms reporting annual revenues over £50m said there were either no usage controls at all or poorly applied manual processes over the use of spreadsheets at the firms.
In 2013 Thomas Herndon, a graduate student of economics at the University of Massachusetts Amherst found major coding flaws in the spreadsheet used by the economists Carmen Reinhart and Kenneth Rogoff in a very influential 2010 journal article. The Reinhart and Rogoff article was widely used as justification to drive 2010–13 European austerity