|
|
||||||||
a Dep. of Geography, Univ. of Wisconsin, Madison, WI 53706
b Dep. of Geography and Geology, Univ. of Wisconsin, Whitewater, WI 53190
c State Key Lab. of Resources and Environmental Information System, Inst. of Geographical Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China and Dep. of Geography, Univ. of Wisconsin, Madison, WI 53706
* Corresponding author (weidong6616{at}yahoo.com)
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: CCDF, conditional cumulative distribution function CMC, coupled Markov chain JCPD, joint conditional probability distribution PV, occurrence probability vector PV-realizations, visualized realizations from the calculated PVs TMC, triplex Markov chain TPM, transition probability matrix TP-realizations, simulated realizations using the TMC model through conditional transition probabilities
| INTRODUCTION |
|---|
|
|
|---|
There are several problems hindering spatial uncertainty modeling: (1) It is difficult to mathematically calculate the JCPD of a random variable at all unknown locations in a study area of even moderate size. So far we have not yet found any existing geostatistical method that realizes this goal. The normal way for spatial uncertainty modeling is through generating a set of alternative realizations and then approximately estimating the JCPD (represented as a series of probability maps) from a number of realizations (Zhang and Goodchild, 2002; Zhang and Li, 2005). Thus, the accuracy of probability maps is largely dependent on the number of realizations used. (2) Many random field models have difficulties generating a sufficiently large number of realizations within acceptable computation time and computer storage (Dubrule and Damsleth, 2001), particularly when the number of classes or thresholds is large. With the ongoing development of computer techniques, this problem has been relaxed in recent years. For example, the sequential indicator simulation is an efficient variogram-based simulation method; in recent years it has been used for spatial uncertainty modeling of land-cover classes with a small number of classes (Kyriakidis and Dungan, 2001; Zhang and Goodchild, 2002). But for some heavily iterative methods that may need a long computation time (days or even months) to generate a realization, such as the recently emerged Bayesian-Markov random field method (using simulated annealing) for sparse data modeling (Norberg et al., 2002), computation load is still a big concern. (3) How to effectively incorporate the complex spatial variation of random variables into a simulation method is still a difficult issue. Given the same conditioning dataset, a simulation method that can incorporate more information of spatial heterogeneity of the targeted random variable will generate more realistic realizations of the unknown truth, and thus more effectively reflect the spatial pattern of the targeted random variable and decrease the spatial uncertainty. Variograms provide widely accepted measures of spatial continuity, but conventional variogram-based methods are not capable of reflecting the complex interdependence of multiple classes and reproduce complex large-scale and long-range features (Ortiz and Deutsch, 2004; Liu and Journel, 2004). The major reasons may be that class interdependences (including cross correlations) are normally ignored in variogram-based simulation algorithms because of the awkwardness in cokriging a number of classes (Goovaerts, 1996, p. 911912) and auto-variograms are too limiting in capturing complex heterogeneity of real patterns of categorical geographical variables (Caers and Zhang, 2004). In the 1990s and thereafter, large effort has been devoted to this issue in geostatistics, and significant progress has been made in recent several years. Major work has mainly focused on two general approaches: (i) incorporating multiple-point statistics into indicator simulation from various data sources, such as training images (Guardiano and Srivastava, 1993; Caers and Zhang, 2004), blasthole data (Ortiz and Deutsch, 2004), and structured paths (Liu and Journel, 2004); and (ii) using Markov chains in multi-dimensions to generate conditional realizations (Luo, 1996; Elfeki and Dekking, 2001; Li et al., 2004; Zhang and Li, 2005) or using transition probabilities to replace variograms in indicator simulation (Carle and Fogg, 1996).
Markov chain models have been used in soil science for characterizing spatial distribution of soil classes and soil layers. For one-dimensional applications, see Li et al. (1997)(1999). For two-dimensional applications, see Li et al. (2004) and Wu et al. (2004). The triplex Markov chain (TMC) model proposed by Li et al. (2004) for conditional simulation of soil classes in two-dimensions uses the method of coupled Markov chains (CMC) (Elfeki and Dekking, 2001) in its calculation of conditional transition probabilities. Only four nearest known neighbors along the orthogonal directions (i.e., x and y axes) are considered in determining the conditional transition probability at each point to be estimated and the conditional transition probability is explicitly calculated by directly conditioning to known data. Therefore the method is highly efficient. Recently, Wu et al. (2004) proposed an efficient Markov mesh model for reconstruction of binary images of heterogeneous soil structures. Through scanning the real image for simultaneous local parameter estimation, the model can well capture the spatial patterns and reproduce the variogram. The Markov mesh models (Abend et al., 1965; Qian and Titterington, 1991; Gray et al., 1994) represent a special subclass of Markov random fields (Besag, 1986), also outside conventional geostatistics. The Markov mesh models are cliqued-based and have been widely used for image processing. Differing from conventional Markov random field methods that use iterative updating approaches, Markov mesh models can be used to conduct efficient simulation by a one-pass way and are particularly used for image structure analysis (mainly for binary images) through unconditional simulation. Directly using Markov mesh models for conditional simulation on sampled data seems infeasible. Norberg et al. (2002) recently used Markov random fields for the geostatistical modeling purpose by using simulated annealing for iterative updating.
One typical feature of multinomial categorical variables in soil science such as soil layers and soil types is that they normally exhibit strong interdependence between multinomial classes. This interdependence includes strong cross correlations, juxtaposition relationships, and directional asymmetry in spatial occurrence of multinomial classes (Li et al., 1997; Li et al., 2004). Although many random field models can be used to simulate categorical variables (Chiles and Delfiner, 1999), Markov chain-based conditional simulation methods can better incorporate these special features because of the special characteristics of transition probabilities. For example, if Class A frequently occurs as a neighbor of Class B and seldom occurs as a neighbor of Class C, this juxtaposition relationship can be reflected in Markov transition probabilities and therefore respected in realizations. Similarly, if Classes A, B, C often occur as a sequence of ABC along a direction (e.g., west-to-east), this asymmetry also can be reflected in Markov transition probabilities along that direction and thus also respected in realizations. But such behaviors may be difficult to be captured with other spatial measures (Zhang and Li, 2005). The second typical feature of categorical soil variables is that the number of classes may be very large. For example, there may be dozens of soil series occurring in a watershed stretching over dozens of square kilometers (USDA, 1962). To deal with a large number of soil classes with due consideration of cross correlations, approaches based on iterative simulation methods (e.g., simulated annealing) or solving large cokriging equation systems may be unpractical in both computation time and numerical stability (Goovaerts, 1996). Unlike many conventional geostatistical methods that describe spatial correlations of classes by indicator covariance, Markov chain methods use transition probabilities of classes for the same purpose. In a Markov transition probability matrix (TPM), the diagonal elements (auto-transitions) represent autocorrelations of individual classes and the off-diagonal elements (cross-transitions) represent cross correlations between different classes. Because TPMs are normally estimated unidirectionally (e.g., from north to south), directional asymmetries can be included naturally. Thus, class interdependence is implicitly incorporated in Markov chain models by cross-transition probabilities. Since simulation by Markov chain models does not involve complex computation, they have the advantage in dealing with a large number of classes for higher resolution simulation. This capability is especially important for simulation of soil classes, where normally many classes (or types) may be involved (Li et al., 2004).
In this paper, we proposed an innovative probability vector approach to directly calculating the JCPD of multinomial classes through the TMC model as an alternative to the conventional "brute force" method that estimates the same from a number of realizations. We represent the JCPD of multinomial classes as a set of PVs. The objective is to find a more accurate and efficient way to represent the spatial uncertainties that arise in mapping soil classes. We hypothesize that calculated PVs represent the PVs estimated from an infinite number of realizations and thus the visualized realizations from calculated PVs represent the spatial variation of multinomial classes. A simplified soil map is used as a case study for providing conditioning data and for evaluating this hypothesis. The probability vector approach proposed here is also applicable to conditional simulation with other existing Markov chain models based on single pass algorithms and explicit conditional transition probability expressions.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The Markov chain conditional simulation methods such as the TMC model (Li et al., 2004) follow the same simulation technique as sequential simulation algorithms except for using smaller (and changeable) neighborhoods of conditioning data and a fixed sequence (path). Although probability maps can be estimated from multiple realizations in the TMC model (Zhang and Li, 2004; Zhang and Li, 2005), the accuracy of estimated occurrence probabilities depends on the number of realizations. When the simulation area is large and the number of involving classes is large, estimating accurate PVs by generating a large number of realizations is computationally burdensome. The facts that the TMC model (a) is a single-pass simulation method, (b) has an explicit conditional transition probability expression, and (c) conditions the estimate of each unknown point to a few (four) nearest known points on fixed axis directions, suggests another way. In particular, it may be possible to calculate the JCPD (i.e., PVs) from the Markov transition probabilities by conditioning to both observed data and previously estimated locations, rather than approximately estimate them from a large number of simulated realizations. Thus, single realizations may be directly obtained from the calculated PVs by Monte Carlo sampling. Below we explain how to make this calculation.
For a categorical variable, a JCPD of all unknown points in a study area can be expressed as
![]() | [1] |
By using the Bayes' Theorem (i.e., the definition of conditional probability), Expression [1] can be factored as
![]() | [2] |
To solve Eq. [2], our task is to solve every one-point conditional probability distribution on the right-hand side, for example, the conditional probability distribution of the ith visited point
![]() | [3] |
It is difficult to directly solve Eq. [3] in sequential simulation algorithms without all of the (n + i 1) points being known. For example, kriging in sequential simulation algorithms deals only with single indicator values of all of the (n + i 1) points, not their conditional probability distributions (i.e., vectors of conditional probability values). That means we have to estimate the CCDF of one unknown point first, allocate a value to the point by Monte Carlo sampling, then estimate the next unknown point. Thus, by following a (random or fixed) visiting sequence, sequential simulation algorithms can generate a set of alternative realizations to represent the JCPD (i.e., spatial uncertainty). However, with the simple neighborhood structure and the simple explicit solution of conditional transition probability in the TMC model, Eq. [3] can be directly solved as long as the one-point conditional probability distributions of the (n + i 1) points are known.
Note that in practical use, the n in Eq. [3] need not be all the observed data and also i need not be all previously simulated values in the whole study area, because the data closest to the location being estimated tend to screen the influence of distant data. In the practice of sequential simulation algorithms, only the original data and those previously simulated values closest to the location ui are retained. Including all observed data and previously simulated data is not only unnecessary but also impossible from a practical standpoint.
The TMC Model
The TMC model uses a simulation algorithm similar to sequential simulation algorithms in kriging geostatistics, but rather than estimating CCDFs, it estimates the conditional transition probabilities. Unlike the sequential simulation algorithms, which need to define a search neighborhood to limit the number of conditioning data, the TMC model only conditions on four nearest known locations along the axes (Li et al., 2004). The conditional probability distribution of the ith unknown point can be simplified as
![]() | [4] |
|
A TMC can be explained as two extended CMCs in the opposite directionsthe CMC ZL and the CMC ZR. The two CMCs proceed alternately in opposite directions. The alternate paths in the TMC model are a necessity, not only for avoiding directional artifacts (i.e., the directional trend effect along the simulation direction; see demonstrations in Gray et al. (1994)), but also for effectively imposing influences of known data (simulated or observed) at both (left and right) sides on the estimate of the current unknown location. Therefore, the TMC model is composed of a conditional transition probability pairs
, which represent the left-to-right and right-to-left CMCs, respectively. Here, l and m represent the specific states of the two adjacent predecessors, k represents the state of the current pixel, q and o represent the specific states of the two future known pixels. For more detailed explanation of the expressions of the conditional transition probability pairs, see Li et al. (2004).
Occurrence Probability Vectors
The probability vector approach calculates the JCPD by following the visiting sequence (simulation path) and using the conditional transition probability expressions in the TMC model. The JCPD is represented as a set of PVs, and each PV actually represents a one-point conditional probability distribution in Eq. [3]. The calculation of the PV of each unknown point is conditioned to the observed data and previously estimated valuesnot values of classes, but the calculated PVs of classes at those locations.
A PV consists of a set of probability values representing the likelihood that each class occurs at a particular point (i.e., grid cell or pixel). Mark and Csillag (1989) and Goodchild et al. (1992) discussed the feasibility of using probability vectors to represent the transition between two classes if there were cartographic (or locational) errors. If each class is represented as a probability distribution, categories (or classes) have transition zones varying gradually between the maximum and minimum class likelihood with 0.5 at the location of class boundary. Zhang and Li (2005) estimated PVs from a large number of realizations to represent the spatial uncertainty of multinomial land-cover classes and demonstrated the transition zones between classes. The PVs consist of the occurrence likelihoods of multiple classes that possibly occur in a study area. They can be hardened into an area class map (i.e., the prediction map) by choosing a standard such as the maximum occurrence probabilities and assigning corresponding class values (or labels). Such PVs may be used to describe the locational uncertainty of multiple classes arising in the mapping process of hand-delineated area class maps. The PV approach developed here provides more accurate PVs in a more efficient way.
The PV for a pixel or grid cell (i, j) can be expressed as
![]() | [5] |
Using the TMC model, PVs for every grid cell can be estimated. If a cell is located on the survey lines (i.e., an observed point), its PV is already known with its certain element pk(i,j) being 1 if class k occurs in the cell and other elements being 0, that is,
![]() | [6] |
![]() | [7] |
![]() | [8] |
Considering that the TMC model is composed of a conditional probability pair and the two CMCs proceed in opposite directions, we have
![]() | [9] |
![]() | [10] |
The PVs for the study area hold all information about the JCPD of all classes in the area. Corresponding to this probability vector approach, we refer to the realization generation algorithm through conditional transition probabilities of the TMC model (Li et al., 2004) as the simulation approach. The calculation of PVs is once for all for a dataset, and the time needed for this calculation is equivalent to that used for generating one realization using the simulation approach.
Visualizing the Probability Vectors
From the calculated PVs of all cells in a study area, we can get the following information: (1) a series of occurrence probability maps of individual classes, which represent where and with how much certainty (or uncertainty) a class will occur in the study area; (2) the maximum occurrence probability map, that is, the map of greatest occurrence probabilities among occurrence probabilities of all classes at every location, which represents how much certainty (or uncertainty) exists with each point in the prediction map; (3) the prediction map based on the maximum occurrence probabilities, which represents the optimal prediction; and (4) single realizations, each of which represents one possible configuration of soil classes in the study area based on the survey data. The generation of such realizations is accomplished by Monte Carlo simulation based on the PVs. We refer to this as a "visualization" process. This visualization process can use any path, fixed or random, because it does not involve calculation of conditional transition probabilities that depend on predecessor cells. Therefore, the predictive mapping process using this probability vector approach actually consists of the calculation of PVs and the visualization of the PVs. The visualization of PVs is very quick (normally within seconds) because no complex computation is needed.
In the following sections, for clarity we will refer the realizations visualized from calculated PVs as PV-realizations, and the realizations generated by the simulation approach of the TMC model through conditional transition probabilities as TP-realizations.
Case Study
A simple case study is used to verify the probability vector approach. We calculated the PVs of soil classes on a small area of 4 x 1.7 km2. The soils in the area were classified into 7 soil classes (or types). The specific soil types are themselves of no particular interests in this study because our method does not involve any physical processes. They are just used here to show that spatial heterogeneity of soil types or classes can be characterized (Li et al., 2004). The study area is discretized into an 80 x 34 grid with a cell size of 50 x 50 m. Survey lines are distributed in the study area (i.e., the map) with an interval of about 500 m. Survey line data may be acquired by observing soil class boundary changes along a line. It is not necessary to observe every point along a line within class parcels (polygons) since their labels are the same within a parcel. Although the model itself does not limit the shape of survey lines, here we mainly use regular survey lines for the convenience of parameter estimation in the demonstration of the probability vector approach. One single-step TPM would be enough for a simulation without considering anisotropies and asymmetry. Here in our case study we used three one-step TPMs, all of which were estimated from these regular survey lines (Table 1). We first used the probability vector approach to calculate the PVs of all grid cells and visualized them. For the purpose of a comparison, we then used the simulation approach to generate some realizations and estimated probability maps from these simulated TP-realizations. The original soil map is used to represent the truth (unknown in a real world application) against which we can evaluate our results.
|
|
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
|
|
|
For a large area or high-resolution simulation, even when using efficient random field models, generating a large number (e.g., 1001000) of realizations is time-consuming. But if the number of simulated realizations is small, the probability maps estimated from a few realizations may be very inaccurate. Figure 4 indicates that the results estimated from 10 TP-realizations deviated significantly from the results estimated from 100 TP-realizations. Therefore the uncertainty information estimated from a small number of simulated realizations may not be reliable.
Such uncertainty information is crucial for users to understand the possible distribution of soils in their study areas, and the positional uncertainty of soil classes existing in the predicted soil map. More importantly, these uncertainty data may serve as direct input data of risk assessment and decision-making models so that decision makers can make more reasonable decisions with the awareness of spatial uncertainties existing in their reference mapsusually hand-delineated maps or model-interpolated maps (Zhang and Goodchild, 2002).
A prediction map visualized from PVs based on maximum occurrence probabilities represents the optimal prediction. From Fig. 5 , it can be seen that the prediction maps from the probability vector approach were similar to those obtained from the TMC simulation approach. This is not surprising, because both PVs (the calculated PVs and the PVs estimated from 100 TP-realizations) had similar values.
But looking at the PV-realizations and the TP-realizations (Fig. 5), surprisingly we found differences. The TP-realizations had bigger patches and clear boundaries, and they closely resembled the prediction map; however, the soil class parcels in the PV-realizations were obviously more fragmentary, particularly at the boundary zones between classes. The reason for this discrepancy may be related with probabilities used for determining the state of a cell in the realization generation process. In the simulation approach, the cumulative conditional transition probability function was used, where the maximum conditional transition probability to a preferential state for the current cell was more prominent and other states had little chance to occur. But in the probability vector approach, the CCDF was used, where the maximum occurrence probability of the preferential state at the current cell might not be so prominent and other states also had some chance to be drawn in Monte Carlo sampling. Although the PV-realizations did not have abrupt boundaries, this should not be interpreted as a defect. Rather, it might reflect a situation where the transition between different soil classes was not so abrupt in the field, and thus there should be some interlacing of adjacent classes. For example, in the transition zones from grassland to forest, the two land cover types may be interlaced with small patches, and this should also be true for the corresponding soil types.
Figure 6 gives only a subset of omnidirectional (cross) semivariograms from the original soil map, the first PV-realization (Fig. 5c) and the first TP-realization (Fig. 5f). It can be seen that both kinds of realizations approximately reproduced the (cross) variograms. The small deviations between observed and simulated results were expected for single realizations. That is, variograms from multiple realizations normally should fluctuate around the observed results (Goovaerts, 1997). Of course, if conditioning data were more dense, such discrepancies should largely decrease. From cross variograms in Fig. 6, it also can be seen that the spatial cross-correlation between soil classes was very complex; this kind of spatial dependence is difficult to model using conventional cross variogram models.
|
REMARKS
The purpose of geostatistical simulation, for our understanding, is to predict the unknown from the known with as little uncertainty as possible, and at the same time to reflect the inevitable spatial uncertainty. A good simulation conditioned to an observed dataset should reflect the uncertainty contained in the dataset as much as possible. That means that while spatial uncertainty is inevitable because of the limited survey data, it would be desirable for realizations conditioned on that limited data to imitate the "real" map (assuming we had it in model testing) as much as possible, since structure-imitating is also one simulation purpose (Koltermann and Gorelick, 1996). Conventional spatial continuity measure such as indicator autovariograms only measures a part of spatial variation information (i.e., autocorrelations) contained in sampled data or the target variables (Goovaerts, 2002). For example, various strongly different types of heterogeneities may produce similar autovariograms (see Caers and Zhang, 2004). Therefore, approximately reproducing indicator autovariograms is necessary but may not be sufficient for a successful simulation. The efforts in geostatistics in recent years, whether incorporating multi-point statistics from training images (or dense datasets) into indicator simulation approaches (e.g., Ortiz and Deutsch, 2004; Caers and Zhang, 2004; Liu and Journel, 2004), or using Markov chains (i.e., auto/cross transition probabilities) to incorporate interclass dependences into simulation (Elfeki and Dekking, 2001; Li et al., 2004; Zhang and Li, 2005), all had the same objectiveto incorporate as much available spatial variation information into a random field model as possible. Therefore, to rigorously test a method (not an application), it may be helpful to compare simulated realizations with the "real" one so that it can be visually seen that whether realizations mimic the real spatial patterns. The apparent similarity between conditional realizations from Markov chain methods and the original image should be mainly attributed to the special characteristics of Markov transition probabilities in accounting for the interdependence of multinomial classes.
Spatial uncertainty (also called locational or positional uncertainty) is a concept relative to spatial certainty represented by observed (or sampled) data. It may not be feasible to analyze spatial uncertainty using geostatistical simulation if we have no observed data for a study area, or if the simulation method cannot condition on the observed data. Unconditionally simulated results using global parameters are completely uncertain; that is, the estimated occurrence probabilities are identical at all locations. Because the available data is limited, we obviously can only guess at classes for unobserved locations. However, the limited observed data can provide us some clue of what they may be (i.e., their occurrence probabilities) and the most likely values (at least for unobserved locations close enough to the observed data that spatial correlations contain a significant signal). An optimally interpolated map only tells us which class has the highest likelihood of occurring at an unobserved location (i.e., the interpolated value). But it is much more useful to know the probabilities of all classes at unobserved locations. Unlike unconditional methods, the conditional Markov chain simulation method provides this critically important extra dimension.
| CONCLUSIONS |
|---|
|
|
|---|
By visualizing the PVs, spatial uncertainty maps including occurrence probability maps of single classes, single realizations, the maximum occurrence probability map, and the prediction map based on maximum occurrence probabilities can all be acquired. These data may provide important input information for decision-making and risk assessment in natural resource evaluation and environmental conservation. A test study showed that the PVs estimated from simulated realizations gradually approached the calculated PVs with increasing the number of simulated realizations. When the number of realizations for estimating the PVs was small, the estimated PVs were not reliable. Individual realizations could be visualized from the calculated PVs quickly by Monte Carlo sampling. The visualized PV-realizations showed different characteristics from the simulated ones (i.e., TP-realizations). However, indicator (cross) variograms calculated from these two kinds of realizations showed that they were similar and both of them could approximately reproduce the complex spatial dependence relationships in the reference soil map. The PV-realizations also mimicked the reference soil map in spatial patterns. These mean that the visualized realizations from the calculated PVs could effectively reflect the spatial variation of soil classes.
Although the proposed method uses survey line data, it is possible to apply it to other kinds of data. With a more intensively developed software tool and a parameter estimation strategy, point data, line data, patch data, or even mixture of them might be used for conditional simulations. However, for representing the spatial continuity, line data may be more advantageous, and such data are not difficult to acquire in field survey for area class soil mapping. (As we had mentioned, survey line data might be acquired by just recording class boundary changes along a line in field survey as a special kind of purposeive sampling).
We had shown that the special features exhibited by the probability vector approach computed from a TMC were promising. Such an approach is also applicable to other Markov chain conditional simulation methods that used a one-pass simulation algorithm with explicit conditional transition probability expressions. This paper focused on demonstrating the estimation of PVs and their characteristics. Further studies are necessary to address related issues such as parameter estimation from other kinds of data and incorporation of secondary information, and to further the ability of Markov chain approaches to effectively modeling multinomial categorical variables.
| ACKNOWLEDGMENTS |
|---|
Received for publication July 29, 2004.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
W. Li Transiograms for Characterizing Spatial Variability of Soil Classes Soil Sci. Soc. Am. J., May 16, 2007; 71(3): 881 - 893. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Crop Science | |||
| Vadose Zone Journal | Journal of Plant Registrations | ||||
| Journal of Natural Resources and Life Sciences Education |
Journal of Environmental Quality |
||||