Robust Statistics - Maple Help

All Products Maple MapleSim

Home : Support : Online Help : Statistics and Data Analysis : Statistics Package : Example Worksheets : Robust Statistics

Robust measures of central tendency and dispersion

Robust statistics seek to describe data sets that suffer from noisy measurements. In particular, they should remain meaningful when a fraction of the data is changed dramatically.

>	with(Statistics):

Robust Measures of Dispersion

A measure of dispersion, also known as a measure of scale, is a statistic of a data set that describes the variability or spread of that data set. Two well-known examples are the standard deviation and the interquartile range. Two more measures of dispersion are called $Sn$ and $Qn$ , originally proposed by Rousseeuw and Croux [1].

Let us investigate how measures of dispersion behave when noise is added to a data set. Specifically, we will have an original data set $X$ of, say, $n$ data points, and a perturbed data set $Y$ where a certain fraction $r n$ of the data points are changed dramatically. We investigate at what value of $r$ the values become meaningless.

>	X := Sample(Normal(0, 1), 1000);

(1)

>	StandardDeviation(X);

$0.990176940818334$

(2)

>	Y := copy(X);

(3)

>	Y[1] := 10^100:

>	StandardDeviation(Y);

$3.16227766016838 \times 10^{98}$

(4)

For the standard deviation, we see that changing only one data point can massively change the standard deviation. In other words, there is no positive fraction $r$ of the data points that we can change while keeping the standard deviation bounded. We say that the breakdown point of the standard deviation is 0.

For the interquartile range, the process is different. Changing a single data point doesn't make the interquartile range of $Y$ change very much; in fact, we can change up to a quarter of the data points while staying within an order of magnitude from the interquartile range of $X$ . As soon as we have changed 250 out of the 1000 data points, though, the interquartile range also goes through the roof.

>	InterquartileRange(X);

$1.36849073417322$

(5)

>	Y := copy(X);

(6)

>	Y[1 .. 249] := 10^100:

>	InterquartileRange(Y);

$3.30395809676134$

(7)

>	Y[250] := 10^100:

>	InterquartileRange(Y);

$5.83333333333371 \times 10^{99}$

(8)

This suggests that the breakdown point of the interquartile range is $\frac{1}{4}$ : changing strictly fewer than $\frac{1}{4}$ of the points cannot make the interquartile range unbounded. This is indeed correct. We say that the interquartile range is more robust than the standard deviation.

The breakdown point for any statistic can never be more than $\frac{1}{2}$ : if we change over half of the data points in the set, then there's no way to decide what the "correct" data is, and what the "changed" data is. So are there dispersion statistics that reach this maximal breakdown point?

Yes, there are. A relatively well-known one is the median absolute deviation from the median, available in Maple as MedianDeviation. As the name says, it is obtained by computing the absolute difference between every data point and the median of the data set, and taking the median of these values.

>	MedianDeviation(X);

$0.686396253277771$

(9)

>	Y := copy(X):

>	Y[1 .. 499] := 10^100:

>	MedianDeviation(Y);

$5.37556591483202$

(10)

>	Y[500] := 10^100:

>	MedianDeviation(Y);

$5.00000000000000 \times 10^{99}$

(11)

The median absolute deviation from the median is a very useful robust estimator, but it also has some disadvantages, explained in the paper [1] by Rousseeuw and Croux. One of their objections is that it doesn't deal with asymmetric distributions very well, and another is that, while it is very robust against extreme changes in some points, it needs relatively many data points to "converge" to the proper value for a distribution in the absence of disturbance. In the statistics literature, this is phrased as saying that the median absolute deviation from the median is not very efficient. These authors propose two alternative statistics that also have a breakdown point of $\frac{1}{2}$ but higher efficiency, called $Sn$ and $Qn$ . Maple has an implementation of both of these, called RousseeuwCrouxSn and RousseeuwCrouxQn.

>	RousseeuwCrouxSn(X);

$0.836306393200841$

(12)

>	RousseeuwCrouxQn(X);

$0.453501081152612$

(13)

>	Y := copy(X):

>	Y[1 .. 499] := 10^100:

>	RousseeuwCrouxSn(Y);

$5.64926532313226$

(14)

>	RousseeuwCrouxQn(Y);

$0.0137805612041167$

(15)

>	Y[500] := 10^100:

>	RousseeuwCrouxSn(Y);

$1.00000000000000 \times 10^{100}$

(16)

>	RousseeuwCrouxQn(Y);

$0.00711405767220685$

(17)

The $Qn$ estimator requires a different pattern to break:

>	Y := copy(X):

>	Y[1..499] := Vector(499, i -> i * 10^97):

>	RousseeuwCrouxQn(Y);

$5.64926532313226$

(18)

>	Y[500] := 500 * 10^97:

>	RousseeuwCrouxQn(Y);

$1.00000000000000 \times 10^{97}$

(19)

We will show how all of these statistics deviate from their true value for beta-distributed data samples at sample sizes from 10 to 10000 and with fractions between $0$ and $\frac{1}{2}$ of the data replaced by the value $5$ . In particular, given the sample size and the fraction $r$ , we replace the highest $100 r$ percent of the data by $5$ , then divide value obtained for the changed sample by the true value for the distribution, thus obtaining a number that should be $1$ for an ideal statistic. We then repeat this $100$ times, and take the average squared difference from $1$ . This is the number shown in the plot below for each of the five measures of dispersion discussed above.

>	functions := [StandardDeviation, InterquartileRange, MedianDeviation, RousseeuwCrouxSn, RousseeuwCrouxQn]:

>	nf := numelems(functions):

>	X := Sample(BetaDistribution(0.9, 1.7), 10^6):

>	true_values := map(f -> f(X), functions);

$true_values ≔ [0.250714910329766, 0.398535818906638, 0.191972959574996, 0.229873030382496, 0.107739262732785]$

(20)

>	sample_sizes := [10, 30, 100, 300, 1000, 3000, 10000]:

>	nss := numelems(sample_sizes):

>	results := Array(1 .. nf, 1 .. nss, 0 .. 10, 1 .. 100);

(21)

for k to 100 do
    X := Sample(BetaDistribution(0.9, 1.7), max(sample_sizes));
    for i to nss do
        Y := X[1 .. sample_sizes[i]];
        sort[inplace](Y, `>`):
        for j from 0 to 10 do
            Y[1 .. ceil(j * sample_sizes[i] / 20)] := 5;
            for f to nf do
                results[f, i, j, k] := functions[f](Y) / true_values[f];
            end do;
        end do;
    end do:
end do:

>	rr := Array(1 .. nf, 1 .. nss, 0 .. 10):

>	for i to nss do for j from 0 to 10 do for f to nf do rr[f,i,j] := sqrt(Moment(results[f, i, j], 2, origin = 1)); end do: end do: end do:

plots:-display(plots:-surfdata~([seq(convert(rr[i], Matrix), i=1 .. nf)], 1 .. nss, 0 .. 0.5,
                                color =~ [red, green, blue, yellow, purple], transparency = 0.2),
               axis[1]=[tickmarks=[seq(i = sample_sizes[i], i = 1 .. nss)]], axis[3]=[mode=log],
               view=[DEFAULT,DEFAULT, min(rr) .. 10], orientation=[116, -68, 177],
               labels=[`Sample sizes`, r, `Standard deviation`],
               labeldirections=[horizontal, horizontal, vertical]);

The colors are red for the standard deviation, green for the interquartile range, blue for the median absolute deviation from the median, yellow for Rousseeuw and Croux' $Sn$ , and purple for $Qn$ . Lower numbers are shown higher in the graph, and are better. We see that in the case where there is no noise ( $r = 0$ ), the standard deviation has the lowest distortion. However, as soon as there is any distortion, it is immediately too inaccurate to be useful for any purpose. For $r < 0.25$ , the interquartile range (green) does reasonably well, but greater values of $r$ make it, too, unusable. For larger values, the median absolute deviation from the median (blue), $Sn$ (yellow), and $Qn$ (purple) all do reasonably well.

Another interesting experiment is to see how these measures of dispersion distinguish two Cauchy distributions with different scale parameters. We can see that the values in $X2$ (plotted in green, below) are just a little further spread out than those in $X1$ (plotted in red). Indeed, one could obtain a sample of the distribution underlying $X2$ by multiplying a sample from the distribution underlying $X1$ by $1.1$ . It would be nice if measures of dispersion reflect this fact. However, the Cauchy distribution naturally has many outliers, and indeed the standard deviation of the distribution is undefined.

>	X1 := Sample(Cauchy(0, 1.0), 10^5):

>	X2 := Sample(Cauchy(0, 1.1), 10^5):

>	plots:-display(KernelDensityPlot~([X1, X2], left=-12, right=12, color =~ [red, green]));

>	for i to nf do f1 := functions[i](X1); f2 := functions[i](X2); print(convert(functions[i], 'string'), f1, f2, f2/f1); end do:

$StandardDeviation, 450.792317322618, 369.344598687811, 0.819323188295339$

$InterquartileRange, 1.97963706221884, 2.20170072370362, 1.11217392608112$

$MedianDeviation, 0.989686310280030, 1.10085271595101, 1.11232488973150$

$RousseeuwCrouxSn, 1.40300629472880, 1.55680729448674, 1.10962245881275$

$RousseeuwCrouxQn, 0.822432294182147, 0.912138503220176, 1.10907427842098$

(22)

We see that all measures of dispersion with a breakpoint greater than $0$ , that is, all of them except for the standard deviation, reproduce this ratio of $1.1$ fairly closely.

Robust measures of central tendency

A measure of central tendency is a statistic that identifies a central value in a sample or distribution. Well-known examples are the Mean, the Median, and the Mode. Another measure of central tendency was invented by Hodges and Lehmann (see [2]) and independently by Sen (see [3]); it is often called the Hodges-Lehmann estimator.

We can study the breakdown point of these quantities as we did with the measures of dispersion. For the mean, the breakdown point is $0$ .

>	X := Sample(Normal(0, 1), 1000);

(23)

Mean(X);

$−0.0139943685387067$

(24)

>	Y := copy(X);

(25)

>	Y[1] := 10^100:

Mean(Y);

$1.00000000000000 \times 10^{97}$

(26)

The mode is a little tricky to handle for a continuous probability distribution given by a sample. The median is clearer; its breakdown point is $\frac{1}{2}$ .

>	Median(X);

$−0.0168102021680020$

(27)

>	Y[1..499] := 10^100:

>	Median(Y);

$2.99849087645744$

(28)

>	Y[500] := 10^100:

>	Median(Y);

$5.00000000000000 \times 10^{99}$

(29)

The Hodges-Lehmann estimator has a breakdown point of $1 - \frac{\sqrt{2}}{2}$ or about $0.29$ .

>	HodgesLehmann(X);

$−0.0174989845223721$

(30)

>	Y := copy(X):

>	Y[1..292] := 10^100:

>	HodgesLehmann(Y);

$2.01242526431790$

(31)

>	Y[293] := 10^100:

>	HodgesLehmann(Y);

$5.00000000000000 \times 10^{99}$

(32)

The advantage of the Hodges-Lehmann estimator is that it converges to its limit value more quickly than the median does (at least for distributions that are symmetric about the median); that is, for relatively small sample sizes, the Hodges-Lehmann estimator has greater accuracy. We proceed as in the previous section.

>	functions := [Mean, Median, HodgesLehmann];

$functions ≔ [Mean, Median, HodgesLehmann]$

(33)

>	nf := numelems(functions):

>	X := Sample(BetaDistribution(0.9, 1.7), 10^6):

>	true_values := map(f -> f(X), functions);

$true_values ≔ [0.346193421872074, 0.302664238430270, 0.334408091829171]$

(34)

>	sample_sizes := [10, 30, 100, 300, 1000, 3000, 10000]:

>	nss := numelems(sample_sizes):

>	results := Array(1 .. nf, 1 .. nss, 0 .. 10, 1 .. 100);

(35)

>	rr := Array(1 .. nf, 1 .. nss, 0 .. 10):

>	for i to nss do for j from 0 to 10 do for f to nf do rr[f,i,j] := sqrt(Moment(results[f, i, j], 2, origin = 1)); end do: end do: end do:

plots:-display(plots:-surfdata~([seq(convert(rr[i], Matrix), i=1 .. nf)], 1 .. nss, 0 .. 0.5,
                                color =~ [red, green, blue], transparency = 0.2),
               axis[1]=[tickmarks=[seq(i = sample_sizes[i], i = 1 .. nss)]], axis[3]=[mode=log],
               view=[DEFAULT,DEFAULT, min(rr) .. 10], orientation=[116, -68, 177],
               labels=[`Sample sizes`, r, `Standard deviation`],
               labeldirections=[horizontal, horizontal, vertical]);

We see that the mean (in red) performs best when $r = 0$ , but miserably otherwise. The Hodges-Lehmann estimator behaves very well for $r < 0.29$ . Beyond that only the median does well.

We can also reproduce the experiment with the Cauchy distribution. We now vary the location parameter between the two samples; the values in $X2$ (plotted in green, below) are just a little further to the right, that is, greater, than those in $X1$ (plotted in red). In this case, one could obtain a sample of the distribution underlying $X2$ by adding $0.1$ to a sample from the distribution underlying $X1$ . It would be nice if measures of central tendency reflect this fact. However, the Cauchy distribution does not have a mean.

>	X1 := Sample(Cauchy(0.0, 1), 10^5):

>	X2 := Sample(Cauchy(0.1, 1), 10^5):

>	plots:-display(KernelDensityPlot~([X1, X2], left=-12, right=12, color =~ [red, green]));

>	for i to nf do f1 := functions[i](X1); f2 := functions[i](X2); print(convert(functions[i], 'string'), f1, f2, f2-f1); end do:

$Mean, 3.77709097557069, −0.714889535849446, −4.49198051142014$

$Median, 0.00112967419851748, 0.107246525011298, 0.106116850812780$

$HodgesLehmann, 0.000686323260547918, 0.101012603763531, 0.100326280502983$

(36)

Again, we see that the two measures of central tendency with breakpoint greater than $0$ (that is, the median and the Hodges-Lehmann estimator) reproduce this difference of $0.1$ correctly, whereas the mean (with breakpoint $0$ ) does not.

	See Also
	HodgesLehmann, InterquartileRange, Mean, Median, MedianDeviation, Mode, RousseeuwCrouxQn, RousseeuwCrouxSn, StandardDeviation, Statistics

References

[1] Rousseeuw, Peter J., and Croux, Christophe. Alternatives to the Median Absolute Deviation. Journal of the American Statistical Association 88(424), 1993, pp.1273-1283.

[2] Hodges, Joseph L., and Lehmann, Erich L. Estimation of location based on ranks. Annals of Mathematical Statistics 34(2), 1963, pp.598–611.

[3] Sen, Pranab K. On the estimation of relative potency in dilution(-direct) assays by distribution-free methods. Biometrics 19(4), 1963, pp.532–552.

Download Help Document

Maple

Maple 附加模块

Math Success Platform

提高存留率

Maple Flow

MapleSim

项目服务

Maple T.A. and Möbius

教育

行业

汽车与航空航天

机器人

机器设计和工业自动化

其他

应用领域

产品价格

购买

院系/全校正版授权

Maplesoft精英维护升级计划

支持

产品培训

产品在线帮助

研讨会与活动

出版

资源中心

示例和应用

社区

关于 Maplesoft

媒体中心

用户社区

联系我们

Online Help

All Products Maple MapleSim

Maple

数学软件

Maple 附加模块

Math Success Platform

提高存留率

Maple Flow

Engineering calculations & documentation

MapleSim

多学科系统级建模仿真

项目服务

Maple T.A. and Möbius

教育

行业

汽车与航空航天

机器人

机器设计和工业自动化

其他

应用领域

产品价格

购买

院系/全校正版授权

Maplesoft精英维护升级计划

支持

产品培训

产品在线帮助

研讨会与活动

出版

资源中心

示例和应用

社区

关于 Maplesoft

媒体中心

用户社区

联系我们

Online Help

All Products Maple MapleSim