note on minimization

Want to emphasize that much of the stats you’ve learned is related to minimization, since this is the basis of much later statistical analysis.

Consider if we want to find the average Age of people in our NY PUMS data. First load it in.

# rm(list = ls(all = TRUE)) # clear workspace
# setwd("C:\\Users\\Kevin\\Documents\\CCNY\\data for classes\\R_lecture1")
load("pums_NY.RData")
Age <- dat_pums_NY$Age

We can think of finding a central value for Age, what single number would be closest to all of 196,314 ages in the data. Evidently this takes a bit of thinking to decide how to measure closest. Thinking of the whole 200k Age values is too much for my small brain so first create a mini version, with just the first 5 obs, 43, 45, 33, 57, 52.

What single number is close to these values? When you’re evaluating close you’re probably thinking distance, so something like \(Age - guess\). But we might not care about the direction of the miss, so maybe something like \(|Age - guess|\) or \(\sqrt{(Age - guess)^2}\).

Suppose we label our guess with the Greek letter, \(\theta\), theta.

Then we can write the distance as \(\sqrt{ (43 - \theta)^2 + (45 - \theta)^2 + (33 - \theta)^2 + (57 - \theta)^2 +(52 - \theta)^2}\)

This gets boring to write out each Age, so maybe use some notation, \[\sqrt{ (Age_1 - \theta)^2 + (Age_2 - \theta)^2 + (Age_3 - \theta)^2 + (Age_4 - \theta)^2 + (Age_5 - \theta)^2 }\]

or even more economically, \[\sqrt{\sum_{i=1}^n {(Age_i - \theta)^2}}\]

You can imagine trying out values of \(\theta\) to minimize that sum. And now that we have that notation we can use all 196,314 data points not just the first five.

Next a couple of minor details: calculating the square root is unnecessary, since whatever \(\theta\) makes the sum of squares smallest will also make the square root of the sum of squares smallest. (Square root is a monotonic function, if you recall your calculus.)

And although I had noted above that you could use absolute value or the square, I choose to use the square. Why? Tradition. Again from calculus recall that if we’re going to be minimizing then you’ll have to take the derivative. And if you’re going to take the derivative then the absolute value function takes a bit of effort since the value of its derivative is underfined at zero. So historically statisticians, particularly those limited to the technology of chalk not silicon, used the square.

How can we have the computer find the closest value of \(\theta\)? Fortunately a lot of people have put in a lot of effort to figure smart ways. But we can start with some dumb ways.

But first some details of R. We will create a function that takes our guess value and returns the squared distance of every Age value from it. Start with a guess of 25.

theta_guess <- 25
f_tobeminimized <- function(theta_guess) sum((Age - theta_guess)^2)

This creates a new function within this R session, so you can type in things like {r} f_tobeminimized(30) {r} f_tobeminimized(30).

One option is to just try some numbers, say \(\theta\) equal to 25, 30, 35, …

\(\theta\)	f()
25	`{r} f_tobeminimized(25)`
30	`{r} f_tobeminimized(30)`
35	`{r} f_tobeminimized(35)`
40	`{r} f_tobeminimized(40)`
45	`{r} f_tobeminimized(45)`

# identical to following: theta_guess <- c(25,30,35,40,45)
theta_guess <- seq(25,45, by = 5)
f_vals <- matrix(data = NA, nrow = length(theta_guess), ncol = 1)
for (indx in 1:length(theta_guess)) {
  f_vals[indx] <- f_tobeminimized(theta_guess[indx])
}

This is called a Grid Search. You could imagine getting successively finer and finer grids until you found the minimum value. Of course regular grid values are probably not the smartest choice, but you might be surprised how many minimization algorithms are just making a smart choice about the next step.

R has a variety of smart fancier search procedures. One good general purpose one is nlm(). In this particular case it’s using a bazooka to kill a fly but you might find it useful later.

nlm_guess <- nlm(f_tobeminimized, 5)
nlm_guess$estimate

## [1] 40.58241

mean(Age)

## [1] 40.58243

In this case, the minimized value for the sum of squared deviations and the mean are identical (nearly so, you could make the minimization routine run harder to make it closer, the tolerance) - which was the whole point of this exercise.

Can you guess what would be the minimized value for the sum of absolute value deviatons?

falt_tobeminimized <- function(theta_guess) sum(abs(Age - theta_guess))
nlm2_guess <- nlm(falt_tobeminimized, 5)
nlm2_guess$estimate

## [1] 41

median(Age)

## [1] 41

What would be minimized to get the mode?

Different measures of close imply different estimates. A measure that squares the errors puts more weight on wide misses, compared to absolute value. You could have misses enter asymmetrically (perhaps being too high is half as bad as being too low). As you do more stats, you’ll keep seeing minimization of errors.

note on minimization

Kevin R Foster

September 10, 2015