Okay, so I’m still trying to figure out how to work blogging into my schedule. I have bits and pieces of time here and there, but I often find myself reading other blogs instead of writing my own during these times. This is, I think, largely due to when these bits and pieces of time occur. For example, I usually read a bit prior to buckling down in the mornings, but I also like to use this time to eat a snack, which makes blogging difficult. I also tend to read on-line materials in the evening while the kids are getting ready for bed, at which time I often find it difficult to muster the energy to blog.

I’m also trying to figure out what, exactly, to blog about. Josh and I have discussed co-blogging about language, and I like the idea, but given how irregularly I blog on my own, I’m hesitant to commit to upkeep on a co-blog. Then again, perhaps committing to such an enterprise would be just the kick in the pants I need to blog as regularly as I (tell myself I) want to.

I think that blogging about politics is not really my game. Although there is a regular supply of political material to respond to, it’s hard to justify spending as much time as I would need to in order to write about issues as thoughtfully as I feel should be done. Josh does a much better job with this kind of blogging than I do. He has a much vaster store of historical and political knowledge, which (along with the practice afforded by regular blogging) allows him to write thoroughly and thoughtfully about new items as they occur. Given my relatively limited ability to blog about politics, and given how busy I am writing papers for publication, getting ready for a conference presentation, and (as of today) taking a course in probability theory, I just can’t commit too much time to this kind of blogging.

So, instead, I will blog about that which I am working on anyway. I’ve always intended my blog to be about research, mostly on mathematical models of perception and decision making, but also on a variety of issues that arise in conjunction with this. Hence, today’s post on factorials.

I employ multidimensional signal detection theory in the study of auditory perception. Briefly, this means I collect and analyze identification-confusion data in tasks in which each stimulus has one of two levels on each of two dimensions (e.g., purple or red, square or rectangle), and each combination of levels-on-dimensions (i.e., each stimulus specification) has a unique response. The general method extends to more levels and more dimensions, but for a variety of reasons, I stick with two-by-two (and lower) structures.

I like to analyze my data by fitting (and comparing) models. I take a given subject’s data and try to find the set of bivariate normal densities and decision bounds (more on this in another post) that most closely ‘predicts’ the observed counts of identifications and confusions. Each trial in one of these experiments consists of stimulus presentation and response execution. Because the response set is the same across trials, the data (i.e., the counts of the four respones) are distributed as multinomial random variables. Here’s where factorials come into the picture.

Pretty much any fit statistic involves a likelihood function. The multinomial likelihood function is proportional to the product of the parameters raised to the appropriate powers (i.e., the counts of the responses). So, for presentations of a red-square stimulus, the data would be the number of times each response was given, so the multinomial likelihood would be the product of the predicted probability of each response raised to the number of times that response was actually made (I would like to have this written out mathematically, but I can’t figure out right now how to get the sub- and super-scripts working).

In order to make it a properly normalized likelihood function, you have to multiply this product by a ratio of factorials, specifically, the factorial of the total number of responses divided by the product of the factorials of each individual response. Now, for a variety of reasons (again, more another time), I collect a *lot* of responses in these experiments. So many, in fact, that I can’t calculate the requisite factorials. If I were content to use regular, old fashioned likelihood ratio model testing, this wouldn’t matter, as the ratio of two likelihoods for the same data set have the same normalizing constant, hence, it cancels, and there is no need to calculate the factorials.

I’m not content to use regular, old fashioned likelihood ratio tests, though. Instead, I use the assuredly fancy-pants fit statistic known as the Bayesian Information Criterion (BIC), defined as -2*log(*L*) + *k**log(*N*), where log is the natural logarithm, *L* is the likelihood, *k* is the number of free parameters in the model, and *N* is the sample size. The basic idea behind the BIC is that it measures fit (the first term) and model ‘complexity’ (second term). The better your model fits, the lower the negative log likelihood, and so the lower the BIC, but the more parameters you need to get that fit (and the larger your sample size), the higher the BIC.

The BIC makes use of a rather crude measure of complexity (hence the scare quotes), but it relates directly to some other handy tools (e.g., minus half times the difference between to BIC values gives you the Bayes factor, which is a pleasantly intuitive (rare in statistics) measure of the relative goodness of fit of two models – the Bayes factor esentially tells you how much more belief-worthy one model is relative to another. Of course, you immediately encounter the same old issues of ‘how big (or small) is big (or small) enough to warrant a strong conclusion, but that seems inescapable).

The point, finally, is that while the normalizing constant is the same within a given subject’s data (and all my analyses are at the individual subject level), it only multiplies the likelihood, leaving the complexity term alone. Thus, though it is perhaps unlikely, it is possible that leaving out the normalizing constant and its many factorials could lead me to the wrong conclusions. Here’s a contrived example: Suppose (unnormalized) *L* = -2000 for one model, and *L* = -2400 for another, and the complexity terms for the two models are 300 and 200, respectively. Plugging these into the BIC formula gives 4300 and 5000, leading to preference for the first model. Now suppose that the normalizing constant (for both) is 1/20. Including this in calculating the (proper) likelihood values leads to BIC values of 500 and 440, respectively, leading to preference for the second model. Oops.

So, the end result is that I will spend a good chunk of time today figuring out a way to calculate the ratio of factorials so that I can normalize my likelihoods appropriately. The basic idea (which shouldn’t take long to execute – less time, perhaps, than it has taken me to blog about it) is to create vectors with the elements that are to be multiplied in the factorials for the numerator and denominator and divide them element-wise prior to taking the product. Which is to say, I am going to violate the order of operations handed down by Moses as he descended from Mt. Sinai. I’ll also do this for the log likelihood, only with adding and subtracting. I’ll post again with updates (and I’ll get to the posts I promised last time at some point, too). I don’t guess this will change the results of my analyses, but I don’t know yet for sure.

## 2 Comments

This post isn’t terribly clear. Perhaps I’m not reading it closely enough – but faithfully subbing -2000 for L and 300 for the second term (i.e. k*log(N)), I get something closer to 900,000 than 500 – so I’m doing something wrong – or else you’ve left out a crucial detail in the explanation.

I get the basic idea, however – that the normalizing constant only applies to the first term and hence can in theory overrule the correction for the complexity of the model in some cases.

It sounds like an algorithms problem, actually, in that you’re worried primarily about whether one factor in your equation is “growing” faster than another as a function of something else (which in this case seems to be as a function of the number of responses you get). And I guess for the factorial factor what worries you most is whether the denominator of the normalization factor seems likely to swamp the complexity correction. It seems like there should be some generalizations about factorials written down somewhere (probably in algorithms books, actually, since everything in complexity theory seems to involve factorials) that would answer the question.

Assuming, of course, that I actually understood the post.

I actually wasn’t thinking about this correctly. I will change the end of this post fairly extensively in short order (i.e, within the next week or so).

## One Trackback/Pingback

[…] Source-Filter « Factorials! […]