<< Chapter < Page | Chapter >> Page > |
Note that the sum over $\xi $ in the partition function refers to the sum over all possible $\xi $ , not just the $\xi $ that have been observed. This fact makes computation of the partition function intractable and we must approximate it. Following a sampling-based learning technique, we conclude [link] :
Where ${\theta}^{0}$ is some set of parameters from which T samples are drawn (20 in our case). Since $\text{ln}Z\left({\theta}^{0}\right)$ is a constant, we can leave it out of the optimization's objective function and we solve the MLE problem via gradient ascent.
Where $s$ is some small step size. We update ${\theta}^{0}$ on each iteration to be ${\theta}^{t-2}$ . This is due to the fact that the partition function approximation is only reasonable in a neighborhood of ${\theta}^{0}$ [link] . It follows that the $\xi $ 's which are indexed by t are drawn from a model with parameters ${\theta}^{t-2}$ , while the $\xi $ 's indexed by j still represent the historical data.
Due to the small number of historical observations (13) and the large number of possible combinations for any edge ( $60\text{states}*60\text{states}=3,600$ combinations), we must come up with a more concise way to learn the relationships between counties. To that end, we look not at the absolute voting percentages of counties but rather the difference in voting percentage between each pair of neighboring counties. This method has the added bonus of circumventing the problem of overall change that has affected every county. Unfortunately, there are still 119 possible differences that could occur (-59,-58,...,0,...58,59) and only 13 elections to determine the frequency with which each difference occurs. Therefore, we place each difference into a cluster, e.g. [-9,-6]. We use 11 clusters total and since the differences between counties are fairly consistent between years, the 13 observations should be sufficient for an approximation of the marginal probabilities for each edge. These approximation techniques do not affect the way we solve the problem via gradient ascent. However, once gradient ascent is finished we must convert our small $\theta $ into standard long form (as displayed in Section 2.1).
Due to our approximation techniques in the learning process, we are confronted with a problem when attempting to predict the 2012 election. Since the entire model is based off relativity, any outcome for a particular county is equally likely as long as the rest of the model shifts with it. In order to ensure we do not get extremely low or high results, we must fix some subset of the counties as a starting point for the model. In order to do this, we utilize linear regression techniques (as discussed in the next section). Once the model is partially filled in, we solve the binary program stated above with our learned $\theta $ (in standard long form) via Gurobi Optimizer.
Multivariate Linear Regression is commonly used in social sciences as a means of predicting future outcomes based off of known data. It will provide us with a comparison as well as a starting off point for our Markov Random Field model. Our model will have Incumbent Party Vote % as the dependent variable. That is, if a Democratic president is currently in office, then we will be predicting the voting %'s earned by this year's Democratic Candidate.
Notification Switch
Would you like to follow the 'The art of the pfug' conversation and receive update notifications?