Workshop: Week 6 PRELIMINARIES: ============== I have kept this worksheet shorter to allow students to spend time on their projects. The last two questions are optional (though IMHO fun). =============================================================== Question 1: BIM with full knowledge =================================== In Equation 9 of Lecture 10, we estimated the weight of term $t$ as w_t = \log \frac{p_t}{1 - p_t} + \log \frac{1 - u_t}{u_t} We also said how to estimate values for $p_t$ and $u_t$ (respectively, the proportion of relevant and irrelevant documents that contain term $t$). Assume we have the following values: $N$ Number of documents $R$ Number of relevant documents $f_t$ Number of documents that term $t$ appears in $r_t$ Number of relevant documents that term $t$ appears in derive a full formula for $w_t$. What happens to your formula if $t$ appears in every relevant document? NOTE: for this question, you can show your derivation in either of these forms: 1. As latex code (see above). Please place in a stand alone file "q1.tex" that will compiles with "pdflatex q1.tex" 2. Hand-written on paper (neatly please!), then photo'ed and compressed. (If you can't get photo < 100k, please just write the final formula in text, and email me the image. Question 2: Calculating f_{d,t} distributions ============================================= Write a program that does takes the following arguments: Then for every term in that has a collection frequency (i.e. total number of occurrences, not number of documents it occurs in) between and (inclusive), calculate the number of times that term appears with an fdt between [0, ]. Keep a separate count for fdt values greater than . Then print out the fdt values for the following settings: lyrl_tokens_30k.dat 1000 1040 15 that is, the terms in the LYRL30k dataset that have a collection frequency between 1000 and 1040. An example line of output is: fair 1019 [30480, 626, 93, 31, 15, 6, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] In answering this question, you may find the file: http://www.williamwebber.com/comp90042/wksh/wk06/code/lyrl_terms.py helpful. It contains a function, lyrl_to_term_bow, which parses the LYRL 30k data into a list of BOW dictionaries, one per document. Question 3: Expected One-Poisson distribution ============================================= Assume a term $t$ has a collection frequency of 1020, and given the collection size is 31254 (the number of docs in LYRL30k). What is the parameter $\lambda$ in the one-Poisson model for that term? The code: from scipy.stats import poisson poisson.pmf(0, lambda) gives the probability that a Poisson process with parameter $\lambda$ will result in 0 observations in the unit interval of time. Using this function, calculate the number of docs we expect $t$ to have an $f_{d,t}$ of 0, 1, 2, 3, 4 ... in. Which of the terms in Question 2 have a distribution that appears roughly to follow the expected one-Poisson distribution? (You may use 1020 as the $c_t$ for all these terms, rather than their actual $c_t$.) Question 4: Formalization of Question 3 (OPTIONAL) ================================================== A more formal way of making the visual judgment of fit is as follows. The cumulative density function (CDF) of a random distribution gives the proportion of observations that are expected be at or below a given value. So, for instance: >>> poisson.cdf(2, 0.5) 0.9856 says that for a Poisson RV with $\lambda = 0.5$, 98.56% of the observations are expected to be in the range {0, 1, 2}. One minus the CDF will give the proportion of observations expected to be above the specified value. Find the value $f_{d,t}$ such that we expect the $t$ (as defined in Question 3) to have less than a 1% chance of occurring with that $f_{d,t}$ or higher within the collection (given the collection size). What is that $f_{d,t}$? If we observe that a term $t$ has that $f_{d,t}$, we can say that we are 99% confident that $t$ deviates from the One-Poisson model. For which of the terms in Question 3 are we _not_ 99% confident that it diverges from the One-Poisson model? Question 5: Extension of Question 4 (VERY OPTIONAL) =================================================== The survival function is (1 - CDF(x)); that is, it gives us the proportion of observations we expect to be at or above ("survive until") $x$. The inverse survival function takes a proportion $p$, and (for a discrete RV like the Poisson) gives smallest value $x + 1$ for which 1 - CDF(x) < p. The inverse survival function for the Poisson distribution is provided by poisson.isf(p, lambda) Use this to examine _all_ the terms in the LYRL30k collection. Which of these terms are we _not_ 99% confident they violate the One-Poisson distribution? Would you describe these as "non-content" terms (given the nature of the LYRL30k collection)?