NovelNewsNabber
=== Using Artificial-Immunology to Recognize Novelty in News Headlines
=
Jeremy Avnet and Aaron Clauset %%% Topics in Complex Adaptive Systems, Stephanie Forrest %%% University of New Mexico (Spring 2003)
The Novel News Nabber is meant to discover novel news headlines that appear on a given blog. There are a number of topics we are casually interested in but don't have time to read about on a regular basis. With limited time and only a casual interest, we are mostly interested in news items that are quite novel and separate themselves from the normal stream of news that generally appear on the blog. The Novel News Nabber develops a sense of self for a given news stream and alerts its user to deviations from this norm before subsequently redefining its sense of self.
A finished paper on our work can be found at ?????.
The streams we worked with are documented on NovelNewsNabberDataStreams.
Below, remains of discussion sessions, outlines, and goals documented during the course of the project can be found.
==== References
==
- Kleinberg did some algorithmic work on finding word bursts in temporal text streams. See the blurb on NiftyNews.
==== Notes and Thoughts from Various Discussions
==
- Weight the detector matches by frequency. This can be accomplished by looking at the location of the detector in the cache. Alternatively, this data can be stored in the hash (+1 for each hit).
- Look Ahead seems flexible enough to adapt to random permutations of headlines, while Ngrams is not. Observed that the convergence-to-self looks very similar whether the headlines were un-permuted or permuted for Look Ahead.
Perhaps a 2-part training scheme would be useful. The first part would sample the distribution of ngrams in the data stream to determine how large to make the cache. (i.e. 1/e (# of unique detectors) or ln (# of unique detectors)) The second part would be devoted to filling the cache and characterizing a sense of 'self'.
- The frequency of detectors in a data stream (which is dependent on 'n') seems like a really important type of data to sample - it seems reasonable that the cache size should be set such that it holds the set of detectors which characterize the most frequently used ones. We empirically observed that the frequency distribution of ngrams is an exponential.
====Goals
==
- We'd like to be able to say that a news source like the New York Times has a larger repertoire (i.e. is more novel on a more regular basis; requires larger cache size, or whatever) than FoxNews. This requires comparing the performance of the system on several different news feeds, and drawing some conclusions on the characteristic language used in headlines.
We'd like to be able to detect how frequently, and to what degree, a news feed changes the type* of news which it is reporting (as measured by the type of language in its headlines). Particularly with a news source like the NYT during the GW2, there was a high degree of correlation between headlines.
- conjecture - when a news story breaks, there tend to be a cluster of articles with similar (linguistic) headlines. detecting when these first occur is to detect novelty, and detect+ignore them at a later time is to ignore lack of novelty.
====Characterizing Normal and Abnormal
==
- Because we're interested in characterizing the normal behavior for a stream of news headlines, the definition of 'abnormal' is not one we can quantitatively define in a robust or easy manner. Rather, novelty can be calculated using statistical measures, but ultimately remains to be set by the user/human. We've created a parameterized definition of novelty which we feel provides some degree of control over how sensitive the system is to novel (as defined by our statistical measure) headlines. If our system is truly able to decide novelty as compared to the previous corpus of headlines it's seen, then the parameterization is valid and useful.
====Why Use Character-based Detectors
==
- We agreed to stay with characters since character strings (as opposed to strings of words) seem to be able to cross word-boundaries and capture some information that word strings would not. At larger values of 'n', ngrams become very similar to words-grams.
- We can also filter the input data stream for certain elements to exclude and thus perhaps reduce some of the redundancy in the language that may obscure semantic meaning. For instance, we could filter the stream of all grammar-cruft like articles (a, an, the), leaving only important things like noun forms and verb forms. Already remove whitespace, but leave punctuation. Ideally, we want to focus on the semantic content of a headline, rather than the stylistic syntax.
====Definition of Novelty
==
- Novelty is related to how much overlap a new stimulus (headline) has with the current definition of 'self' in the space of detectors. For instance, in the immune system, a novel pathogen who's sphere of stimulation overlaps substantially with the space covered by the detectors will not be recognized as novel (and in the case of the human immune system, it will generate a response - in our case, novelty is good, so lack of novelty stimulates no response). In our case, we want stimuli which are distinct from our current definition of 'self' to generate a response (novelty detection), thus we define the following types of novelty according to how strict we are about matching 'self':
- loose if a headline produces a single novel detector, then the entire headline is novel
- proportional if a headline is composed of some fraction X or more novel detectors, then is novel
- strict iff a headline produces all novel detectors, then it is novel
====Set of Parameters
==
- cache size
- detector hit threshold (novelty sensitivity)
- filter style/level (not used)
- size of detector (n)
- flag: ngram or look-ahead
- training file/url
- testing file/url
====Set of Experiments
==
- How quickly does the definition of 'self' converge during the training phase?
' measure the number of new cache entries as a function of time (headline index). ' finite cache size
- How large a cache is needed to fully characterize 'self'? How does the detector space grow as a function of time?
' measure the cumulative number of unique detectors as a function of time (headline index) ' unbounded cache size ' conjecture: the number of detectors needed to fully characterize a data stream grows in a particular manner over time and approaches an asymptotic region where the frequency of needing new detectors to recognize a headline becomes very low. a cache size set to that of the asymptotic region will have a very broad sense of self, while set to the high-slope region will have a more narrow sense of self.
- How novel is a new headline? How can we detect novelty in an online fashion?
' define novelty as a change in 'self': dS/dt = (# new cache entries) / (# of unique detectors) ' measure dS/dt on a per-headline basis
====Set of Graphs
==
- all data, convergence in training for both a) ngrams and b) look ahead
- all data, coverage in detector space for both a) ngrams and b) look ahead
- test data, novelty rates in test for both a) ngrams and b) look ahead
- ranked plot frequency of detectors in data stream
- various values of 'n' as well (makes for lots of graphs)
====Data Set Statistics
==
|- | 'blog' || '# headlines' || '# chars' || 'type' |- | accordian guy || 44 || 34 || train |- | arstechnica || 423 || 58 || test/train |- | boing boing (4) || 1440 || 50-60 || test/train |- | daedalus || 22 || 31 || train |- | defensetech || 297 || 50 || train |- | jesus || 56 || 48 || train |- | joho || 493 || 52 || train |- | nickyee || 108 || 35 || train |- | pushby || 614 || 54 || train |- | sgtstryker || 1233 || 55 || test/train |- | slashdot || 270 || 50 || test/train |- | randoms(control) || - || - || -
update from Tuesday 4.22
====Reflecting on Recent Results
==
- n-grams in this problem space doesn't really work .we've shown that conclusively by showing that n-grams just characterizes the English language, rather than the normal behavior of a stream. this indicated that we need to capture more relevant information in order to distinguish headlines at the semantic level, rather than the letter-frequency level.
====Using Word-Grams
==
- using n=1 word-grams seems to be better. characterizes the frequency of word usage in a data stream. seems to capture more of the semantic content. filtering out cruft words doesn't seem to do much aside from shifting the overall novelty down a bit (because we're effectively shrinking the cache by devoting small pieces of the cache exclusively to tracking these common words). filtering out case and punctuation doesn't seem to matter much - thus, we should do it because it will shrink the detector space, increase speed without impacting the performance.
- only looking at word frequencies (n
1, ngrams) does pretty well, but to truly capture semantic content, we need to look at correlations between word occurrences. using all possible pairs of headline words (like n2 lookahead) for a supplementary set of detectors seems like a reasonable way to try to capture more of the semantic content of the sentences. this set is explicitly used to penalize novelty (maybe).
====Thinking about the Novelty Function
==
- do we want a really sharp threshold for novelty? the new weighted novelty metric (just averaging) smoothes out the novelty landscape in a way that causes headlines that are only slightly novel to be flagged as entirely novel. this makes our hit-rate for novelty go up a lot, and is due to the fact that detectors which haven't been seen for a while are given some non-zero novelty rating. a sharp threshold might better distinguish truly novel things from slightly novel things. similarly, a high exponent for the weighting function will tend to only give a reasonable semi-novel status to detectors that are extremely old. we could also employ an exponential increase as a function of cache index in novelty - this might give the novelty function the appropriate 'sharpness'.
====To Do List
==
- weighted novelty metric (done)
- play with exponent for novelty metric - seems like having a high exponent
- look at what's actually in the cache (what words are really being picked up?)
- add refresh-counter to detector set to track how many times a detector has been seen
- track the index of a 'hit' in the cache. that way we can get an idea of if we're hitting all over the place in the cache (suggesting the cache is too small) or if we're concentrating most (yy% or more) of the hits in the top xx% of the cache.
- truncate data into a short training sequences of ~100 headlines
====Data to Take
==
- to determine novelty detection, if we conjecture that two data streams are distinct, then the performance on an online file which concatenates several distinct data streams will show that the algorithm can correctly pick-out the transitions between data streams.
- could construct synthetic streams with fairly homogeneous topics, but with truly distinct headlines peppered in (i.e. grep all the iraq headlines into a single file, but intersperse them with a few headlines on something esoteric (linux distributions)).
- how long does it take to find a sense of self on a homogeneous data stream (all headlines on iraq)? (hard to really understand what's going on with a noisy real-world data stream)
- want a very specific content data stream (sewing/knitting blog) that we have only a general interest in, but would like to know when something new develops (novelty). would like to be able to set NNN onto that stream and be alerted for when something new comes up.
===Experiments
=
====Finding a sense of self in a homogeneous data stream
==
training: use either of the _hom.txt files
- data to track: novelty metric during training
- parameters: try a variety of cache sizes (50,100,200,400,750,1000)
====Distinguishing really novel stuff; training on homogeneous stream (50 headlines) with distinctly different headlines interspersed (knitting headlines after 50th headline. they are at every 5th location from 50 to the EOF)
==
- training file: sgtstryker_knit.txt
- no test file - just look at performance over training
- data to track: novelty metric, num
- parameters: smaller cache perhaps around 200-300
====Finding novelty in a data stream (knitting)
==
- training file: knitting.txt (~90 headlines)
- data to track: novelty metric
- no real idea what this might do; just have to see what the algorithm pics out as being novel
====Distinguishing real data streams (which we conjecture to be distinct)
==
- training file: whatever, try training on one of the training files
- testing file: use streamsthree_test.txt
- data to track: novelty metric in both training and test phases, num of unique ngrams needed
- parameters: not sure. maybe cache size of 300-400
Last Edit: Thu, 24 Apr 2003 01:46:52 -0700 Revisions: 51