NovelNewsNabberDataStreams
====NovelNewsNabber : Archived Data and Blog RDF/RSS Feeds
==
Some descriptions about RDF and RSS:
Sites with a RDF/RSS feeds often them keep at: index.rdf, index.rss, index.xml, rss.xml
Some data streams:
- Accordian Guy Blog (RSS 0.91)
- ArsTechnica (RSS 0.94)
- Boing Boing: A Directory of Wonderful Things (RSS 0.91)
- Daedalus Project - MMORPG Research Blog (RSS 0.91)
- Defense Tech Blog (RDF/RSS 1.0)
- JOHO Blog - culture, tech, world, stuff (RDF)
- Journey With Jesus (RDF)
- Knitting Musings (No feed?)
- Nick Yee's Blog (RSS 0.91)
- Pushby: Queue (RDF)
- SgtStryker - pro-war, political (RDF 1.0)
- Slashdot (RDF)
We parsed some HTML archives to get some training data:
- Accordian Guy Blog
- ArsTechnica
- Boing Boing: A Directory of Wonderful Things
- Daedalus Project
- Defense Tech Blog
- JOHO Blog
- Journey With Jesus
- Knitting Musings
- Nick Yee's Blog
- Pushby: Queue
- SgtStryker
- Slashdot
====Parse info (for sucking headlines outta HTML):
==
- Accordian Guy ('h1')
- ~ArsTechnica (style
"text-decoration: none"><font color"#FFFFFF">xxx</a>) - Daedalus Project (.php">xxx</a>)
- JOHO (
xxx
) - Journey With Jesus ('h4')
- Knitting Musings (h3>xxx)
- Nick Yee (
xxx
) - ~SgtStryker (.php">xxx</a>)
=== Files Generated for Training and Testing
=
====file_rand.txt
==
- randomized versions of most files
====homogeneous data streams (~100 headlines per)
==
- sgtstryker_hom.txt
- defensetech_hom.txt
- both of these streams were filtered from the originals using the keywords (terror||osama||iraq||saddam||marine||war||rifle||bomb||air force||army)
====training files
==
- joho_100.txt
- slashdot_100.txt
- boingboing_100.txt
- these files are just a random selection of 100 headlines taken from the corresponding original files
- johorand100.txt
- slashdotrand100.txt
- boingboingrand100.txt
- same deal, but from the randomized versions of the files
====knitting blog file
==
- knitting.txt
- knitting_20.txt
====homogeneous stream + very novel blips
==
- sgtstryker_knit.txt
- took the sgtstrykerhom.txt stream, and, starting at line 50, inserted a headline from knitting20.txt at every fifth headline. there were 4 knitting_20.txt headlines left over after this process; they were concatenated at the end of the file.
====distinct streams
==
- streamsthree_test.txt
- streamstwo_test.txt
- streamsfour_test.txt
the first file is the three training files _100.txt concatenated the second file is the two homogeneous files _hom.txt concatenated the third file is the three training files 100.txt concatenated, but with a rand100.txt in third position
Last Edit: Thu, 24 Apr 2003 01:14:49 -0700 Revisions: 3