NovelNewsNabberDataStreams

From brainsik
Jump to navigation Jump to search

====NovelNewsNabber : Archived Data and Blog RDF/RSS Feeds

==

Some descriptions about RDF and RSS:

Sites with a RDF/RSS feeds often them keep at: index.rdf, index.rss, index.xml, rss.xml

Some data streams:

We parsed some HTML archives to get some training data:

====Parse info (for sucking headlines outta HTML):

==

  • Accordian Guy ('h1')
  • ~ArsTechnica (style"text-decoration: none"><font color"#FFFFFF">xxx</a>)
  • Daedalus Project (.php">xxx</a>)
  • JOHO (

    xxx

    )
  • Journey With Jesus ('h4')
  • Knitting Musings (h3>xxx)
  • Nick Yee (

    xxx

    )
  • ~SgtStryker (.php">xxx</a>)

=== Files Generated for Training and Testing

=

====file_rand.txt

==

  • randomized versions of most files

====homogeneous data streams (~100 headlines per)

==

  • sgtstryker_hom.txt
  • defensetech_hom.txt
  • both of these streams were filtered from the originals using the keywords (terror||osama||iraq||saddam||marine||war||rifle||bomb||air force||army)

====training files

==

  • joho_100.txt
  • slashdot_100.txt
  • boingboing_100.txt
  • these files are just a random selection of 100 headlines taken from the corresponding original files
  • johorand100.txt
  • slashdotrand100.txt
  • boingboingrand100.txt
  • same deal, but from the randomized versions of the files

====knitting blog file

==

  • knitting.txt
  • knitting_20.txt

====homogeneous stream + very novel blips

==

  • sgtstryker_knit.txt
  • took the sgtstrykerhom.txt stream, and, starting at line 50, inserted a headline from knitting20.txt at every fifth headline. there were 4 knitting_20.txt headlines left over after this process; they were concatenated at the end of the file.

====distinct streams

==

  • streamsthree_test.txt
  • streamstwo_test.txt
  • streamsfour_test.txt

the first file is the three training files _100.txt concatenated the second file is the two homogeneous files _hom.txt concatenated the third file is the three training files 100.txt concatenated, but with a rand100.txt in third position


Last Edit: Thu, 24 Apr 2003 01:14:49 -0700
Revisions: 3