Hello Website 3 ruby tutorial

Continuing with ruby tutorial. The goal is to download a news feed and extract some news titles into a text file.

Here is the link I’ll test first: http://www.lemonde.fr/rss/tag/international.xml

just slight modification to the code in previous tutorial. The web address goes to a variable:

newsrsslemonde="http://www.lemonde.fr/rss/tag/international.xml"

require "open-uri"                         # Required library
f = open(newsrsslemonde)  # opening the address in the newsrsslemonde variable
newsrss = f.read                           # Read it as one big string
f.close                                    # Don't forget to close!
puts newsrss

Now  we can reuse the other previous code block with slight modification and extract contents of title tag:

titleStartIndex = newsrss.index('<title>')
titleStartIndex = titleStartIndex + '</title>'.length
titleStartIndex -= 1
titleEndIndex = newsrss.index('</title>')
titleLength = titleEndIndex- titleStartIndex
newsrss[titleStartIndex,titleLength]

This works here as well. Only I want to have more than just the first title tag collected. Add some regular expression magic to the mix:

titles=newsrss.scan(/<title>(.*?)<\/title>/)
titles.length

What happened here? Here is some more examples:
http://www.ruby-doc.org/core/classes/String.html#M000812

scan method goes through the newsrss string that contains the rss feed xml and collects all matches of regular expression to an array we call ‘titles’. /<title>(.*?)<\/title>/ regular expression looks for all string subpatterns where there is opening and closing tags: <title></title> and collects the contents inside parentheses: (.*?).

here is one reference to regular expressions:

http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UK

I quote:”

. (period)
Appearing outside brackets, matches any character except a newline.
*
Matches zero or more occurrences of re.
?
Matches zero or one occurrence of re. The *, +, and {m,n} modifiers are greedy by default. Append a question mark to make them minimal. “

(.*?) means that we are looking for any character except a newline that occurs zero or more times. In other words we say: give me anything there is between <title></title> tags. But why do we need the question mark? If we did not have it regex would give everything until the end of the document or newline. Quoting again:” *, +, and {m,n} modifiers are greedy by default. Append a question mark to make them minimal.” This way we get only what is before the first occurence of closing tag: ‘</title>’. One more thing. If you examine the closing tag in the regular expression it reads: <\/title>. Why is that instead of </title>. We need to “escape” the forward slash / since that would otherwise  mark the end of regex pattern. Finally we need the parentheses because we only want the contents inside title tags. we could also try: /<title>.*?<\/title>/ but that would collect the titles with the tags as well.

titles.length gives the length of the array that now contains the titles: Here is what I get:

irb(main):086:0> titles.length
=> 27

now we can look into the items in the array:

i=0
titles[i]

this line is convenient for browsing in irb:

i+=1;puts i;titles[i]

just hit up arrow key and enter to get the next item.

There is something that needs to be fixed: the structure of array has an unwanted nesting artefact. For example here is one news title:

irb(main):093:0> i+=1;puts i;titles[i]
2
=> ["Vers des \303\251lections l\303\251gislatives anticip\303\251es e
n Australie"]

it shows between square brackets [ ]

if we examine which class it is by calling titles[i].class, we get Array:

irb(main):096:0> titles[i].class
=> Array

But I want to have strings! Here is the fix:

titlesflat = titles.flatten

testing the class:

titlesflat[i].class

resulting:

irb(main):102:0> titlesflat[i].class
=> String

Which is what we want.
Let’s try to loop through all titles:

for title in titlesflat
puts title
end

Scroll up with command prompt to find this:

irb(main):107:0> for title in titlesflat
irb(main):108:1> puts title
irb(main):109:1> end
International - LeMonde.fr
International - LeMonde.fr

let’s remove one redundant item while observing the length of array:

titlesflat.length
titlesflat.shift #remove the first in the array
titlesflat.length

The first item in the array is removed and length should drop one notch.

Now let’s get publish date(s):

pubdates=newsrss.scan(/<pubDate>(.*?)<\/pubDate>/)
pubdatesf=pubdates.flatten #flatten again
puts pubdatesf[0]

pubdatesf[0] gives the first item in the list which is the latest pubdate.

and now we are ready to go and save the titles to a txt file: (commets explains the steps)

savep='D:/newstitles/' # define the directory for your file to save
txtdocname= "lemonderssnewstitles.txt" # define the name of the file
filename=savep +txtdocname # compose the full path for the file
puts filename # print the full path to console
aFile = File.new(filename, "w") # Open file for writing
aFile.write(pubdatesf[0]+" ") # latest pubdate to the first line un the file
for title in titlesflat # loop through titles
puts title # output to the console
aFile.write(title+"\n") # writing the title to the file with Newline 
end # close the for loop
aFile.close # close the file

That’s it for this tutorial.

Advertisements

About learnprogramruby

My name is Jukka Ylitalo. I live in Finland Helsinki metropoly area. I'm a philosophy major, media artist and a travelling lecturer on "digital design". At the moment one of the things I want to do is to learn program ruby.
This entry was posted in Programming, Programming tutorial, Ruby. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s