Hello Website 3 ruby tutorial

Continuing with ruby tutorial. The goal is to download a news feed and extract some news titles into a text file.

Here is the link I’ll test first: http://www.lemonde.fr/rss/tag/international.xml

just slight modification to the code in previous tutorial. The web address goes to a variable:

newsrsslemonde="http://www.lemonde.fr/rss/tag/international.xml"

require "open-uri"                         # Required library
f = open(newsrsslemonde)  # opening the address in the newsrsslemonde variable
newsrss = f.read                           # Read it as one big string
f.close                                    # Don't forget to close!
puts newsrss

Now  we can reuse the other previous code block with slight modification and extract contents of title tag:

titleStartIndex = newsrss.index('<title>')
titleStartIndex = titleStartIndex + '</title>'.length
titleStartIndex -= 1
titleEndIndex = newsrss.index('</title>')
titleLength = titleEndIndex- titleStartIndex
newsrss[titleStartIndex,titleLength]

This works here as well. Only I want to have more than just the first title tag collected. Add some regular expression magic to the mix:

titles=newsrss.scan(/<title>(.*?)<\/title>/)
titles.length

What happened here? Here is some more examples:
http://www.ruby-doc.org/core/classes/String.html#M000812

scan method goes through the newsrss string that contains the rss feed xml and collects all matches of regular expression to an array we call ‘titles’. /<title>(.*?)<\/title>/ regular expression looks for all string subpatterns where there is opening and closing tags: <title></title> and collects the contents inside parentheses: (.*?).

here is one reference to regular expressions:

http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UK

I quote:”

. (period)
Appearing outside brackets, matches any character except a newline.
*
Matches zero or more occurrences of re.
?
Matches zero or one occurrence of re. The *, +, and {m,n} modifiers are greedy by default. Append a question mark to make them minimal. “

(.*?) means that we are looking for any character except a newline that occurs zero or more times. In other words we say: give me anything there is between <title></title> tags. But why do we need the question mark? If we did not have it regex would give everything until the end of the document or newline. Quoting again:” *, +, and {m,n} modifiers are greedy by default. Append a question mark to make them minimal.” This way we get only what is before the first occurence of closing tag: ‘</title>’. One more thing. If you examine the closing tag in the regular expression it reads: <\/title>. Why is that instead of </title>. We need to “escape” the forward slash / since that would otherwise  mark the end of regex pattern. Finally we need the parentheses because we only want the contents inside title tags. we could also try: /<title>.*?<\/title>/ but that would collect the titles with the tags as well.

titles.length gives the length of the array that now contains the titles: Here is what I get:

irb(main):086:0> titles.length
=> 27

now we can look into the items in the array:

i=0
titles[i]

this line is convenient for browsing in irb:

i+=1;puts i;titles[i]

just hit up arrow key and enter to get the next item.

There is something that needs to be fixed: the structure of array has an unwanted nesting artefact. For example here is one news title:

irb(main):093:0> i+=1;puts i;titles[i]
2
=> ["Vers des \303\251lections l\303\251gislatives anticip\303\251es e
n Australie"]

it shows between square brackets [ ]

if we examine which class it is by calling titles[i].class, we get Array:

irb(main):096:0> titles[i].class
=> Array

But I want to have strings! Here is the fix:

titlesflat = titles.flatten

testing the class:

titlesflat[i].class

resulting:

irb(main):102:0> titlesflat[i].class
=> String

Which is what we want.
Let’s try to loop through all titles:

for title in titlesflat
puts title
end

Scroll up with command prompt to find this:

irb(main):107:0> for title in titlesflat
irb(main):108:1> puts title
irb(main):109:1> end
International - LeMonde.fr
International - LeMonde.fr

let’s remove one redundant item while observing the length of array:

titlesflat.length
titlesflat.shift #remove the first in the array
titlesflat.length

The first item in the array is removed and length should drop one notch.

Now let’s get publish date(s):

pubdates=newsrss.scan(/<pubDate>(.*?)<\/pubDate>/)
pubdatesf=pubdates.flatten #flatten again
puts pubdatesf[0]

pubdatesf[0] gives the first item in the list which is the latest pubdate.

and now we are ready to go and save the titles to a txt file: (commets explains the steps)

savep='D:/newstitles/' # define the directory for your file to save
txtdocname= "lemonderssnewstitles.txt" # define the name of the file
filename=savep +txtdocname # compose the full path for the file
puts filename # print the full path to console
aFile = File.new(filename, "w") # Open file for writing
aFile.write(pubdatesf[0]+" ") # latest pubdate to the first line un the file
for title in titlesflat # loop through titles
puts title # output to the console
aFile.write(title+"\n") # writing the title to the file with Newline 
end # close the for loop
aFile.close # close the file

That’s it for this tutorial.

Posted in Programming, Programming tutorial, Ruby | Leave a comment

Hello Website 2 and some Work with Strings Ruby tutorial

Continuing with the previous tutorial we are now here:

require "open-uri"                         # Required library
f = open("http://www.juy.fi/kurssit/index.html")  # put a self promting web address here
webpage = f.read                           # Read it as one big string
f.close                                    # Don't forget to close!
puts webpage

The aim is to extract title tag from the webpage which is now stored into variable called ‘webpage’ (you may certainly try it with any web address you wish, I’ll continue with this slightly self centered approach..)

Let’s first check that we are dealing with a string type:

webpage.class

Here is the result:

irb(main):006:0> webpage.class
=> String
irb(main):007:0>

String class it is. Take a look of what methods this class has to offer:

webpage.methods

outputs a long list of methods. ‘find’ from the list seems promising:

where do I go for reference?
Starting here:
http://www.ruby-lang.org/en/documentation/
That page lists an overwhelming number of links

http://pine.fm/LearnToProgram/
looks promising as a general tutorial

But this must be what I’m looking for..
http://www.ruby-doc.org/core/

Frames again! 😦
those with .c extensions look like something for c developers so I’ll search the middle column:
Sure enough there is a String section
lets get out of the frames:
http://www.ruby-doc.org/core/classes/String.html

but where is the find method? Be that where ever it may.. let’s try index method instead, try:

webpage.index('<title>')

we get a number! reporting the location index number of our search string: ‘< title>’
Now query the length of the string:

webpage.length

lets try to fetch a substring:

webpage[0,20]

produces:

irb(main):135:0> webpage[0,20]
=> "<!DOCTYPE html PUBLI"
irb(main):136:0>

next I’ll nest these things together:

webpage[webpage.index('<title>'),webpage.index('</title>')]

Not exactly the contents of title here. Another approach should do better:

titleStartIndex = webpage.index('<title>')
titleEndIndex = webpage.index('</title>')
titleLength = titleEndIndex- titleStartIndex
webpage[titleStartIndex,titleLength]

We create three variables to hold numerical values indicating text positions. titleStartIndex holds the numerical value of the beginning of that string sequence in webpage. titleEndIndex indicates the beginning of the closing tag. Then we have a mathematical operatíon with variables to get the titleLength. And finally print out the requested sequence with webpage[titleStartIndex,titleLength].

But it’s not quite there. We need to shift to the end of title tag.and try again:

titleStartIndex = titleStartIndex + '</title>'.length
titleLength = titleEndIndex- titleStartIndex
webpage[titleStartIndex,titleLength]

Off one step and here’s the cure:

titleStartIndex = webpage.index('<title>')
titleStartIndex = titleStartIndex + '</title>'.length
titleStartIndex -= 1
titleEndIndex = webpage.index('</title>')
titleLength = titleEndIndex- titleStartIndex
webpage[titleStartIndex,titleLength]

And that’s it. Next step is going to be working with the actual news site and news titles.

Posted in Programming, Programming tutorial, Ruby | Leave a comment

Hello web site! with Ruby

This is my first tutorial  on Ruby programming. I want to have a simple web document reader that gathers news titles from web and saves them to a text file.

So the plan is: ruby based client to download stuff from web and extracting relevant parts and saving it to a text file. This involves more that can be crammed to one tutorial so in this tutorial only the downloading part is covered:

I assume you have ruby installed, if not consult here: http://www.ruby-lang.org/en/downloads/

Once installed goto command prompt.

If you are on windows xp open start > Run…

Run dialog

and type ‘cmd’ to open

When command prompt opens, type ‘irb’ to enter ‘Interactive Ruby’:

WINDOWS cmd Prompt

..and we are ready to go!

Type: puts “hello web!”

irb(main):001:0> puts "hello web!"
hello web!
=> nil
irb(main):002:0>

So that’s our obligatory first test. Irb (Interactive Ruby) does what it’s expected. ‘puts’ command tells ruby to print that string of letters. As Ruby in Twenty Minutes tells us nil “is Ruby’s absolutely-positively-nothing value.” never mind what that means I want irb to print out a contents of a website.

I’ll go to http://stdlib.rubyonrails.org/ and look for http:

net/http > Net::HTTP

here is the link for the frame: http://stdlib.rubyonrails.org/libdoc/net/http/rdoc/classes/Net/HTTP.html

let’s try the first example and put something existing into the url placeholders:

require 'net/http'
Net::HTTP.get_print 'www.juy.fi', '/kurssit/index.html'

copy paste (or type) that to irb and it should print the contents of that html document into the console. If so, success!

(By the way to copy paste code into the irb does not seem to be too user friendly in windows command prompt. Usually you can find a paste command in right-click context menu but now it only seems to work on the title bar. Right click title bar like this:

paste in cmd prompt

That’s not too user friendly but still more convenient than typing everything by hand.  (Thinking of making an Autohotkey script to simulate linux shell functionlity to remedy this.. speaking of which it is here)

Another thing to enhance the coding pleasure would be to have some command prompt settings adjusted. I think it comes up with too small font sizes. Go to the properties (bottom line in the context menu on the image above) and make your preferred changes.  Well .. I guess there is (/should be)  a nicer way to play with irb in windows.. ?

these other Net::HTTP documentation examples look promising but for some reason I have some trouble .. I’m looking for a really simple snippet and resort to The Ruby programming language book by D. Flanagan and Y. Matsumoto. Here is 5 lines that work:

require "open-uri"                         # Required library
f = open("http://www.juy.fi/kurssit/index.html")  # put a self promoting web address here
webpage = f.read                           # Read it as one big string
f.close                                    # Don't forget to close!
puts webpage

If you get bunch of html code running on your screen you can cheer Hurrey! That is what is supposed to happen..

Next step is to do something with this variable called webpage that contains the html code. For example how do I extract the title tag? I would suppose there is a number of ways to do it in Ruby. Python has ‘beautiful soup’ and others like it, I’m sure Ruby must have something like that too. If this website is valid xhtml I should be able to parse it with some xml parsing tools too. But I will take a rudimentary string processing option.
So strings and Ruby.. That will be a subject for the next post..

Posted in Programming, Programming tutorial, Ruby | 1 Comment

Keyboard shorcut Autohotkey for pasting stuff in windows command prompt

This is very crude fix. Get Autohotkey. Install it and run a script that might be something like a following script. I won’t recommend running identical script since this is based on mouse positions I got out of AutoSriptWriter on my screen and might depend somewhat on the window positionings. (You might need to adjust the mouse cordinates to your own window positions. Run AutoSriptWriter to get your mouse sequence.)

Here is what I got while running irb:

It works with Ctrl+Shift+V

^+V::

WinWait, C:\WINDOWS\system32\cmd.exe - irb,
IfWinNotActive, C:\WINDOWS\system32\cmd.exe - irb, , WinActivate, C:\WINDOWS\system32\cmd.exe - irb,
WinWaitActive, C:\WINDOWS\system32\cmd.exe - irb, 

MouseClick, right,  360,  16
Sleep, 100
MouseClick, left,  399,  154
Sleep, 100
MouseClick, left,  505,  187
Sleep, 100

return

Better ways to do this? (switching to Linux I should say..)

Posted in Autohotkey, Programming, Ruby | 1 Comment

The idea of this blog

The aim of this little blog is to document my learning path with Ruby programming and write a bunch of tutorials  for anyone wanting to learn along.

My approach is going to be very practical project oriented.

My conviction is that anyone can learn to program.

I started to program with max in 1996. After that I learned some javascript, actionscript and it did not stop there.. I been bothering my head with some c++, Java, php, linux bash and lately Python.
So I figure that now is time for Ruby. One selling point for Ruby, they say, is that it makes a programmer happy. Sounds appealing.. Another is of course Rails. But I’m going to start with stuff that I used to use Python for. Web scraping and files administration.

Here we go.

Posted in Programming, Ruby | Leave a comment