Continuing with the previous tutorial we are now here:
require "open-uri" # Required library f = open("http://www.juy.fi/kurssit/index.html") # put a self promting web address here webpage = f.read # Read it as one big string f.close # Don't forget to close! puts webpage
The aim is to extract title tag from the webpage which is now stored into variable called ‘webpage’ (you may certainly try it with any web address you wish, I’ll continue with this slightly self centered approach..)
Let’s first check that we are dealing with a string type:
Here is the result:
irb(main):006:0> webpage.class => String irb(main):007:0>
String class it is. Take a look of what methods this class has to offer:
outputs a long list of methods. ‘find’ from the list seems promising:
where do I go for reference?
That page lists an overwhelming number of links
looks promising as a general tutorial
But this must be what I’m looking for..
Frames again! 😦
those with .c extensions look like something for c developers so I’ll search the middle column:
Sure enough there is a String section
lets get out of the frames:
but where is the find method? Be that where ever it may.. let’s try index method instead, try:
we get a number! reporting the location index number of our search string: ‘< title>’
Now query the length of the string:
lets try to fetch a substring:
irb(main):135:0> webpage[0,20] => "<!DOCTYPE html PUBLI" irb(main):136:0>
next I’ll nest these things together:
Not exactly the contents of title here. Another approach should do better:
titleStartIndex = webpage.index('<title>') titleEndIndex = webpage.index('</title>') titleLength = titleEndIndex- titleStartIndex webpage[titleStartIndex,titleLength]
We create three variables to hold numerical values indicating text positions. titleStartIndex holds the numerical value of the beginning of that string sequence in webpage. titleEndIndex indicates the beginning of the closing tag. Then we have a mathematical operatíon with variables to get the titleLength. And finally print out the requested sequence with webpage[titleStartIndex,titleLength].
But it’s not quite there. We need to shift to the end of title tag.and try again:
titleStartIndex = titleStartIndex + '</title>'.length titleLength = titleEndIndex- titleStartIndex webpage[titleStartIndex,titleLength]
Off one step and here’s the cure:
titleStartIndex = webpage.index('<title>') titleStartIndex = titleStartIndex + '</title>'.length titleStartIndex -= 1 titleEndIndex = webpage.index('</title>') titleLength = titleEndIndex- titleStartIndex webpage[titleStartIndex,titleLength]
And that’s it. Next step is going to be working with the actual news site and news titles.