PostgreSQL and Full Text Searches

I recently bought the PostgreSQL peepcast and decided to make an app to use PostgreSQL’s amazing full text search. I don’t want to give away too much of the cast since it’s only twelve dollars, but I wanted to share something that they did not, which is how to search based on user input that will not have an operator between words.

So first off the objective is to take a few simple queries such as: “radioactive spider” and “blue not green” and  transform them into the proper format for a full text search. That would be “radioactive & spider” and “blue & ! green” respectively.

In a future post I will cover “radioactive spider not blue or green” which should generate “radioactive & spider & ! (green | blue)”

So first off, in order to use PostgreSQL, you have to have PostgreSQL. Next up, lets add some index to our blog’s Post model. Assuming that you want to search Post.title and Post.body, migration code (via pastebin).

Now this is where I deviate from the screencast. Inside my Post model I want to add a search method that can take a search query from a user and parse it so PostgreSQL doesn’t complain that it’s formatted improperly and give the user a 500. Here is the code I wrote (via pastebin).

So first we duplicate the query object, this is because arguments are frozen and we need to modify their query to be PostgreSQL friendly. First we escape any operators they add, this is because it can easily cause errors and the chance of the user really wanting “Sparta!” to throw an error is unlikely. Next we apply our own search operators by replace any “or”s with the proper pipe bar operator. Then we replace any “not” or “and not” with “& !”. Then we replace any remaining “and”s with the proper ampersand.

But we’re not done yet! All words have to have an operator between them. Search engines such as Google will add an “or” operator but place a higher value on items that match the most terms. The code that I wrote uses the ampersand instead, because for my application that’s what works best, so be sure to try customizing that out and seeing what returns the best results for you. Now the regular expression is a bit cryptic, but it’s saying any space that doesn’t have a operator before and after it. Take a look at it on Rubular to see what I mean.

So now the query is sanitized and for the most part pretty sane (because we currently don’t handle “this and not this or that”). So lets send it off to our PostgreSQL database! We use sanitize_sql_array, an ActiveRecord method to escape any SQL injection code and pass it into PostgreSQL’s “to_tsquery” function. By specifying the language we get some nifty features like “Sparta’s” will give the same result as “sparta” and other language tricks for matching a word’s other tenses. 

The where clause might be a tad bit cryptic, but search_vector is the name of the field where we’re storing our indexed data and the @@ specifies that we’re doing a fulltext search on the field. We then order it by ts_rank_cd, this should put more relevant results at the top of the return list. For example if you have 100 similar records, it will return those 100 then the extra ones, or if you have more terms match in one record it will move that record to the top of the results. You should really look into that feature as well since there are several ways to customize it and even another method called ts_rank.

How to Make Your Rails App Faster

I see a lot of coders that make their code as efficient as possible. But you have to remember that it’s not only your code, it’s also the Rails core code. It’s also the Ruby core code. It’s also the database. It’s also the filesystem, etc. And I think a lot of people fail to realize this and waste their time making their code more efficient when it would be more efficient of their time to just change their database or interpretor.

If you didn’t check out the YouTube videos of RailsConf 2011, you’re missing out! You should check out Arron’s “Double Dream Hands: So Intense” which is funny, informative, and memorable. Then checkout Dr. Nick’s talk on Rubinius. After watching them I switched around to Rubinius and PostgreSQL for production and see much better response times. Rubinius is a work in progress and some things will give segment faults (some = very very very small amount of things), so make sure to run your test suites before making a public deploy and change. 

Didn’t like this post? Want to know about a specific topic? Hit the “Ask me a question” button!

Clearing instance variables (useful for testing)

Sorry that it’s been awhile since I’ve posted. I’m working on creating my own freelancing company, so hit me up if you want to get a quote mohammad { et } ReliableRabbit.com

So, you built a user system and you’re writing functional tests. But you run into a problem, your login test help seems to not be working correctly. You’re current code is:

def login(item)
  session[:user_id] = item.try(:id) || item
end

What we hope is happening is that this sets the session key user_id to item.id or item (so both login(@user) or login(@user.id) work). But if we debug our application we will see that session[:user_id] is nil. That’s because tests don’t handle the session hash the same way as a controller does. So lets make it so it sets our session correctly.

def login(item)
    @request.session[“#{item.class.to_s.downcase}_id”] = item.id
end

Now our session[:user_id] is set correctly! But as try to test our super user after our normal user, we get an error! Once again we use debugging to look into this, and it shows that the current_user is the normal user. What!? Not what we expected! This is because current_user is defined as:

@current_user ||= User.where([‘id = ?’, session[:user_id]]).first if session[:user_id].present?

so the @current_user variable is still cached and not dumped. Instead of modifying the login method to delete the @current_user variable we’re going to make a logout method that will do that.

def logout(item = nil)
  key = “#{item.class.to_s.downcase}_id”
  if item.present?
    @request.session.delete(key) if @request.session[key] == (item.try(:id) || item)
  else
    @request.session.delete(key)
  end
  @controller.send(:remove_instance_variable, :@current_user) if @controller.instance_variable_defined?(:@current_user)
  @controller.send(:remove_instance_variable, :@current_ability) if @controller.instance_variable_defined?(:@current_ability)
end

This method takes an optional item or id of the object and will only logout if it’s the same or it just logsout whomever is logged in. But you can also see that it sends a private method “remove_instance_variable”. I’m also unsetting @current_ability because this is what cancan uses to cache permissions. Now in our login method we simply add “logout” to the top of it and vwala! It works as expected and all your tests should be passing now!

Quick Post on Pagination

Pagination, it’s commonly used to reduce the amount of data a server has to load at a time, although some sites use it to increase ad revenue. But it’s something that every web developer ought to know. I recently ran across some people that were struggling with the proper way to do this. So I thought I would share how I do it. Which is a way I came up with on my own, which I assume is how it’s done. (But remember assuming makes an ass out of you and me).

So first we’re going to have our raw query

$query = “SELECT * FROM users”;

Now we want to paginate it at 50 users per page, so lets set a variable and redo the query using that variable

$per_page = 50;
$query = “SELECT * FROM users LIMIT $per_page”;

Now we need to also be able to change the pages, via a page variable and render the offset

$page = 1;
$per_page = 50;
$offset = (($page-1)*$per_page);
$query = “SELECT * FROM users LIMIT $per_page OFFSET $offset”;

But now how can we reliably tell if there is a next page? By checking if there is one extra row.

$page = 1;
$per_page = 50;
$offset = (($page-1)*$per_page);
$query = “SELECT * FROM users LIMIT “.($per_page+1).” OFFSET $offset”;

now you can check if there is a next page via counting the rows and seeing if it is greater then $per_page and there will always be a previous page if the current page ($page) is greater then one.

For ActiveRecord (rails) you can scope it or combine it

users = User.limit(per_page).offset(offset+1).all
users = User.all(:limit => per_page, :offset => offset + 1)

Exception Notifier will let you know about your bug pronto

Read this post without the annoying “fixes” to my quotes @ http://bit.ly/fK24EB

Using Exception Notifier and assuming that you have ActionMailer configured correctly and it can send mail, you will receive an e-mail for ever “We’re sorry, but something went wrong” — with a stack trace (example mail below). You’ll want to make sure that whatever client you use to check those e-mails supports threading. Because if you have a hundred users and each one is having a problem logging in, you’re going to get over a hundred e-mails telling you about the error. But that’s good, because one time a co-worker had made an error where whenever an odd number was used in the application it would give an error (he modified the Integer class). Without even looking at code, we were able to realize that was the problem and save time digging through the stack trace to see the root of the problem.

Add the following to your Rails 3 bundle. Then run `bundle install`.

gem ‘exception_notification_rails3’, :require => ‘exception_notifier’

Now open your production settings (or whatever environment you want) in “config/environments/production.rb”. Within the config block add the following and modify it to your settings.

config.middleware.use ‘::ExceptionNotifier’,   :email_prefix => ‘[PA] ‘,   :sender_address => ‘“Errors” <error@project-army.isscary.com>’,   :exception_recipients => %w{someone@somewhere.com aspace.is.a.new@email}

Example output:

Sensitive data will be marked as: —REMOVED BY AUTHOR—
Following output is brought to you by: http://IsScary.Com

From: Errors <errors@project-army.isscary.com>
Subject: [PA] battlefield#index (ActionView::Template::Error) “undefined method `covert_points’ for nil:NilClass”
Body: http://pastebin.com/CqvfF0F2

What is: a background job

Continuing with the discussion of threading from: http://mohammad.el-abid.com/post/3363641377/what-is-threading

At some point of time you might find that your application starts to “lag” a bit while doing complex things. This means it’s time for background jobs. Lets go ahead and take YouTube for example. I submit a video, it encodes my video, and then my video can be seen by millions of people.

Step one: Submitting the video.

I simply use the form to upload my video, my browser sends the data, which depending on my upload speed and file size, can take awhile. But once it is done, the server will continue on with it’s normal process of handling the data and/or sending back HTML to me.

Step two: Encoding the video.

Encoding a video can take a few seconds to a few hours. If this was done within the upload script the browser would probably give up and let the connection die (meaning the application would probably stop encoding the video), or you would have to keep your browser open for a really long time, or the server might shutdown the thread for running too long. 

So instead, the server will save your raw video somewhere and save it to a jobs database and some computer (perhaps the same one) will pick up that request and process the video. There are plenty of benefits here. For example, you control how many jobs run, it’s not based on how many connections you have, you can choose how much time and processing power and ram to dedicate, etc. You just gain a lot of control.

But now that the data is saved and will be handled else where, the server thread can now tell us our video has been uploaded and will take awhile to appear.

Step three: Showing the video

Now that our video has been picked from the background queue it’s time for some processing. The background job will probably load up the video, FFMPEG it to FLV format and then store it on their file system and mark it as completed. Should FFMPEG fail, the server crash, etc. The data is still there in the database and will be picked up by another background worker. 

Author’s note: This article was a bit rushed, please leave a comment if anything was not understandable.

Good code is its own best documentation. — Steve McConnell (via wtfcode)

What is: Threading

Threading is a very fundamental technique, however I want to go over it before I write the next article which relies on basic threading knowledge. So let’s dive in.

Threads are vital to computer processing. Each core in your computer can only handle one thread at a time. So if you have a dual core, two threads at a time. So what is a thread then? A thread is a program or a part of a program that is running. So your core can only think about one thing at a time, very much like our own brain.

So how come you can run windows (which is made out of many threads) and a ton of applications at once? The cores are constantly rotating on all the threads. Giving higher priority threads a longer run time and lower threads a lower run time and sometimes a the other threads get to ‘skip the line’; but with today’s computing power that is generally not an issue.

So why do you need to know this? Well, let’s take an example from JavaScript here. If I write a script that goes: while(true){}; the ENTIRE browser will generally lock up, thankfully many modern browsers will prevent a lock up, but that’s beside the point. Why is it the the JavaScript will cause a lockup? JavaScript runs in ONE thread. But see, if JavaScript supported threading, we would make a new thread and run our intensive stuff in there so the browser doesn’t lock up.

Threading is used all over the place. As I’ve mentioned before, the GUI should be in it’s own thread so users don’t think the application has crashed because it locks up. But you can also use it for converting a video faster. Create a few threads and ration out the length to convert, for example we make two threads and tell one to do the first half an the other one to do the second half. When they are done, we combine the files together and replace the headers, and the video will have probably converted faster, depending on the system and settings.

I encourage you to look at your languages threading libraries and try a few demo applications if you have not already. Next we will be talking about ‘background jobs’ and why any successful website should use them.

What is: Memcached

Just working on going over some basic stuff, so here is something really basic that most new web developers overlook, memcached. If you have any questions on this, hit up the comments. If you have any questions that you would like answered in a post, send @the_empty a tweet.

—-

Memcached is a Memory cache. You run one or more memcached servers which will run in ram, saving nothing to a database or hard disk. You can think of it as a hash. You save a value to a key and it will stay there for as long as it is relevant and used. A key has an expire time and a last used time. When the cache maxes out (that you configure), it will take the last used item and remove it to make space for the new key.

So where should you use it? You can use it as a web cache replacement because it will be faster and more efficent. It does not save to a file, it saves to RAM which should have a faster I/O and more efficent because you can run it on diffrent computers and network them all together. You shouldn’t use memcached for things like verification numbers, if the memcached server crashes, that data is gone and the the user will have to request another verification number which can be annoying.

Why should you write tests?

Today I’m working on one of my rails games, I was looking forward to adding new functionality, but, I ran my tests and got some errors. Now you may ask yourself, why should you write tests? They get outdated quickly and they can take extra time and resources that you don’t want to spend.

There are many great reasons to use TDD (test driven development), but I think the best one is that you then know all the features and when you write a test for a feature and it returns an error, you know something is wrong. But, it’s more then that too. For example I was writing tests for my game and found that a “game record” (this game resets, so a user will have many records) was not accessible to non-logged in users. But, this was something I wanted, but overlooked because I had a before_filter to check if there was a logged in user.

Lets say you’re on a cheap-cheap-cheap-cheap budget and don’t want to pay for a TDD, should you make your programmers write tests before deployment. YES YES YES! Tests will pretend to be a user and browse your site searching for errors. So you will save money because you will not need to hire a tester or loose money because there was an error in production that prevented users from using your site in some way.

Not only do tests provide you with the sanity check, but they also go over your views and other data and make sure that everything there is fine. So even if your controller code works fine, but the view doesn’t, it will raise an error and spit out all the data needed to fix it. Tests are a huge pain in the ass sometimes, but once you get comfortable with them, you’ll be happy to have the security of knowing there is a very-very low chance of error during production.