In the previous entry, I wrote about parsing the StackOverflow XML dump files.
This entry is about setting up SQLite full text search for the StackOverflow posts data. In Smalltalk, set up the database schema:
| schema db |
schema := 'create table posts (Id integer primary key, ...)'.
db := SQLiteConnection fileNamed: 'so.db'.
db open.
db executeQuery: schema.
db close.
The key SAX parsing callback method becomes, in essence, as follows:
startElement: aQualifiedName attributes: aDict
| stmt |
"db is the instance variable with the database connection."
stmt := db prepare: 'insert into posts values (?,...,?)'.
aQualifiedName = 'row' ifTrue: [
stmt at: 1 putInteger: (aDict at: 'Id') asInteger.
...
stmt at: 20 putInteger: (aDict at: 'FavoriteCount' ifAbsent: [-1]) asInteger.
stmt step. stmt clearBindings. stmt reset ].
stmt finalize.
On my computer, running through the 6+ million posts took about an hour. The resultant SQLite database file is about 6.5GB in size. Here are the comparative file sizes:
-rw-r--r-- 1 pierce staff 7569879502 Sep 7 2011 posts.xml
-rw-r--r-- 1 pierce staff 6819280896 Nov 24 20:31 so.db
Next, let's set up full text search on this database:
% sqlite3 so.db
sqlite> create virtual table sposts using fts4 (id, body, title);
sqlite> insert into sposts select id, body, title from posts;
sqlite> create virtual table stags using fts4 (id, tags);
sqlite> insert into stags select id, tags from posts;
sqlite>
I didn't time how long the INSERTs took. Setting up full text search boosted the database's file size:
-rw-r--r-- 1 pierce staff 6819280896 Nov 24 20:31 so.db (before FTS)
-rw-r--r-- 1 pierce staff 16491300864 Nov 25 21:34 so.db (after FTS)
Now search the database using the sqlite3 shell tool:
% sqlite3 so.db
sqlite> select count(*) from stags where tags match 'smalltalk';
332
sqlite> select max(id) from stags where tags match 'smalltalk';
7260027
sqlite> select body, tags from posts where id = 7260027;
<p>what's the best server side programming language for web development
if we keep very good performance and a lot of database interaction of
the script with server in mind. i am already working on a project with
php, i'm fearing that my application might have a lot of traffic, i
just want to use only the server side language which is best in terms
of performance, and please also guide me any major step i should take
to prevent my application from crashing when it will face a huge
traffic, what major step i might need to take to stop myself crying
after launching my applicaition, i heard that twitter use ruby, how's
ruby's performance? my concern is only and only performance and
database interaction, thanks in advance.</p>
|<php><python><ruby-on-rails><ruby><smalltalk>
sqlite>
Tags: full text search, SQLite, StackOverflow