Music and Audio Retrieval Tools

How to Install MaART on Trustix Linux

Introduction
Motivation
Installation Overview
Create a user
Setup the Perl Modules Required by UBH
Setup UBH
Manage the files fetched by UBH
Add Files to the mididb Index
Automate the Process
Install the web interface
Install the CGI Script
Maintenance
Issues, Improvements and the Future

Introduction

This HOW-TO explains the steps taken to install an auto-retrieving MIDI file index with web-based access (and possibly a web-service). The target system is Trustix Secure Linux 3.0, which is a security-conscious distribution and intended for server use. It was originally based on Red Hat without the graphical features, so it is likely commands given here will translate well to Red Hat.

Motivation

One of the goals of the Music and Audio Retrieval Tools project is to produce a musical equivalent of web search engines such as Google and Yahoo!. Indeed there is an increasing trend for the major search engines, such as Yahoo!, to support some form of search for music (typically identifying files by their meta data). Indexing the web requires each page to be visited (e.g. using a web trawler / spider / robot); an active task which requires large amounts of bandwidth and computing power to achieve an effective level of coverage of music on the web.

An alternative to downloading each page on the web, and one which is enough to prove the MaART technology, is to fetch the music files which are exchanged on the Usenet (a.k.a. newsgroups). The groups can be polled on a daily basis and any new files fetched, cleaned, stored and indexed. This is less "hit and miss" than trawling the web for appropriate files and so requires less bandwidth.

Installation Overview

The installation described here will create a new user (called maart) which will run an update script on a daily basis (using cron). The script will use Usenet Binary Harvester to fetch all midi and karaoke files new to the alt.binaries.midi newsgroups. Each file will be processed to ensure it is a valid file and one which is not in the current collection (based on the content of the file, rather than the name) then added to the collection and indexed. The collection will be searchable, based on melody, from a web interface.

Note that this install was performed by someone who was not an expert Linux user: There is no guarantee that the methods used are the best / most secure. This also goes some way to explaining the hand-holding nature of some of the text - the installation was a learning experience for the author and this document is a way to capture the knowledge for future reference. Please use the MaART support mechanisms to report any errors / omissions / improvements to this text.

Create a User

Creating a new user is not required for the software to work, but does allow the permissions to be specific to maart (i.e. not being able to 'su' to root). Create a user 'maart' which is who will run cron jobs and host the web pages. Log in as (or 'su' to) root and type:

    useradd -m maart

Then set a password (the following will prompt for one, set to "maart" in this case):

    passwd maart

Then ensure you can log in as this user.

Setup the Perl Modules Required by UBH

Trustix v3.0 supports CPAN, as most modern Linux installations will. Log in as root and invoke the CPAN installation of the modules required:

    perl -MCPAN -e 'install Net::NNTP'
    perl -MCPAN -e 'install News::Newsrc'
    perl -MCPAN -e 'install MIME::Parser'
    perl -MCPAN -e 'install MIME::Base64'
    perl -MCPAN -e 'install String::CRC32'

The following might have been asked for when installing the above:

    perl -MCPAN -e 'install Set::IntSpan'
    perl -MCPAN -e 'install IO::Stringy'

Pete has some useful perl notes. Particularly, I had a problem with "Can't locate News/Newsrc.pm in @INC" on running ubh and found the directories in /usr/lib/perl5/site_perl/5.8.7 were not readable or executable by anyone other than root. So, in the absence of the correct solution, I typed (as root and from the site_perl directory):

    find . -type d -exec chmod ugo+rx {} \;

Setup Usenet Binary Harvester

A full package containing v2.5 of UBH can be found at http://ubh.sourceforge.net/download.html. The archive should be extracted to the home directory of maart and extracted directory renamed to ubh. It is recommended that the latest version of the 2.6b1 build, currently rev 1.79, be fetched (save the following target as ubh). Note that the development of version 3.0 has been started but uses mysql for cache files, which adds to the complexity.

Alternatively, an unofficial 2.6b1 package is available as either a zip or tarball from the maart website. Extract the ubh-2_6b1.tar.gz archive in the home directory of maart and rename the directory:

    wget http://maart.sourceforge.net/resources/ubh-2_6b1.tar.gz
    tar xvzf ubh-2_6b1.tar.gz
    mv  ubh-2_6b1 ubh

In the ubh directory, create a text file called ".newsrc" listing the newsgroups to fetch files from. Below is a starting point for midi files:

~/.newsrc

alt.binaries.sounds.midi:
alt.binaries.sounds.midi.alternative:
alt.binaries.sounds.midi.beatles:
alt.binaries.sounds.midi.blues:
alt.binaries.sounds.midi.classical:
alt.binaries.sounds.midi.country:
alt.binaries.sounds.midi.ethnic:
alt.binaries.sounds.midi.games:
alt.binaries.sounds.midi.jazz:
alt.binaries.sounds.midi.originals:
alt.binaries.sounds.midi.pop:
alt.binaries.sounds.midi.rap:
alt.binaries.sounds.midi.rock:
alt.binaries.sounds.midi.soul:
alt.music.midi:

The Usenet Binary Harvester requires a configuration file to operate. It defaults to looking for a file called .ubhrc in the maart home directory. To make the file more visible, and easier to maintain, a ubhrc file (without the dot) is created in the ~/ubh directory. The file used for maart is shown below. See the ubh README file for full details of the available options.

~/ubh/ubhrc

NNTPSERVER = news.btinternet.com
NEWSRCNAME = .newsrc

# Define both of these only if your server requires them.
# You must define BOTH of these.
# ACCOUNT = fred
# PASSWORD = flint+stone

DATADIR = data



# Define the extensions of the files we want to fetch
MULTI_EXT = (?i)mid|zip|kar|midi
SINGLE_EXT = (?i)mid|zip|kar|midi

# Skip any files with the same name.
# We will probably pick them up next time, and letting ubh create
# unique filenames makes recovering the original name difficult
OPT_O = skip

# Mark all articles ubh inspects as read (which cleans up the .newsrc).
# Probably don't want this option if another program uses news too.
OPT_z = 1

Create the data and temporary / cache directories, then run ubh from the top-level ubh directory:

    cd ~/ubh
    mkdir data
    mkdir temp
    ./ubh -c ubhrc

Manage the files fetched by UBH

The files downloaded might be duplicates or renamed versions of files in the collection, or broken. The MaART collector utility can be used to create a collection of files. Because of the flexibility the midi file format allows in the representation of note on/off sequences, some files might be duplicates with slightly different note encoding. To address this, the collector normalises each midi file before searching for duplicates. If the file is in the collection under a different name then a note is made of the alternative name.

Download the MaART source code from the SourceForge web site, extract it and compile the collector:

    cd ~maart
    tar xvzf maart-src-*.tar.gz
    cd maart
    make collector

And setup the collection.... Create a directory for the collection (somewhere with space, which home directories don't often have):

    mkdir /data/midi/collection

Create a configuration file for collector (as the default is to leave the original files) and place this is in the /data/midi/collection/ directory:

/data/midi/collection/collector.cfg

indexes.main = collection/index.idx
indexes.alt = collection/altnames.idx
indexes.reject = collection/reject_files.idx
indexes.pitch = collection/midi/contours.idx

paths.root = collection
paths.broken = collection/broken
paths.uninteresting = collection/short
paths.leave_originals = false

Uninteresting.min_length = 25

Run the collector with the list of files downloaded (ensuring we ignore directories and zip files:

    cd /data/midi/collection
    find ~maart/ubh/data/ -type f | grep -v .zip > /tmp/fetchedfiles
    ~maart/maart/collector/collector -l /tmp/fetchedfiles
    rm /tmp/newfiles

We can find out the files added to the collection (ignoring directories and index files) :

    find collection -mmin -720 -type f | grep -v .idx > /tmp/newfiles

Note: The option "-mmin -720" was added to find during testing. This finds files modified in the last 720 minutes (12 hours), as the collector was not (initially) configured to delete the incoming files - only newly downloaded files require processing.

Add Files to the mididb Index

Compile the midi database software:

    cd ~maart/maart
    make mididb

Create the database and the collected files:

    mkdir /data/midi/collection/index
    cd /data/midi/collection/index
    ~maart/maart/mididb/mididb -x create
    cd ..
    ~maart/maart/mididb/mididb -f index/sbh addlist /tmp/newfiles

Check the database works by searching for one of the files collected:

    ~maart/maart/mididb/mididb -x -f index/sbh findmidi `head -n1 /tmp/newfiles`

Automate the Process

Having confirmed the steps involved in adding files work, we can make a script which can be run on a daily basis. The script, created as ~maart/autoadd.sh (shown below), calls to ubh and mididb, using the find command to identify new files. A summary output (a list of the new files) is output to the standard output and a fuller log file is saved to /tmp. To keep things tidy, approximately seven log files are retained.

Once the script has been written, it is a good idea to run it - remembering to make it executable (chmod u+x autoadd.sh). Once the script is confirmed as working, a cron job can be added to regularly fetch files from the newsgroups. The following command can be used to edit (or create) the crontab file for the current user (maart):

    crontab -e

Add the following line to run the script at 4:15 AM each day:

    15 4 * * * ~maart/autoadd.sh

~maart/autoadd.sh

#!/bin/sh

# This script invokes ubh to download files from newsgroups, adds them to
# a collection and indexes them with mididb.

DATE=`date -I`
LOGDIR=/tmp
LOGBASE=autoadd
LOGFILE=${LOGDIR}/${LOGBASE}-${DATE}.log
echo Logging to $LOGFILE

# Fetch the files from newsgroups
cd ~maart/ubh
./ubh -c ubhrc >> $LOGFILE

# Add the files to a collection
cd /data/midi/collection
find ~maart/ubh/data/ -mmin -720 -type f | grep -v .zip > /tmp/fetchedfiles
echo Fetched `wc -l /tmp/fetchedfiles` | tee -a $LOGFILE
~maart/maart/collector/collector -l /tmp/fetchedfiles >> $LOGFILE
#rm /tmp/fetchedfiles

# Note how many new files we have
find collection -mmin -720 -type f | grep -v .idx > /tmp/newfiles
echo New files found `wc -l /tmp/newfiles` | tee -a $LOGFILE

# Add the files to the index
~maart/maart/mididb/mididb -x -f index/sbh addlist /tmp/newfiles | tee -a $LOGFILE
#rm /tmp/newfiles

# Delete old log files
echo Deleting the following old log files:
find /tmp/autoadd* -atime +7 -print -exec rm -f '{}' \;

# Let the user know we finished properly
echo Finished autoadd

Install the web interface

Before starting this, it is worth making sure that httpd is started, "/sbin/service httpd status", and start it if it is not running: "/sbin/service httpd start". If you want the server to start on system startup, use a initscript utility, e.g. "/sbin/chkconfig httpd on". You will most likely need to be root to do this.

At least on Trustix, the user directories have to be enabled before a user can publish web pages. A Directory tag was added to the httpd.conf file, based on an example in there. The entry below only enables the maart home directory, also enabling cgi scripts to be run from there (the ExecCGI and AddHandler options):

    <Directory /home/users/maart/public_html>
        AllowOverride FileInfo AuthConfig Limit
        Options MultiViews Indexes SymLinksIfOwnerMatch IncludesNoExec ExecCGI
        AddHandler cgi-script .cgi
        <Limit GET POST OPTIONS PROPFIND>
            Order allow,deny
            Allow from all
        </Limit>
        <LimitExcept GET POST OPTIONS PROPFIND>
            Order deny,allow
            Deny from all
        </LimitExcept>
    </Directory>

Note: The location of your httpd.conf is set when httpd is compiled. I found mine by looking at the output of "httpd -V" for HTTP_ROOT and SERVER_CONFIG_FILE. On this system, the path was /etc/httpd/conf/httpd.conf. For more information on enabling CGI, there is a howto on the Apache web site.

Create a public_html directory in the maart home directory (as maart). Note that permissions need setting for httpd to access it:

    cd ~
    mkdir public_html
    chmod go+rx public_html

Extract the template search pages in the public_html directory and change the permissions so they can be accessed:

    cd public_html
    tar xzvf maart_search.tar.gz
    find . -exec chmod go+r {} \;
    find . -type d -exec chmod go+x {} \;

Now is a good time to check progress by pointing a browser to http://server/~maart/search

Install the CGI Script

Compile the CGI script

Build the version of the midi database which supports musical queries submitted through CGI and copy it to the public_html directory (with a .cgi file extension):

    cd ~/maart
    make cgi
    cd ../public_html
    cp ~/maart/mididb/cgi/mdbcgi ~/maart/mididb/search/cgi-bin/mdbcgi.cgi

The mdbcgi program requires a configuration file to identify the database, enable certain search features and set the look & feel of the search results pages. An example configuration file is shown below. Note that this allows the user to query the time indexes of search terms in a particular piece and to display text strings in a piece. These require access to the midi files and can take a significant amount of CPU (compared to the original search), so might want to be disabled on highly-loaded servers.

~/maart/mididb/search/cgi-bin/mdbcgi.cfg

html.head = head.inc
html.tail = tail.inc

# Set the default distance limit
options.d = 0
# Set the default near match limit
options.n = 10000
# Set the default file limit
options.r = 100

config.db = /data/midi/collection/index/sbh
config.midibasepath = /data/midi/collection/
config.midibaseurl = http://server/~maart/search/cgi-bin/redir.cgi?ID=
config.scripturl = http://server/~maart/search/cgi-bin/mdbcgi.cgi

allow.timeindex = 1
allow.tempomap = 0
allow.showcontours = 0
allow.displaytext = 1

# Non-zero if the user should be able to indicate the correct file
allow.feedback = 0

pictures.playgif = http://server/~maart/search/images/play.gif
pictures.indexgif = http://server/~maart/search/images/hrglass.gif
pictures.textgif = http://server/~maart/search/images/text.gif
pictures.tempomapgif = http://server/~maart/search/images/clock.gif
pictures.contourgif = http://server/~maart/search/images/contour.gif
pictures.correctgif = http://server/~maart/search/images/tick.gif
pictures.incorrectgif = http://server/~maart/search/images/cross.gif

The midibaseurl setting is used to identify the root URL of the MIDI files ( i.e. as it would appear to external users). This can be left blank but, if specified, the mdbcgi program will insert a play icon into the search results. The MIDI files on this example installation do not form part of the web page. The web server could be configured to expose the files but a redirection script (redir.cgi), written in perl, was chosen instead. The script is fairly basic at the moment but could be expanded to limit access to files, e.g. implement user quotas, allow access from certain IP addresses, limit access to copyright- cleared files. The redir.cgi script should be copied to the cgi-bin directory and modified for your installation.

Maintenance

Following the setup given in this document, the cron job will email output to the maart user. Whilst this is only a summary of the actions, namely the latest additions to the collection, the output will build up over time. Because of this, the mail for the maart user should be checked periodically. Alternatively, the cron job could be modified not to email the output.

Issues, Improvements and the Future

The installation, as detailed here, implements a melody search of songs in a database of MIDI files that will automatically grow over time using the Usenet as a source of files. While this is a fully working system, there are couple of issues which don't affect the functionality but are appreciated to be a little untidy. These could be addressed with a little more time than was available for this article. There are also some obvious improvements which could be implemented relatively simply but some others which are best described as future work.

Broken files are reported by the collector. This should be fixed, maybe by deleting / moving broken files before adding to the index. Note that the reason for keeping them at all is for development purposes, as the file might not be broken after all and instead not be understood by the MIDI parser due to a bug.

The current traffic of the midi groups includes a lot of files with names including the string "[met tekst]", believed to be German for "has text" - i.e. a karaoke file. Not all of these are recognised as karaoke files, which needs to be looked at. Once addressed, the text should be removed from the filename by the update scripts.

Zip files are currently fetched but not added. Each zip file downloaded could be extracted to a temporary directory (e.g. tmp/midi-zip/) from which the midi and karaoke files could be added to collection before deleting the directory (i.e. rm -rf /tmp/midi-zip).

A nice improvement might be to generate a html list of latest additions. If the web site is not intended to be a download site for MIDI files, the list of additions would need obfuscating in some manner, so preventing a user writing a script using the additions to 'leech' the new files. Naturally, if a midi download/library web site is the desired end goal, automatically generating html index pages would be a desirable improvement.

A web service [1] [2] has already been implemented in MaART (using gSOAP). Adding this to the installation would allow the database to be queried by external programs more appropriately than submitting CGI requests and parsing the HTML returned.

Adding a text search engine would compliment the musical search. There are a number of open source text search engines which implement fast and efficient indexing, e.g. Lucene (written in Java but has been ported to a number of languages).

As well as identifying duplicate files, the collector also notes alternative names for files. These names might be translations into different languages or, where an artist is mentioned, can be due to different artists performing cover versions of songs (where an existing MIDI file has simply been renamed). Showing the alternative names in the search results would be helpful. Implementing this would then make management of the names a higher priority (e.g. selecting best names, disallowing others).

Adding musical summary of the search results would greatly improve the interpretation of the list of files. MaART already implements a midi file summarisation algorithm. Creating this summary at the time the file is added to the index is feasible. The summary should probably not be stored as a midi file but in a format suitable for display in the results list. This could either be a static graphic or something more forward thinking, such as in a text format (e.g. abc) which could be rendered into a stave display on the user machine using a java applet.

Feedback on this article, or if you wish to submit text implementing an improvement, please use the forums on the MaART project page or email me: beeka at users dot sourceforge dot net.