0
votes

I am creating my own mp3 player. For the mp3 player I need a song searching facility same as VLC Media Player and Rhythmbox and other media players, in which one can search for a song by giving artist/track/album name.

As an example consider these are 4 songs with their respective meta data

Track                     Artist                            Album

Dear Agony                Breaking Benjamin                 Test Name
Radioactive               Imagine Dragons                   Billboard
Feel Good Drag            Anberlin                          Random
Khamaj                    Fuzon                             Tere Liye

Now suppose I give search query: ag then result should be this:

Dear Agony                Breaking Benjamin                 Test Name
Radioactive               Imagine Dragons                   Billboard
Feel Good Drag            Anberlin                          Random

because first three song have some occurence of ag in the meta data however the fourth track doesn't have any hence it shouldn't be listed.

All the mp3 files will have all these data present in them, and I know how to extract this data from them. The real challenge is which data structure to use and how to use that data structure to implement this.

Especially if user's song playlist is very big then a efficient retrieval of results is required. Please suggest some data structures which I can implement to achieve this. By the way I am using Python

3
You can take SQL database like Firebird or SQLite, make a table for the songs having all the columns as fields from ID3/UD3 specifications, and adding three more columns: row ID (unique number, primary key), song file name and total text string ( ALL_TAGS COMPUTED BY TAG_ALBUM || ' ' || TAG_BAND || ' ' || TAG_TITLE || ... ) and then do like SELECT ID FROM SONGS WHERA ALL_TAGS LIKE '%ag%'` Would be a linear search reading from HDD, so not very fast (about 100 MBytes/sec for raw speed). But simple. And in memory everythign would be fast...Arioch 'The
en.wikipedia.org/wiki/Inverted_index is the basis of all FTS engines, but you would have to make a compromise what is the minimum reasonable length. Many forums disable text search unless you key in at least FOUR letters, so i think 2- or 3-letters indexes become way too large to be efficiently maintainedArioch 'The
@Arioch'The: I would like to know how I can use Inverted Index for searchingkhirod
Most FTS engines are open-source - read and learn - both data structures and algorithms operating them. Some basis information can be ggogled in the net. OF course latest know-hows by top-rank engines like Google, Yandex, etc - are probably unpublished or patent-covered. But you do not need that much.Arioch 'The

3 Answers

2
votes

If you break the list of words of all data (track name, band, album) you could have a hashtable indexed by the words with a linked list as the value containing all the tracks that have that occurrence.

For the searching you could have a B+ tree to index the words to get to the keys of the hashtable (more or less like a word processor does for the autocorrection).

Cheers

2
votes

The easiest way to do this, of course, is to sequentially search the records, doing a string.contains check on each of the fields for each of the records. Provided all of the metadata is in memory, this turns out to be not terribly inefficient even if you have tens of thousands of songs. Remember, this doesn't have to be blindingly fast. Users will probably be willing to wait a few hundred milliseconds for their results.

It really depends on how you construct your user interface. For example, if the user types "a", you have to go through the entire list of songs to find those that contain that letter anywhere in the metadata. If the user then types "g", you don't have to go through the entire list of songs. You only have to look at the list that you already have: those that start with "a". Considering that the most frequent bigram in English ("th") occurs in about 2.5% of words, by the time the user has typed two or three characters you're working with lists so small (at most a few hundred items) that a naive sequential search is plenty fast enough.

If you want to do it faster, you have to build a trie and insert every n-gram. The accepting states contain a list of the records that contain that n-gram. It takes a while to build, and the resulting data structure is pretty big because of all the references at the accepting states. Even optimized, there is one reference for each letter of a particular word. For example, the word "Agony" ends up having five references. Updating the trie when you add or remove songs isn't especially difficult.

You can do much the same thing with a dictionary or hash map, with the n-gram being the key, but it's much more difficult then to combine references. Using a dictionary, the word "Agony" would end up storing a reference in the map for "a", "ag", "ago", ... "o", "on", "ony", "n", "ny", "y". So rather than length references for each word, you end up with (length^2 - length)/2 references per word.

I once used a hybrid approach. I build a tree of bigrams with references so that I could do the initial lookup very quickly on the first two letters. Then I'd sequentially search those results. So if the user typed "ago", I'd go to the trie and find every item that had "ag" in the metadata. Then I'd sequentially search those items for "ago". Because the second list was usually relatively small, this was surprisingly fast. And building a trie of bigrams didn't take up a huge amount of space.

My suggestion would be to build the sequential search first. Then if it's too slow, implement the hybrid approach above.

1
votes

This might be a bit of an overkill, but you can look into sorl: link