Skip to main content

Searching an ArrayList

4 replies [Last post]
zman771
Offline
Joined: 2008-01-23
Points: 0

I have an ArrayList of strings that I need to search. The search has to be a 'like' search and not an exact match search. e.g. if the arraylist is ["google", "groups", "java"] and I search "o" it should return ["google", "groups"] (or index 0 and 1).
I looked at a binary search but as far as I can tell it will only return exact matches.

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
swv
Offline
Joined: 2007-05-28
Points: 0

The thing is, it appears to me from your question that you'd have a long way to go from where you are to where you need to be in terms of things to learn ...that's not to say you can't do it, because you can, it's just to say that there is no simple solution because the problem itself isn't simple.

You want to develop a sophisticated notion of similarity in words. Then you want to use that notion to compare words and sentences. That involves you immediately in things like stemming words, stop words, distance measures, tries of all sorts including suffix tries, and lots of other stuff. For instance, here's a page, drawn at random, which is dedicated to the notion of the similarity between two strings:

http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

It's not that what you want is unreasonable or arcane, it's that what you want is ...interesting.....it's basically a part of what goes by the name of "information retrieval".

Here's a link from IBM
http://www.ibm.com/developerworks/java/library/j-text-searching.html

Remember this, a lot of what's good about Google is this just this subject, and they sort of set the standard.. people expect that level of performance with respect to this topic, and anything else just seems broken.

If you just have an assignment or a deadline to meet, then the best thing you can do is get a notion in your head about the ideaspace you seem to have landed in so as to better understand what it is you're not going to be implementing and what sort of performance you're not going to have... just what to expect from yourself and any code or data structures you might find out there.

On the other hand if you're scratching an itch then you've found an interesting topic that you can learn a lot about and perhaps even contribute to since the field is still wide open. Here's a good book to get you started:

http://www.amazon.com/Modern-Information-Retrieval-Ricardo-Baeza-Yates/d...

swv
Offline
Joined: 2007-05-28
Points: 0

What you're asking for is not a direct function of ArrayList, but more like a program. Implementing "like" or similarity searches against text is a subject unto itself, which starts with defining very precisely what you mean by "close" (yes, that's a bit of an oxymoronic statement) and ends with runtime analysis of your chosen data structure .

That said, if you were able to define similar in some good enough and simple way, you could implement an appropriate compare() function for a Comparable class and essentially rate the goodness (closeness) of each word.

There's always Lucene, the open source library.

But in general what you're describing isn't a feature of any class, it's a field of study- the stuff a PhD thesis is made of.

zman771
Offline
Joined: 2008-01-23
Points: 0

I only mentioned arraylist as a starting point - i know it's not what i really need.
I agree with your Comparable class, but i didn't think that was the fastest way to do it.

are there better datatypes other ways of doing this that's faster than a simple arraylist implementation?

thanks

tarbo
Offline
Joined: 2006-12-18
Points: 0

If you search for exact substrings (as in, no regular expressions), you can use [url=http://en.wikipedia.org/wiki/Suffix_tree]suffix trees[/url]. To my knowledge, Java SE has no suffix tree implementation, but they're pretty straightforward to build. In this case, you would add all your words to the suffix tree, and you can quickly search all your words in a single go. You'll need to pay attention to store [i]all[/i] your entries, though, and not to overwrite them in any way in the tree (which is sometimes done, cfr. compression or "any" matching).

If you do require regular expressions, you have two options. Either you go the easy way and just attempt to match each element in your collection against the regular expression, or you go the hard way and extend suffix trees to support regular expressions (which is certainly feasible but not entirely straightforward).

I think I have an implementation of suffix trees lying around, but it's not the variety you would need -- it's a compression tree.

Hope this helps,
Jonathan