Skip to main content

Can such operation be tuned?

12 replies [Last post]
ixanos
Offline
Joined: 2008-12-06
Points: 0

I have an inverted index stored in a RandomAccessFile. Whenevr a search is taking place the procedure is as follows:

1) read seek-point into the file
2) read a list of numbers
3) one of the above numbers points to the next list of numbers somewhere else in this file.
4) seek into this next point in the file and continue the same cycle until you reach start of file.

Now, if all these lists where consecutive one could read 1-2MB of data and do some sort of buffering. But if the data is spread across the RandomAccessFile, is there a way to optimize the read access?

NOTE: The file is to big to be loaded as a whole

Any ideas are very appreciated.

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
ixanos
Offline
Joined: 2008-12-06
Points: 0

Firstly let me thank you all for your ocnstructive feedback.
In summarty from this Thread I will employ the following two suggestions:

1) SSD drives (hardware tuning)
2) Optimization

As mentioned the part that could use some tuing was the one that reads randomly from the RandomAccessFile. Therefore, if we can minimize the amount of HD-disc head movements to actually read the data then we can increase the speed of our queries. This is achieved by using an optimization algorithm that will process the whole file off-line and will consolidate all lists so only a single read will be needed.
Whenever we notice a deterioration of our query-performance we could then simply re-run the optimization algorithm.

tarbo
Offline
Joined: 2006-12-18
Points: 0

You could try memory mapping the file. For large files, I believe this is about the fastest non-sequential access you'll have. Memory mapping pulls blocks into memory on-demand, and pushes them back out when opportune. Further, since the mapped buffer is direct, you may save a data copy (or two), depending on your OS.

Do you have a more precise definition of your data format? Can you make minor adjustments to this file format? In your four steps, you mention that "one of the above numbers" points to the next list. Is this pointer at the start or the end of the list? Do you know beforehand where it will be? Are the other numbers relevant?

If the upper part of your tree is less often modified than its lower branches, you could cache the top X layers in memory. You can potentially save a lot of repeated reads with this. You could even try to cache often-walked paths, but that's going to take a lot more effort.

If the number of reads vastly outweighs the number of writes, you could consider building an entire index with your needed data and saving it in a seperate file. Each time the data file changes, you need to rebuild (part of) your index, so you should probably only do this is you have more than a 10:1 read-to-write ratio; if you can do a smart, partial rebuild, you can have a lower treshold.

Hope some of these ideas are useful to you,
Jonathan

ixanos
Offline
Joined: 2008-12-06
Points: 0

thanks for your answers.
The data structure is raughly like this:

long,long,int,long[] (differentially encoded)

where the first long points to the location in the file where the next list of numbers is stored. This next list will obay to the very same structure. Always the first number points to the next record and goes up to the start of the file where this long will become -1. Its a bottom up approach.

unfortunately there is absolutely no reason to say that the last or middle or top part of the file is used less or more frequent.

Tuning the whole using faster hard drives is a good solution and I will keep this in mind. Then again, I would like to firstly consider all software and architectural options I may have.

Further on, the number of writes is not significant. The writes correspond to the indexing procedure which is already tuned. Its now the search mechanism that I would like to tune further. And this mechanism only uses Reads.

The idea of building a smart index on the side is also something I will consider. Though I would rather want to avoid this approach as I am currently using 3 other indices (for other purposes) and the whole collection of files and indices are maintained via a complicated optimization algorithm. So , If I was to introduce a new index I would have to re-program the optimization algorithms and this is something I can only do if I really have time! As they are extremely complex.

peter__lawrey
Offline
Joined: 2005-11-01
Points: 0

If the values are differentially encoding your values, are you using variable length numbers e.g. 0 - 127 s one byte, larger values two bytes, etc.

ixanos
Offline
Joined: 2008-12-06
Points: 0

only the long[] is differentialy encoded. The reference to the next list is the very first long and it is not encoded.
The exact format is

long
int
int
long
long[]
int
long
long[]
int
long
long[]
....

peter__lawrey
Offline
Joined: 2005-11-01
Points: 0

Okay, but what is the point of differentially encoding the numbers unless you are using this to make the value-length (in bytes) smaller?

ixanos
Offline
Joined: 2008-12-06
Points: 0

this is not the issue in this thread. What I am asking is whether there is a smart way to traverse (pseydo-randomly) a RandomAccessFile. Is there another construct that is more appropriate perhaps? Or is there some sort of smart buffering that I could employ?

peter__lawrey
Offline
Joined: 2005-11-01
Points: 0

I think it is very much the issue. If you compact your data size say 2-5 times, then you have more efficient use of your disk caching and improve performance. (I assumed it was performance you wanted to tune) Compress the data enough and it may even fit in memory or certainly more of your records will fit in memory. (I am referring to the OS disk cache memory, it doesn't have to be the Java heap space)

The only smart way to traverse a random access to a file is to make it less random and more localised (So more of your access will already be in memory). You could for example, move the index/header data in the file i.e. all the non long[] data to the start of the file, this may or may not improve locality.

ixanos
Offline
Joined: 2008-12-06
Points: 0

I understand Peter.

Thats why differential encoding was used in the first place. In the matter of fact we even play with bits .If for example two consecutive differences are small enough we try to squize them into a single byte. Thats why I told you that my question refers only to how to randomly access the file in an efficient manner.

the long[] cannot be placed on the top of the file because this would mean that every time a new file is parsed I would have to re-arange the file. This would cost too much time and is unacceptable. The way it works is that this big file receives posting-lists from parsed text files. A token can exist several times in a single document and further on sevaral times in hundreds of other documents. Every time the same token is found in a new file it is linked to the last known list rteferring to this token. therefore the linkage of posting lists in this file is pseydo-random.
Of course we could leave a fairly large chunk of data empty in the begin of the file and fill it up as we start parsing. But this would be a different architecture which we havent considered.

Furter on, I dont want to depend on a solution that will try to fit the whole file in the memory as this file may increase to infinity (theoretically at least). This file acts as a vault for data parsed from the indexed files.

Finally, the search engine is very fast for an individual user. It can return hundreds of documents in less than a second and all known queries are allowed:
TermQueries : blabla
SuffixQueries: *blabla
PrefixQueries: blabla*
PhraseQueries: "blabla1 blabla2"
WildcardQueries: bl*bla

http://www.ixanos.com

We are now just trying to squeeze as many milliseconds as possible out of it.

Any revolutionary ideas will be very welcome and perhaps quoted into our algorithms.

tarbo
Offline
Joined: 2006-12-18
Points: 0

If you want revolutionary ideas, you have to be open towards them. Revolutionary idas, by their definition, require a view outside the box one imposes by following a thought process.

There are three main factors that determine how fast your algorithm is:
A) the cost of the repositioning (seek time);
B) the cost of reading the data;
C) the number of times you have to load new data.

If your data is truly random --and it rarely is-- then you cannot optimise your algorithm and you have no choice but to resort to upgrading your hardware. If you can find a recurring pattern, however, you have a chance at optimisation.

What Peter advocates could make a palpable difference. File caches can be fairly big, and if you can squeeze even 20% off the size of a record, your OS can keep 25% more records in cache. Especially near the top of your tree, this save [i]will[/i] squeeze milliseconds.

> unfortunately there is absolutely no reason to say
> that the last or middle or top part of the file is used
> less or more frequent.

Your file, no, but the implicit tree structure it contains. Your data structure is a singly-linked rootwards tree. Given a pseudo-random distribution and a heck of a large file, chances are high that a significant amount of your nodes will be on an often-walked path to the root (since you'll probabilistically approach a bell curve). The closer to the root a node is, the higher the chances of it being on a hot path. Of course, this is something you'll want to verify with sampling.

In the short run, and since your reads probably vastly outnumber your writes, you can use a simple caching algorithm (storing on a hash index or similar). Remember: even if the hit percentage is low, every hit you get is a seek & read you save. (And memory is a lot easier to provide concurrent access to than hard drives.)

Cheers,
Jonathan

ixanos
Offline
Joined: 2008-12-06
Points: 0

Hi tarbo

Thanks for your long email. All thoughts are very much appreciated.

Some remarks:

You suggested that significant improvement could be achieved by storing parts of the top of the file in memory. I assume this suggestion refers to the case where the file actually has an appropriate header or it is say a B-tree and you suggest to store the root-node and some children down to a certain level.

Then again, Our file, is neither a B-tree nor it contains a header on the top of it. But I am currently thinking of reserving space in the top of the file and loading this in memory during application start-up. Only the contents of this header I have not thought out yet.

The data in this file has to be seen as random, because it contains posting lists of words in the sequence they are read from any given file. Then again the files are parsed by the user and the sequence is user-deifned (for us random). This file has to be seen as a stack where you never pop() but only push(). When I get a file like that, The only thing I can tell is that the top file has been parsed first and the ones that follow were parsed later.

The file cache is not a bad idea. I could consider every time a record is stored in this file to also store its key numbers in a separate index (leave the long[] out). Then at application start up I could load these indices. This would speed up all queries except PhraseQueries as they will still need all offsets and therefore all the numbers. But still, it would be a good option to consider.

In essence what needs tuning is the number of reads. I need to prepare some sort of a distribution for the number of reads per any given word and then take things from there. If I can minimize these reads the whole process should be fairly tuned.

Any changes I have decided to do and that have resulted in an improvement I will publish in this thread.

peter__lawrey
Offline
Joined: 2005-11-01
Points: 0

If the data doesn't fix in memory, the critical performance factor will be the behaviour of the disk drive. Its worth noting that a typical hard drive can only do about 120 random reads per second. This means tuning your behaviour may mean minimising the number of random accesses you perform (even if it means consuming more CPU or other resources)

The simplest way to improve this is to buy a fast SSD drive which can perform up to 10,000 random accesses per second (because it has no moving parts)