Search A Word In A File In Java
A simple technique that could well be considerably faster than indexOf is to use a Scanner, with the method findWithinHorizon. If you use a constructor that takes a File object, Scanner will internally make a FileChannel to read the file. And for matching it will end up using a for efficient string searching.
Searching word or phrase among files. Java Word Search Solver. Storing recent search phrases. Search word in multiple files. PHP surround search text.
If you don't have access to JDK 5 but you can use JDK 1.4, then you can use java.util.regex and FileChannel yourself, but it will be more complex. Reading the file in chunks (e.g. Into a CharBuffer) is a good idea, but you need to consider that it's possible that the target string may be split between reads. If you know the maximum size of the target string (and if it's not a prohibitively large number), you can handle this by copying length-1 chars from the end of the last read to the beginning of the buffer for the next read. You'll have to be careful to get this right though. Simply reading the file line by line and searching each line individually (preferably using java.util.regex) may be acceptable, if you're certain that the target string does not contain any line separators. It will still most likely be slower than the other techniques suggested so far, because it will take more time to transfer each and every char into a char and then into String object.
The nice thing about NIO classes is they allow most of that stuff to happen outside the JVM, and thanks to Boyer-Moore you don't really need to look at all the charcters anyway. But reading line by line does have the advantage of being a common idiom that most people understand. If you can't use a Scanner and findWithinHorizon, then reading line by line will probably be simpler than anything else I've suggested. And simplicity is almost always a virtue in programming. June 19, 2006: Message edited by: Jim Yingst.
Search For A Word In A File Command Line
I understand the horizion parameter to be a limit set on the amount of data to search: public String findWithinHorizon(Pattern pattern,int horizon) Attempts to find the next occurrence of the specified pattern. A scanner will never search more than horizon code points beyond its current position. If horizon is 0, then the horizon is ignored and this method continues to search through the input looking for the specified pattern without bound. It looks like your code should work. Hi all, I tried with doing a mini benchmarking exercise for comparing the performance of a searching a 1 gb text file using the method described here and also using String.indexOf.
Trial was done for about 15 different Strings as search param. Using Regex and NIO:- 120 to 210 seconds Using indexOf:- 70 to 100 seconds As shown above, the results are not very encouraging and I find that in many cases, indexOf performed atleast 25% faster.
Also another thing I noticed with the approach here is that the time taken varies widely depending on the String that is used to search, whereas it was more or less consistent when using indexOf method. Hence I have doubt if using Regex patterns is the best way to search for plain strings in a large input. Please let me know your comments. I'd like to see the exact code you're using at this point.
Is it the same code that was throwing OutOfMemoryError, or has it been further modified? Boyer-Moore definitely will be more advantageous the longer the target string is, as Stan said. It sounds like the strings you're looking for are short enough that it doesn't offer a great advantage.
Also I guess Scanner is such a generalized class, it's designed to be able to do so many different things, it's not necessarily optimized to be as fast as possible in all things. If you want to continue optimizing this stuff, it would probably be worthwhile to start using a profiler on the code to see what's really taking up most of the time. Guesswork has taken us this far, but real data would be preferable.
Is the index&search Java library which I hear most about, but I've never used it. MS SQL Server (I've used its full-text indexes) (and probably other RDBMSs) ship with full-text indexing and search. There probably are simpler solutions probably but it totally depends on what your constraints are Is mirroring the directory in Ramdisk and doing a grep an acceptable solution, where would we know?. And that deep an analysis is beyond the scope of Code Review SE. – Jul 15 '13 at 11:00.
As, your current approach might be too slow for your goal of 'about 4-5 seconds'. Depending on what's your actual use case, using an index might indeed be a good idea. This is similar to how internet search engines do it. To create the index:. create a map Map for holding the association of search terms to files. go through all your files, and for each word in each file, add that file to the list corresponding to that word in your index map.
you might want to skip common words, such as 'a', 'and', 'the', etc., or you could even apply a to drastically reduce the variability in words Once you've created the index (which could take quite some time), you just have to look up the search word in that index to get the list of files that contain the word (or a linguistic variant of it, if you used a stemmer). As said before, the applicability of this approach depends heavily on your actual use case. If the files contain genetic sequences and you are searching a certain pattern, this probably won't help much. Even if you are searching for some complex phrase this might not work, as you'd have to add each possible (sub) phrase to the index. But if you are looking for individual words in ordinary text files (or HTML or the like) this might work. Update: Since you seem to indeed search for complex phrases, you could try the following:.
create the index, as desribed above, optionally using a stemmer. search for each word in your phrase (or the stemmed version thereof) in your index. for each file that, according to your index, contains all the words, do a full-text search for the original phrase Finally, if this still does not cut it, you could also create indexes for. If you are hitting a bottleneck in terms of sheer hardware, you may move to a distributed model using. It works just like a distributed awk / grep conceptually. Enter big words like Big Data, which really is trading compute time on one machine (expensive one most of the time) with less compute time on more machines (commodity hardware most of the time) by following divide-and-conquer principles.
I wouldn't take this route unless you have identified that your solution is going to hit the limits. But consider this:. what happens if your files grow over time?. what happens if your have more files over time? Distributed systems scale horizontally (number of machines) whereas a single machine does not scale well vertically (single machine specs $$$).