Skip to main content

How to parse HTML pages in Java?

10 replies [Last post]
mw
Offline
Joined: 2003-06-12

I want to parse HTML page, for example i want to get data from table that is at HTML and put them into JTable or array. How can i do it BY MYSELF??

Please give me an example

Thank you for all

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
mouned
Offline
Joined: 2009-12-30

I changed a little code for it only gives me the tables:

[b]import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.io.*;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;

public class skaner2 {
final Parser parser;

public static void main(String[] args) throws IOException {
skaner2 scanner = new skaner2();
final Map map = scanner.scanHierarchy(new File("exemple2.html"/*args[0]*/));
//sort keys by values
List list = new ArrayList(map.keySet());
Collections.sort(list,
new Comparator() {
public int compare(Tag key1, Tag key2) {
return map.get(key2) - map.get(key1);
}
});
/*for (Tag t : list) {
System.out.format("%-10s %s\n", t, map.get(t));
}*/
}
public skaner2() {
parser = (new ScannerHTMLEditorKit()).getParser();
}

public Map scanHierarchy(File file) throws FileNotFoundException, IOException {
Map map = new HashMap();
scanHierarchyImpl(file, map);
return map;
}
private void scanHierarchyImpl(File file, Map map) throws FileNotFoundException, IOException {
if (file.isDirectory()) {
for (File f :file.listFiles()) {
scanHierarchyImpl(f, map);
}
} else if (file.isFile()) {
String name = file.getName();
if (name.endsWith(".html")
|| name.endsWith(".htm")) {
scan(file, map);
}

}
}

public void scan(File file, Map map) throws FileNotFoundException, IOException {
scan(new BufferedReader(new FileReader(file)), map);
}
public Map scan(File file) throws FileNotFoundException, IOException {
return scan(new BufferedReader(new FileReader(file)));
}

public void scan(Reader in, Map map) throws IOException {
parser.parse(in, new ScannerParserCallback(map), true);
}
public Map scan(Reader in) throws IOException {
Map map = new HashMap();
scan(in, map);
return map;
}
//we need this class only to get the default html parser
//the returned parser creates a new parser on every parse call
//so one parser is enough
private static class ScannerHTMLEditorKit extends HTMLEditorKit {
public Parser getParser() {
return super.getParser();
}
}

HTMLEditorKit.ParserCallback callback =new HTMLEditorKit.ParserCallback () {
public void handleText(char[] data, int pos) {
System.out.println(data);
}
};

private static class ScannerParserCallback extends HTMLEditorKit.ParserCallback {
final Map map;
Parser p;
ScannerParserCallback(Map map) {
this.map = map;
//p=pp;
}
@Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {

if ((t.toString()).equals("table")){
System.out.print("\n

");
}
if ((t.toString()).equals("td")){System.out.print("|");}
if ((t.toString()).equals("tr")){System.out.println();}
//handleSimpleTag(t, a, pos);
}

public void handleSimpleTag(Tag t, MutableAttributeSet a, int pos) {
Integer integer = map.get(t);
int counter = (integer != null) ? integer.intValue() : 0;
map.put(t, ++counter);
}
public void handleText(char[] data, int pos) {
System.out.print(data);
System.out.print(" ");
}

public void handleEndTag(HTML.Tag t, int pos){
if ((t.toString()).equals("table")){
System.out.println("\n

");
}
}
}
}[/b]

But I have not come to retrieve each HTML table in a matrix so that I can subsequently use it or convert it into XML, or JTable.

So each table html is a matrix, and the number of rows and columns of each table are the size of the matrix associated with the table.

Help me, it's 25 days that I work here and I have not yet found the solution

Thank you for taking the trouble to read my message

idk
Offline
Joined: 2005-01-12

Hi,

>ok i changed this code (not much) and i would like to print all texts from html to console . There is code :

Your code:
[code]
public void scan(Reader in, Map map) throws IOException {
parser.parse(in, new ScannerParserCallback(map), false);
}
[/code]

My code:
[code]
public void scan(Reader in, Map map) throws IOException {
parser.parse(in, new ScannerParserCallback(map), true);
}
[/code]

The last argument for HTMLEditorKit.Parser.parse is
[b]boolean ignoreCharSet[/b].

It indicates whether we want to handle charset.
Charset is handled via ChangedCharSetException.
(see javax.swing.text.JEditorPane.read(InputStream in, Document doc) for the details).

Thanks,
Igor

ashish260381
Offline
Joined: 2006-02-14

hi igor,

thx my problem be solve now.

thx

maming2000
Offline
Joined: 2006-08-23

When I use jsf1.1,I like the dataTable,but,there is only column but not colspan attribute.
when I want make the table like the html4.01's,how can I do?
for example:

A
B C

like this,how can I use jsf do it?

mw
Offline
Joined: 2003-06-12

sorry, but i downloaded jnetkits and wiseuploadi from this page, but i don't see any HTML parser there.

Message was edited by: mw

jetbrains
Offline
Joined: 2005-12-15

There has a html parser in lucene sample with source code

java email verify
http://www.wisesoft.biz/

mw
Offline
Joined: 2003-06-12

ok i changed this code (not much) and i would like to print all texts from html to console . There is code :

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.io.*;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;

public class skaner {
final Parser parser;

public static void main(String[] args) throws IOException {
skaner scanner = new skaner();
final Map map = scanner.scanHierarchy(new File("g:/bankier.html"/*args[0]*/));
//sort keys by values
List list = new ArrayList(map.keySet());
Collections.sort(list,
new Comparator() {
public int compare(Tag key1, Tag key2) {
return map.get(key2) - map.get(key1);
}
});
for (Tag t : list) {
System.out.format("%-10s %s\n", t, map.get(t));
}
}
public skaner() {
parser = (new ScannerHTMLEditorKit()).getParser();
}

public Map scanHierarchy(File file) throws FileNotFoundException, IOException {
Map map = new HashMap();
scanHierarchyImpl(file, map);
return map;
}
private void scanHierarchyImpl(File file, Map map) throws FileNotFoundException, IOException {
if (file.isDirectory()) {
for (File f :file.listFiles()) {
scanHierarchyImpl(f, map);
}
} else if (file.isFile()) {
String name = file.getName();
if (name.endsWith(".html")
|| name.endsWith(".htm")) {
scan(file, map);
}

}
}

public void scan(File file, Map map) throws FileNotFoundException, IOException {
scan(new BufferedReader(new FileReader(file)), map);
}
public Map scan(File file) throws FileNotFoundException, IOException {
return scan(new BufferedReader(new FileReader(file)));
}

public void scan(Reader in, Map map) throws IOException {
parser.parse(in, new ScannerParserCallback(map), false);
}
public Map scan(Reader in) throws IOException {
Map map = new HashMap();
scan(in, map);
return map;
}
//we need this class only to get the default html parser
//the returned parser creates a new parser on every parse call
//so one parser is enough
private static class ScannerHTMLEditorKit extends HTMLEditorKit {
@Override
public Parser getParser() {
return super.getParser();
}
}

HTMLEditorKit.ParserCallback callback =
new HTMLEditorKit.ParserCallback () {
public void handleText(char[] data, int pos) {
System.out.println(data);
}
};

private static class ScannerParserCallback extends HTMLEditorKit.ParserCallback {
final Map map;
Parser p;
ScannerParserCallback(Map map) {
this.map = map;
//p=pp;
}
@Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
System.out.println("hhhhhhh"+t.toString());
if ((t.toString()).compareTo("table")==0){
System.out.println("TABELKA SIE ZACZYNA");
}
if ((t.toString()).compareTo("td")==0){
System.out.println("mamy kolumne nowa ");

}
handleSimpleTag(t, a, pos);
}
@Override
public void handleSimpleTag(Tag t, MutableAttributeSet a, int pos) {
Integer integer = map.get(t);
int counter = (integer != null) ? integer.intValue() : 0;
map.put(t, ++counter);
}
public void handleText(char[] data, int pos) {
System.out.println(data);
}
public void handleEndTag(Tag t,MutableAttributeSet a,int pos){
if ((t.toString()).compareTo("/table")==0){
System.out.println("TABELKA SIE KONCZY");
}
System.out.println("aaaasssssssssssssss");
}
}
}

and when i put page from http://gielda.onet.pl/a,p,notowania.html (of course i have downloaded this page (all))

instead of "g:/bankier.html" it doesn't work.
"Exception in thread "main" javax.swing.text.ChangedCharSetException" (this is the problem)

For g:/bankier.html it works really good. Here is code of page bankier.html:














0.1 0.2 0.3
0.4 0.5 0.6


Please help me ...
regards
mw

ashish260381
Offline
Joined: 2006-02-14

hi,

code are so good no doubt and thx for it
but there is a problem, program throws a exception when our html file have a tag.
please help me out.

thx

idk
Offline
Joined: 2005-01-12

Hi,

You can user Parser from javax.swing.text.

This is something I wrote to find out what html tags are used and how often :
[code]
/**
* A simple html scanner.
* Helps to find what html tags were used and how often.
* Example usage:
*

 * HTMLScanner scanner = new HTMLScanner();
 * //scans all the html files under _file_
 * Map map = scanner.scanHierarchy(_file_);
 * 

*
* This class has a {@code main} method and could be used as standalone program.
*

 * java HTMLScanner _path_
 * 

* Scans all the html files under _path_ and prints tags frequency table to stdout
* sorted by frequence.
*
* @author Igor.Kushnirskiy@sun.com
*/
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTMLEditorKit.Parser;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;

public class HTMLScanner {
final Parser parser;

public static void main(String[] args) throws IOException {
HTMLScanner scanner = new HTMLScanner();
final Map map = scanner.scanHierarchy(new File(args[0]));
//sort keys by values
List list = new ArrayList(map.keySet());
Collections.sort(list,
new Comparator() {
public int compare(Tag key1, Tag key2) {
return map.get(key2) - map.get(key1);
}
});
for (Tag t : list) {
System.out.format("%-10s %s\n", t, map.get(t));
}
}
public HTMLScanner() {
parser = (new ScannerHTMLEditorKit()).getParser();
}

public Map scanHierarchy(File file) throws FileNotFoundException, IOException {
Map map = new HashMap();
scanHierarchyImpl(file, map);
return map;
}
private void scanHierarchyImpl(File file, Map map) throws FileNotFoundException, IOException {
if (file.isDirectory()) {
for (File f :file.listFiles()) {
scanHierarchyImpl(f, map);
}
} else if (file.isFile()) {
String name = file.getName();
if (name.endsWith(".html")
|| name.endsWith(".htm")) {
scan(file, map);
}

}
}

public void scan(File file, Map map) throws FileNotFoundException, IOException {
scan(new FileReader(file), map);
}
public Map scan(File file) throws FileNotFoundException, IOException {
return scan(new FileReader(file));
}

public void scan(Reader in, Map map) throws IOException {
parser.parse(in, new ScannerParserCallback(map), true);
}
public Map scan(Reader in) throws IOException {
Map map = new HashMap();
scan(in, map);
return map;
}
//we need this class only to get the default html parser
//the returned parser creates a new parser on every parse call
//so one parser is enough
private static class ScannerHTMLEditorKit extends HTMLEditorKit {
@Override
public Parser getParser() {
return super.getParser();
}
}
private static class ScannerParserCallback extends ParserCallback {
final Map map;
ScannerParserCallback(Map map) {
this.map = map;
}
@Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
handleSimpleTag(t, a, pos);
}
@Override
public void handleSimpleTag(Tag t, MutableAttributeSet a, int pos) {
Integer integer = map.get(t);
int counter = (integer != null) ? integer.intValue() : 0;
map.put(t, ++counter);
}
}
}

[/code]

Thanks,
Igor

leouser
Offline
Joined: 2005-12-12

some links to help:
http://java-source.net/open-source/html-parsers
http://java.sun.com/products/jfc/tsc/articles/bookmarks/

or maybe even try sucking it in via DOM or SAX and get your data that way. This might not work if your html is not well formed. Maybe some simple munges on the data could enable it.

leouser