Class AbstractDictionary
java.lang.Object
org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
- Direct Known Subclasses:
BigramDictionary
,WordDictionary
SmartChineseAnalyzer abstract dictionary implementation.
Contains methods for dealing with GB2312 encoding.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final int
Dictionary data contains 6768 Chinese characters with frequency statistics.static final int
Last Chinese Character in GB2312 (87 * 94).static final int
First Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptiongetCCByGB2312Id
(int ccid) Transcode from GB2312 ID to Unicodeshort
getGB2312Id
(char ch) Transcode from Unicode to GB2312long
hash1
(char c) 32-bit FNV Hash Functionlong
hash1
(char[] carray) 32-bit FNV Hash Functionint
hash2
(char c) djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.int
hash2
(char[] carray) djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c.
-
Field Details
-
GB2312_FIRST_CHAR
public static final int GB2312_FIRST_CHARFirst Chinese Character in GB2312 (15 * 94) Characters in GB2312 are arranged in a grid of 94 * 94, 0-14 are unassigned or punctuation.- See Also:
-
GB2312_CHAR_NUM
public static final int GB2312_CHAR_NUMLast Chinese Character in GB2312 (87 * 94). Characters in GB2312 are arranged in a grid of 94 * 94, 88-94 are unassigned.- See Also:
-
CHAR_NUM_IN_FILE
public static final int CHAR_NUM_IN_FILEDictionary data contains 6768 Chinese characters with frequency statistics.- See Also:
-
-
Constructor Details
-
AbstractDictionary
AbstractDictionary()
-
-
Method Details
-
getCCByGB2312Id
Transcode from GB2312 ID to UnicodeGB2312 is divided into a 94 * 94 grid, containing 7445 characters consisting of 6763 Chinese characters and 682 symbols. Some regions are unassigned (reserved).
- Parameters:
ccid
- GB2312 id- Returns:
- unicode String
-
getGB2312Id
public short getGB2312Id(char ch) Transcode from Unicode to GB2312- Parameters:
ch
- input character in Unicode, or character in Basic Latin range.- Returns:
- position in GB2312
-
hash1
public long hash1(char c) 32-bit FNV Hash Function- Parameters:
c
- input character- Returns:
- hashcode
-
hash1
public long hash1(char[] carray) 32-bit FNV Hash Function- Parameters:
carray
- character array- Returns:
- hashcode
-
hash2
public int hash2(char c) djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.- Parameters:
c
- character- Returns:
- hashcode
-
hash2
public int hash2(char[] carray) djb2 hash algorithm,this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.- Parameters:
carray
- character array- Returns:
- hashcode
-