Table of Contents
GalateaTalk
in Japanese : http://ja.nishimotz.com/galateatalk
README
Originally, readme file is in Japanese.
https://github.com/nishimotz/jagtalk/blob/master/README.gtalk
To output the selected (as below) internal data to file:
set Log = filename
If the file exists, append mode is used.
To output using stderr:
set Log = CONSOLE
To disable output:
set Log = NO
Slots are as follows:
Log.conf : configrations of ssm.conf
Log.text : input text
Log.arrangedText : arranged input text
Log.chasen : analysis result of chasen
Log.tag : tag lists (CONTEXT, SPELL is not included)
Log.phoneme : phoneme information
Log.mora : mora information
Log.morph : morphological analysis information
Log.aphrase : accent phrase information
Log.breath : breath paragraphic information
Log.sentence : sentence information
The default value is NO (output is disabled).
To enable output log for 'chasen' slot:
set Log.chasen = YES
text2wav
Mac OS X suport
since 2010-11-14
using Mac OS X 10.6.5 (64bit).
macports
- download and install: MacPorts-1.9.2-10.6-SnowLeopard.dmg
chasen
http://sourceforge.jp/projects/chasen-legacy/
Binary version of Unidic is compatible with 32bit binary of chasen.
MacPorts version of chasen is 64bit binary.
Using terminal:
$ sudo mkdir -p /opt/local/bin/portslocation/dports/chasen
$ cd /opt/local/bin/portslocation/dports/chasen
$ sudo port install chasen
if not installed, darts and nkf are also fetched and installed.
Due to historical reasons, the default encoding of ChaSen is set to EUC-JP. If you'd like to handle text files written in UTF-8 or Shift_JIS, you may use -r and -i options. UTF-8) chasen -r /opt/local/etc/chasenrc-UTF-8 -i w <input> Shift_JIS) chasen -r /opt/local/etc/chasenrc-Shift_JIS -i s <input>
$ file /opt/local/bin/chasen /opt/local/bin/chasen: Mach-O 64-bit executable x86_64
$ echo "123" | /opt/local/bin/chasen | nkf -w 1 イチ 1 名詞-数 2 ニ 2 名詞-数 3 サン 3 名詞-数 EOS
nkf -w converts output (EUC-JP) to Terminal default (UTF-8).
- at this time, ipadic-2.7.0 is used with chasen.
- if you want to remove chasen: sudo port -f uninstall chasen
chaone + unidic
- http://www.tokuteicorpus.jp/dist/ (Japanese pages, user registration required)
- download 1: chaone-1.3.3.tar.gz
- download 2: unidic-chasen1312src.tar.gz (use source. binary version is for 32bit chasen)
gtalk + speakers
- download 1: gtalk-090225.tar.gz (or clone jagtalk from github.com)
- download 2: speakers-060820.tar.gz
uncompress and compile
$ cd $ cd code $ pwd /Users/nishimotz/code $ tar xvfz ~/Downloads/unidic-chasen1312src.tar.gz $ tar xvfz ~/Downloads/chaone-1.3.3.tar.gz $ tar xvfz ~/Downloads/gtalk-090225.tar.gz.gz $ tar xvfz ~/Downloads/speakers-060820.tar.gz.gz
Xcode (gcc) must be installed.
$ gcc -v Using built-in specs. Target: i686-apple-darwin10 Configured with: /var/tmp/gcc/gcc-5659~1/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 Thread model: posix gcc version 4.2.1 (Apple Inc. build 5659)
building unidic for x64
seems easier to use default (UTF-8 version) of unidic, rather than to make EUC-JP version of unidic.
$ cd unidic-chasen1312src $ ./configure $ make /opt/local/lib/chasen/makemat -i w parsing grammar.cha parsing cforms.cha parsing ctypes.cha parsing connect.cha table size: 9767 lines: ......................
modify chasenrc:
;(GRAMMAR ./dic) (GRAMMAR .)
or make symbolic link:
$ ln -s . dic
test chasen using unidic:
$ echo "123" | chasen -r chasenrc 1 イッ 名詞-数詞 lForm="イチ" lemma="一" orthBase="1" pronBase="イッ" kanaBase="イッ" formBase="イチ" goshu="漢" iConType="N1" fType="チ促" fForm="促音形" aType="2" aConType="C3" 2 ニ 名詞-数詞 lForm="ニ" lemma="二" orthBase="2" pronBase="ニ" kanaBase="ニ" formBase="ニ" goshu="漢" fType="イ長添" fForm="基本形" aType="1" aConType="C3" 3 サン 名詞-数詞 lForm="サン" lemma="三" orthBase="3" pronBase="サン" kanaBase="サン" formBase="サン" goshu="漢" iConType="N3" aType="0" aConType="C3" EOS
rename the directory:
$ cd .. $ mv unidic-chasen1312src unidic-chasen1312_utf8-x64
building chaone
$ cd chaone-1.3.3 $ sh configure $ make
$ sudo port install libxml $ sudo port install libxml2 $ sudo port install libxslt
still errots:
In file included from chaone.c:12: /usr/include/libxslt/transform.h:15:27: error: libxml/parser.h: No such file or directory /usr/include/libxslt/transform.h:16:26: error: libxml/xmlIO.h: No such file or directory
$ sh configure (omitted) configure: WARNING: "xml2-config is not found" $ make
to avoid the errors:
$ cd /usr/include/ $ sudo ln -s libxml2/libxml .
$ sh configure
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... ./install-sh -c -d
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
checking whether make sets $(MAKE)... yes
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for style of include used by make... GNU
checking dependency style of gcc... gcc3
checking for
xmlCleanupParser,
xlFreeDoc,
xmlLoadExtDtdDefaultValue,
xmlFree,
xmlParseMemory,
xmlStrcat,
xmlStrdup,
xmlSubstituteEntitiesDefault in -lxml2... yes
checking for
xsltApplyStylesheet,
xsltCleanupGlobals,
xsltFreeStylesheet,
xsltParseStylesheetFile,
xsltSaveResultToFile in -lxslt... yes
checking for
exsltRegisterAll in -lexslt... yes
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... rm: conftest.dSYM: is a directory
rm: conftest.dSYM: is a directory
yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking libxslt/transform.h usability... yes
checking libxslt/transform.h presence... yes
checking for libxslt/transform.h... yes
checking libxslt/xsltutils.h usability... yes
checking libxslt/xsltutils.h presence... yes
checking for libxslt/xsltutils.h... yes
checking libexslt/exslt.h usability... yes
checking libexslt/exslt.h presence... yes
checking for libexslt/exslt.h... yes
checking for an ANSI C-conforming const... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible malloc... yes
configure: creating ./config.status
config.status: creating Makefile
config.status: executing depfiles commands
program runs, but fails to read data:
$ ./chaone I/O warning : failed to load external entity "/usr/local/chaone/chaone.xsl" error xsltParseStylesheetFile : cannot parse /usr/local/chaone/chaone.xsl Segmentation fault
copy to /usr/local (“sudo make install” does not work??):
$ sudo mkdir /usr/local/chaone $ sudo cp *.xml *.xsl /usr/local/chaone/ $ sudo cp chaone /usr/local/bin/
now /usr/local/bin/chaone works.
$ chaone -h
Usage: chaone [options] [file]
[file] input file name. if none is specified, stdin is used
output to stdout
[options]
--encoding {ISO-2022-JP|EUC-JP|Shift_JIS|UTF-8}: set I/O encoding
--mode {prep|chunker|phonetic|accent|postp|pc|pcp|pcpa|gtalk}: set standalone mode
--debug : debug output to stderr in UTF-8
building gtalk
see jagtalk
Mac build (32bit, euc-jp, without ports)
since 2011-10-08
- MacOSX 10.6.8
http://chasen.org/~taku/software/darts/
$ tar xvfz darts-0.32.tar.gz $ cd darts-0.32 $ CFLAGS='-arch i386' ./configure $ make $ make check $ sudo make install
http://sourceforge.jp/projects/chasen-legacy/
$ tar xvfz chasen-2.4.4.tar.gz $ cd chasen-2.4.4 $ make distclean $ CFLAGS='-arch i386 -m32' CXXFLAGS='-arch i386 -m32' LDFLAGS='-arch i386' ./configure; make $ sudo make install $ file /usr/local/bin/chasen /usr/local/bin/chasen: Mach-O executable i386
chaone (from tokuteicorpus site or galateatalk sourceforge.jp site)
- to use the system libraries for XML, chaone was build as the 64bit binary.
$ tar xvfz chaone-1.3.3.tar.gz $ cd chaone-1.3.3 $ CFLAGS='-I/usr/include/libxslt -I/usr/include/libxml2' CPPFLAGS=$CFLAGS sh configure $ make $ chmod 755 install-sh $ sudo make install $ file /usr/local/chaone/chaone /usr/local/chaone/chaone: Mach-O 64-bit executable x86_64
The installer seems forgetting to copy a file..
$ sudo cp ap_pos_rule.xml /usr/local/chaone/
prepare speakers and unidic-chasen:
$ ls ~/work/galatea/speakers-060820/ female01 male01
$ ls ~/work/galatea/unidic-chasen1312_eucj/ ChangeLog chadic.lex grammar.cha table.cha cforms.cha chasenrc license.txt chadic.da chasenrc_chaone manual.pdf chadic.dat ctypes.cha matrix.cha
copy and build jagtalk:
$ git clone https://nishimotz@github.com/nishimotz/jagtalk.git $ cd jagtalk $ make -f Makefile.MACOSX
check the files below (modify them if necessary):
$ cat test-jagtalk-macosx.sh cat 00-testcmd | ./jagtalk -C jagtalk-macosx.conf
$ cat 00-testcmd set Text = 123 set SaveWAV = _out.wav set Run = EXIT
$ cat jagtalk-macosx.conf # configuratiuon file for gtalk (GalateaTalk) # macosx: http://en.nishimotz.com/galateatalk CHASEN: /usr/local/bin/chasen CHAONE: /usr/local/chaone/chaone -s gtalk --encoding EUC-JP CHASEN-RC: ./chasenrc-euc-jp-macosx # default for numbers and alphabets NUMBER: DECIMAL ALPHABET: WORD DATE: YMD TIME: hms # dictionary DICTIONARY: ./gtalk-eucjp.dic # automatic play of synthesized speech AUTO-PLAY: NO # time delay [msec] for autuomatic play AUTO-PLAY-DELAY: 250 # file of phoneme list PHONEME-LIST: mono.lst # parameter files for each speaker SPEAKER-ID: female01 GENDER: female DUR-TREE-FILE: ../../galatea/speakers-060820/female01/tree-dur.inf PIT-TREE-FILE: ../../galatea/speakers-060820/female01/tree-lf0.inf MCEP-TREE-FILE: ../../galatea/speakers-060820/female01/tree-mcep.inf DUR-MODEL-FILE: ../../galatea/speakers-060820/female01/duration.pdf PIT-MODEL-FILE: ../../galatea/speakers-060820/female01/lf0.pdf MCEP-MODEL-FILE: ../../galatea/speakers-060820/female01/mcep.pdf
chasenrc-euc-jp-macosx contains EUC-JP charactors.
$ cat chasenrc-euc-jp-macosx
;;
;; chasenrc for unidic / chaOne
;;
(GRAMMAR /Users/nishimotz/work/galatea/unidic-chasen1312_eucj)
(DADIC chadic)
(UNKNOWN_POS (名詞 普通名詞 一般))
(OUTPUT_FORMAT "<W1 orth=\"%m\" kana=\"%?U/%m/%y0/\" pron=\"%?U/%m/%a0/\" pos=\"%U(%P-)\"%?T/ cType=\"%T \"//%?F/ cForm=\"%F \"//%?I/ %i0//>%m</W1>\n")
(OUTPUT_COMPOUND "SEG")
(BOS_STRING "<S>\n")
(EOS_STRING "</S>\n")
(DEF_CONN_COST 10000)
(POS_COST
((*) 1)
((UNKNOWN) 30000)
)
(CONN_WEIGHT 1)
(MORPH_WEIGHT 1)
(COST_WIDTH 0)
(ANNOTATION
(("<" ">") "%m\n")
(("\"") "<cha:W1 orth=\""\" kana=\""\" pron=\""\" pos=\"%U(%P-)\"%?T/ cType=\"%T \"//%?F/ cForm=\"%F \"//%?I/ %i//>%m</cha:W1>\n")
)
run the test script, open _out.wav using QuickTime Player. You will hear 'hyaku ni juu san' (123 in Japanese).
$ sh test-jagtalk-macosx.sh
jagtalk now uses Mac audio device:
$ sh run-jagtalk-macosx.sh
$ cat 00-testcmd-speaker set Speak.syncinterval = 500 set Text = 123456789 set Speak = NOW
