Table of Contents
GalateaTalk
in Japanese : http://ja.nishimotz.com/galateatalk
README
Originally, readme file is in Japanese.
https://github.com/nishimotz/jagtalk/blob/master/README.gtalk
To output the selected (as below) internal data to file: set Log = filename If the file exists, append mode is used. To output using stderr: set Log = CONSOLE To disable output: set Log = NO Slots are as follows: Log.conf : configrations of ssm.conf Log.text : input text Log.arrangedText : arranged input text Log.chasen : analysis result of chasen Log.tag : tag lists (CONTEXT, SPELL is not included) Log.phoneme : phoneme information Log.mora : mora information Log.morph : morphological analysis information Log.aphrase : accent phrase information Log.breath : breath paragraphic information Log.sentence : sentence information The default value is NO (output is disabled). To enable output log for 'chasen' slot: set Log.chasen = YES
text2wav
Mac OS X suport
since 2010-11-14
using Mac OS X 10.6.5 (64bit).
macports
- download and install: MacPorts-1.9.2-10.6-SnowLeopard.dmg
chasen
http://sourceforge.jp/projects/chasen-legacy/
Binary version of Unidic is compatible with 32bit binary of chasen.
MacPorts version of chasen is 64bit binary.
Using terminal:
$ sudo mkdir -p /opt/local/bin/portslocation/dports/chasen
$ cd /opt/local/bin/portslocation/dports/chasen
$ sudo port install chasen
if not installed, darts and nkf are also fetched and installed.
Due to historical reasons, the default encoding of ChaSen is set to EUC-JP. If you'd like to handle text files written in UTF-8 or Shift_JIS, you may use -r and -i options. UTF-8) chasen -r /opt/local/etc/chasenrc-UTF-8 -i w <input> Shift_JIS) chasen -r /opt/local/etc/chasenrc-Shift_JIS -i s <input>
$ file /opt/local/bin/chasen /opt/local/bin/chasen: Mach-O 64-bit executable x86_64
$ echo "123" | /opt/local/bin/chasen | nkf -w 1 イチ 1 名詞-数 2 ニ 2 名詞-数 3 サン 3 名詞-数 EOS
nkf -w converts output (EUC-JP) to Terminal default (UTF-8).
- at this time, ipadic-2.7.0 is used with chasen.
- if you want to remove chasen: sudo port -f uninstall chasen
chaone + unidic
- http://www.tokuteicorpus.jp/dist/ (Japanese pages, user registration required)
- download 1: chaone-1.3.3.tar.gz
- download 2: unidic-chasen1312src.tar.gz (use source. binary version is for 32bit chasen)
gtalk + speakers
- download 1: gtalk-090225.tar.gz (or clone jagtalk from github.com)
- download 2: speakers-060820.tar.gz
uncompress and compile
$ cd $ cd code $ pwd /Users/nishimotz/code $ tar xvfz ~/Downloads/unidic-chasen1312src.tar.gz $ tar xvfz ~/Downloads/chaone-1.3.3.tar.gz $ tar xvfz ~/Downloads/gtalk-090225.tar.gz.gz $ tar xvfz ~/Downloads/speakers-060820.tar.gz.gz
Xcode (gcc) must be installed.
$ gcc -v Using built-in specs. Target: i686-apple-darwin10 Configured with: /var/tmp/gcc/gcc-5659~1/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 Thread model: posix gcc version 4.2.1 (Apple Inc. build 5659)
building unidic for x64
seems easier to use default (UTF-8 version) of unidic, rather than to make EUC-JP version of unidic.
$ cd unidic-chasen1312src $ ./configure $ make /opt/local/lib/chasen/makemat -i w parsing grammar.cha parsing cforms.cha parsing ctypes.cha parsing connect.cha table size: 9767 lines: ......................
modify chasenrc:
;(GRAMMAR ./dic) (GRAMMAR .)
or make symbolic link:
$ ln -s . dic
test chasen using unidic:
$ echo "123" | chasen -r chasenrc 1 イッ 名詞-数詞 lForm="イチ" lemma="一" orthBase="1" pronBase="イッ" kanaBase="イッ" formBase="イチ" goshu="漢" iConType="N1" fType="チ促" fForm="促音形" aType="2" aConType="C3" 2 ニ 名詞-数詞 lForm="ニ" lemma="二" orthBase="2" pronBase="ニ" kanaBase="ニ" formBase="ニ" goshu="漢" fType="イ長添" fForm="基本形" aType="1" aConType="C3" 3 サン 名詞-数詞 lForm="サン" lemma="三" orthBase="3" pronBase="サン" kanaBase="サン" formBase="サン" goshu="漢" iConType="N3" aType="0" aConType="C3" EOS
rename the directory:
$ cd .. $ mv unidic-chasen1312src unidic-chasen1312_utf8-x64
building chaone
$ cd chaone-1.3.3 $ sh configure $ make
$ sudo port install libxml $ sudo port install libxml2 $ sudo port install libxslt
still errots:
In file included from chaone.c:12: /usr/include/libxslt/transform.h:15:27: error: libxml/parser.h: No such file or directory /usr/include/libxslt/transform.h:16:26: error: libxml/xmlIO.h: No such file or directory
$ sh configure (omitted) configure: WARNING: "xml2-config is not found" $ make
to avoid the errors:
$ cd /usr/include/ $ sudo ln -s libxml2/libxml .
$ sh configure checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... ./install-sh -c -d checking for gawk... no checking for mawk... no checking for nawk... no checking for awk... awk checking whether make sets $(MAKE)... yes checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ISO C89... none needed checking for style of include used by make... GNU checking dependency style of gcc... gcc3 checking for xmlCleanupParser, xlFreeDoc, xmlLoadExtDtdDefaultValue, xmlFree, xmlParseMemory, xmlStrcat, xmlStrdup, xmlSubstituteEntitiesDefault in -lxml2... yes checking for xsltApplyStylesheet, xsltCleanupGlobals, xsltFreeStylesheet, xsltParseStylesheetFile, xsltSaveResultToFile in -lxslt... yes checking for exsltRegisterAll in -lexslt... yes checking how to run the C preprocessor... gcc -E checking for grep that handles long lines and -e... /usr/bin/grep checking for egrep... /usr/bin/grep -E checking for ANSI C header files... rm: conftest.dSYM: is a directory rm: conftest.dSYM: is a directory yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking for stdlib.h... (cached) yes checking for string.h... (cached) yes checking libxslt/transform.h usability... yes checking libxslt/transform.h presence... yes checking for libxslt/transform.h... yes checking libxslt/xsltutils.h usability... yes checking libxslt/xsltutils.h presence... yes checking for libxslt/xsltutils.h... yes checking libexslt/exslt.h usability... yes checking libexslt/exslt.h presence... yes checking for libexslt/exslt.h... yes checking for an ANSI C-conforming const... yes checking for stdlib.h... (cached) yes checking for GNU libc compatible malloc... yes configure: creating ./config.status config.status: creating Makefile config.status: executing depfiles commands
program runs, but fails to read data:
$ ./chaone I/O warning : failed to load external entity "/usr/local/chaone/chaone.xsl" error xsltParseStylesheetFile : cannot parse /usr/local/chaone/chaone.xsl Segmentation fault
copy to /usr/local (“sudo make install” does not work??):
$ sudo mkdir /usr/local/chaone $ sudo cp *.xml *.xsl /usr/local/chaone/ $ sudo cp chaone /usr/local/bin/
now /usr/local/bin/chaone works.
$ chaone -h Usage: chaone [options] [file] [file] input file name. if none is specified, stdin is used output to stdout [options] --encoding {ISO-2022-JP|EUC-JP|Shift_JIS|UTF-8}: set I/O encoding --mode {prep|chunker|phonetic|accent|postp|pc|pcp|pcpa|gtalk}: set standalone mode --debug : debug output to stderr in UTF-8
building gtalk
see jagtalk
Mac build (32bit, euc-jp, without ports)
since 2011-10-08
- MacOSX 10.6.8
http://chasen.org/~taku/software/darts/
$ tar xvfz darts-0.32.tar.gz $ cd darts-0.32 $ CFLAGS='-arch i386' ./configure $ make $ make check $ sudo make install
http://sourceforge.jp/projects/chasen-legacy/
$ tar xvfz chasen-2.4.4.tar.gz $ cd chasen-2.4.4 $ make distclean $ CFLAGS='-arch i386 -m32' CXXFLAGS='-arch i386 -m32' LDFLAGS='-arch i386' ./configure; make $ sudo make install $ file /usr/local/bin/chasen /usr/local/bin/chasen: Mach-O executable i386
chaone (from tokuteicorpus site or galateatalk sourceforge.jp site)
- to use the system libraries for XML, chaone was build as the 64bit binary.
$ tar xvfz chaone-1.3.3.tar.gz $ cd chaone-1.3.3 $ CFLAGS='-I/usr/include/libxslt -I/usr/include/libxml2' CPPFLAGS=$CFLAGS sh configure $ make $ chmod 755 install-sh $ sudo make install $ file /usr/local/chaone/chaone /usr/local/chaone/chaone: Mach-O 64-bit executable x86_64
The installer seems forgetting to copy a file..
$ sudo cp ap_pos_rule.xml /usr/local/chaone/
prepare speakers and unidic-chasen:
$ ls ~/work/galatea/speakers-060820/ female01 male01
$ ls ~/work/galatea/unidic-chasen1312_eucj/ ChangeLog chadic.lex grammar.cha table.cha cforms.cha chasenrc license.txt chadic.da chasenrc_chaone manual.pdf chadic.dat ctypes.cha matrix.cha
copy and build jagtalk:
$ git clone https://nishimotz@github.com/nishimotz/jagtalk.git $ cd jagtalk $ make -f Makefile.MACOSX
check the files below (modify them if necessary):
$ cat test-jagtalk-macosx.sh cat 00-testcmd | ./jagtalk -C jagtalk-macosx.conf
$ cat 00-testcmd set Text = 123 set SaveWAV = _out.wav set Run = EXIT
$ cat jagtalk-macosx.conf # configuratiuon file for gtalk (GalateaTalk) # macosx: http://en.nishimotz.com/galateatalk CHASEN: /usr/local/bin/chasen CHAONE: /usr/local/chaone/chaone -s gtalk --encoding EUC-JP CHASEN-RC: ./chasenrc-euc-jp-macosx # default for numbers and alphabets NUMBER: DECIMAL ALPHABET: WORD DATE: YMD TIME: hms # dictionary DICTIONARY: ./gtalk-eucjp.dic # automatic play of synthesized speech AUTO-PLAY: NO # time delay [msec] for autuomatic play AUTO-PLAY-DELAY: 250 # file of phoneme list PHONEME-LIST: mono.lst # parameter files for each speaker SPEAKER-ID: female01 GENDER: female DUR-TREE-FILE: ../../galatea/speakers-060820/female01/tree-dur.inf PIT-TREE-FILE: ../../galatea/speakers-060820/female01/tree-lf0.inf MCEP-TREE-FILE: ../../galatea/speakers-060820/female01/tree-mcep.inf DUR-MODEL-FILE: ../../galatea/speakers-060820/female01/duration.pdf PIT-MODEL-FILE: ../../galatea/speakers-060820/female01/lf0.pdf MCEP-MODEL-FILE: ../../galatea/speakers-060820/female01/mcep.pdf
chasenrc-euc-jp-macosx contains EUC-JP charactors.
$ cat chasenrc-euc-jp-macosx ;; ;; chasenrc for unidic / chaOne ;; (GRAMMAR /Users/nishimotz/work/galatea/unidic-chasen1312_eucj) (DADIC chadic) (UNKNOWN_POS (名詞 普通名詞 一般)) (OUTPUT_FORMAT "<W1 orth=\"%m\" kana=\"%?U/%m/%y0/\" pron=\"%?U/%m/%a0/\" pos=\"%U(%P-)\"%?T/ cType=\"%T \"//%?F/ cForm=\"%F \"//%?I/ %i0//>%m</W1>\n") (OUTPUT_COMPOUND "SEG") (BOS_STRING "<S>\n") (EOS_STRING "</S>\n") (DEF_CONN_COST 10000) (POS_COST ((*) 1) ((UNKNOWN) 30000) ) (CONN_WEIGHT 1) (MORPH_WEIGHT 1) (COST_WIDTH 0) (ANNOTATION (("<" ">") "%m\n") (("\"") "<cha:W1 orth=\""\" kana=\""\" pron=\""\" pos=\"%U(%P-)\"%?T/ cType=\"%T \"//%?F/ cForm=\"%F \"//%?I/ %i//>%m</cha:W1>\n") )
run the test script, open _out.wav using QuickTime Player. You will hear 'hyaku ni juu san' (123 in Japanese).
$ sh test-jagtalk-macosx.sh
jagtalk now uses Mac audio device:
$ sh run-jagtalk-macosx.sh
$ cat 00-testcmd-speaker set Speak.syncinterval = 500 set Text = 123456789 set Speak = NOW