GalateaTalk

README

Originally, readme file is in Japanese.

https://github.com/nishimotz/jagtalk/blob/master/README.gtalk

To output the selected (as below) internal data to file:

    set Log = filename 

If the file exists, append mode is used.

To output using stderr:

    set Log = CONSOLE

To disable output:

    set Log = NO

Slots are as follows:

    Log.conf : configrations of ssm.conf
    Log.text : input text
    Log.arrangedText : arranged input text
    Log.chasen : analysis result of chasen
    Log.tag : tag lists (CONTEXT, SPELL is not included)
    Log.phoneme : phoneme information
    Log.mora : mora information
    Log.morph : morphological analysis information
    Log.aphrase : accent phrase information
    Log.breath : breath paragraphic information
    Log.sentence : sentence information

The default value is NO (output is disabled).

To enable output log for 'chasen' slot:

    set Log.chasen = YES

text2wav

Mac OS X suport

since 2010-11-14

using Mac OS X 10.6.5 (64bit).

macports

chasen

http://sourceforge.jp/projects/chasen-legacy/

Binary version of Unidic is compatible with 32bit binary of chasen.

MacPorts version of chasen is 64bit binary.

Using terminal:

$ sudo mkdir -p /opt/local/bin/portslocation/dports/chasen $ cd /opt/local/bin/portslocation/dports/chasen

$ sudo port install chasen  

if not installed, darts and nkf are also fetched and installed.

Due to historical reasons, the default encoding of ChaSen is set to EUC-JP.
If you'd like to handle text files written in UTF-8 or Shift_JIS, you may use -r and -i options.

  UTF-8)     chasen -r /opt/local/etc/chasenrc-UTF-8 -i w <input>
  Shift_JIS) chasen -r /opt/local/etc/chasenrc-Shift_JIS -i s <input>
$ file /opt/local/bin/chasen
/opt/local/bin/chasen: Mach-O 64-bit executable x86_64
$ echo "123" | /opt/local/bin/chasen | nkf -w
1	イチ	1	名詞-数		
2	ニ	2	名詞-数		
3	サン	3	名詞-数		
EOS

nkf -w converts output (EUC-JP) to Terminal default (UTF-8).

  • at this time, ipadic-2.7.0 is used with chasen.
  • if you want to remove chasen: sudo port -f uninstall chasen

chaone + unidic

  • http://www.tokuteicorpus.jp/dist/ (Japanese pages, user registration required)
  • download 1: chaone-1.3.3.tar.gz
  • download 2: unidic-chasen1312src.tar.gz (use source. binary version is for 32bit chasen)

gtalk + speakers

uncompress and compile

$ cd
$ cd code
$ pwd
/Users/nishimotz/code
$ tar xvfz ~/Downloads/unidic-chasen1312src.tar.gz 
$ tar xvfz ~/Downloads/chaone-1.3.3.tar.gz 
$ tar xvfz ~/Downloads/gtalk-090225.tar.gz.gz 
$ tar xvfz ~/Downloads/speakers-060820.tar.gz.gz 

Xcode (gcc) must be installed.

$ gcc -v
Using built-in specs.
Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5659~1/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5659)

building unidic for x64

seems easier to use default (UTF-8 version) of unidic, rather than to make EUC-JP version of unidic.

$ cd unidic-chasen1312src
$ ./configure
$ make
/opt/local/lib/chasen/makemat -i w
parsing grammar.cha
parsing cforms.cha
parsing ctypes.cha
parsing connect.cha
table size: 9767
lines: ......................

modify chasenrc:

;(GRAMMAR ./dic)
(GRAMMAR .)

or make symbolic link:

$ ln -s . dic

test chasen using unidic:

$ echo "123" | chasen -r chasenrc 
1       イッ    名詞-数詞                       lForm="イチ" lemma="一" orthBase="1" pronBase="イッ" kanaBase="イッ" formBase="イチ" goshu="漢" iConType="N1" fType="チ促" fForm="促音形" aType="2" aConType="C3"
2       ニ      名詞-数詞                       lForm="ニ" lemma="二" orthBase="2" pronBase="ニ" kanaBase="ニ" formBase="ニ" goshu="漢" fType="イ長添" fForm="基本形" aType="1" aConType="C3"
3       サン    名詞-数詞                       lForm="サン" lemma="三" orthBase="3" pronBase="サン" kanaBase="サン" formBase="サン" goshu="漢" iConType="N3" aType="0" aConType="C3"
EOS

rename the directory:

$ cd ..
$ mv unidic-chasen1312src unidic-chasen1312_utf8-x64

building chaone

$ cd chaone-1.3.3
$ sh configure 
$ make
$ sudo port install libxml
$ sudo port install libxml2
$ sudo port install libxslt

still errots:

In file included from chaone.c:12:
/usr/include/libxslt/transform.h:15:27: error: libxml/parser.h: No such file or directory
/usr/include/libxslt/transform.h:16:26: error: libxml/xmlIO.h: No such file or directory
$ sh configure 
(omitted)
configure: WARNING: "xml2-config is not found"
$ make

to avoid the errors:

$ cd /usr/include/
$ sudo ln -s libxml2/libxml .
$ sh configure
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... ./install-sh -c -d
checking for gawk... no
checking for mawk... no
checking for nawk... no
checking for awk... awk
checking whether make sets $(MAKE)... yes
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for style of include used by make... GNU
checking dependency style of gcc... gcc3
checking for
    xmlCleanupParser,
    xlFreeDoc,
    xmlLoadExtDtdDefaultValue,
    xmlFree,
    xmlParseMemory,
    xmlStrcat,
    xmlStrdup,
    xmlSubstituteEntitiesDefault in -lxml2... yes
checking for
    xsltApplyStylesheet,
    xsltCleanupGlobals,
    xsltFreeStylesheet,
    xsltParseStylesheetFile,
    xsltSaveResultToFile in -lxslt... yes
checking for
    exsltRegisterAll in -lexslt... yes
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... rm: conftest.dSYM: is a directory
rm: conftest.dSYM: is a directory
yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking libxslt/transform.h usability... yes
checking libxslt/transform.h presence... yes
checking for libxslt/transform.h... yes
checking libxslt/xsltutils.h usability... yes
checking libxslt/xsltutils.h presence... yes
checking for libxslt/xsltutils.h... yes
checking libexslt/exslt.h usability... yes
checking libexslt/exslt.h presence... yes
checking for libexslt/exslt.h... yes
checking for an ANSI C-conforming const... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible malloc... yes
configure: creating ./config.status
config.status: creating Makefile
config.status: executing depfiles commands

program runs, but fails to read data:

$ ./chaone 
I/O warning : failed to load external entity "/usr/local/chaone/chaone.xsl"
error
xsltParseStylesheetFile : cannot parse /usr/local/chaone/chaone.xsl
Segmentation fault

copy to /usr/local (“sudo make install” does not work??):

$ sudo mkdir /usr/local/chaone
$ sudo cp *.xml *.xsl /usr/local/chaone/
$ sudo cp chaone /usr/local/bin/   

now /usr/local/bin/chaone works.

$ chaone -h
Usage: chaone [options] [file]
[file]	input file name. if none is specified, stdin is used
	output to stdout
[options]
	--encoding {ISO-2022-JP|EUC-JP|Shift_JIS|UTF-8}: set I/O encoding
	--mode {prep|chunker|phonetic|accent|postp|pc|pcp|pcpa|gtalk}: set standalone mode
	--debug : debug output to stderr in UTF-8

building gtalk

see jagtalk

Mac build (32bit, euc-jp, without ports)

since 2011-10-08

  • MacOSX 10.6.8

http://chasen.org/~taku/software/darts/

$ tar xvfz darts-0.32.tar.gz 
$ cd darts-0.32
$ CFLAGS='-arch i386' ./configure 
$ make
$ make check
$ sudo make install

http://sourceforge.jp/projects/chasen-legacy/

$ tar xvfz chasen-2.4.4.tar.gz
$ cd chasen-2.4.4
$ make distclean
$ CFLAGS='-arch i386 -m32' CXXFLAGS='-arch i386 -m32' LDFLAGS='-arch i386' ./configure; make
$ sudo make install
$ file /usr/local/bin/chasen
/usr/local/bin/chasen: Mach-O executable i386

chaone (from tokuteicorpus site or galateatalk sourceforge.jp site)

  • to use the system libraries for XML, chaone was build as the 64bit binary.
$ tar xvfz chaone-1.3.3.tar.gz
$ cd chaone-1.3.3
$ CFLAGS='-I/usr/include/libxslt -I/usr/include/libxml2' CPPFLAGS=$CFLAGS sh configure
$ make
$ chmod 755 install-sh
$ sudo make install
$ file /usr/local/chaone/chaone
/usr/local/chaone/chaone: Mach-O 64-bit executable x86_64

The installer seems forgetting to copy a file..

$ sudo cp ap_pos_rule.xml /usr/local/chaone/

prepare speakers and unidic-chasen:

$ ls ~/work/galatea/speakers-060820/
female01	male01
$ ls ~/work/galatea/unidic-chasen1312_eucj/
ChangeLog	chadic.lex	grammar.cha	table.cha
cforms.cha	chasenrc	license.txt
chadic.da	chasenrc_chaone	manual.pdf
chadic.dat	ctypes.cha	matrix.cha

copy and build jagtalk:

$ git clone https://nishimotz@github.com/nishimotz/jagtalk.git
$ cd jagtalk
$ make -f Makefile.MACOSX

check the files below (modify them if necessary):

$ cat test-jagtalk-macosx.sh 
cat 00-testcmd | ./jagtalk -C jagtalk-macosx.conf
$ cat 00-testcmd 
set Text = 123
set SaveWAV = _out.wav
set Run = EXIT
$ cat jagtalk-macosx.conf
# configuratiuon file for gtalk (GalateaTalk)
# macosx: http://en.nishimotz.com/galateatalk

CHASEN: /usr/local/bin/chasen
CHAONE: /usr/local/chaone/chaone -s gtalk --encoding EUC-JP
CHASEN-RC: ./chasenrc-euc-jp-macosx

# default for numbers and alphabets
NUMBER: DECIMAL
ALPHABET: WORD
DATE: YMD
TIME: hms

# dictionary
DICTIONARY: ./gtalk-eucjp.dic

# automatic play of synthesized speech
AUTO-PLAY: NO

# time delay [msec] for autuomatic play
AUTO-PLAY-DELAY: 250

# file of phoneme list
PHONEME-LIST: mono.lst

# parameter files for each speaker
SPEAKER-ID: female01
GENDER: female
DUR-TREE-FILE:   ../../galatea/speakers-060820/female01/tree-dur.inf
PIT-TREE-FILE:   ../../galatea/speakers-060820/female01/tree-lf0.inf
MCEP-TREE-FILE:  ../../galatea/speakers-060820/female01/tree-mcep.inf
DUR-MODEL-FILE:  ../../galatea/speakers-060820/female01/duration.pdf
PIT-MODEL-FILE:  ../../galatea/speakers-060820/female01/lf0.pdf
MCEP-MODEL-FILE: ../../galatea/speakers-060820/female01/mcep.pdf

chasenrc-euc-jp-macosx contains EUC-JP charactors.

$ cat chasenrc-euc-jp-macosx
;;
;;  chasenrc for unidic / chaOne
;;
(GRAMMAR /Users/nishimotz/work/galatea/unidic-chasen1312_eucj)
(DADIC chadic)

(UNKNOWN_POS (名詞 普通名詞 一般))

(OUTPUT_FORMAT "<W1 orth=\"%m\" kana=\"%?U/%m/%y0/\" pron=\"%?U/%m/%a0/\" pos=\"%U(%P-)\"%?T/ cType=\"%T \"//%?F/ cForm=\"%F \"//%?I/ %i0//>%m</W1>\n")

(OUTPUT_COMPOUND "SEG")

(BOS_STRING "<S>\n")      
(EOS_STRING "</S>\n")

(DEF_CONN_COST 10000)
(POS_COST
	((*)       1)
	((UNKNOWN) 30000)
)

(CONN_WEIGHT 1)
(MORPH_WEIGHT 1)
(COST_WIDTH  0)

(ANNOTATION
	(("<" ">") "%m\n")
	(("\"") "<cha:W1 orth=\"&#x22;\" kana=\"&#x22;\" pron=\"&#x22;\" pos=\"%U(%P-)\"%?T/ cType=\"%T \"//%?F/ cForm=\"%F \"//%?I/ %i//>%m</cha:W1>\n")
)

run the test script, open _out.wav using QuickTime Player. You will hear 'hyaku ni juu san' (123 in Japanese).

$ sh test-jagtalk-macosx.sh

jagtalk now uses Mac audio device:

$ sh run-jagtalk-macosx.sh
$ cat 00-testcmd-speaker 
set Speak.syncinterval = 500
set Text = 123456789
set Speak = NOW
galateatalk.txt · Last modified: 2011/10/11 10:40 by nishimotz
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0