A Comparison of Synthetic and Human Speech: An Evaluation by English As A Foreign Language Students in A Public Costa Rican University
A Comparison of Synthetic and Human Speech: An Evaluation by English As A Foreign Language Students in A Public Costa Rican University
41-58)
41
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
de ILE acerca de las voces humanas y de inteligencia artificial. Asimismo, explora las opiniones de estudiantes sobre la
instrucción de la escucha. Esta investigación se llevó a cabo de abril a setiembre de 2022 e incluyó a 36 estudiantes de ILE
matriculados en un Bachillerato en Inglés o Enseñanza del Inglés en una universidad pública costarricense. Se utilizó un
modelo cuantitativo de encuestas. El investigador recolectó las respuestas mediante una encuesta diseñada para recabar
las percepciones del estudiantado acerca de las voces generadas por computadora, las voces humanas, y la instrucción
de la escucha. Los datos fueron analizados de manera cuantitativa utilizando estadística descriptiva. El análisis de los da-
tos indica que: 1) el estudiantado encuentra las voces humanas más atractivas que las voces generadas con inteligencia
artificial; 2) el estudiantado considera las voces femeninas más atractivas que las masculinas cuando son generadas por
computadora; 3) las voces generadas por inteligencia artificial comparten algunas características que el estudiantado en-
cuentra más atractivas; y 4) las presentes políticas y materiales para la instrucción de la escucha deben ser reexaminadas
en el programa de idiomas. Consistente con la literatura revisada, estos resultados demuestran que aunque las voces
TTS no llaman tanto la atención del estudiantado como las voces humanas, una parte de la población considera las voces
generadas por computadora interesantes. El análisis también sugiere que una parte del estudiantado no puede discernir
en su totalidad entre voces humanas y generadas por computadora; por lo tanto, su uso puede ser apropiado en algunos
contextos. Finalmente, los resultados confirman que las políticas y los materiales para la enseñanza de la escucha deben
ser revisados para mejorar los procesos de adquisición del lenguaje del estudiantado.
42
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
from being exposed to materials that are more contex- mans. It also involves creating machines that can learn
tualized to their needs, more adequate to their level, and from data, make predictions, make decisions, and per-
more appropriate to their interests. Students can also be form tasks that would typically require human intelli-
exposed to AI speakers of various accents, ages, or gen- gence, such as visual perception, speech recognition,
ders if needed. Therefore, having this variety and adap- decision-making, and translating (Abbott, 2020; Arora,
tation may enrich the class dynamics in the ESL class. 2022; Cameron, 2019; Jeste et al., 2020; Kent, 2022).
There are various types of AI. Narrow or weak AI is
This paper is divided into five distinct sections. The in- designed to perform a single task (Gulson et al., 2022;
troduction describes the importance and potential bene- Kindersley, 2023), while general or strong AI can per-
fits of including TTS audios in the context of English as form any intellectual task that a human can (Kindersley,
a second language or foreign language (ESL/EFL). The 2023; Mitchell, 2019). Artificial intelligence is used in
literature review presents the most relevant concepts various applications, such as self-driving cars, virtual
of artificial intelligence (AI) and natural language pro- personal assistants, and biometric authentication meth-
cessing (NLP). It also introduces some core concepts ods. In addition, AI research aims to create systems ca-
related to the human voice, TTS theory, and listening pable of performing tasks that typically require human
instruction. The methods section describes the partici- intelligence, such as understanding natural language,
pants, materials, methodology, procedure, and data col- recognizing images, playing games, and solving com-
lection and interpretation steps. The results section of- plex problems (Jeste et al., 2020).
fers a statistical analysis of the data collected. Finally,
the discussion summarizes some possible limitations Natural Language Processing (NLP) is a subset of arti-
and proposes the main results of the study and their im- ficial intelligence that focuses on the interaction of com-
plications in the field of language teaching, particularly puters and humans through natural language (Kochmar,
listening instruction. 2022; McRoy, 2021). It involves the development of
models and algorithms that can analyze, comprehend,
Aims and produce human language. Natural Language Pro-
The article aims to compare students’ perceptions of cessing is used in various applications, such as sentiment
synthetic and human speech. This article also focuses analysis, online searches, predictive text, and machine
on the perceived differences or similarities between translation (Adamopoulou & Moussiades, 2020; Luo et
male and female voices. al., 2022). It has also been instrumental in advancing
virtual assistants and chatbots. In addition, NLP tech-
niques are based on a combination of computer science,
REVIEW OF THE LITERATURE
linguistic theory, and machine learning (Raaijmakers,
The role of listening instruction has been extensively 2022). The goal of NLP is to create systems that can ac-
studied in recent years. However, to the best of the re- curately compute and analyze large amounts of data and
searcher’s knowledge, studies comparing AI-generated use this information to perform specific tasks. Overall,
audio and human-created audio in ESL settings are NLP is a rapidly growing field that has the potential to
scarce. This literature review summarizes some of the revolutionize how computers and humans interact and
main concepts related to this study. It does not intend has a wide range of practical applications, including
to be comprehensive but to provide an overview of the customer service, marketing, healthcare, etc.
central aspects of speech synthesis and language in-
The Human Voice
struction, particularly the listening skill.
The human voice is the sound produced by the vibration
Artificial Intelligence and Natural
of the vocal folds in the larynx. Sound waves are pro-
Language Processing
duced by the vibration of the vocal folds, which travel
Artificial intelligence is the simulation of human intel- through the oral and nasal cavities to produce speech
ligence in machines designed to think and act like hu- or singing; these waves interact with our articulators
(tongue, jaw, teeth, etc.) to produce specific sounds
43
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
(Calais-Germain & Germain, 2016). The human voice such as large print keyboards, mouse devices, screen
is a powerful and unique tool for communication and magnifiers, and adapted joysticks for individuals with
self-expression. It can convey a wide range of emotions mobility or dexterity issues. Examples of software as-
and has been used throughout history for interacting, sistive technology include screen readers and TTS soft-
storytelling, singing, and other forms of artistic expres- ware for individuals who are blind or have low vision.
sion (Karpf, 2006).
Although screen readers and TTS systems are similar,
The study of the human voice, including its production they also have some differences. A screen reader is a
and perception, is known as voice science or phonet- type of assistive technology that reads out loud the text
ics (Akmajian et al., 2017). Voice science deals with the on a computer screen. It is primarily designed to help
sound and quality of the voice, which in turn is influ- individuals who are blind or have low vision access
enced by several factors, including age, gender, physi- the information and functions of a computer (Evans &
cal attributes, and emotional state. This field of study Blenkhorn, 2008). In this case, the program reads what
is essential for understanding how the voice works and is already on the screen. Text-to-speech is an advanced
developing techniques for improving vocal health and technology that converts written text into speech. One
performance, helping people with trouble speaking, or of its main goals is to be very similar or even indistin-
developing techniques and strategies to help students guishable from the human voice (Dutoit, 1997). It uses
learn a new language. natural language processing and speech synthesis to
generate human-like speech from input text (Taylor,
In addition to its role in communication and expression, 2009). The output speech can be played back using
the human voice also plays an essential role in identi- speakers or headphones or stored as an audio file. For
ty and socialization, as it is frequently used to convey instance, unlike screen readers, a user can deliberately
personal and cultural information (Norton & Toohey, enter text to be read. This first user can modify the text
2011). Thus, the human voice is a complex and unique and how it will be presented to the end user. Text-to-
aspect of human physiology and behavior and contin- speech technology is commonly used for accessibility
ues to be studied by scientists and artists alike. It pro- purposes, for individuals with visual impairments, and
vides essential information such as gender, personality, for various applications in fields such as education, en-
accent, race, and emotion, among other aspects (Nass tertainment, and business (Narayanan & Alwan, 2005).
& Brave, 2005). However, the role of the voice as an
instrument that carries a message has been frequently Text-to-speech technology breaks down written text
overlooked (Karpf, 2006). Listening exercises frequent- into words and phrases and then uses a computer-gener-
ly focus more on the quality of the audio in general than ated voice to read them aloud. The process of TTS typi-
on the characteristics of the voice, and research about cally involves the following stages: text analysis (the
the role of the human voice in learning remains scarce text is analyzed and processed to determine pronun-
(Craig & Schroeder, 2019). ciation, rhythm, and stress patterns), voice synthesis (a
computer-generated voice is created by concatenating
Assistive and text-to-speech technology or piecing together segments of pre-recorded speech),
Assistive technology refers to tools, devices, or soft- and speech production (the processed text is combined
ware created to help people with disabilities perform with the generated voice to produce spoken language)
tasks that they would otherwise be unable to perform (Hersh et al., 2008; Holmes & Holmes, 2001).
or may complete with difficulty (Emiliani & Associa- Several studies have found potential benefits from using
tion for the Advancement of Assistive Technology in TTS in language classes or have found no significant
Europe, 2009). These technologies can aid people with differences between using a synthetic or human voice
several disabilities, including physical, sensory, and (Bione et al., 2017; Cardoso et al., 2015; Craig & Schro-
cognitive impairments (Bouck, 2017; Cook, 2019; Dell eder, 2019; Hillaire et al., 2019; Kang et al., 2008) and
et al., 2017; Green, 2018). Examples of hardware as- the possibility of more interactive models where people
sistive technology include adaptive computer hardware, can keep conversations in real time with machines (Ku-
44
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
mar et al., 2023). In addition, TTS systems may use var- related to language features or contextual characteris-
ious techniques to improve the quality and naturalness tics of the message. For example, the speed of delivery
of the generated speech, such as adjusting the rhythm (Brown & Lee, 2015; Ur, 2012) or the speakers’ ac-
and intonation to match that of a human speaker or add- cent, especially if no adequate training has been previ-
ing natural-sounding pauses and inflections. Since this ously provided (Charpentier-Jiménez, 2019; Derwing
technology is rather new, it is constantly evolving and & Munro, 2015; Field, 2011; Harmer, 2007), may limit
improving (Chen et al, 2023; Wang et al., 2023). There- students’ processing time and frustrate their attempts to
fore, the accuracy and quality of TTS systems can vary decode the message. Additionally, the type of vocabu-
widely, depending on factors such as the complexity of lary (Hadfield & Hadfield, 2008; Watkins, 2010) and
the text, the quality of the voice synthesis, and the so- the level of formality (Hadfield & Hadfield, 2008) could
phistication of the TTS algorithms used. slow down students’ ability to comprehend the mes-
sage. On the one hand, the words, expressions, or gram-
Teaching Listening mar used could be too specific, elaborate, or technical
Teaching listening skills involves providing students for students to understand. On the other hand, language
with opportunities to practice and develop their ability could be too colloquial and culturally bound, making
to understand spoken language. It also includes strate- understanding the message more challenging.
gies such as providing opportunities for authentic lis- Another aspect to consider is the message and its char-
tening, using varied listening materials, and incorporat- acteristics. For example, audio input should present
ing interactive activities (Brace et al., 2006; Ur, 2012). students with authentic input while considering various
Teaching listening skills also requires dedication and task types and audio formats (Brown & Lee, 2015; Bur-
a focus on the process and the outcome. Thus, regular gess & Head, 2005; Celce-Murcia et al., 2010). Content
practice and ongoing feedback are essential for helping is another aspect professors should examine (Harmer,
students improve their listening abilities. 2007). These aspects make finding voice recordings
Listening has historically been viewed as a receptive more difficult for professors. Despite the myriad pos-
skill (Field, 2011; Harmer, 2013). To understand a lis- sibilities the Internet brings, audio recordings do not
tening passage, the listener uses their linguistic abili- always adapt to students’ levels, the desired task, the ap-
ties and schemata. In this regard, one of the main dif- propriate accent, or the content under study. Moreover,
ficulties for ESL students is making sense of the sound professors should consider aspects like length, audio re-
system of English, especially if they are learning it as cording quality, or any other aspect that interferes with
adults (Field, 2011; Nation & Newton, 2009). On the the message, such as background noise (Watkins, 2010),
other hand, the topics used in language classes should since the audio input should provide students with an
consider students’ schemata. A schema is a cognitive appropriate model to imitate (Patel & Jain, 2008).
framework or mental model that helps us organize and Finally, the advancement of AI and text-to-speech
understand information (Brown & Lee, 2015; Harmer, systems have proven effective in improving language
2013). Schemata can refer to general concepts or mental learning. A study by Al-Jarf (2022) highlighted nota-
structures about the world or specific knowledge struc- ble improvements in decoding skills, reading fluency,
tures about a particular topic or situation. For example, and pronunciation accuracy when using these tools,
we may have a schema about what a typical car looks although there was no significant enhancement in vo-
like, which helps us understand and categorize new in- cabulary knowledge. Additionally, the integration of
formation about cars we encounter in real-life situations AI-driven techniques in ELT has been instrumental in
or through written, pictorial, or audio messages. There- bossing motivation and fostering heightened learner
fore, the learner’s linguistic proficiency and schemata engagement. As highlighted by Anis (2023), learners
are crucial when decoding the message. experience heightened involvement due to the effects
In addition, some other aspects may constrain students’ of adaptive instruction, intelligent tutoring systems, and
listening comprehension. Some of these limitations are personalized learning applications. These innovative
45
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
approaches not only stimulate motivation but also en- be used freely without special permission. At no point in
courage active participation in language-related activi- the study did students have access to the script. The four
ties. Furthermore, as Moybeka et al. (2023) emphasize, different audios included this same passage. To record
text-to-speech applications serve as pivotal tools in dis- the audio, the software Speechelo was used. Speechelo
mantling language barriers, leading to a more inclusive is an AI-enabled TTS and voiceover, paid software that
and equitable approach to English language education. turns text into human-sounding voiceovers (BlasterOn-
TTS also offers a unique advantage, assisting students line, 2023). It can also create audio in 23 languages. It
in refining their listening and reading proficiencies was chosen because of its quality and the number of au-
(Hartono et al., 2023). Text-to-Speech tools could be dios it has available. Two of the audios were read by a
a valuable asset in acquainting students with a diverse male and female human, both native American English
range of accents, further enriching their auditory experi- speakers. Students were not informed that some audios
ence and understanding of the language (Fitria, 2023). could be computerized as this could have biased their
perception. The other two audios were read in American
This literature review presents some main concepts English by a male and female AI voice using Speechelo.
related to using TTS in ESL classes. The researcher All audios were encoded in an MP3 format. Participants
must grant that some concepts related to TTS systems listened to the audio using noise-canceling, over-the-ear
or listening instruction have been purposely left aside headphones, the Bose QC35 Series II, which guaran-
as they do not directly relate to the objective of this tees optimal listening conditions. These headphones
study. However, this omission does not limit or impair were wirelessly connected to a different audio system
the findings of this paper. to avoid any interference with students’ answers. Fi-
nally, the survey was divided into four sections: a) de-
METHODS mographic information, b) participants’ perceptions of
voice recordings in English classes, c) the evaluation of
Participants
the AI or human audio, and d) an optional open-ended
This study includes Costa Rican university students en- question. The survey used two question formats: forced-
rolled in their second language course. The researcher choice and open-ended questions. Except for the open-
visited the students’ oral classes to invite them to partic- ended question, items included Likert scales for all sec-
ipate. The participants were selected because they were tions. For example, some items asked the participants
currently enrolled in an oral course in their second aca- to rate the audio quality in their English classes. These
demic year. Their proficiency level corresponds to B1- items were placed on a 5-point Likert scale that ranged
B2. Thirty-six participants were willing to participate; from 1 (Very poor) to 5 (Very good). This format, or a
however, they did not receive monetary compensation similar one, was also used for other questions.
for their participation. All participants speak Spanish as
The last part of the survey contained one optional, open-
their first language.
ended question. This question invited participants to
Materials add any other comments they believed were relevant to
the study. The total time to complete the survey was es-
The materials include written consent, a listening script timated at 10 to 15 minutes.
(see Appendix 1), four different audios (see Appendix
3), the software, the necessary equipment for the listen- PROCEDURE
ing part, and an electronic survey (see Appendix 2) to
collect participants’ answers. The written consent was This study used a quantitative survey design. First, the
sent to participants electronically before their participa- researcher selected an appropriate text to create the au-
tion, and a checkbox labeled “agree to terms and condi- dio recordings. The text was selected because it is copy-
tions” was included to certify voluntary participation in right-free and normally used in language analysis. The
the study. The listening script used was Comma Gets a researcher then pilot-tested several female and male AI
Cure, a diagnostic passage for dialect and accent that can voices with ten participants from the same affiliation
46
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
as the target population. This stage aimed to extract the derived from participants’ survey answers. The analy-
two voices that sounded more human-like. The AI voic- sis included descriptive statistics, where percentages,
es chosen (Mathew and Grace) were fed the proposed nominal data, and the standard deviation, among other
text. These two voices were chosen from a list of 17 basic statistics, were performed to compare participants’
voices offered by the software. Although Speechelo al- opinions about the audios and their listening training.
lows the user to add breathing and pauses, among other
changes, the audios were not modified in any way. The ANALYSIS OF THE RESULTS
human voices were professional voiceover actors. The
speakers also read the same text, and their voices were The following summary of the results presents the main
in no way altered. findings of the study in four distinct sections. The first
section includes the participants’ demographic informa-
After preparing the materials, the researcher created tion. The second section compares the four voiceovers
the survey. The survey included sections about par- based on participants’ ratings. The third section summa-
ticipants’ demographic information, their perception of rizes the main features under analysis and their ratings.
audio quality in English classes, a list of ten descrip- Finally, the fourth section describes participants’ gener-
tors to evaluate the four voiceovers, and an open-ended al perceptions of the audios used during English classes
question. To explore participants’ perceptions of audio and the type of listening instruction they received.
recordings, the list of ten descriptors was extracted from
a list of 17. This list was compiled by the researcher, Demographic Information
considering the most common characteristics asso-
Of the 36 study participants, 27 (75%) were females
ciated with vocal features (Memon, 2020, Paz et al,
and 8 were males (22.22%). One participant (2.78%)
2022). Some items from the initial list were discarded
chose to be identified as non-binary. Overall, 33 partici-
since they did not fit the study’s scope (i.e., background
pants (91.67%) reported being between the ages of 18
noise, length, and volume, among others). By default,
and 24, while two (5.56%) were between 25 and 34.
some of these features were either objectively the same
Only one participant (2.78%) was between 35 and 44.
in all audios or could be adjusted by the participants.
All participants are native Spanish speakers and study
All students had access to a sample survey before their
English as a foreign language. Regarding studies, the
appointment.
study participants are enrolled in the BA in English (n =
Finally, during the data-gathering stage, participants 29, 80.56%) or English teaching (n = 6, 16.67%). Only
were summoned to a vacant office with a silent environ- one participant reported studying both majors (n = 1,
ment. Students were able to choose their appointments 2.78%). The two majors share the same core language
at their own convenience. The researcher provided writ- courses, including oral courses. All of the participants
ten and oral instructions to all participants. Participants are currently in their second or third year.
used noise-canceling, over-the-ear headphones to mini-
Participants’ ratings of voiceovers
mize any background noise during this stage. Although
participants could listen to the audio more than once, The following analysis delves into the realm of par-
no student asked to listen again. Participants’ answers ticipant voiceover preferences. Table 1 shows that par-
were collected through an anonymous electronic survey ticipants’ voiceover ratings can be analyzed from two
that was partially completed while listening to the au- perspectives. First, participants had a slight preference
dio. Other sections did not require the audio to be com- for female voices. Although the difference was almost
pleted. non-existent when comparing human voices, the female
AI was more than six points above the male AI. Sec-
Data Processing and Analysis
ond, participants showed a marked preference for hu-
The original data set in Excel format (xls) was subjected man voices. Even though all maximum grades were at
to computational analysis using the statistical package or above 90, the minimums for AI voices were below
for social sciences (SPSS) Version 26. The data was 55, while human voices exceeded the 70 threshold. The
47
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
standard deviation also shows that, when evaluating hu- voices, they appealed to part of the population. This was
man voices, ratings tend to be more uniform. However, especially evident when overlapping those results with
ratings are more spread when evaluating AI, indicat- the mean and maximum grades.
ing that, although AI voices ranked lower than human
Table 1. Summary of participants’ perceptions of each voiceover: mean and standard deviation
Participants’ voiceover rating per criteria ing to each audio, participants used a five-point Likert
scale to rate one of the audios that were evenly and ran-
To analyze participants’ perceptions of each audio, ten domly assigned to them. Table 2 presents a summary of
criteria were chosen. As previously stated, some criteria the main findings of this section.
from the initial list of 17 were discarded. While listen-
Table 2. Summary of participants’ ratings of each criterion: means of raw data and percentage
48
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
According to Table 2, some criteria show more con- important. No participant believed that listening in-
trast, while others are more similar. In terms of simi- struction was slightly important or not important. These
lar characteristics, speed (paused – fluent) (SD = 2.13) results show that participants recognize the importance
and vocal variety (unfriendly – friendly) (SD = 2.87) of listening instruction in ESL settings; however, a sig-
are three or fewer points apart from each other. In both nificant number of participants consider that listening
cases, participants perceive the friendliness and fluency instruction needs improvement.
of the voice as good and very good, respectively. On
the other hand, some criteria were different. For exam- Additionally, the survey requested that participants
ple, according to the participants’ answers, intonation evaluate the audio quality and quantity during language
(monotonous – varied) (SD = 18.68) and vocal variety classes. Concerning quality, the results were varied. Six
(strained-natural) (SD = 12.17) are characteristics that participants (16.67%) rated the audio quality as very
show great variation, favoring human voices. Another good, and ten (27.78%) labeled the audio quality and
characteristic worth mentioning is voice quality (harsh- quantity a good. The majority of the participants (n =
pleasant) (SD =11.90). In this last case, the variation oc- 11; 30.56%) considered the audios acceptable, while
curred mainly because of the perceived harshness of the nine (25.00%) mentioned that the audios were poor in
male AI voice. quality. No participant ranked them as very poor. In re-
gard to the audio quantity, eight participants (22.22%)
In addition, some criteria had higher or lower marks believed it was very good. The same number of par-
overall. For example, voice quality (unclear – clear) ticipants (n = 14; 38.89%) ranked the audio quantity as
(93.33%) and speed (slow – fluent) (93.33%) were good or acceptable. No participant chose poor or very
the two highest-ranked criteria for AI voices. In the poor for this section. These results demonstrate that the
case of human voices, voice quality (unclear – clear) use of audio should be revised, especially in terms of
(94.44%) and general audio quality (unintelligible – quality.
clear) (94.44%) were the highest. This shows that over-
all voice quality (unclear – clear) was the characteristic Finally, participants were asked about the challenges
that appealed most to participants. On the other hand, they faced when listening to the audios. The first ques-
AI voices ranked the lowest in intonation (monotonous- tion asked participants if the overall audio quality (back-
varied) (53.33%) and vocal variety (does not convey ground noise or music, static, etc.) had ever increased
emotion – conveys emotion) (58.89%), while human the difficulty level of an audio exercise in language
voices ranked the lowest in speed (unvaried – varied) classes at the university. Most participants answered
(77.78%) and vocal variety (does not convey emotion – affirmatively (n = 28; 77.78%). Only two participants
conveys emotion) (73.33%). On average, vocal variety (5.56%) mentioned that overall audio quality had not
(does not convey emotion – conveys emotion) was the been an issue. Six participants (16.67%) did not remem-
characteristic participants found more unappealing. ber any event where overall audio quality had been an
issue. The second question asked participants whether
Participants’ perceptions of audio use and instruction the speaker’s voice (accent, volume, speed, etc.) had
ever increased the difficulty level of an audio exercise
Participants’ perceptions of audio quality and instruc- in language classes at the university. Most participants
tion were analyzed based on six survey questions. First, answered affirmatively (n = 29; 80.56%). Five partici-
the survey considered listening instruction. Eight par- pants (13.89%) answered that this has never been an
ticipants (22.22%) ranked listening instruction as very issue. Only two participants (5.56%) claim to not re-
good, while 17 (47.22%) ranked it as good. Eleven par- member any instance where the voice quality hindered
ticipants (30.56%) mentioned that listening instruction their understanding. The findings demonstrate that par-
was acceptable. No participant classified listening in- ticipants do not always consider that audios are appro-
struction as poor or very poor. In addition, 17 (47.22%) priate for ESL settings.
participants considered listening instruction important
and very important. Only two participants (5.56%) The strengths and weaknesses of each audio were a
mentioned that listening instruction was moderately recurring theme in the responses to the optional, open-
49
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
ended question. The following examples summarize ing one specific piece of software. Other software may
participants’ opinions concerning the audio. include voices that are more appealing to students or
have a more remarkable resemblance to human voices.
Example 3. It sounds kind of robotic sometimes, but On the part of human voices, the researcher used pro-
it’s acceptable enough. (Participant 6, Female AI fessional voiceover experts. They have the necessary
voice) equipment and record in a professional studio. Although
Example 7. The audio is clear but the voice is too this was done with the intention of replicating the null
robotic and does not sound natural. (Participant 14, environment of AI voices, not all audios used in ESL
Male AI voice) classes share similar characteristics. Finally, AI audios
can be manipulated. In the present study, the audios
Example 13. It’s very pleasant, however, it’s (sic) were not manipulated to standardize procedures. How-
feels rushed and even though it certainly has emotion, ever, using the source software or a third-party appli-
it’s not necessary (sic) to be overlly (sic) happy nor cation, modified audio may improve AI audios. These
too excited. It’s very fluent yet it feels like some air is modifications may also alter the results from one study
neccesary (sic) in order to continue the reading. Pretty to the next. Therefore, the results included in this study
good though. (Participant 20, Female human voice) cannot be generalized but should serve as a base for fu-
ture research.
Example 19. Speed was a bit quick for a short story,
maybe a little bit of excitement would be good. (Par- Researchers should replicate this study in other ESL
ticipant 33, Male human voice) settings or other types of TTS software. For example,
not all institutions or professors may have access to
In general, participants also commented on the quality the same software. Although TTS free software ex-
of the headphones used. According to the participants’ ists, its quality and number of available voices may not
comments, they were very pleased with the equipment. compare to paid software. In addition, future research
The equipment used during listening instruction was should consider other types of environments in which
beyond the scope of this study; however, it should be noise and background noise may play a part in regular
considered in future research on listening instruction or listening instruction. Further research should also deter-
AI voiceovers. mine whether other characteristics or criteria may trig-
This investigation shed light on how students perceive ger other results.
human and AI voices. It also discussed the different
criteria used to rank voices in ESL environments. Fi- CONCLUSIONS
nally, it described students’ perceptions of audio use
Although this study’s findings are not generalizable
and instruction.
beyond the study sample, several conclusions can be
drawn from the analysis of the results. First, AI voices
DISCUSSION are not yet at the same level as human voices. In gen-
Limitations and Future Directions eral, human voices are preferred over human voices;
however, this does not imply that AI voices should not
This study has three main limitations worth mentioning. be used. Some students did not notice that they were
First, the number and type of criteria used are limited. listening to a non-human voice; even human voices,
Only ten criteria were used, and other options could recorded by experts and with professional equipment,
have been considered for this study. However, due to were criticized in some aspects. In addition, AI voices
time constraints, the ten most relevant choices, accord- cannot be used in all scenarios and contexts. For ex-
ing to the researchers’ pilot test, were included. Other or ample, AI voices are limited since they cannot create
more criteria could trigger different results. Second, the role-plays, dialogues, or other interactive communica-
researcher used four types of audio. The selection was tive instances without a lot of human intervention, at
based on the results of the pilot test for AI voices us- least not with the type of TTS software used. As AI
50
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
voices are not as appealing as human voices, they can dents perceive that audios pose additional challenges
be used to generate instructions for listening exercises, created by static, unnecessary background noise or mu-
provide audio support for readings (especially for visu- sic, volume, or accent, among other factors, they may
ally impaired students), give people who have lost their develop negative feelings towards listening exercises.
voices for medical reasons the ability to communicate Nevertheless, this does not mean that students should
orally, or create introductions or summaries of listening not be challenged. Students may face real-life situations
exercises. Finally, AI voices may be modified to play a where some of these added difficulties are present; how-
more pedagogical role by providing extra audio input ever, institutions should develop clear guidelines to pro-
or audio prompts for students to discuss various topics. vide students with appropriate materials for their level,
age, or other conditions.
Second, AI voices do not fall behind in all criteria. This
information may be useful for two populations. On the It is important to remember that users of TTS software,
one hand, people who program TTS applications may including Speechelo, can adjust pitch level, breathing,
seek to adjust, to the best of current technological ca- speed, and emphasis to make voices sound more natu-
pabilities, those characteristics that mark AI voices as ral. Although Speechelo was created with video creators
non-human. On the other hand, language professors and in mind, its use may provide opportunities to improve
material developers may take advantage of this informa- students’ language learning capabilities. The author
tion and include AI or human voices according to their suggests that other language programs replicate this
specific needs. For example, in lieu of having a human study to examine other possible TTS software uses or
record specific audios for beginners, a professor may test its possible improvement in the coming years.
decide to use AI voices since students’ main challenge
is speed, a feature that is easily adjusted in a computer- BIBLIOGRAPHICAL REFERENCES
generated environment. On the other hand, audio that
requires enthusiasm, emotion, or varied intonation may Abbott, R. (2020). The Reasonable Robot: Artificial In-
be more suitable for human voices. In addition, AI voic- telligence and the Law (1st ed.). Cambridge Univer-
es may be useful where resources and exposure to real- sity Press. https://doi.org/10.1017/9781108631761
life languages are limited. Although the Internet is an
excellent source for audio input, finding suitable audios Adamopoulou, E., & Moussiades, L. (2020). An Over-
for students’ specific needs (accent, speed, topic, dura- view of Chatbot Technology. In I. Maglogiannis, L.
tion, vocabulary or grammar level, etc.) may be time- Iliadis, & E. Pimenidis (Eds.), Artificial Intelligence
consuming or virtually impossible without considering Applications and Innovations (Vol. 584, pp. 373–
that some audios may be subject to copyright laws. 383). Springer International Publishing. https://doi.
org/10.1007/978-3-030-49186-4_31
The results of this study call for a revision of the pro-
gram’s listening instructions. Although students recog- Al-Jarf, R. (2022). Text-to-speech software for promot-
nize the importance of listening instruction, they per- ing EFL freshman students’ decoding skills and pro-
ceive some weaknesses in the instruction they receive. nunciation accuracy. Journal of Computer Science
In particular, a relevant group of students considers that and Technology Studies, 4(2), 19-30.
the number of audios, their quality, and the general qual- Akmajian, A., Farmer, A. K., Bickmore, L., Demers, R.
ity of instruction are areas that need improvement. The A., & Harnish, R. M. (Eds.). (2017). Linguistics: an
results do not indicate that these areas need to be com- introduction to language and communication (Sev-
pletely restructured; however, they point to a system- enth edition). The MIT Press.
atic revision of current policies and materials to provide
students with better and more substantial exposure to Anis, M. (2023). Leveraging Artificial Intelligence for
auditory input. Currently, policies and materials should Inclusive English Language Teaching: Strategies And
also be examined to guarantee that students are exposed Implications For Learner Diversity. Journal of Multi-
to audios according to their level and needs. When stu-
51
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
disciplinary Educational Research. 12(6). http://ijmer. Celce-Murcia, M., Brinton, D., & Goodwin, J. M.
in.doi./2023/12.06.89 (2010). Teaching pronunciation: a course book and
reference guide (2nd ed). Cambridge University
Arora, V. (2022). Artificial intelligence in schools: a Press.
guide for teachers, administrators, and technology
leaders. Routledge. Charpentier-Jiménez, W. (2019). University students´
perception of exposure to various English accents
Bione, T., Grimshaw, J., & Cardoso, W. (2017). An and their production. Actualidades Investigativas En
evaluation of TTS as a pedagogical tool for pronun- Educación, 19(2), 1–27. https://doi.org/10.15517/aie.
ciation instruction: the ‘foreign’ language context. v19i2.36908
In K. Borthwick, L. Bradley, & S. Thouësny (Eds.),
CALL in a climate of change: adapting to turbulent Chen, L. W., Watanabe, S., & Rudnicky, A. (2023). A
global conditions – short papers from EUROCALL vector quantized approach for text to speech synthe-
2017 (pp. 56–61). Research-publishing.net. https:// sis on real-world spontaneous speech. arXiv preprint
doi.org/10.14705/rpnet.2017.eurocall2017.689 arXiv:2302.04215.
BlasterOnline. (2023). Speechelo [Computer software]. Cook, A. M. (2019). Assistive technologies: principles
Romania. Retrieved from: https://app.blasteronline. and practice (5th edition). Elsevier.
com/speechelo/
Craig, S. D., & Schroeder, N. L. (2019). Text-to-Speech
Bouck, E. C. (2017). Assistive technology. Sage Publi- Software and Learning: Investigating the Relevancy
cations. of the Voice Effect. Journal of Educational Com-
puting Research, 57(6), 1534–1548. https://doi.
Brace, J., Brockhoff, V., Sparkes, N., & Tuckey, J. org/10.1177/0735633118802877
(2006). Speaking and listening map of development:
addressing current literacy challenges (2nd ed). Rig- Dell, A. G., Newton, D. A., & Petroff, J. G. (2017). As-
by-Harcourt EducationRigby. sistive technology in the classroom: enhancing the
school experiences of students with disabilities (Third
Brown, H. D., & Lee, H. (2015). Teaching by princi- edition). Pearson.
ples: an interactive approach to language pedagogy
(Fourth edition). Pearson Education. Derwing, T. M., & Munro, M. J. (2015). Pronunciation
fundamentals: evidence-based perspectives for L2
Burgess, S., & Head, K. (2005). How to teach for ex- teaching and research. John Benjamins Publishing
ams. Longman. Company.
Calais-Germain, B., & Germain, F. (2016). Anatomy Dutoit, T. (1997). An introduction to text-to-speech syn-
of voice: how to enhance and project your best voice thesis. Kluwer Academic Publishers.
(First U.S. edition). Healing Arts Press.
Emiliani, P. L., & Association for the Advancement of
Cameron, R. M. (2019). A.I. - 101: a primer on using Assistive Technology in Europe (Eds.). (2009). Assis-
artifical intelligence in education. publisher not iden- tive technology from adapted equipment to inclusive
tified. environments: AAATE 2009. Washington, DC : IOS
Cardoso, W., Smith, G., & Garcia Fuentes, C. (2015). Press.
Evaluating text-to-speech synthesizers. Critical CALL Evans, G., & Blenkhorn, P. (2008). Screen Readers and
– Proceedings of the 2015 EUROCALL Conference, Screen Magnifiers. In M. A. Hersh, M. A. Johnson,
Padova, Italy, 108–113. https://doi.org/10.14705/ & D. Keating (Eds.), Assistive technology for visually
rpnet.2015.000318 impaired and blind people. Springer.
52
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
Field, J. (2011). Psycholinguistics. In J. Simpson (Ed.), Honorof, D., McCullough, J., & Somerville, B. Comma
The Routledge handbook of applied linguistics (1st Gets A Cure | IDEA: International Dialects of Eng-
ed). Routledge. lish Archive. https://www.dialectsarchive.com/com-
ma-gets-a-cure
Fitria, T. N. (2023). English Accent Variations of Amer-
ican English (Ame) and British English (Bre): An Jeste, D. V., Graham, S. A., Nguyen, T. T., Depp, C. A.,
Implication in English Language Teaching. Sketch Lee, E. E., & Kim, H.-C. (2020). Beyond artificial
Journal: Journal of English Teaching, Literature and intelligence: exploring artificial wisdom. Interna-
Linguistics, 3(1), 1-16. tional Psychogeriatrics, 32(8), 993–1001. https://doi.
org/10.1017/S1041610220000927
Green, J. L. (2018). Assistive technology in special edu-
cation: resources to support literacy, communication, Kang, M., Kashiwagi, H., Treviranus, J., & Kaburagi,
and learning differences (Third edition). Prufrock M. (2008). Synthetic speech in foreign language
Press, Inc. learning: an evaluation by learners. International
Journal of Speech Technology, 11(2), 97–106. https://
Gulson, K. N., Sellar, S., & Webb, P. T. (2022). Algo- doi.org/10.1007/s10772-009-9039-3
rithms of education: how datafication and artificial
intelligence shape policy. University of Minnesota Karpf, A. (2006). The human voice: how this extraordi-
Press. nary instrument reveals essential clues about who we
are (1st U.S. ed). Bloomsbury Publishing.
Hadfield, J., & Hadfield, C. (2008). Introduction to
teaching English (1. publ). Oxford Univ. Press. Kent, D. (2022). Artificial intelligence in education:
fundamentals for educators. Kotesol DDC.
Harmer, J. (2007). How to teach English. (New ed., 6.
impr). Pearson/Longman. Kindersley, D. (2023). Simply Artificial Intelligence.
DK PUBLISHING.
Harmer, J. (2013). The practice of English language
teaching: with DVD (4. ed., 8. impression). Pearson King, M. R., & chatGPT. (2023). A Conversation on
Education. Artificial Intelligence, Chatbots, and Plagiarism in
Higher Education. Cellular and Molecular Bioengi-
Hartono, W. J., Nurfitri, N., Ridwan, R., Kase, E. B., neering, 16(1), 1–2. https://doi.org/10.1007/s12195-
Lake, F., & Zebua, R. S. Y. (2023). Artificial Intel- 022-00754-8
ligence (AI) Solutions In English Language Teach-
ing: Teachers-Students Perceptions And Experiences. Kochmar, E. (2022). Getting started with Natural Lan-
Journal on Education, 6(1), 1452-1461. guage Processing. Manning Publications.
Hersh, M. A., Johnson, M. A., Keating, D., & Hoff- Kumar, Y., Koul, A. & Singh, C. (2023). A deep learn-
mann, R. (Eds.). (2008). Speech, Text and Braille ing approaches in text-to-speech system: a systematic
Conversion Technology. In Assistive technology for review and recent research perspective. Multimed
visually impaired and blind people. Springer. Tools Appl 82, 15171–15197 https://doi.org/10.1007/
s11042-022-13943-4
Hillaire, G., Iniesto, F., & Rienties, B. (2019). Humanis-
ing Text-to-Speech Through Emotional Expression in Luo, B., Lau, R. Y. K., Li, C., & Si, Y. (2022). A criti-
Online Courses. Journal of Interactive Media in Edu- cal review of state‐of‐the‐art chatbot designs and
cation, 2019(1), 12. https://doi.org/10.5334/jime.519 applications. WIREs Data Mining and Knowledge
Discovery, 12(1). https://doi.org/10.1002/widm.1434
Holmes, J. N., & Holmes, W. (2001). Speech synthesis
and recognition (2nd ed). Taylor & Francis. McRoy, S. (2021). Principles of natural language pro-
cessing. Susan McRoy.
53
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
Memon, S. A. (2020). Acoustic Correlates of the Voice Patel, M. F., & Jain, P. M. (2008). English language
Qualifiers: A Survey (arXiv:2010.15869). arXiv. teaching: (methods, tools & techniques). Sunrise
https://doi.org/10.48550/arXiv.2010.15869 Publishers & Distributors.
Mitchell, M. (2019). Artificial intelligence: a guide for Paz, K. E. D. S., Almeida, A. A., Behlau, M., & Lopes,
thinking humans. Farrar, Straus and Giroux. L. W. (2022). Descritores de qualidade vocal sopro-
sa, rugosa e saudável no senso comum. Audiology
Moybeka, A. M., Syariatin, N., Tatipang, D. P., Mush- - Communication Research, 27, e2602. https://doi.
thoza, D. A., Dewi, N. P. J. L., & Tineh, S. (2023). org/10.1590/2317-6431-2021-2602
Artificial Intelligence and English Classroom: The
Implications of AI Toward EFL Students’ Motivation. Raaijmakers, S. (2022). Deep learning for natural lan-
Edumaspul: Jurnal Pendidikan, 7(2), 2444-2454. guage processing. Manning Publications Co.
Narayanan, S. S., & Alwan, A. (Eds.). (2005). Text to Taylor, P. A. (2009). Text-to-speech synthesis. Cam-
speech synthesis: new paradigms and advances. bridge University Press.
Prentice Hall Professional Technical Reference.
Ur, P. (2012). A course in English language teaching
Nass, C. I., & Brave, S. (2005). Wired for speech: how (2nd ed). Cambridge University Press.
voice activates and advances the human-computer re-
lationship. MIT Press. Wang, C., Chen, S., Wu, Y., Zhang, Z., Zhou, L., Liu, S.,
& Wei, F. (2023). Neural codec language models are
Nation, I. S. P., & Newton, J. (2009). Teaching ESL/ zero-shot text to speech synthesizers. arXiv preprint
EFL listening and speaking. Routledge. arXiv:2301.02111.
Norton, B., & Toohey, K. (2011). Identity, language Watkins, P. (2010). Learning to teach English: a practi-
learning, and social change. Language Teach- cal introduction for new teachers (Reprinted). Delta
ing, 44(4), 412–446. https://doi.org/10.1017/ Publishing.
S0261444811000309
54
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
APPENDIX 1
55
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
APPENDIX 2
Survey
The purpose of this survey is to determine current practices in pronunciation instruction and the use of audio re-
cordings and their perceived quality.
This survey should take no more than 10 minutes of your time. All answers are anonymous. Your participation in
this brief survey is greatly appreciated.
Best regards,
I. Demographic Information
The University of Costa Rica does not discriminate on the basis of sexual orientation, gender identity or expres-
sion, age, or national origin. In order to track the reach and effectiveness of our learning experiences and ensure we
consider the needs of all, please consider the following questions:
56
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
II. Answer the following questions taking into account any exposure you have had to the use of dictionaries
during your major.
5. On a scale of 1 to 5, with 1 being poor and 5 being excellent, how would you rate listening instruction in the ma-
jor?
1-2-3-4-5
6. On a scale of 1 to 5, with 1 being poor and 5 being excellent, how important is listening instruction for you?
1-2-3-4-5
7. Has overall audio quality (background noise or music, static, etc.) ever increased the difficulty level of an audio
exercise in language classes at the university?
Yes
No
I don’t remember.
8. Has the speaker’s voice (accent, volume, speed, etc.) ever increased the difficulty level of an audio exercise in
language classes at the university?
Yes
No
I don’t remember.
9. On a scale of 1 to 5, with 1 being poor and 5 being excellent, how would you rate the quantity of listening exer-
cises in your oral courses?
1-2-3-4-5
10. On a scale of 1 to 5, with 1 being poor and 5 being excellent, how would you rate the quality of audios in your
oral courses?
1-2-3-4-5
III. On a scale of 1 to 5, how would you rate this audio? Use the words at both ends to guide your answer.
57
Revista Comunicación. Año 44, vol. 32, núm. 2, julio-diciembre 2023 (pp. 41-58)
15. How would you define the audio quality of this recording?
Unintelligible or Poor 1-2-3-4-5 Clear or Excellent
APPENDIX 3
58