International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
VIRTUAL VOICE ASSISTANT
Ravikumar N R1, Prateek C2, Sathvik Bhandar3, Rahul Kumar4, Mayura D Tapkire(Assistant
Professer)5
1-5Department of Information Science and Engineering National Institute of Engineering, Mysuru, India.
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - In the 21st century, everything is automated consumers with access to various apps and resources from
including the things we use daily washing machines, anywhere in the world. Some of the most used mobile
dishwashers, refrigerators, bus doors, air condition systems, Operating Systems are Android, Apple, Windows,
turning everything in a single click, etc. In this fast-moving Blackberry, etc. These Operating System provides different
the present study proposes the newer concept of voice- services to the user. Most systems allow the user to train
controlled devices that recognizes one’s voice; process the the software to understand their voice so that it can
request and assigns the time and date of the appointment translate speech to text more precisely. Google has created
based on the request with details as name of the person; "Google Assistant" and iPhone has created "Siri" to help
date; time and other related information. We need to users respond to their voice commands in an effective
develop devices with in-built voice recognition which has the manner. Mobile App Voice Assistant helps users to
ability to recognize the voice even in crowded surroundings communicate with AI in a very meaningful way. Mobile
with just only one form of interaction between the device App Voice Assistant helps users to communicate with AI in
and the human is the Voice. The device will capture the a very meaningful way. In using natural language
audio through the microphone of the device and process the processing and machine learning, Amazon's Echo app has
query made by the human and reply to the human with the enhanced the engagement of individuals with AI
appropriate results. For Example if you ask the device to technology.
change the wallpaper of your Personal Computer it will
change the wallpaper by downloading wallpaper from a Speech recognition is an alternative to keyboard typing.
website and changing the wallpaper. It can also guide you Simply put, you're talking to the machine and your words
the traffic between source and destination and also auto- show on the screen. It has been developed to provide a
suggest lesser traffic and time routes. simple way to write on a computer and can support people
with a range of disabilities. It is helpful for users with hand
Key words: automated, dishwashers, recognizes, disabilities who often find it though, impossible or painful
crowed, website, destination. to type. Voice-recognition apps can also support people
with spelling problems, including those with dyslexia,
INTRODUCTION since well-recognized words are almost always
pronounced correctly. Scientists have used text generated
This Voice recognition technology is evolving rapidly and online by people to train voice assistants to listen and
is expected to become not only the default input form for respond to our requests in a more natural and meaningful
smartphones, but also for cars and other home appliances way. Voice assistants will decipher questions that have
such as TV and fridge. Due to the unique features been phrased in a variety of different ways and interpret
associated with voice input, including an implicit what the user is most likely to want.
verbalization of commands, privacy and acceptability
issues may affect the usage and adoption of voice-bases. 2. LITERATURE REVIEW
Several researchers have been interested in the This research could be is a chunk of a bigger project
recognition of human activities in recent years. In this concerning virtual voice assistant briefed by theories in
project, we propose a voice recognition system that human machine interaction. Moreover speech recognition
recognizes human activities through a deep learning has a brief history with numerous waves of innovations.
algorithm. Voice is essentially a mode of communication Voice recognition for dictation, hunt and voice command
that lets users communicate with each other. Voice has become vital feature on personal devices: like
Recognition, also known as Automatic Speech Recognition wearable devices and smartphone’s. This system was
(ASR), identifies spoken words and phrases and translates developed as a humanoid application that confirms the
them to a readable computer format. It takes the input of necessity of language rework that sends messages and
the user in the form of a voice or text and processes it and also use build-in application by processing the commands
returns the feedback in different ways, such as the action given by user to the system. Importantly smartphone
to be done or the search result to the end user. Hence, gadget was way quicker followed by other wearable
there is also the additional challenge of making out spoken devices; so, many arrived to introduce in-voice virtual
words from noise in the audio. Modern mobile technology voice assistant with the importance of adopting and
has been very useful to consumers as it provides applying multiple smart technologies. This system has
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3399
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
some basic features and most importantly mailing and 3. METHODOLOGY
secondly calendar, where user has the privilege to mail
and able to create their required event by providing voice 3.1 General Structure
commands. For instance, if we use artificial intelligence we
can are able to turn off the lights without the instruction Considering overall research, voice application will be
given by the user. Almost, Everyone has some knowledge used in following three ways: Firstly, command to the
about trending voice assistant like cortana for windows, computer whereas secondly, to input information the
and Siri for apple users, this virtual voice assistants aren’t computer, finally for communication with other people. In
as brainy and intelligent as Ironman's Jarvis which appear this section we will be discussing general components for
in the superhero movie, but the intended actions are voice application. As seen in Figure[ 1],voice will be divide
almost similar by virtual voice assistant. It’s like you need into four different parts: front-end interface, end users,
a ask question, and within a few fraction of seconds you voice recognition System finally dictionary and text file
will get an answer. It’s just give a command and get result. database. Each section is explained as follows:
Here are some amazing Features of virtual voice assistant: Front-End Interfaces
Open any website in the browser: If any user needs to In front-end-interface, user will be having direct access to
open any website they just need to voice out “open the interface and communication users by providing Input
nameofwebsite.com” or “open website.org”. consider and Output with graphics designs and icon-based menu. It
example:”open xhsj.com” or”hey requesting you please receives user prompt input voice and in return delivers
open zzz.com” . users with a voice recognition system to detect voice
inputs, and usually generates feedback of voice to users,
Plays song on VLC media player: Ask voice assistant to after completion of commands by several other functions
play a desired song in VLC media player :For instance user of the system.
will ask voice assistant”can you please play me a song”,
whereas bot will ask”what song shall a pay Sir/Madam?”
and voice assistant will transfer the required music to
youtube, which is present in your local drive and it will
stream the searched content in VLC media player ,
however if user plays any new song, previously
downloaded music will be automatically deleted .
Scan the Headlines: raise voice assistant to scan out daily
headlines from news connected application, where user
has the privilege to select the interested topics of his/her
FIGURE 1: General Structure of Voice Application.
own alternative.
End Users
Send Email: If user had prompted the word email in
his/her commands then the voice assistant will ask user Basically end users refers to device users. They will be
for recipient, If user response is abc, then assistant will use using this devices for communication and feedback of
phones library for search user data and then it is directed voice with the use of application, and moreover end users
to email with recipient name on it. are those who will be using this application with there
personal devices like mobile and laptop users.
Tells you the current time: Using voice assistant users
can ask the current time. For instance: “whats time right 3. Voice Recognition Systems
now” then assistant will report you the current time as per
your timezone.therefore”current time is 1.14p.m.” It is the heart of a Voice application system, which has
ability to understand voice input given by user, and make
Keep reported about the weather and temperature of any application work in a efficient way and generating voice
world: Voice assistant can report weather for the day and feedback to the user. This system is an important
it also can give as the minimum and maximum component for user as a gateway to use his or her voice as
temperature of any city across the world. User must just a input component. In a Nutshell, for clearly
give commands like “what is the current weather in understanding user voice command and to get feedback
Mysore “or tell me the current weather in “India”: you will from the system, we should consider voice recognition
be getting results within fraction of seconds. system contains all the process by which application
system directs for building speech signals to text data and
Answer your Desirable questions: Ask voice assistant
few form of important meaning of speech.
some interesting facts or the new facts, solve some basic
mathematical problems or we are also able to ask a joke.
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3400
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
3.1 Dictionary and Test files Database semantic description is used to match the command to the
Web service adapter.
Providing the type of the device and requirements for the
user, the system application needs to support few exact FIGURE 3. Data Flow Sequence
input types or by providing peculiar voice feedback.
Whereas for language type, the system application can
give additional explanations to the users or it may provide
functions based on the files in the database in extension,
system requires to install an additional text file database
to add and update application in few different cases.
3.2 System Architecture
The total design consists of these phases:
1) Collection of data which is in speech format.
2) Analyse the voice and convert it to text.
3) storing the data and processing it. 3.3.3 Task Manager:- The Speech-to-Text and Text-to-
Speech transfers are done by the task manager. It provides
4) Speech generation from the text output that is semantic description of what was spoken, which is passed
processed. into a service manager.
3.3.4 Device:- For speech output, a natural language
(NLG) component and a text-to-speech (TTS) component
is used.
3.3.5 Firebase Cloud Server:- Firebase is a Backend-as-a-
Service - BaaS - which hosts a plethora of APIs to perform
certain tasks. We analyse the command given by the web
service adapter and further it matches the command and
run the script with python script.
3.3.6 Device:- For speech output, a natural language
FIGURE 2: System Architecture of Voice Controlled (NLG) component and a text-to-speech (TTS) component
Personal Assistant is used.
The data that is collected in the speech form is stored and 3.3.7 Web Service Adapter - It calls the correct Firebase
used as input for next phase of the process. In next phase, Cloud Service adapter based on the command.
the input which is given in the form of voice is processed
continuously and is converted into text by using STT. In 3.3.8 Execute command : Run the Respective python
third phase, the text which is converted, is analysed by script after you have found for the match for the given
Python Script which processes it and identifies the action order.
to be taken for the command. In the last phase, after the
action to be taken is identified, output will be obtained 4. APPLICATIONS
from text to speech conversion using TTS.
There are a wide variety of services which are provided by
3.3 Data Flow Sequence the voice-enabled devices which range from simple
commands like providing information about the weather
3.3.1 Initialize Device:- Device initialization does of a place, general information from Wikipedia, movie
whatever steps are necessary to get a system into a rating from IMDB, setting an alarm or reminder, creating a
working state. It set the unit in motion by calling its name. to-do list and adding items to the shopping list so that we
The process is specific for every device, there are no magic don’t forget when we go shopping. It can also read books
values that would initialize any device that you come for the user or else play music from any streaming
across. services depending on the device provider or user
preference. It can also play videos from YouTube or else
3.3.2 Service Manager:-It helps in Command analysis and from any streaming services. In a recent study, voice
match with Web service adapter and cloud server. The assistants are also being used to assist public interactions
with the Government and also a decrease of 30% work-
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3401
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 07 Issue: 04 | Apr 2020 www.irjet.net p-ISSN: 2395-0072
load on humans when voice assistants are used in call- translation system” proceedings of 5108978-1-4577-
centers. 0539-7/11/$26.00 ©2011 IEEE.
3. Hirschberg, Julia, and Christopher D. Manning.
5. LIMINATIONS “Advances in Natural Language Processing.” Science 349
(6245): 261–266. doi:10.1126/science.aaa8685.
The devices which use the human voice for interacting 4. Moore, Clayton. “The Most Useful Skills for Google
with the device use single commands as input for the Home.” Digital Trends (May 3,
device they usually consist of single phrases. When 2017).https://www.digitaltrends.com/home/google-
commands become ambiguous, the resulting actions can home-most-useful-skills/.
be misunderstood by the devices. There is only one-way 5. Lee, Nicole. “Google Assistant on the iPhone Is Better
communication between the user and the device because than Siri, but Not Much.”Engadget (May 17, 2017).
the device cannot talk back for clarification. The https://www.engadget.com/2017/05/17/google-
applications on the devices cannot reply back with the assistant-iphone-hands-on/.
state of the process whether it is ongoing or completed. 6. “Accurate and compact large vocabulary speech
There are many cases where only specific tasks are recognition on mobile devices,” in INTERSPEECH.2013, pp.
allowed to be done by the voice-enabled devices because 662–665, ISCA.
of stove top cannot/should not be turned on when there is 7. “Connectionist temporal classification: Labelling
no one in the kitchen/house. The devices can not integrate unsegmented sequence data with recurrent neural
context data. They can not log any history about the networks,” in ICML, 2006, pp. 369–376.
queries made but they can be trained to learn about the 8. “Lattice-based optimization of sequence classification
user behavior and learn about the user’s usage statistics criteria for neural-network acoustic modeling”.
and give a recommendation to the user according to the 9. R. Rosenfeld, D. Olsen, and A. Rudnicky, "Universal
time, place, or by any other calculated parameters. speech interfaces," in Proceedings of the International
Conference on Spoken Language Processing, Beijing,
6. CONCLUSION
China: 2000.
Voice-Controlled Devices uses Natural Language 10. T. Hornstein, "Telephone voice interfaces on the
Processing to process the language spoken by the human cheap," in Proceedings of the UBLAB '94Conference, 1994.
and understand the query and process the query and 11. F. D. Davis, “Perceived usefulness, perceived ease of
respond to the human with the result. The understanding use, and user acceptance of information technology,” MIS
of the device means Artificial Intelligence needs to be Q., vol. 13, no. 3, pp. 319–340, 1989.
integrated with the device so that the device can work in a 12. M. Fishbein and I. Ajzen, Belief, Attitude, Intention and
smart way and can also control IoT applications and Behavior: An Introduction to Theory and Research,
devices and can also respond to query which will search Addison- Wesley Publishing Company, Inc.: Reading, 1975.
the web for results and process it. It is designed to 13. R. Want and B. N. Schilit, ”Interactive Digital Signage,”
minimize the human efforts and control the device with in Computer,
just human Voice. The device can also be designed to vol. 45, no. 5, pp. 21-24, May 2012.
interact with other intelligent voice-controlled devices like 14. Dempsey P. The teardown: Google Home personal
IoT applications and devices, weather reports of a city assistant//Engineering & Technology. – 2017. – Т. 12. –
from the Internet, send an email to a client, add events on No. 3. – С. 80-81.[2] Chung H. et al. Alexa, Can I Trust You?
the calendar, etc. The accuracy of the devices can be //Computer. – 2017.– Т. 50. – No. 9. – С. 100-104.
increased using machine learning and categorizing the 15. Arriany A. A., Musbah M. S. Applying voice recognition
queries in particular result sets and using them in further technology for smart home networks //Engineering &
queries. The accuracy of the devices is increasing MIS(ICEMIS), International Conference on. – IEEE, 2016. –
exponentially in the last decade. The devices can also be С. 1-6.
designed to accept commands in bilingual language and
respond back in the same language queried by the user.
The device can also be designed to help visually impaired
people.
REFERENCES
1. DOUGLAS O’SHAUGHNESSY, SENIOR MEMBER, IEEE,
“Interacting With Computers by Voice: Automatic Speech
Recognition and Synthesis” proceedings of THE IEEE, VOL.
91, NO. 9, SEPTEMBER 2003
2. Kei Hashimoto1, Junichi Yamagishi2, William Byrne3
Simon King2, Keiichi Tokuda, “An analysis of machine
translation and speech synthesis in speech-to-speech
© 2020, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page 3402