Raspberry Pi Continuous Speech to Text

You are giving a speech, get stuck in between, and don't know what to say. No worries here is an Artificially Intelligent system that helps you out. Predictive Speech with raspberry pi and deep learning uses a raspberry pi and a microphone to record your speech. Its core is a trained LSTM (Long short term memory) model which either uses behavioral cloning or some speech data set and deployed to IBM Watson Cloud. Finally, a pygame window is used to display the speaker's words as well as the predicted ones.

Here is the schematic representation of the workflow of this project. We will go step-by-step and dive into each topic separately.

Content

  • Audio Recognition
    • Raspberry Pi
    • Microphone
    • Python Code
  • Speech To Text
  • Artificial Intelligence
    • Dataset
    • Pre-processing
    • The Model
    • Training
    • Prediction
  • Deployment
    • IBM-Watson
    • Ml model
    • Prediction
  • Pygame
    • Speech Output
    • Predict Output

Audio Recognition

1. Raspberry Pi

Raspberry Pi is a small computer. It mostly uses Rasbian as its operating system but, as of today, it can support almost anything. From home automation to self-driving cars, it is used in almost every hardware project. We will use it to record audio continuously from the microphone.

Voice Recording on Raspberry Pi

2. Microphone

In this tutorial, a USB microphone will be used, but feel free to explore. The advantage of the USB microphone is that it is easily detected by Rasbian. To get started we first need to install pyaudio library in our system after that the speech recognition library.

Since the speech recognition library is based on pyaudio we need to install it first. Pyauidio helps read the microphone and use it to directly record audio from python code.

pip install SpeechRecognition          

The speech recognition library is used to capture the audio and easily translate it to text using google speech API. It also has various helpful functions that remove ambient noise and records audio in a different thread in the background. These libraries can be very useful in audio processing. It is recommended to go through its documentation to learn more.

3. Python Code

Recording the audio is a looped function and is the main control of the program and therefore all other functions will be called in its loop.

STEP-1 : Create a new python file and import the following libraries.

import speech_recognition as sr          

STEP-2 : The pyaudio library need not be imported . Now declare a object :

#speech recognizer initialization r = sr.Recognizer() m = sr.Microphone()          

STEP-3 : The code for the main loop to record audio continuously:

while(start):      try:         with m as source:                          # It listens by default for 1 sec and removes external noise             r.adjust_for_ambient_noise(source)                          print("start speaking")             audio = r.listen(source)        except sr.RequestError as e:          print("Could not request results; {0}".format(e))               except sr.UnknownValueError:          print("unknown error occured")          
  • Line 7 listens for ambient noise only once through the looping.
  • If no sound is recorded then the second exception occurs.
  • The advantage of using this library is that it starts recording when you speak and stops when you are done.

Speech To Text

Each time the speaker speaks a sentence, it is recorded in an audio object, which is then converted to text using the google speech-to-text API. It is very efficient and the conversion quality is very high. The default API key is used and a personal key is not necessary. There are other APIs also like Bing and IBM, try them and see for yourself. The speech-to-text conversion is very extensively trained machine learning model and need not be understood in detail for this project.

A Speech Interface in 3 Steps

After listening to the audio through the microphone, in the same function, we pass it though the API and print the output.

Mytext = r.recognize_google(audio) Mytext = Mytext.lower()  print(Mytext)          

Artificial Intelligence

Today the world is heading towards a technology where IoT and AI are but one discipline. Together these fields can perform wonders. To implement and bring forward such ideas we will add the AI element to our project. To help the speaker in his speech we use this system to predict what he might say next in his speech and display it in a dialogue box. Creating and training an Ml dataset requires a dataset, an algorithm, and finally a certain output in mind. We will go through all these in detail as we code.

ML Board Help | MAGELLAN BLOCKS

1. Dataset

Each AI system has to initially learn from something. We use some raw data containing predetermined input and the desired output. It repeatedly uses this data to better its brain. In our project, we can either use the speaker's previous speeches as data or simply, sample speeches. The dataset is input in text format.

# the dataset can be any text file containing strings filename = "wonderland.txt"  raw_text = open(filename, 'r', encoding='utf-8').read()  raw_text = raw_text.lower()          

2. Pre-processing

Each model requires the input data to be in a certain format. Our aim is to use the speaker's last hundred words and predict the next character he might say. Then use the last ninety characters and the predicted output to again predict the next character. Do this till we predict at least a hundred new characters that the speaker might speak. So we need a sequential input and single character as output. Start by importing the NumPy library.

STEP-1: Define the variables of our dataset

# find number of distinct characters our dataset contains chars = sorted(list(set(raw_text)))  # give each character a particular number and vice versa char_to_int = dict((c, i) for i,c in enumerate(chars)) int_to_char = dict((i, c) for i,c in enumerate(chars))          
  • Since our model can understand numbers and not characters, Line 5 is necessary
  • The output of our model will too be a number, and to convert it back Line 6 is necessary.

STEP-2 : Define the total size of the dataset in vocabulary and in size.

n_chars = len(raw_text) n_vocab = len(chars)          

STEP-3 : Convert the string input into a sequence of integers , each of length 100. The output into a single integer that represents the character which comes after the input sequence.

seq_length = 100 dataX = [] dataY = []  for i in range(0, n_chars - seq_length, 1):     seq_in = raw_text[i:i + seq_length]     seq_out = raw_text[i+seq_length]     # The input data     dataX.append([char_to_int[char] for char in seq_in])     # The output data     dataY.append(char_to_int[seq_out])  n_patterns = len(dataX) print("total patterns :", n_patterns)          
  • If the dataset contains 101 characters then the loop will run 1 time, if it contains 110 characters then it will run 10 times.
  • the length of the input data is the total number of patterns formed, in the above example it will be 10.

Import a keras package to perform further processing:

from keras.utils import np_utils          

STEP-4: Reshape the input and then normalize it. Convert the output to categorical data. So if there are 10 characters, a single output will not be an integer but an array of length 10 will all zeros except for one value whose index matches the integer.

X = np.reshape(dataX, (n_patterns, seq_length, 1)) # Normalization X =X/ float(n_vocab)  y = np_utils.to_categorical(dataY)          

Final output after pre-processing :

3. The Model

We will be using deep learning to create our model and LSTM(Long Short Term Memory) to build its structure. LSTM is used since it is perfectly suitable for recurrent and sequential operations. Each cell in an LSTM model has three gates which help it categorize and filter data. Refer to this YouTube channel if you want to learn more about LSTMs and recurrent networks : https://www.youtube.com/watch?v=8HyCNIVRbSU

Keras library and its sequential API are used to build our neural network.Start by importing all the required libraries:

from keras.models import Sequential from keras.layers import Dense, Dropout, LSTM from keras.callbacks import ModelCheckpoint          
  • Dense is a fully connected layer
  • Dropout layer is used to set random units to 0 for proper training.

The Sequential model:

model = Sequential() model.add(LSTM(256, input_shape = (X.shape[1], X.shape[2]), return_sequences = True)) model.add(Dropout(0.2)) model.add(LSTM(256)) model.add(Dropout(0.2)) model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss = 'categorical_crossentropy', optimizer='adam')          
  • Since there are many classes we use categorical cross-entropy as loss function.

4. Training

Depending on the dataset we train the model for certain epochs. For now the training will be for 50 epochs with a batch size of 64. We will also use callbacks to get the loss value after every batch is trained.

STEP-1 : Write the code for callbacks to print loss and save the model after every epoch.

filepath = "weights/weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5" checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose = 1, save_best_only=True, mode = 'min') callbacks_list = [checkpoint]          

STEP-2: Fit the model with the specified parameters. Also, depending on your system the training time will vary, so if you do not have a decent enough system it is suggested to use google colab or other cloud computing services.

model.fit(X, y, epochs = 50, batch_size=64, callbacks=callbacks_list)          

5. Prediction

For now, we will predict the output of the model to test it locally and if it is good enough we will upload it in the cloud. For prediction either create a new file and load all variables or simply use the same file. Create a new function predict which inputs text and the dataset variables and outputs the prediction text.

def predict(text_in, char_to_int, int_to_char, model, n_vocab):     text1=[]     out = ''     #converts the string input to numbers     for i in text_in:         text1.append(char_to_int[i])      # each loop predicts one character therefore 100     for i in range(100):         text = np.reshape(text1, (1, len(text1), 1))         #Normalize the input         text = text /float(n_vocab)         predict = model.predict(text, verbose = 0)         index = np.argmax(predict)         result = int_to_char[index]         out +=result         text1.append(index)         text1 = text1[1:len(text1)]      return out          
  • Since Softmax activation gives a probability output, Line 13 finds the max probability value which should be our required output.
  • Each output integer is converted to the character and added to the output.
  • Lines 16 and 17 are used to include the prediction output as the next input to the model.

Depending on the dataset if the prediction output seems valid enough proceed to the deployment otherwise consider training the model for more epochs or vary the complexity and hyper-parameters of the model by hid and trial.

Deployment

Every IoT project needs cloud computing and so does ours. Not only for sake of it but because a raspberry pi cannot handle the computational complexity and will probably crash. Even with better-embedded systems, it is recommended to use Cloud Computing. Now, all we need to do is create Cloud credentials, deploy our model file there, and easily predict outputs.

IBM Knowledge Center

1. IBM-Watson

In this tutorial, we will be using the IBM-Watson Cloud Computing platform to deploy our model. Normally people prefer either Google Cloud or Digital Ocean but IBM is also very good. Also if you are comfortable with other platforms, feel free to use them instead. To get started with deployment we first need to create an IBM Cloud account.

STEP-1 : Create a IBM cloud account.

STEP-2 : Catalog -> Services -> AI -> Machine Learning

STEP-3 : New credential, and then copy the credential to your local system.

Install Watson library in your local system.

pip install watson_machine_learning_client          

Create a new python file solely for deploying the model, import the library and paste the credentials

from watson_machine_learning_client import WatsonMachineLearningAPIClient  wml_credentials={   "url": "xxxxxxxxxxxxxxxxxx",   "apikey": "xxxxxxxxxxxxxxxxxxxxxxx",   "instance_id": "xxxxxxxxxxxxxxxxxxxxxx" }  # A client with the credentials client = WatsonMachineLearningAPIClient(wml_credentials)          

2. ML model

Compress the model first before deployment .Import the required libraries for compression in the same file.

from contextlib import suppress import os          

Now to compress the file we use the above libraries. The output file must be a "model.tgz" file.

#compressing model filename = 'weights/model.h5' # Compression to .tgz format tar_filename = 'weights/model' + '.tgz' cmdstring = 'tar -zcvf ' + tar_filename + ' ' + filename os.system(cmdstring)          

After compression, we need to define the properties of the model that we are uploading like the libraries and their versions. Check the versions used while training the model. Also, IBM supports only particular versions and the list is updated frequently, so do check it out. As of today, these specifications are up-to-date.

model_props = {     client.repository.ModelMetaNames.NAME: 'Speech - compressed keras model',     client.repository.ModelMetaNames.FRAMEWORK_NAME: 'tensorflow',     client.repository.ModelMetaNames.FRAMEWORK_VERSION: '1.15.0',     client.repository.ModelMetaNames.RUNTIME_NAME: 'python',     client.repository.ModelMetaNames.RUNTIME_VERSION: '3.6',     client.repository.ModelMetaNames.FRAMEWORK_LIBRARIES: [{'name':'keras', 'version': '2.2.5'}]          

Finally, deploy the model and get the required details to predict the output from it.

# Evaluates the model published_model_details = client.repository.store_model(model=tar_filename, meta_props=model_props)         # gets and prints the unique id of the model model_uid = client.repository.get_model_uid(published_model_details) print(model_uid)  # Deploys it as a web service deployment = client.deployments.create(model_uid, 'Keras Speech recognition through compressed file.')          

3. Prediction

For prediction using the cloud, some changes need to be madeto the local prediction code written earlier.

STEP-1 : Instead of loading the model using the credentials declare a client as done before. Then go to the deployment in your cloud account and copy the scoring endpoint.

def init():      wml_credentials={     "url": "xxxxxxxxxxxxxxxxxx",     "apikey": "xxxxxxxxxxxxxxxxxxxxxxx",     "username": "xxxxxxxxxxxxxxxxxxxx",     "password": "xxxxxxxxxxxxxxxxxxxxxxx",     "instance_id": "xxxxxxxxxxxxxxxxxxxxxx"}          # IBM deployment initialization     client = WatsonMachineLearningAPIClient(wml_credentials)      char_to_int = pickle.load(open("variables/c2i.pkl", "rb"))     int_to_char = pickle.load(open("variables/i2c.pkl", "rb"))     n_vocab = pickle.load(open("variables/n_vocab.pkl", "rb"))      scoring_endpoint = 'https://eu-gb.ml.cloud.ibm.com/xxxxxxxxxx'      return client, char_to_int, int_to_char, n_vocab, scoring_endpoint          

STEP-2 : Instead of predicting using the model use the client and endpoint to predict.

scoring_payload = {'values': text.tolist()} predict = client.deployments.score(scoring_endpoint, scoring_payload)          

Pygame

The output of this project is in the form of a GUI text. It continuously displays the speech as well as the prediction output. We will use pygame as the GUI library since its display is flexible and not bound by a loop or rigid like Tkinter. The text output from the google speech API is fed to pygame, then after crossing 100 words, the prediction output is also fed to a different function that displays the dialogue box. Install pygame and import it in a new file.

STEP-1 : Importing:

STEP-2 : Initialize basic pygame variables in a new function. The parameters can be varied as per choice.

def show_init():     pygame.init()      screen_size=(1500, 600)     screen=pygame.display.set_mode(screen_size)      pygame.display.set_caption('Show speech')       return screen          

This should be able to display a blank pygame window titled 'Show Speech'.

1. Speech Output

The speech output is in the form of scrolling text. Now we will define a new function to display the text. First, we input the text in the function then assign it a font, and then draw a rectangle around it to measure its height and width. Using those we can determine where to place the text. The next step is to take the text and go through the loop to add a scrolling effect.

def paint(screen, txt, posx, posy):     font = pygame.font.Font('freesansbold.ttf', 18)      screen.fill((0, 0, 0), rect=[posx, posy, (font.size(txt[0])[0]*100), 20])      clock = pygame.time.Clock()      green = (0, 255, 0)       text = font.render(txt, True, green)           # create a rectangular object for the text surface object     textRect = text.get_rect()      if((posx + textRect[2]) <= 800):         textRect = textRect.move(posx, posy)     else:         posy +=textRect[3]         posx = 0         textRect = textRect.move(posx, posy)      pygame.display.update()          for i in range(len(txt)):             text2 = font.render(txt[i], True, green)              screen.blit(text2, (posx +(font.size(txt[:i])[0]), posy))              pygame.display.update()             clock.tick(12)          posx += textRect[2]          return pygame.display.get_surface(), posx, posy          
  • The same pygame screen initialized earlier is updated every function.
  • Line 12-17 are used for logically placing the text after the last, adjusted according to the size of the window.
  • Line 21-25 are used to scroll the text. A clock timer is required to check the speed of scrolling.
The accuracy of the text can be seen clearly. The only fault it had was interpreting pygame as tiger, but for common words even this error is highly rare.

2. Predict Output

The prediction is a 100 character output derived from the last 100 characters spoken by the speaker. It is shown in a dialogue box after the last words in the same window. The basic visual properties can be modified easily. Create a new function and copy the above in it, since only slight changes need to be made.

def pred_show(screen, txt, posx, posy):      green = (0, 255, 0)      blue = (0, 0, 128)       font = pygame.font.Font('freesansbold.ttf', 18)       text = font.render(txt, True, green, blue)       # create a rectangular object for the text surface object      textRect = text.get_rect()       if((posx + textRect[2]) <= 800):         print(str(posx+textRect[2])+ "ii")         textRect = textRect.move(posx, posy)     else:         print(str(textRect[3]+ posy) + "jj")         textRect = textRect.move(0, 300)         #posx += textRect[2]      pygame.display.update()          # infinite loop      screen.blit(text, (posx, posy))      pygame.display.update()      return pygame.display.get_surface()          
  • As can be seen, it is very similar to the earlier function as their job description is almost the same.
  • The value of x and y positions are not modified in this function
  • The scroll effect is not added
  • The background color of the text is blue to display it as a dialogue box.

The prediction may not be accurate because of the limitations of the model but it can bettered.

With this we come to the end of the tutorial, I hope you learned something new about raspberry pi, audio processing, and Artificial Intelligence. If you come along any doubts or errors go through the respective documentation or comment below. The entire working code and original project directory of this project can be found in this GitHub repository: https://github.com/Shaashwat05/Predictive_speech

If you like projects like these, do comment and more tutorials will be your way. Also, need similar ideas , visit this link: https://dzone.com/articles/artificial-intelligence-in-iot-4-examples-how-to-m-1

opasboureack.blogspot.com

Source: https://iot4beginners.com/predictive-speech-with-raspberry-pi-and-deep-learning/

0 Response to "Raspberry Pi Continuous Speech to Text"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel