Commit 16747c95 authored by Elias's avatar Elias
Browse files

pre-processing data. Cleaning documents

parent abd16b33
# Projet_IAS_NPL
# English proficiency prediction NLP
Basically the idea of the project is to predict the someone's English proficiency based on a text input.
<b>Description : </b>
IAS module project at ENIB (SP9 - 2021)
Basically the idea of the project is to predict the someone's English proficiency based on a text input.
We used the The NICT JLE Corpus available here :
The source of the corpus data is the transcripts of the audio-recorded speech samples of 1,281 participants (1.2 million words, 300 hours in total) of English oral proficiency interview test. Each participant got a SST (Standard Speaking Test) score between 1 (low proficiency) and 9 (high proficiency) based on this test.
<b>Tasks : </b>
- Pre-process the dataset: extract the participant transcript (all `<B><B/>` tags). Inside participant transcript, you can remove all other tags and extract only English words.
- Process the dataset: extract features with the Bag of Word (BoW) technique
- Train a classifier to predict the SST score
- Compute the accuracy of your system (the number of participant classified correctly) and plot the confusion matrix.
- Try to improve your system (for example you can try to use GloVe instead of BoW).
<b>Supervisor :</b>
Olivier Augereau
<b>Authors :</b>
%% Cell type:markdown id: tags:
<a href="" target="_parent"><img src="" alt="Open In Colab"/></a>
%% Cell type:code id: tags:
#Libraries used
import re #Regular expression
import os #Operating system
%% Cell type:code id: tags:
#We load our dataset from Drive
#Dataset: NICT_JLE_4.1
from google.colab import drive
%% Output
Mounted at /content/gdrive
%% Cell type:code id: tags:
def read_text_file(file_path):
with open(file_path, mode='r',encoding="utf8",errors='ignore') as f:
content = f.readlines()
return content
def write_text_file(file_path,fileProccesed,i):
with open(file_path+'/Output'+str(i)+'.txt', 'w') as f_out:
def preProcessingData(content):
This function will return a list with the text clean.
Each element corresponds to a line from the files in "LearnerOriginal"
for j, i in enumerate(content):
if ('<SST_level>' in i):
output_Score=i.strip('</SST_level> \n')
#--- Debug ---
print("Output Score: ", output_Score)
if ("<B>" in i): #that is, all the lines where the candidate spoke
#We change the name to be readable
#In the following list we declare all unwanted characters.
#<OL></OL>: Overlapping speech
#<Laughter></Laughter> : Laughter
#<F>:Filler/Filled pause
#<R></R> : Repetition
#<nvs> : Non-verbal Sound
#<JP></JP> : Japanese word
#<H pn=”X”></H> : Learner’s personal information
#<SC></SC> : self-correction
#<SC?></SC?> :Unclear self-Correction
#<.></.> : Short Pause (2 – 3 seconds)
#<..></..> : Long Pause (more than 4 seconds)
#<?></?> : Unclear passage
for tag in listUndesiredChars:
text_clean = re.split(tag,lines)
text_clean =''.join(text_clean)
#all the remaining symbols
pattern ='[.?"-,]'
replacement =''
text_clean = re.split(r'<.+?>',text_clean)
# We remove the characters that are not letters and make everything lowercase
result = re.sub(pattern, replacement, text_clean).lower()
# We eliminate multiple spaces
result = re.sub('\s+',' ',result)
return file_clean
if __name__ == "__main__":
# Path to: READ and WRITE
path_input = '/content/gdrive/MyDrive/NICT_JLE_4.1/LearnerOriginal/'
path_output= r'/content/gdrive/MyDrive/NICT_JLE_4.1/Output/'
# Change the directory
#Counter to write all files
# iterate through all file
for file in os.listdir():
# Check whether file is in text format or not
if file.endswith(".txt"):
file_path = f"{path_input}/{file}"
# We read the files to treat
content = read_text_file(file_path)
#The function that clean the files
#We write out files
#Counter update
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment