Commit 16747c95 authored by Elias's avatar Elias
Browse files

pre-processing data. Cleaning documents

parent abd16b33
# Projet_IAS_NPL
# English proficiency prediction NLP
Basically the idea of the project is to predict the someone's English proficiency based on a text input.
<b>Description : </b>
IAS module project at ENIB (SP9 - 2021)
Basically the idea of the project is to predict the someone's English proficiency based on a text input.
We used the The NICT JLE Corpus available here : https://alaginrc.nict.go.jp/nict_jle/index_E.html
The source of the corpus data is the transcripts of the audio-recorded speech samples of 1,281 participants (1.2 million words, 300 hours in total) of English oral proficiency interview test. Each participant got a SST (Standard Speaking Test) score between 1 (low proficiency) and 9 (high proficiency) based on this test.
<b>Tasks : </b>
- Pre-process the dataset: extract the participant transcript (all `<B><B/>` tags). Inside participant transcript, you can remove all other tags and extract only English words.
- Process the dataset: extract features with the Bag of Word (BoW) technique
- Train a classifier to predict the SST score
- Compute the accuracy of your system (the number of participant classified correctly) and plot the confusion matrix.
- Try to improve your system (for example you can try to use GloVe instead of BoW).
<b>Supervisor :</b>
Olivier Augereau
<b>Authors :</b>
CORREA, Elias
GASSIBE, Franco
%% Cell type:markdown id: tags:
<a href="https://colab.research.google.com/github/Eliascc5/English_proficiency_prediction_NLP/blob/main/preProcessing_NPL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
%% Cell type:code id: tags:
```
#Libraries used
import re #Regular expression
import os #Operating system
```
%% Cell type:code id: tags:
```
#We load our dataset from Drive
#Dataset: NICT_JLE_4.1
#Reference: https://alaginrc.nict.go.jp/nict_jle/index_E.html
from google.colab import drive
drive.mount("/content/gdrive")
#----------------------------------
```
%% Output
Mounted at /content/gdrive
%% Cell type:code id: tags:
```
def read_text_file(file_path):
with open(file_path, mode='r',encoding="utf8",errors='ignore') as f:
content = f.readlines()
f.close
return content
def write_text_file(file_path,fileProccesed,i):
with open(file_path+'/Output'+str(i)+'.txt', 'w') as f_out:
f_out.write('\n'.join(fileProccesed[i]))
f_out.close
def preProcessingData(content):
'''
This function will return a list with the text clean.
Each element corresponds to a line from the files in "LearnerOriginal"
'''
file_clean=[]
for j, i in enumerate(content):
if ('<SST_level>' in i):
output_Score=i.strip('</SST_level> \n')
file_clean.append(output_Score)
#--- Debug ---
print("Output Score: ", output_Score)
print("---------------------------------")
if ("<B>" in i): #that is, all the lines where the candidate spoke
#We change the name to be readable
lines=i
#In the following list we declare all unwanted characters.
listUndesiredChars=[r'<F>.+?</F>',r'<R>.+?</R>',r'<OL>.+?</OL>',r'<laughter>.+?</laughter>',r'<nvs>.+?</nvs>',r'<CO>.+?</CO>',r'<H.+?</H>']
#TAGS:
#<OL></OL>: Overlapping speech
#<Laughter></Laughter> : Laughter
#<F>:Filler/Filled pause
#<R></R> : Repetition
#<nvs> : Non-verbal Sound
#<JP></JP> : Japanese word
#<H pn=”X”></H> : Learner’s personal information
#-----------------------------------
#<SC></SC> : self-correction
#<SC?></SC?> :Unclear self-Correction
#<.></.> : Short Pause (2 – 3 seconds)
#<..></..> : Long Pause (more than 4 seconds)
#<?></?> : Unclear passage
for tag in listUndesiredChars:
text_clean = re.split(tag,lines)
text_clean =''.join(text_clean)
#all the remaining symbols
pattern ='[.?"-,]'
replacement =''
text_clean = re.split(r'<.+?>',text_clean)
text_clean=''.join(text_clean)
# We remove the characters that are not letters and make everything lowercase
result = re.sub(pattern, replacement, text_clean).lower()
# We eliminate multiple spaces
result = re.sub('\s+',' ',result)
file_clean.append(result)
return file_clean
if __name__ == "__main__":
# Path to: READ and WRITE
path_input = '/content/gdrive/MyDrive/NICT_JLE_4.1/LearnerOriginal/'
path_output= r'/content/gdrive/MyDrive/NICT_JLE_4.1/Output/'
# Change the directory
os.chdir(path_input)
#Counter to write all files
i=0
fileProccesed=[]
# iterate through all file
for file in os.listdir():
# Check whether file is in text format or not
if file.endswith(".txt"):
file_path = f"{path_input}/{file}"
# We read the files to treat
content = read_text_file(file_path)
#The function that clean the files
fileProccesed.append(preProcessingData(content))
#We write out files
write_text_file(path_output,fileProccesed,i)
#Counter update
i+=1
```
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment