操作系统自用4 从布朗语料库提取词汇创建字典进程监视和管理

操作系统自用4 从布朗语料库提取词汇创建字典进程监视和管理

2024-11-11 02:03

OS Programme Lecture #4

1. BASH Programming（用unix系统） Read one-million words from text files:

一个更复杂的脚本程序

从布朗语料库(第一个机读语料库)，Brown Corpus，提取词汇和词汇使用频率

该脚本自动遍历brown文件夹里的每一个文件，提取词库中的词语和他们的使用频率

程序可以移除一些符号例如',[,],$，创建字数统计在hashmap数据结构中（也被称作“字典”）

一旦所有数据文件都被读取，这个脚本在最后一个for循环中打印词汇使用频率

可使用man sed查询sed关键词含义并尝试理解

创建WordFrequencies.sh，写入以下代码：

declare -A hashmap

for file in brown/*[0-9]; do

echo "Reading $file"

sed 's_([^ ]*)/[^ ]*_1_g' $file > t1.txt

sed "s/'//g" t1.txt > t2.txt

sed "s/`//g" t2.txt > t3.txt

sed "s/[//g" t3.txt > t4.txt

sed "s/]//g" t4.txt > t5.txt

sed "s/\$//g" t5.txt > t6.txt

while read -r line; do

line="$line"

if [ ${#line} -gt 0 ]; then

#echo $line

for word in $line; do

if [ ${#word} -gt 0 ]; then

#echo ${word}

if [ ${hashmap[${word}]+_} ]; then

let hashmap[$word]=$((hashmap[${word}]+1))

else

let hashmap[$word]=1

done

done < "t6.txt"

done

for i in "${!hashmap[@]}"; do

echo $i ${hashmap[$i]}

done

运行！（要有耐心，脚本会运行较长时间！）

注释以上代码中某些行，再看一下程序的输出变化，以加深理解

再尝试以下代码，完成作业中的问题，参考代码：

declare -A hashmap

for file in brown/*[0-9]; do

echo "Reading $file"

sed 's_([^ ]*)/[^ ]*_1_g' $file > t1.txt

sed "s/'//g" t1.txt > t2.txt

sed "s/`//g" t2.txt > t3.txt

sed "s/[//g" t3.txt > t4.txt

sed "s/]//g" t4.txt > t5.txt

sed "s/\$//g" t5.txt > t6.txt

while read -r line; do

line="$line"

if [ ${#line} -gt 0 ]; then

#echo $line

for word in $line; do

if [ ${#word} -gt 0 ]; then

#echo ${word}

if [ ${hashmap[${word}]+_} ]; then

let hashmap[$word]=$((hashmap[${word}]+1))

else

let hashmap[$word]=1

done

done < "t6.txt"

#break

done

numWords=0

topWord=""

topFreq=0

sumFreq=0

for i in "${!hashmap[@]}"; do

echo $i ${hashmap[$i]}

let numWords=$numWords+1

if [ $topFreq -lt ${hashmap[$i]} ]; then

topWord=$i

topFreq=${hashmap[$i]}

let sumFreq+=${hashmap[$i]}

done

avgFreq=`echo $sumFreq/$numWords | bc -l`

echo "What is the total number of words? Answer="$numWords

echo "What is the most frequent word? Answer="$topWord

echo "What is the number of hits of the most frequent word? Answer="$topFreq

echo "Average word frequency="$avgFreq

echo "Does the memory used grow as your script reads more data, and why? Answer=Yes, because the variable 'hashmap' grows with more data."

2. Process Management:

(1) BASH - Process execution

首先，我们写一个无限循环的脚本loop.sh，参考代码如下：

#!/bin/bash

let num=1

while true; do

let square=$num*$num

echo $num $square

let num=$num+1

done

echo "Program terminated ..."

Ctrl+C可以终止脚本运行

在第一个控制台中运行ps aux

打开一个新的terminal，运行ps aux | grep bash

回到第一个控制台，运行loop.sh

切换到新的控制台，运行ps aux | grep bash，运行ps aux | awk '$8 == "R+"'，比较结果

回到第一个控制台，杀死进程loop.sh

切换到新的控制台，运行ps aux | grep bash，运行ps aux | awk '$8 == "R+"'

比较结果！

The ps aux command is a tool to monitor processes running on your Linux system.

A process is associated with any program running on your system, and is used to manage and monitor a program’s memory usage, processor time, and I/O resources.

(2) BASH - process termination with the kill command

用kill终止一个进程运行：

在第一个控制台中运行loop.sh无限循环脚本

切换到第二个控制台，查找运行loop.sh脚本的进程，记录此进程PID

我们现在用kill来终止此进程，在第二个控制台中，尝试运行 kill PID

回到第一个控制台，看一下脚本有没有被终止？

3. 作业：

(1) 熟悉操作WordFrequencies.sh脚本，根据以上参考代码的执行，尝试回答以下几个问题：

- 使用的内存在你的脚本阅读更多数据时，会不会增加？（打开系统监控，观察内存使用情况）

- 词汇总量是多少？

- 使用最频繁的词是什么？

- 最常见单词的命中次数是多少？

- 平均的字数是多少？

(2) 运行(1)WordFrequencies.sh脚本，在过程中打开另一个terminal进行进程监视，随后回到运行该进程的控制台kill该进程，把过程和ps aux | grep bash结果贴图写入实验记录

以上就是本篇文章【操作系统自用4 从布朗语料库提取词汇创建字典进程监视和管理】的全部内容了，欢迎阅览！文章地址：http://sjzytwl.xhstdz.com/quote/76954.html
栏目首页相关文章动态同类文章热门文章网站地图返回首页物流园资讯移动站 http://mip.xhstdz.com/ , 查看更多