Adhere to the highest levels of academic integrity. Cite any sources that you have referred to. Also, state anyone who you may have discussed your approach with. Use the "Whiteboard approach" discussed in lecture at the beginning of the semester.
You are expected to use the provided data structures and methods only. You are forbidden from using your own methods or data structures that go against the spirit of the assignment and cannot be tested by our tools. E.g., adding your own my_strlen
helper method is OK (as long as they do not modify the headers). It is not OK to replace vector_char
with your own data structure or methods, knowingly bypassing our methods, or using strtok
.
Our internal grading tools will simply mark failed executions as 0. Note that the grade distributed within the assignment scripts is tentative only. If you contravene the rules we cannot guarantee that your code will run against our internal testing.
In many cases below you will see a reference to $REPO
. $REPO
refers to the folder you have cloned the repo to locally.
# Example $REPO. github username is <student>
git clone git@github.com/CMPT-295-SFU/assignment-1-student
# $REPO is the folder assignment-1-student.
WARNING: WE CANNOT GRADE YOU UNTIL YOU HAVE COMPLETED LAB 0!
malloc
, realloc
and free
? Watch Week 2 videocgdb
?If you cannot answer these questions, revise lab 1 and read:
gcc -I./include
what does the -I
flag does. Short answer: It adds additional folders for compiler to search for headers. Answermake -n
does? It's a dry run that shows what commands are being run by make during the build process.NOT GRADED
Prereq self assesment deadline: Try to complete within 1st three days
Expected Completion Time: 1 hour
.In this module we will implement 3 functions in the file Arraylist/arraylist.c
:
void arraylist_add(arraylist *a, void *x)
void arraylist_free(arraylist *a)
void arraylist_insert(arraylist *a, unsigned int index, void *x)
YOU CANNOT MODIFY ANY OTHER FILES OR FUNCTIONS.
arraylist_add
(arraylist* a, void* x)
The elements of an arraylist are stored consecutively in a dynamically allocated array. This function takes a pointer to an arraylist and a value, and appends that value to the end of the arraylist. It should also increment the length of the arraylist to include the new element.
If this is a conceptual drawing of what the arguments to the function look like at the beginning of a call to arraylist_add:
{: width="1000px"}
Then this is a conceptual drawing of what the arguments should look like at the end of the function arraylist_add:
{: width="1000px"}.
If the arraylist is currently at capacity when arraylist_add
is called, you should create a new dynamic array that is twice the size of the old one and copy over all the current elements in the arraylist. You may find the standard library function realloc
useful for doing this.
arraylist_insert
(arraylist* a, unsigned int index, void* x)
Hint: Use memmove(). This function should take advantage of your arraylist_add
to insert an element at an arbitrary location in your arraylist. You do not have to worry about attempting to insert to locations beyond the end of the arraylist (doing so is undefined behavior).
arraylist_free
(arraylist* a)
The arraylist_free
simply has to free all the resources used by an arraylist. Note that the arraylist struct contains a pointer!
$ cd $REPO
$ cd Arraylist
$ make -n # prints the steps for building the executable
gcc -c -o obj/arraylist.o arraylist.c -ggdb -I./include
gcc -c -o obj/main.o main.c -ggdb -I./include
gcc -o arraylist.bin obj/arraylist.o obj/main.o -ggdb -I./include -lm
$ make # Builds the executable
$ ./arraylist.bin --verbose
# If success
Test Test 0:
Case Add:1-5: Inserts numbers 1 to 5.
main.c:90: Check !strcmp(vec->output, output)... ok
SUCCESS: All conditions have passed.
Test Test 1:
Case Add:1-3,Insert:100:0-3: Inserts numbers 1 to 5. followed up insertion of 100 at positions 0-3.
main.c:127: Check !strcmp(vec->output, output)... ok
SUCCESS: All conditions have passed.
Test Test 2:
Case Add:1-5,Insert:100:0-6,Free: Inserts numbers 1 to 5. followed up insertion of 100 at positions 0-3. Finally free the arraylist.
main.c:60: Check !strcmp(vec->output, output)... ok
SUCCESS: All conditions have passed.
Summary:
Count of all unit tests: 3
Count of run unit tests: 3
Count of failed unit tests: 0
Count of skipped unit tests: 0
# If test case fails, you will see the expected output.
# Test 1 and 2 WILL LEAK MEMORY. TEST 3 should not.
This assignment will help you understand
You will write a C code that takes in an input string and:
Alphanumeric character: An alphanumeric character (includes any of a-z, A-Z, or 0-9). Any character that is not a alphabet or digit should be ignored and is not considered part of the word.
Read Hint
Word A word is a sequence of alphanumeric characters separated by non-alphanumeric character.
e.g.,
Input string: Lorem Lorem i dolor sit amet,
Step 1: Words:
# Notice that we ignored the space and punctuations.
Lorem
Lorem
i
dolor
sit
amet
# Step 2: Dictionary of words (only unify)
1. Lorem
2. i
3. dolor
4. sit
5. amet
# Step 3: Word counts
Lorem:2
i:1
dolor:1
sit:1
amet:1
All command files below here refer to word-count
folder
cd $REPO
cd word-count
{: .table-striped .table-bordered }
Filename | Description |
---|---|
dictionary.c | TODO: main(). Implement dictionary |
tokenize.c | TODO: main(). Implement tokenizer |
linecount.c | TODO: main(). Implement line count |
include | header files |
obj | where compiled .o files are stored |
txt | Input test files. |
out | folder for holding output from tests |
reference | Reference output for each step |
str_cmp.c | TODO: Implement your own strcmp |
vector_char.c | A dynamically resizable character array |
vector_string.c | TODO: A list of words |
table_string.c | TODO: A table of words |
unit_test_vector_char.c | A simple program illustrating vector chars |
unit_test_vector_string.c | A simple program to test your vector string implementation |
strcmp|strlen|isalnum|isalpha|isdigit|strdup|strcat|strncpy|strncmp|strtok
. You can read about them here. You will have to write your own implementations if/when required.TODO:
blocks in the code. You can modify or add files prefixed with unit_test.WARNING: DO NOT PROCEED WITHOUT UNDERSTANDING THE FOLLOWING
'a'
and 'A'
? What is the ascii table? Ascii Table Lecture 1 Slidesheader_as_type
vs. header_as_pointer
? what is the size (in bytes) of entry_t
? What is the problem with line 24 (head_p.variable->value = string;
)? Run in python tutor here. WARNING: DO NOT PROCEED WITHOUT UNDERSTANDING THIS EXAMPLE AND HOW TO FIX THE UNINITIALIZED POINTER
struct entry_t {
char *value;
char *next;
};
struct header_as_type {
struct entry_t variable;
};
struct header_as_pointer {
struct entry_t* variable;
};
void main() {
struct header_as_type head_t;
struct header_as_pointer head_p;
char string[] = "Hello World";
head_t.variable.value = string;
printf("%p",head_t.variable.value);
// INCORRECT EXECUTION.
// UNCOMMENT Line below if you want correct execution
// head_p.variable = (struct entry_t*)malloc(sizeof(struct entry_t));
head_p.variable->value = string;
printf("%p",head_p.variable->value);
}
Compare two character arrays. Return 0
if they match, 1
otherwise.
s1
and s2
are null terminated strings.WARNING:
strings could potentially be of different lengths e.g., "apple" does not match "apple " (which includes the extra space). Value to be returned will be 1.strlen
, strncmp
, and strcmp
.WARNING: Our my_str_cmp is slightly different. When strings mismatch, we simply return 1
If you do not fill any code and simply try running, there will be a segmentation fault and/or compiler errors.
Filename | Description |
---|---|
str_cmp.c | TODO: Implement string compare |
test_str_cmp.c | YOU CANNOT MODIFY THIS FILE . Tester for str_cmp (will not work with gdb) |
unit_test_str_cmp.c | TODO: Write your own unit test case for gdbing |
# Build unit test case
gcc -I./include/ unit_test_strcmp.c str_cmp.c -o unit_test_strcmp.bin
make test_strcmp.bin
./test_strcmp.bin --verbose
Test Test:
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: a
String s2: b
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: apple
String s2: appl@
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: apple and oranges
String s2: apple and oranges
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: 123456
String s2: 123456
test_strcmp.c:24: Check result == test_vectors[i].expected... failed
String s1: 342342;
String s2: afasdfsafd;
FAILED: 5 conditions have failed.
Summary:
Count of all unit tests: 1
Count of run unit tests: 1
Count of failed unit tests: 1
Count of skipped unit tests: 0
You should now see the fruits of your labor! If any of these cases fail, the tester will display the produced and expected values. Use this feedback to check results constructed by your approach.
A word tokenizer cuts an input string into individual words. For our purposes, a word is defined as a contiguous sequence of alphabets or numbers (from 'a'
to 'z'
, 'A'
to 'Z'
, and '0'
to '9'
).
A useful data structure for tokenizing is a character vector. Since words in the input string can vary in length, a vector is useful as it can dynamically grow. C lacks a dynamic vector and only supports fixed-size arrays. We have provided vector_char
, a dynamic character array with helper functions. You do not need to know how vector_char
works (however it may help if you do). You do need to know how to use the provided api.
Watch youtube video on vector char
To assist you in implementing the word tokenizer, we have provided the following helper files:
vector_char.c
: Implementation of the vector_char
data structure (DO NOT MODIFY
)unit_test_vector_char.c
: Example usage of the vector_char
data structureYou can compile and run the unit_test_vector_char.c
file to see how the vector_char
works.
Filename | Description |
---|---|
vector_char.c |
Implementation of vector_char . DO NOT MODIFY |
unit_test_vector_char.c |
READ : Illustrates how to use a vector_char |
make unit_test_vector_char.bin
./unit_test_vector_char.bin ∞
Actual:HELLO WORLD
Lower: hello world
Length (without \0 termination character): 11%
If you can answer the following questions, you can proceed.
vector_char_allocate()
allocate? space
for string OR space
for pointer to string?unit_test_vector_char.c line 10: vector_char_t *header
? stack or heap?header->data
allocated? stack or heap?tokenize.c
is where main
is located. To make it easier, we have already provided the basic code for reading the input file into a null terminated character array called source
.
Follow the instructions in tokenize.c
.
./tokenize.bin txt/small.txt
# source contains contents of txt/small.txt as character array
Filename | Description |
---|---|
tokenize.c |
TODO : Modify listing to iterate over string and tokenize |
refererence/.tokenize |
Reference outputs. Your output should match that |
If things work correctly then the output of tokenize.bin
should be as shown below.
IMPORTANT
: See below for grading
make tokenize.bin
# Test below shows expected output.
./tokenize.bin txt/small.txt
Lorem
Lorem
i
dolor
sit
amet
# Validate
./tokenize.bin txt/small.txt > out/small.txt.tokenize; diff out/small.txt.tokenize reference/small.txt.tokenize
If you use vscode
, it has a more powerful and visual diff
:
$ code --diff out/small.txt.tokenize reference/small.txt.tokenize
# Source is the array of characters that needs to be tokenized into words
# This is only meant to be a guide and may not be fully accurate or correct
for ch in source[]:
word = new word()
if ch is alphabetical
word.add(ch)
else
if word is not empty:
print word
word = ""
WARNING 1: You cannot use strtok.
WARNING 2: You cannot assume the max word length is known.
WARNING 3: The data ptr may change as you call vector_char_add and other functions
Use vector_char
. It is required and there to help you. Further its required in subsequent steps.
Now that we know how to tokenize, let's build a dictionary. A dictionary is a collection of all the unify words
. e.g.,
1. Lorem
2. i
3. dolor
4. sit
5. amet
Pseudocode
for each word (w words)
// Each word
lookup word in vector string
if word is present, continue
else
insert at the end of vector string
end
entries = array of all dictionary keys (words) in the vector string
for each entry
print index + ". " + word
end
To help you with this task we want you to implement a data structure called vector_string
, a dynamic linked list in which each entry points to a dictionary word.
Filename | Description |
---|---|
include/vector_string.h |
READ Contains the top-level structs describing the vector_string |
vector_string.c |
TODO: Implement the TODO blocks |
refererence/\*.dictionary |
Reference outputs. Your output should match that |
dictionary.c |
TODO : Implement main() |
You are expected to implement the following five functions:
vector_string *vector_string_allocate();
Creates a vector string on the heapbool vector_string_find(vector_string *vs, char *key);
Searches the vector string pointed to by vs
and returns true
if entry found, false
otherwise.void vector_string_insert(vector_string *vs, char *key);
Inserts a string into the vector string at the tail.void vector_string_deallocate(vector_string *vs);
Removes all entries and deallocate vector string.void vector_string_print(vector_string *vs);
Prints all value in each entry of the vector string, one line at a time.WARNING: print format has to exactly match for tests to pass
make dictionary.bin
# Validate
./dictionary.bin txt/small.txt > out/small.txt.dictionary; diff out/small.txt.dictionary reference/small.txt.dictionary
Track each line in which a word occurs. Provide a list of all dictionary words in a document along with the exact line number in which they occur.
$ cat word-count/reference/input.txt.linecount
Lorem: 1
amet: 1 4 6 8
consectetur: 1 7
adipiscing: 1
felis: 3
tempor: 4
....
....
If you open txt/input.txt in a text editor
and search for Sed, you will see it occurs in line 1 and 7
while adipiscing occurs only in line 1
\n
.realloc
. (https://www.tutorialspoint.com/c_standard_library/c_function_realloc.htm)A key limitation of the vector_string
is that it uses a single linked list to keep track of the words in the document. Hence searching through it is expensive. Think of the checking if a word exists? In the worst-case how many entries in the list would we have to check?. To make things faster we will be building a new data structure with new methods: table_string
. Figure below illustrates a table_string
. Unlike a vector_string
, a table_string
maintains multiple lists (you can refer to these as buckets). Each word is hashed or mapped to a unify bucket. The function djb2_word_to_bucket(char* word, int buckets):table_string.c
takes in a string and returns the bucket corresponding to the word. It returns a value between [0,buckets-1]
. The word should be inserted only in that bucket. Each bucket is simply a linked list of entries. This ensures that words are naturally partitioned between buckets. The linked list in each bucket will not be as long and this makes it faster and easier to search for a particular word.
Filename | Description |
---|---|
include/table_string.h |
READ Contains the top-level structs describing the table_string |
table_string.c |
TODO: Implement the TODO blocks |
unit_test_table_string.c |
READ . Contains unit test and framework for illustrating how to use table string |
linecount.c |
TODO: main() for tracking which lines a word occurs in |
These are the four functions that you are expected to implement for table_string
:
table_string *table_string_allocate();
Creates a table string on the heap. The reference outputs were generated assuming a table_string
with 4 buckets.void table_string_insert_or_add(table_string *vs, char *key, int line);
Inserts the word pointed to by key
into the table string at the tail (if it does not exist
), otherwise adds new line entry. Refer to vs_entry_t
in include/vector_string.h
. Both table_string
and vector_string
use nearly identical entries (table_string
includes extra array for tracking the lines in which each word occurs.)void table_string_deallocate(table_string *vs);
Removes all entries and deallocate table stringvoid table_string_print(table_string *vs);
Prints all values in each entry of the table string. See above for format WARNING: print format has to exactly match for tests to pass. See table_string.c
WARNING: It is very important that you insert at the tail in each bucket
If the order of words in the bucket or the bucket in which you insert a word
is wrong then you will not match the reference output
# Make sure you initialize the table_string with 4 buckets in linecount.c
# The reference output was generated assuming 4 buckets
# in the table string.
# If you do not set buckets to 4, then each word will be fine. But the order of the words could be different leading to mismatches against the reference
cd $REPO/word-count
make linecount.bin
# Validate.
./linecount.bin ./txt/small.txt > ./out/small.txt.linecount; diff ./out/small.txt.linecount ./reference/small.txt.linecount
If you open txt/input.txt and txt/small2.txt in a text editor
and search for Lorem, you will see it occurs in line 1 in input.xt and twice in small2.txt in line 1
WARNING: print format has to exactly match for tests to pass. See reference
We will now proceed to find the duplicates from two files. You will the common word along with the count of the number of times the word occurs within each file.
# Make sure you initialize the table_strings with 8 buckets in duplicate.c
# The reference output was generated assuming 8 buckets
# in the table string.
# If you do not set buckets to 8, then each word will be fine. But the order of the words could be different leading to mismatches against the reference
$ cat reference/input.small2.ccount [55%]
Lorem:3
amet:5
dolor:3
sit:5
i:2
We are going to build a useful application on existing data structures. Table strings make it really easy to search since they partition the words across buckets. We are going to build a common word count detector. Provided two files, we will search for common words between the two files i.e., print words in file 1 that common with file 2 along with count. The steps are as follows:
table1
and table2
).table1
, for each word check table string table2
.dedup_entry
.dedup_entry
has count for tracking the number of times the word occurs in each file.Filename | Description |
---|---|
include/dedup.h |
READ Contains the top-level structs describing the dedup entry for tracking common word counts |
dedup.c |
TODO: Implement the ccount() function. Print the common word count of both table strings |
duplicate.c |
TODO: main() for tracking for invoking table_string and common word count values |
There is only one function that you are expected to implement for common word counts:
int ccount(table_string *t1, table_string *t2)
:
t1
and t2
are the table strings that you will use to track the words in the two files.# Make sure you initialize the table_string with 8 buckets in linecount.c
# The reference output was generated assuming 8 buckets
# in the table string.
cd $REPO/word-count
# Even though binary name says duplicate. Note that you
# are expected to implement common word count finding algorithm.
make duplicate.bin
# To test make sure you are in the parent folder
cd $REPO
./word-count/duplicate.bin ./word-count/txt/input.txt ./word-count/txt/small2.txt > ./word-count/out/input.small2.ccount
diff ./word-count/out/input.small2.ccount ./word-count/reference/input.small2.ccount
# To match reference output, you have to run from the $REPO folder.
Congrats! You are now ready to receive your tentative grade.
$ cd $REPO # $REPO refers to the top of your cloned repo.
$ source ./scripts/localci.sh # run from the repo folder
# Check if SUCCESS OR FAILED was dumped by scripts
# There will be a _Grade.json (this is your tentative grade). The file is in json format.
# The mark component in each module indicates your score for that portion.
# LOG.md contains the log of your runs. This will give you more information. See the FAILED header.
# To check if you are using these banned functions.
$ cd $REPO/word-count
$ egrep "strcmp|strlen|isalnum|isalpha|isdigit|strdup|strcat|strncpy|strncmp|strtok" table_string.c vector_string.c dictionary.c tokenize.c linecount.c duplicate.c -n
# You can ignore line. word-count/table_string.c:21: for (i = 0; i < strlen(word); i++). This is code provided by us. You can also ignore any usage in unit_tests. We do not grade unit_tests.
# The above will return any lines where you are using those banned functions.
# Note that if you are writing your own my_strlen and use it anywhere in your code it will match again. You will have to rename it to something like my_str_len
This grade is only tentative. Based on additional test cases in our evaluation, you could score less points. If valgrind reports any memory errors other than leaks e.g., writes to unallocated memory etc then your score will be zeroed out for that portion
Module | Total |
---|---|
arraylist | 5 |
strcmp | 5 |
tokenize | 20 (+5 points if no memory leaks) |
dictionary | 20 (+5 points if no memory leaks) |
linecount | 20 (+5 points if no memory leaks) |
ccount | 40 (+5 points if no memory leaks) |
In this assignment, you are also expected to manage your memory and prevent memory leaks from happening. A good tool for checking whether your program has memory leak is Valgrind
. You can learn how to use it here. A short example is below. We will deduct points if you fail to free all memory before the program exits.
$ valgrind --leak-check=yes ./executable
Valgrind
run memcheck
to find uninitialized memory locations and references$ valgrind --tool=memcheck ./executable
Lorem^?
. Check if you inserted null termination character.TA Help
We strictly adhere to a 'No Print No Help' policy. Please review the policy on assignment page for more details. To ensure the TA meetings are productive, we recommend bringing the following information:
data
pointer and struct
pointer of vector_char
.vector_string
(including head
, tail
, and next
).table_string
.Please ensure that you have thoroughly reviewed your code and attempted to debug the issue before seeking assistance.