Adventures of Writing an OS Kernel from Scratch on a Cortex M3 Board

This term I'm taking SE350 (formerly known as ECE354) a.k.a. Operating Systems a.k.a. one of the heaviest workload courses in my program. For this course, we have a cumulative lab with 4 deliverables. At the end of this term, we'll have a fully working OS kernel for the Keil MCB1700 Cortex-M3 Board with the following features:

Part 1
- Basic multiprogramming environment
- 5 Priority queues
- Simple memory management
Part 2
- Message-based IPC (Interprocess Communication)
- Basic timing Services and a Wall Clock
- System console interrupt-driven I/O
Part 3
- Stress tests
Part 4
- Documentation

For this lab, I'll be working with my classmates Fasih A., Josh K., and John Z. With any luck, we should be able to get a decent amount of sleep and not kill each other by the end of this term. Because we have to write a 30-ish page report at the end documenting all of our work and what we've learned, I thought it would be a good idea to start a journal about our (mis)adventures starting from the beginning.

Part 1: The adventure begins...

Jan 11

Made a new private GitHub repo with the default C .gitignore file. Good enough...

Jan 18

Upper year: Oh by the way, you'll be writing your OS in C89 instead of C99 or C++. You know, the one where you have to declare all of your variables at the top of your methods?
Me: Fuck, so it's going to be like VB6?

Jan 19

Josh is trying to run an ARM simulator inside a Windows VM on his Macbook (Yo dawg, I hear you like simulators so we put an ARM simulator inside your Windows simulator so you can simulate your OS while simulating Windows OS).
Josh: How the fuck do we create a linked list for the OS's memory allocator if our OS doesn't have a memory allocator yet?
Oh, we don't actually need dynamic memory; we can just stick the pointer for next* inside the blocks that we allocate! Wait, maybe we should use a doubly linked list to make free() easier and faster. Since we're given a pointer the block that we want to free, we can just modify the next* and prev* pointers and do it in O(1) time instead of O(n) time.
Man, we just can't escape the tradeoff between performance and memory in computers.
Why is our memory allocator returning blocks of 0x800 instead of 0x100? What kind of compiler just assumes pointer arithmetic (pointer + constant) is in bit mode and that the constant should be multiplied by 8 just in case?

Jan 20

Apparently our memory allocator is returnning blocks of 0x800 because of pointer arithmetic. Since our doubly linked list has two pointers (8 bytes) and we're adding a constant to a node pointer, the compiler automatically multiplies it by the size of the object.
We should zero out our memory during initialization stage since hardware memory is not guaranteed to be all zero everytime. Ok, we'll use XOR since that's the fastest way to zero out memory.
Josh: Stephen, if you want the function that zeros out memory in assembly then you write it. Oh wait, then I have to code review assembly... Nevermind, we'll do it in C.

Jan 21

Implemented a linked list for our scheduler's queue. At least our memory allocator is working today.
Well it compiled and works on the simulator, git add commit done!

Jan 22

Huh, we can't flash the board's memory.
Fuck, this was the broken board? Time to switch computers...
Fuck, invalid memory access. Dammit Josh, you're zeroing out the memory wrong and now we're getting hard faults!
- Fixed it by adding extra padding to the end of the for loop so that it ends 8 bytes before the actual end of DRAM.
- This is why I said we should align the start of heap with 0x100's but noooo... nobody listens to Stephen.
Fuck, missing algorithm 0x0000000H to 0x00007FFH. What does that even mean?!?!
OMG IT WORKS! NOBODY TOUCH ANYTHING. COMMIT EVERYTHING!

Jan 23

To avoid confusing each other when we say "doing OS" (this lab) vs. "doing OS" (taking notes from the lectures), we have officially dubbed our project "OS WOW".

OS WOW

Jan 24

We went to the OS lab after dinner. The lab is empty at 8:30 p.m. on a Friday night. Well at least it'll be quite amusing when we see people in here pulling all nighters next week.
We need some music in this lab to calm ourselves.
Our memory allocator needs to be changed since we now need more metadata in the block header like the owner's pid.
- Also it turns out that we were returnning the head of the block instead of where the user memory area begins. Need to fix that...
- Josh: We need to make the data pointer to a void**!
- Me: No we need to make that a void*!
- Josh: Oh wait nevermind, I need to get some sleep.
- Okay, let's just allocate a fixed size header with enough room for future headers. For now let's #define the header size to be 0x10 and then take the difference between 0x100, our block size, to get the actual user usable memory size.
We've just found out that the heap size needs to be fixed between the OS image and stack. This means that out of our total 0x8000 bytes available SRAM, we can probably allocate 0x3000 bytes for the heap. Therefore, we will have a total of 48 blocks to distribute between the OS and user processes. Oh well, we can reize the blocks later by change the #define BLOCK_SIZE. We'll just keep it aligned to 0x100 for now since it'll make debugging a lot easier.
To help with debugging, we'll initialize all the memory first with a constant non-zero value since 0 is actually a valid memory address on this board. This way, we can easily tell if our code did or didn't write to memory properly.
- It's too bad that we write in big endian but the memory inspector window displays the 4 bytes as little endian so 0xDEADBEEF becomes EF BE AD DE.
- 0xDEADBEEF, 0xDEADBEEF everywhere...
Death threat counter: 5

Jan 25

Okay so the user sets the process table (which is currently allocated as a global static variable) and then the kernel uses that table to create the process control blocks (PCB) for the queues? I thought all of the tables lives in the kernel space? Well I guess this makes more sense because otherwise how can the user tell the kernel where are all of its processes? Let's try to implement the scheduler now.
The lab manual says that we should have a memory table but where should we put it? Can we just put the pid inside the memory block? It'll definitely be easier but the manual doesn't explicity say that we need to allocate a new table.
- Regardless of where we put it, we need some way to know who's the original owner of the block. When release_memory_block(void*) is called, we need to check the block's owner's pid with the current process's pid. Otherwise, greedy users can just modify the owner pid of every block in the heap to itself, call free on all of them, and then request every block for itself.
- Actually, let's create a new memory table in kernel space so that it'll be harder for the user to modify the blocks' owner's pid. Yay security by obscurity.
Oh god there's so many compiler errors...
- What do you mean you can't find symbol reference to proc_table? It's right here!
- Let's git stash and incrementally add back our changes one by one.
- Okay so the error is coming from this header where we defined proc_table. Did we extern it properly? How does extern even work in C89?
  - Stephen: In C++, it just tells the compiler that it exists somewhere and it's the linker's problem. All extern does is to tell the compiler to not mangle the name.
  - WHY CAN'T IT FIND THE FUCKING SYMBOL!
  - An hour later... John: Wait, did we even include user_proc.c in our project files? Fuck. Josh kicks a chair.
Fuck, why is the printf now hard faulting? Their starter code must be broken. Let's just use putc. You know what? let's just remove all the printf and use breakpoints intead.
- Later... Josh: Oh I know why our printf isn't working; we're calling it before we call init_printf(). Josh puts the chair back then kicks it down again
Our scheduler works but it gets a hard fault the second time it dispatches the null process. Dammit our linked list implementation has bugs in the push() and pop() methods. Wait it's still hard faulting. Sigh it's going to be a long night.
- I think our stack is getting corrupted by an infinite recursion since the end of k_release_processor() just calls the function pointer to next null process which then calls k_release_processor() again. It'll never return and we'll overflow our stack!
- What if we just take that function pointer call out? Wow it worked. Oh I see, the context swicher modified already modified our program counter and stack pointer so we were actually calling the next process twice.
OMG THE SCHEDULER WORKS! git commit everything and let's go home, it's already past midnight.
Death threat counter: 7

Jan 26

Okay let's split up to two groups today.
- Josh and Fasih will implement changing priorities.
  - Added a new method for our linked list structure to allow us to linearly search and delete from the current priority's queue and then push to the end of the new priority's queue.
  - Also added preemption check so that if the new priority is greater than the current process's priority, we need to preempt the current process by calling release_processor.
- John and I will implement the memory table.
  - We want to implement some kind of protection for our memory manager so that a greedy user can't just release the memory blocks of other users. To do so, we need to update request_memory_block() so that it stores the requester's pid with the return block.
  - The easiest solution would be to just store the pid inside the memory block's header but it would be extremely easy for a malicious user to calculate the offset to the pid field for everybody else's blocks.
  - Nah, we'll just create a memory table before the start of the heap so that we can obscure the acutal location of owner pid. Again, security by obscurity.
  - The main problem now is that when the kernel requests memory off the heap, our global current_pcb variable is still referring to the process that made the system call. Note: We need to refactor our code so that our kernel gets its own heap when we have the time. It's a pretty bad idea for the kernel and user to share the same heap.
  - We got around this by creating k_request_memory() and k_release_memory(). These two simply wrap around the actual request_memory() and release_memory() calls. Before the call, it overrides the global current_pcb variable with a "dummy kernel pcb" and then after the call, it restores current_pcb to its original value.
Oh god so many merge conflicts...
- Okay finally fixed all of these conflicts so let's now test our user processes. Good, the basic tests passed. Now let's try an edge case: get process A to request all of the memory blocks, get another process B to try to request memory and get blocked, get A to release all of its memory, and see if B gets unblocked.
- Wait, why isn't it running B after A?
- Let's add a breakpoint in get_next_process(). Okay, let's inspect the queues... Why the fuck is the blocked queue empty? Where did B disappear to then?!
- Dammit our switch_process() that does the context switch is wrong so process B never gets pushed to the blocked queue. It's going to be a long night...

Jan 27

Apparently we also need to implement preemption we we release memory (i.e. when a process releases memory and there's another process with higher priority that's currently blocked on memory, we need to preempt the current process). Well that's just a simple if statement in our release_memory() method. It can't be that bad, right?
Why the fuck are we hard faulting again?! Sigh, it's going to be another long night in the lab but at least there's more people in here now.
Okay, it's hardfaulting on this line during inialization where it's trying to push the new PCBs to the ready queue. Let's step through this for loop.
- Okay, push PCB[0] at 0x10000510.
- Okay, push PCB[1] at 0x10000520.
- Okay, push PCB[2] at 0x00005230. Wait, what?
Maybe our pcb_t* array is corrupt?
- Let's step through how it's allocating memory for pcb_t** and its elements.
- There! It's overriding the pointer to the 2nd pcb_t.
- Wait why is our start_of_heap variable only at 0x1000050A instead of 0x1000051C? The contents of the first pcb_t is overwriting our pointer to the 2nd pcb_t.
- What the fuck is this line doing start_of_heap += NUM_PROCESSES * sizeof(pcb_t*) ?!?!
  - Note: NUM_PROCESSES is a macro to 7
  - Why is it only adding 0xA (10) instead of 0x1C (28)?
  - Guess we're stepping through the assembly.
  - Wow... that line generated move r2, #4 and then adds r2, #6. It's literally adding 6 to 4. Great we just found our first compiler bug.
  - Fuck it, we'll just hardcode 0x1C and label it #WONTFIX.

Jan 28

Wait the lab manual says that we have to get our user processes to print out its test results. Fasih, you fix it. I'm tired...

Jan 29

All that's left to do is documentation. Let's just take turns writing comments for each of our methods in k_process.c
Alright, zip and submit!
Well, time to study for SE380 (Feedback Control) quiz tomorrow...

Feb 5

Demo day! Hurray we sniped the last slot so we all get to sleep in until 9 a.m.
Apparently we didn't implement a FIFO context switcher. Right now, we're pushing any memory blocked process to the front of the ready queue instead of the end of the ready queue when there's free memory available.
Even though our memory table was a nice feature, the TA said that we may encounter problems with message passing in part two since it's assuming that whoever requests memory will also free that memory. For message passing, the sender should only have to request a memory block, write its message to the block, and send it off with a system call. The recipient should be the one that frees the memory block.

Part 2

For part two, we have to implement message passing, keyboard interrupts, and timer interrupts. In part one, we simply used busy waiting for printf that waited until the THR (Transmission Holding Register) is empty before writing the next character to the uart (Universal Asynchronous Receiver/Transmitter). In part two, we need an interrupt routine that handles THRE interrupts (Transmission Holding Register Empty interrupts). After getting a THRE interrupt, our interrupt process then needs to get the next message in its pending messages queue, store it in a buffer, free the message's memory, and write the next chacacter in our buffer to the THR.

As for the timer interrupts, we just need to increment an internal counter everytime we get an interrupt and deliver any delayed messages that have expired. Finally, we also need to implement a digital clock on the terminal that updates every second. Overall, it doesn't look too bad and we have about a month before it's due. Unfortunately, we also have at least 3 midterms (depending on our electives) and 3 major assignments between part one and part two.

Mar 1

Well the deadline is only less than a week and we haven't even started yet. GG
Since we just got destoryed in our CS341 (Algorithms) midterm yesterday, let's just add the new sample code and let's just call it a day. We got a SE380 midterm tomorrow.

Mar 2

It turns out that the compiler bug from part one (where it was adding 6 instead of multiplying by 4) was because we're adding a macro value to an int. Apparently all we have to do was wrap that macro with an explicit cast to uint32_t to get the correct result. C is weird...
Stephen spent all day refactoring the kernel linked list implementation to not request/release memory because we already know at compile time how many nodes we need for the PCBs and memory blocks. All we have to do is simply pass the nodes around and change their next/prev pointers.
John worked on removing the memory table because doing a linear search to validate release_memory_block() calls are quite expensive and the kernel_pcb hack is going to bite us in the ass sooner or later.
Fasih and Josh worked on fixing bugs in our scheduler and refactoring so that we can actually understand that clusterfuck of nested if statements.

Mar 3

I doubt anyone is going to be working on OS today because we have a SE380 midterm tomorrow.

Mar 4

Josh gave up on last minute studying because the midterm's weight will get shifted to the final if we do better on the final. He spent the afternoon working on the message-blocked queues.
2 hours later... Fuck.
Let's just go cry ourselves to sleep tonight...

Mar 5

Stephen just spent the night LaTeXing his CO487 (Applied Cryptography) assignment.
Josh is still debugging send and receive messages while yelling at me to LaTeX faster.
John tried to copy and paste strncpy, strcmp, and strlen implementations from FreeBSD source while trying to teach the first years working in the lab how to implement a BST.
Fasih started working on timers.
Josh later blackmailed the rest of us that he won't drive us to get groceries until OS is finished. We could technically just take the bus but it's too cold to take the bus.

Mar 6

I spent the afternoon writing some basic user test processes to test the new message passing APIs that Josh implmented yesterday. Since Fasih isn't done with timers yet, we can't test delayed_send() yet.
Someone didnt set their git author properly so I just see a bunch of commits by unknown.
- It's probably John and Josh working on the keyboard interrupt routine and the KCD (keyboard code decoder).
- Since the interrupt routine needs to send/receive messages and request/release memory blocks, they also need to use our kernel functions. However, the problem is that the interrupt routine should not be preempted by the scheduler or be blocked by memory. I can't believe they got lazy and just copy and pasted the original kernel functions and just removed the calls to the scheduler. Better add this to the list of things to refactor afterwards.
- After debugging all night, Josh learnt that a compiler warning is actually what's breaking his code. Apparently, if you have a macro to a function that returns a void*, the star should be beside the function name instead of the type name. Otherwise, the compiler will say type qualifier on return type is meaningless.
In other news, the SE380 midterm class results are up. Wow, that class average and standard deviation is terrible! I'm scared to get our midterms back in class tomorrow...
Death threat counter: 9

Mar 7

John and Josh continued to work on the KCD and keyboard interrupt routine.
Fasih finally finished the timers so we worked on integration and testing. Great, we're hard faulting everytime an interrupt occurs. G fucking G.
Let's just disable interrupts everywhere to see if we can narrow down the source.
- It seems that we can't disable interrupts when making a system call from user mode otherwise we hard fault. We should probably be more careful in where we're disabling interrupts and make sure we re-enable it before calling k_release_processor.
- A classmate later suggested that we should try and see if printf can work alone without any variables passed in. Oh, we don't hardfault anymore! Why the fuck is printf broken when we have interrupts? Let's go to the author's "webpage":http://www.sparetimelabs.com/tinyprintf/tinyprintf.php and see if there's anything that we missed.
- Are you fucking kidding me? According to this StackOverflow question, it's probably because va_start() used by printf() isn't re-entrant safe on our platform (i.e. the behaviour is undefined if we get interrupted, have everything saved onto the stack, and then restore the system state from the stack). It would've been great if we learned this when we covered system interrupts instead of spending the past 10 hours debugging by trial and error.
Death threat counter: 10

Mar 8

Since everyone else is working on the CS349 (User Interfaces) assignment, I spent the day implenting the clock process (as a user-level process). It's surprisingly simple because all the necessary APIs have already been implemented. All I had to do was to just send a 1000 ms delayed message to itself in an infinite loop. Since it'll get blocked by receive_message after the delayed send, the clock process will only print once a second.
The console display doesn't seem to work properly. For some reason, it's double printing everything I type after a carriage return. There's probably something wrong with how the keyboard interrupt process handles its input buffer before passing it to the KCD.

Mar 9

I had to do some research on how the board actually prints and how interrupts gets triggered to fix the double printing.
Also, it seems that the '\0' character is actually important at the end of input to our THR otherwise the THRE interrupt never gets triggered again. Even though it prints out a space on the simulator, it doesn't actually show up on Putty when we open a serial port to our board.
Okay this a2i function included with the printf file doesn't work as expected so we can't manually set the clock time
- Do we really have to write a complete atoi implementation?
- Oh wait, we can just hard code the value since we only need to parse 2 digits at a time
- Wait why is hour 99 being displayed as 0? Dammit Stephen, it's supposed to be mod 100 not mod 99.
Now that we can finally start the clock, why isn't our clock printing more than once?
We need to swap the global current_pcb variable with the message's sender's PCB in the timer interrupt routine because we could get interrupted while the current_pcb variable is not the message's original sender. Thus when we tried to forward the message via send_message, it will overwrite the message's sender_pid with the global current_pcb.

Mar 12

Demo day! Hurray, we sniped the last slot again so we all get to sleep in till 9 a.m. again.
So far so good, all of the API that we implmented worked as expected.
Wait a minute... shouldn't the wall clock reset to 00:00:00 after 23:59:59? Fuck, why did I make it mod 100 and how did it make it past everyone during code reviews?
Oh wow, they're making us check for error cases even though it wasn't specified in the lab manual. We should talk to the lab instructor later because this was not in the specifications.
For the last part of our demo, we were given a couple of object files that contained their private test processes and we had to link with them to show the TA that the output is correct.
- Hmm, for some reason we're getting linking errors with undefined symbols everywhere...
- Okay so we're not linking properly because we didn't use their specified names for the process table. I don't recall this being in the lab manual either...
- Fuck, it hard faulted for all three test cases. GG marks.

Part 3

Mar 17

First things first: fix the hard fault.
- We may not have the original source code but we can still read the disassembly in the object files. Thankfully, we're dealing with RISC assembly so it's simple for me to read.
- Okay it seems to be hard faulting once their process calls release_processor(). Our scheduler is hard faulting?
- It seems that we hard fault as soon as we try to go from handler mode (in the scheduler) to thread mode (the new process), our OS just hard faults. This is happening when we try to start the null process... Wait no, it's happening when we try to start the Wall Clock process. The current_pcb PID field isn't even being set so it defaulted to 0 (i.e. the null process).
  - Fuck, we were initializing the Wall Clock's process table entry in our own test process file. When we switched to their object file, Wall Clock was never initialized in the process table. No shit it's going to hard fault, the stack is empty so BL 0xFFFF FFF9 will set the PC to 0!
  - Well at least I finally figured out what that weird branch instruction from the sample code does. Apparently, it's a special instruction from the board that moves the stack pointer up 8 words and writes those 8 values to R0-R3, R11, LR, PC, and xPSR. At the very least, our stack needs to contain the correct PC value (i.e. the start of the process) at SP + 0x18. Since this was never set, we ended up branching to address 0x0 (which is actually a valid address on this board) thus causing us to hard fault.
I later implemented backspace for our console IO because I'm tired of restarting an input on a new line every time I mistype something.
Got atoi() working as well. I'm on fire tonight! It may not be as efficient as the cstdlib implementation but it's good enough for our purposes. This sure beats hardcoding input_buffer[4] - '0' to extract the digits, character by character.

Mar 18

Let's add some error checking to our inputs even though it wasn't specified in the lab manual, just sayin'. Weird, we're hard faulting when we try to enter an invalid input. Oh! Our error message buffer is overflowing our stack because I declared a char array of size 80. Solution: double the stack size of every process! Yay, it worked!
If there's one thing I learned this term, it's that buffer and stack overflows are real and are usually hard to deal with without virtual memory. Thankfully we still have enough unused memory that we can easily double the stack size.
All that's left for this deliverable is to implement the stress tests and another keyboard command which Fasih and John wanted to do. I guess I'll try to do something extra to get bonus marks...
Got the LCD display to work!

OS WOW on the LCD Display

Mar 22

Fasih and John finished the remaining processes without any problems. It's amazing how easy development gets becomes once all the base APIs are bullet proof.
Time to do code review and run astyle before we zip this up

Mar 26

Last demo day! Yay, we sniped the last slot 3rd time in a row so we get to sleep in late again.
Well there wasn't that much to implement for this deliverable so it's actually a very short demo. The TA just asked us a couple of questions about the expected behaviour of our system when we toggle the priorities of our stress tests such as deadlocks and starvation.
We finally went through an entire demo session without fucking up!

Part 4

Mar 29

We split up all of the requried parts of the documentation based on how much work we did in the previous parts. Josh will be merging everything together and proofreading it because he was too busy with his elective's assignments when we were doing part 3.
Fasih went to the lab to write the benchmark processes and collect the raw data.
Josh spent the day working on his Feedback prelab so I guess he'll do his share of the documentation tomorrow.
John and I spent the day at home writing our share of the documentation in LaTeX.
- On a side note, I really love the bytefield package for drawing the memory maps!
- Well, we're done our share of the work. Can't believe it's finally over... and just in time for CS341 and CO489 assignments that's due next Friday! Fuck it, I'm playing Diablo 3 tonight, I deserve a break!