Sunday, September 12, 2010

Programming Crash Course

For those crazy enough to want to do something as absurd as programming, here's a crash course.
Sometimes when coding while reaally sleepy, my brain goes demented and I begin to think that every function in the program is a character and they all begin to chatter with each other.  What does that tell us about the basis of mind - or of reality?  Anyway, onward with the instructions..

The Basics

Programming has just a few basic constructs.  Most other things can then be understood from the framework of those constructs.

  • Variables.
    In math, if you say x=y, then x is always identical to y, no matter what "happens to" x or y.  In programming, x = y takes the current value of y and stores it in x.  x = 10 does the same thing.
    Some typical fundamental variable types are integers, floating points (a kind of decimal number), and strings (that means something like "abc."), arrays (a list of values of the same type), and lists or collections (a list of values of arbitrary types).  Lists/collections can contain other collections or even themselves.
  • Conditional statements.
    If x=10 then do this. 
  • Code blocks
    If x=10 then {do this, and this, and also this}
    Code blocks can be nested.
    If x=10 then {do this; if y=11 then {do that}}
  • For loops
    For each x from 0 to 100 {do this with x}
    For each x in this list {do this with x}
  • While loops
    While not x > 10 {do this}
  • Infinite loops
    repeat {x = x + 1; if x = 100 then stop looping}
    I've never seen a language that supports infinite loops directly semantically.  You always have to either use a while loop with a condition that's always true, or in C/C++ you can use a for loop with no parameters: "for(;;) {blah}"
  • Function calls.  A function is a type of code block.  Instead of being the target of a while or for loop or conditional, it stands on its own and is executed by name (possibly within a loop or conditional of some sort).
    Functions are similar to functions in math, though much more flexible.  In math, if you say f(x) = 12x, then f(10) is literally the same as 120, and f(f(10)) is literally the same as 1440.  The same is true in programming.  "y = 10; x = f(y)" is effectively the same as "y = 120".
    In programming, though, there are some differences:
    - a function can take multiple parameters: y = f(10, 20)
    - a parameter doesn't have to be numeric: y = f("hello")
    - a function may or may not return a result: f("hello")
    - a function may return multiple values (though technically it still returns one value which is a list or array or reference to one)
    - a function may have "side-effects", meaning that it can change the flow of the program even independently of what value it returns and whether it does.  For example (in Python),
    b = 10
    def a():
      global b
      b = 20
    print b
    a()
    print b
    >>> prints 10, then 20
  • Grouping
    You've probably already deduced this from the above, but grouping is very important in programming. It's basically the nesting of expressions.  f(f(10)) does some grouping.  Here's another example. (2+3)*5.  Without grouping, that would be 2+(3*5) per the order of operations.
    In programming, "order of operations" takes on new dimension, because there are many more syntactical structures to interlope than just algebra.  This statement in Python exposes 16 distinct levels of Pythonic precedence ("**" is shown twice because it's right-associative.):




  • a = lambda: b or c and not d in e is f != g == h < i | j ^ k & l << m + n * o ** p ** -q
Programming with Class

In object-oriented programming, we add another element: the class.

What if you want a variable to contain not just one number or string, but several items of data?  Why not just use several variables?  Well, consolidating them into one variable makes them easier to manipulate as a unit (passing to functions, returning from functions, assigning to another variable, etc.)  Just the ability to do this alone is what a 'struct' in C++ does.  A class goes a little further, and adds functions to such constructs. 

In C++ you'd declare an integer like this:
int blah;
If you had a struct called q, you could declare an instantiation of that struct like this:
q blah;
We could instantiate a second one,
q hello;
If q had a member variable called y, you would refer to it like this:
blah.y
If q were a class and had a function called z, you could call the function like this:
blah.z();

z would be fairly likely to modify internal variables within that blah object.  For example, it could change the value of blah.y but not of hello.y.   In this way classes become like little units of encapsulated intentionality in code.  For example in a game, a house object and all the things you might do to the house could all be encapsulated into a house class.  An instantiated instance of a class (for example, blah or hello in the above examples) is called an "object", or sometimes an "instance".

Classes work in a hierarchical fashion; one class can "inherit" another.  So if class 'blah' has a method called "hi" and class "hello" has a method called "bob", and class "hello" inherits class "blah", then an instance of class "hello" will contain both methods "hi" and "bob".  If both "blah" and "hello" define a method called "hi", then "hello"'s method will override "blah"'s method.  A class that's inherited is called a superclass.  Some languages allow inheriting multiple classes (on the same tier of inheritance), some don't.  

class blah {
  void hi() {cout << "hi";} }
class hello(blah) {
  void bob() {cout << "hi, from bob";} }
hello bob_the_builder;
hello.hi();
hello.bob();

In this case, blah is a superclass of hello.  Superclasses generally contain more generalized functions and variables than their inheriting classes (or functions that must be overridden), which is why they're superclasses waiting to be inherited by multiple subclasses.   

Changing the behavior of a class can change the behavior of a whole program in every place that employs that class.  But the same goes for any normal function.  Classes aren't strictly necessary for programming; some people program only in C, which doesn't have classes (C++ is sometimes called "C with class.")  Classes were just a revolution in programming ideology that seems to make a lot of things easier to work with, conceptually and also in some other ways such as with collaborative environments.

Sometimes a class will want to allocate memory (for example, for a new array) when an instance is created, and then deallocate the memory (free it for the system) when the instance is destroyed.  In some languages you don't need this because the "garbage collector" automatically detects when information goes out of scope, i.e.  can't be referenced and hence used anymore, and deallocates it.  In C++ you do need to do this (for arrays and such things, not for normal variables), and the function that is automatically executed when a class is instantiated is called the "constructor".  The function that is automatically executed when the object is deleted is called the "destructor".  In Python they're called __init__ and __del__ respectively, but __del__ is much less commonly needed (because Python is a garbage-collecting language).

Classes can have special types of members, such as static members (they're universal to that class, not instance-specific), private functions (functions of the class can use them, instructions external to the class can't) and constants (things that can't be modified elsewhere in the program), and those with friends (specific other classes that can gain access to the members).

Namespaces

Notice that if we didn't have classes to put "hi" and "bob" in, they'd have to be functions within the main namespace.  Polluting the main namespace with too many functions from various sources and unrelated purposes gets really messy.  

There are other ways to have namespaces than class, though, depending on the language.  With C++ you can only have namespaces with classes.  With C# and Python, for example, namespaces are a concept in itself.

For example, in Python:
import xml
d = xml.dom.minidom.parseString("hello")
print d.firstChild.value

Boilerplate Code

In some languages you just start typing instructions, and when you run the code the instructions are executed.  Other languages require certain standard code just to make a working program.  In C and C++, for example, the entry point to the program is the main() function.  If you don't define a function called "main", nothing executes.  So to say "hello world" in C you need at least this:

#include
void main() {
  cout << "hello"; }

In Java you need something even worse than a function, you need a class with a function it in order to make any working program.

<<

You may be distraught to notice that << seems to represent some sort of flow of information not falling under any of the above categories; i.e., why didn't we just say cout("hello")?  Well, C++ actually uses a trick to accomplish this.  << is an operator just like any other.  If + hypothetically calls a function called "add", "<<" in this context (in other contexts << does something different in C++) calls cout with the parameter, and after printing the parameter cout actually returns cout (i.e. the function itself).  That way we can chainlink things together like "cout << a << b << c;" 

High-level and low-level languages

C++ is a low-to-mediate level language, low enough to be a systems programming language.  This means it's tedious to program in, but executes fast and (potentially) produces tight code.   

Python is a scripting language which is about as high-level as they get.  Your average Python program contains about 6.5 *times* fewer lines that your average C++ program that does roughly the same thing, and it's more readable and more flexible.  On the downside, Python runs approximately 10 to 100 times slower than C (but speed is not always crucial, because most apps aren't CPU-bound but I/O or user-input-bound). Higher-level languages are generally more brief, more readable, more flexible, and easier to debug.

Assembly is the lowest-level programming language, as its keywords are merely mnemonics for the CPU op-codes themselves.  Needless to say, programming in assembly takes a lot of time.  A good assembly programmer can produce code that runs up to 10 times faster than the corresponding C program.

There are various distinct technologies by which programs are compiled and/or executed.
  • Natively compiled (low- to medium- level)
    The program is converted into machine code.  This requires a relatively low-level programming language.  Examples are D, Go, Delphi, Lisp, C++, C, Visual Basic 5 and 6, C--, HLA and assembler.
  • Just-in-time compiled (medium- to high-level)
    The program is compiled to bytecode, which is a sort of compacted intermediate code, that the program's run-time environment converts into machine code on-the-fly.  This gives the code a level of dynamism that isn't normally afforded by machine-compiled languages, while still retaining a lot of speed.  Examples are Java, JavaScript, ActionScript, C#, VB.net, and any other .net language.
    (If you program in C++ in Visual Studio, it's compiled to .net unless you specify a Win32 project.  As a .net program it should have the same speed as C#, so might as well use C# because it's much more flexible, elegant, and easy to debug, and it supports some more modern programming constructs.)
  • Interpreted from bytecode (medium- to high-level)
    The interpreter converts the program to bytecode, but it does not perform any just-in-time native complication; it simply executes the bytecode.  Examples are Python, Ruby, Lua, PHP, Perl, QuickBasic, and Visual Basic 5.0 and lower.  
  • Interpreted from source code (i don't even know why someone would do this.)
    The interpreter reads the code directly and executes it.  I don't know of any examples.  Probably extremely simple embedded languages, possibly without any functions or loop constructs (thus making them not Turing-complete)
Scripting languages (like Python, Ruby, Lua, Perl, and PHP) are high-level languages, intended to facilitate expedient programming. Technically they're supposed to script already-existing applications, but Perl, PHP Python, and Ruby aren't generally used for this.  Lua is more commonly used for this, as an embedded part of the application.

In most of the natively-compiled languages, an integer is just an integer, a float is just a float, a function is just code, etc.  The compiler essentially references them using pointers to memory locations and knows at compile-time what kind of value you're referencing and thus how to deal with it.  In higher-level languages, more types of things tend to be objects, or, referring more to the internal workings of the interpreter, "boxed." In C, for example, a string is just some characters followed by a null character.  In Python, it contains a header field that specifies its length, member functions, something indicating what type of object it is, etc.  Even classes are objects in Python, which means you can pass them around as parameters, modify them during run-time, etc.; you can't in natively complied languages.

In C++, you #include a file that imports a function that does a certain kind of operation, then call that function the appropriate object. Example: 
#include <ctype>
#include <ios>
void main() {
  char* c = "hello";
  cout << tolower(c); }

In Python:
c = "hello"
print c.lower()

Notice that in Python the top-level namespace isn't polluted by all the functions of the ctype library like tolower.  Instead, tolower is a member function of the type that's appropriate to execute that function, namely strings.  This is a fundamental characteristic of object-oriented programming, especially of the higher-level variety.  PHP is more like C/C++ in this respect, though.

Static typing, dynamic typing, and duck typing

In static typing, the compiler knows at compile-time the type of each variable/object.  That's why in C++ we declare a variable something like: 
int x;
Then the compiler knows that x will -always- be a 32-bit integer (within the lexical scope in which it's defined, at least).  Static typing is almost a necessity for natively compiled languages.   
Dynamic typing works more like this:

x = 10
print x
x = "hello"
print x

That is, the same *name* can be reassigned to objects of differing types.  This is where its internal boxing becomes useful.  Typing is also considered a way to keep programs under control, i.e. to help reduce hard-to-find bugs.  It's a debatable topic.  Python's way of dealing with typing is sometimes cleverly dubbed "duck typing."  What it means is basically this.  

def a(s):
  return ''.join((x[0].upper() + x[1:].lower() for x in s.split()))

If you pass it "hello there my name is Bob", it'll return "HelloThereMyNameIsBob".
If you pass it 16, it will give you an error because 16 is an integer type and integer types don't have a member called "split".  If it does have a member called "split" and it returns the wrong type of object, that type of object probably won't have members called "upper" and "lower".  If it does and they don't return strings (or something that behaves suitably like strings), ''.join will choke on it. So the result is still that you get an error, you just get it at run-time instead of compile-time.  So the importance isn't so much on the fact that s "is" a string, but that it acts like one.

Duck typing is an allusion to two things simultaneously: 1. it "ducks" (as in dodges) typing, i.e., getting the same effect of code security without having to worry about it explicitly, and 2. it's "duck" typing, as in, if it walks like a duck and quacks like a duck, it must be a duck.  That is, our 's' parameter could have been any other type of object that happens to have a 'split' function that returns objects that have 'lower' and 'upper' functions that return objects that have whatever functions ''.join requires of them.

Dynamic typing is not to be confused with "loose typing."  "Loose typing" is when a programming language lets its brains fall out to cater to cripples.  It's named after cheap whores with flabby breasts.  The opposite of loose typing isn't static typing, it's "strict typing."  Loose typing is also known as "weak typing", named after the weaklings who use it, and its opposite in that context is called "strong typing."  Python has strong dynamic  types.  C++ has strong static types.  PHP has weak dynamic types.  

C++ Pointers

Unlike most programming languages, in C and C++ you can work more-or-less directly with memory locations.  (This is a necessary requirement for any systems-level programming language, though.)  

#include
void main() {
  int p = 16;
  int q = 17;

  int *p_pointer;
  p_pointer = &p; 

  cout << *p_pointer;
  p_pointer++;
  cout << *p_pointer; }

In C and C++, the ++ operator increments an integer by one.  With pointers it actually increments it by the necessary amount to point to the next location of a variable (an int takes 4 bytes on normal computers). 

"int *p_pointer;" tells the compiler that p_pointer is an int, but it's also a pointer.  That means that, without some sort of trick like casting to (void*), you can't assign any type of int other than a pointer to p_pointer.  It works kind of like the type system in that respect.  

&p refers to the address of p.  The compiler associates the name "p" with the address to an int (the place where it stores 16 in binary).  But when you say something like, "cout << p;", or "int y = p;", it wouldn't do you much good to print out p's memory address or store it in y, so you're normally talking about the contents at the location the compiler associates with p.   &p retrieves the actual address. 

"p_pointer = p;"  would have been rejected by the compiler, because p isn't a pointer type and p_pointer is.
"q = &p;" wouldn't have worked either, because &p is a pointer type and q isn't.  
Similarly "*p_pointer" = &q;" would have been rejected because *p_pointer dereferences it, it refers to content pointed at by p_pointer, and thus *p_pointer doesn't refer to a pointer type.
It's strictly for safety reasons (so your program is less likely to crash), not functional reasons.

'*' and '&' are sort of complementary in a way, as you may have noticed, but it would be misleading to think they're exact opposites.  For example, you can do:

int p = 90;
int *q_p;
q_p = &p;
int **q_p_p;
q_p_p = &q_p;
int ***q_p_p_p;
q_p_p_p = &q_p_p;
***q_p_p_p = 120;

(in which case p will be changed to 120)

But that doesn't mean you can do:

q_p = &&&q_p_p_p;

& only works once.  You can't follow memory locations backward; what if more than one thing points to that memory location?  And why would we store a backreference  anyway?  p, q_p, q_p_p, and q_p_p_p are all just single names with a single memory location and single type; the number of *'s associated with them only determines what can be done with them semantically, i.e. the type (or level) of pointer they are, for purposes of programming safety and clear expression. 

I understand machine language, have taken C++ under Kip Irvine (a bestselling author), read 3-4 C++ books, spent many hours of time in C++ chat, and made a few C++ programs that use pointers, and it took me over 10 years to finally figure out what I just told you; so consider yourself lucky. >:P

& can also occur in a variable's definition, as in:
int& a;
or
int f(int& a, int& b) {blah..} //defines a function where the caller implicitly passes the addresses of a and b instead of their contents. 

&-defined types have strange properties, sort of like a pointer but with severe restrictions on use.  they can basically act as a mnemonic for the dereferenced variable (i.e. the value pointed to by the address) but cannot be stored in an array, have their address changed or accessed, etc.

in:

int f(int& a, int& b) {
  a = 10; b = a*100; }
x = 20; y = 200;
f(x, y);

the function returns no value, but changes x and y directly in the calling scope.   notice that f(20, 200) wouldn't have worked here, it would be a compile error.  you can't reference a literal value.

in:

int f(int a, int b) {
  a = 10; b = a*100; }
x = 20; y = 200;
f(x, y); 

x and y aren't changed, because the *values* or contents, not the addresses, of x and y are passed to the function.  The function's own a and b variables are local to that function; they change the *copy* of the contents that was passed to the function via the stack.  If the function had called its own internal variables x and y, it still wouldn't change the outer x and y; different lexical scope, so different variables.

"int *a" means the same thing as "int* a", "int * a" and "int*a".  Same with "int& a", "int &a", etc.  It's just a philosophical issue where to put the spacing.  And believe me, knowing how programmers are with issues of style, I'm sure there's a religious war about that going on right now as we speak, somewhere.

A multidimensional array might use multiple dereferencing.   For example, you might see some functions define a parameter as char**.  char* is C's way of doing strings; a char is a single character, e.g. 'a'.  A string is a series of characters, so char* defines a pointer to the first character.  In "hello", the pointer would point to the h.  Internally "hello" is stored as "hello\0" ('\0' notates a null character, or ASCII 0), so functions know where to stop reading the string in memory.  

So if you need an array of strings of this kind (as opposed instances of a string class), you need an array of arrays.  Each element of the array points to an array, which means each element is a pointer to a pointer.  Hence an array of strings may be defined as char**.  

It's similar for any other kind of multidimensional array.  An array of arrays of arrays of ints might be defined as "int ***b;"  But if all the dimensions of your array are of known constant size (say, 12, 32 and 90), you might as well define the array this way: "int b[12][32][90];".  Then your array is pre-allocated and you can refer to the 3rd element of the 4th array of the 5th array of arrays as "b[2][3][4]" and the compiler does the math.  (Indexing starts at zero which is why my ordinals were one higher than my indices.)  The math, in this case, is b[2*32*90+3*90+4].  "b" by itself refers to the first item of the array, because &b would be a pointer to the whole array starting at the beginning.  

When not statically defining your arrays (i.e., when not using brackets in your definition), if you wan them to refer to any meaningful content you must allocate that memory yourself.  When using *'s, the compiler won't automatically allocate that space for you and put something in it.  In C this is what you'd use malloc() for.  In C++ you can also use the "new" operator, although they had to make several arbitrary design decisions in how "new" (and its corresponding operator, "del") work, and not everybody's happy with them.   If you're instantiating instances of a class, however (not just primitive types like integers), you'll need the "new" operator.

When you instantiate an object in C++ (and almost any other OO language), you do it by calling the class as if it's a function.  If you pass parameters, the parameters must match the type of parameters accepted by the class's constructor function.  Example:

class hi {
  int c;
  char* d;
  hi(int a, int b) {
    c = a; 
    d = new char[b] }
  ~hi() {
    del d; }
  void hello() {
    c = 1;  } }
void main(argc, **argv) {
  one = hi(10, 20); }

In C++ a constructor is identified by having the same name as the class (ugh) and has no type, and a destructor is the same except the name starts with "~" (ugh).

Notice this time we included two parameters to main().  argc is the number of arguments passed on the command-line, and char** is an array of strings (the arguments).   Note that "void main(argc, *char[])" would have functionally the same effect.   In a multidimensional array definition, the *first* dimension doesn't have to have a fixed size; that's because, if you notice, the math for indexing a single element doesn't require the size of the first dimension.  a[10][20][4] is found in the same place regardless.  Same with b[4] in a one-dimensional array: it's found four positions after b (AKA b[0]).

When instantiating a class, different languages have different philosophies on what to do with superclass' constructors.  In Python you have to call the superclass constructor explicitly (normally within the subclass' constructor).  Python also has a different philosophy on how to refer to the object itself within a class method.  

  For example:

class a:
  def __init__(self):
    self.b = 1
  def say(self, s):
    print s
c = a()
print c.b
c.say("hello")

notice that in both functions, "self" is implied as the first parameter.  That's the particular instance of the class currently being processed.  I could have called it rumplestiltskin if I wanted to; it's just the first parameter passed.   But style is an issue here; that is, if you *don't* call it "self", the collective Python community will beat you to death with several crowbars simultaneously and then have a group martini.

in C++, if you need to refer to the instance directly within the class, you use the "this" keyword. 

In both Python and C++ you can access class members directly through the class rather than through instantiated objects.  In Python it's classname.member; you'd just better pass the instance as the first parameter explicitly if you're calling a function.  (if you call the function as an instance method, Python automatically passes the object as the first parameter.)  In C++ it's classname::member.

Another difference between Python and C++ is that Python can have both functions defined within functions and classes defined within classes. C++ can't have either.  D can.

Aside from "::" and ".' to deal with in C++, you also have "->".  "->" behaves much like "." when referencing a member of an instance, but it only applies when the compiler does not know ahead-of-time where it will find that instance.  For example:

class a {
  int c;}
void main() {
  a b = new a();
  int q = b.c; }

Because there is exactly one "b" defined by name.  That means "b" has a static location, and that means "b.c" has a static location (all instances of a class have the same size and the members are kept in the same relative places, just like with structs).

But what if we did...

class a {
  int c;}
void main() {
  a *b = new a[100];
  a *r = &b[10];
  int q = *r.c; }

that won't work.  r.c. has no fixed location, because the very location of r has been contrived at run-time.  instead, we do:

class a {
  int c;}
void main() {
  a *b = new a[100];
  a *r = &b[10];
  int q = *r->c; }

in that case, it knows the offset of r.c just by taking the current location of r and adding c's offset.

So you use "->" in place of "." wherever you use a *pointer to an instance*.  Even if c were a function and you were calling r->c() you'd still use "->".

If I had an estranged pointer p to an instance of a that for some reason I never bothered to let the compiler know what it is, and I wanted to call a function called "c" defined in class a on the instance pointed to by that pointer, i could do it like this: 
(a *p)->g() 
That casts p as a pointer to an instance of a, then calls g() on it.   Needless to say, if p isn't *actually* a pointer to an instance of a, you're gonna get fubar.  Casting is basically used to tell the compiler to treat something of another type or of void type as if it's the type you want it to be.  But it has a bit of an ambiguous role.

If you have two custom classes a and b, and you cast a b instance as an a instance, it's just going to blindly act as if b points to an instance of a.  On the other hand, if you do something like cast a float to an int, as in "single a = 3.2; int b = (int)a", it will automatically convert; it won't just treat the binary data as if it were an int (which wouldn't have very meaningful results).

So if you *actually* wanted to cast a float as an int, you'd have to use a "union" construct.  A union is a crazy-ass construct that's sort of like a struct, except that every single member is stored *in the exact same location*.  The size of an instance of a union, therefore, is the size of the largest member type in the union's definition. (You can store a char and a Uint32 in the same place, but not if the place only has room for 1 byte..)

Function Overloading

int a(int b) {return b*2;}
int a(float b) {return b*2;}

i have two separate functions by the same name, and they take different types of parameters.  The complier will remember all of these functions, and know which one to select  by which types of parameters you *pass* it.  If I pass a a float, it runs the second function.  If I pass a an int, it runs the first function. 

Python doesn't have function overloading, per se, but you don't need it as much:

def a(b):
  return b*2

works on both ints and floats.  And when you do need it, you can do it explicitly within one function:

def a(b):
  if type(b) is int:
    raise ValueError("SORRY I DON'T MULTIPLY INTS YOU SCUM")
  elif type(b) is float:
    return b*2

In C++, it's kind of a shame, though, to have to repeat code for every possible type or combination of types.    Function overloading doesn't always (usually doesn't) use the exact same instructions but just with different types applied.  But when it *does*, like in my example, there's a better way to do it.

Templates

If i make a into a *template*, then I can specify the code (return b*2) without having to specify specifically what type b is.  C++ is not a dynamic language, that means that during *compilation* stage it must know what the type of b is, or it won't know how to accurately manipulate it.  So how is it that you could write one function that accounts for both an int being passed and a float being passed?  The answer is that, when you write a template, the compiler actually implicitly compiles a version of the function for each possible  combination of types of parameters you could pass it.   

Of course that doesn't make much sense, because that could imply hundreds of versions easily.  What it actually does is include the combinations of parameters that you *actually use* in your program.  Because it's not a dynamic language, it can tell semantically what each of the ways you're ever gonna call the function are during compilation.  That's the same way it knows which overloaded function to call in a given place anyway.  

Main is Obsolete

Using main, arcv and argv are actually out-dated, unless you just want to make a text/console program.  But you'll still see it as the primary method of teaching C++.  It's out-dated because in a GUI environment (like a normal Windows app), main() isn't really the main function.  The application needs a message loop, and events that activate only on command.