Introduction to the Tools
Every new discipline has tools that you need to be familiar with to get started. Those tools for data science in Python are Anaconda and IPython Notebooks. This section introduces you to those tools.
Anaconda
Much of the scientific Python stack requires complicated, compiled code to run. Normal installation methods require you to compile tools like Numpy and pandas on your computer before you can use them. Anaconda is a distribution of Python that comes with all of these tools ready to use.
Anaconda ships with a tool call conda
for managing your local
packages and updating your code. We're going to create an environment to do
all of our work in:
$ conda create --name nicar anaconda python=2
... output from conda ...
$ source activate nicar
(nicar) $
If you're on Windows, you don't have to include the source
.
Now you have an environment that includes the entire Anaconda stack inside it.
There's a lot more that conda can do, including exporting environments for you to share with others, building packages, and more. Rather than dive into those, we'll leave that as an exercise for the student. A great place to start is conda.io.
Quick Note on pip
You can install most data science packages from Python using conda install
.
You might come across a new package or something not specifically related to
data science that you can't install using conda
. When that happens, you can
rely on the standard Python installer: pip
. It's automatically available in
any conda environment that has Python installed. If you somehow end up with an
environment that doesn't have it installed, you can install it via conda
install pip
.
Also, if you don't want to try the Anaconda distribution, you can still work with iPython, pandas and bokeh. You can install each of them via pip in your own virtualenv:
pip install ipython pandas bokeh
This will handle all of the dependencies for pandas, but you will need to have a compiler installed. On a Mac, you can install the Command Lines Tools for XCode by downloading it at Apple's developer site.
IPython Notebooks
A key companion to any scientist is their notebook. It's the place where they work out their ideas, flesh out theories, and plan experiments. Their killer feature is that they can be shared and read by other scientists.
Like their analog counterparts, IPython notebooks (often referred to as ipynb
files) allow you to capture your thoughts in one place including code that can
be executed again and the output of them.
Let's get started. The first thing we need to do is create a directory where
we can store all of our work. I'm going to call my directory notebooks
in my
home directory. You can name it whatever you'd like, but if you use a different
name, make sure to adjust the commands appropriately.
(nicar) $ mkdir notebooks
(nicar) $ cd notebooks
(nicar) ~/notebooks/ $
Now that you have that directory created, it's time to create your first notebook. From the command line, run this command:
(nicar) ~notebooks/ $ ipython notebook
# TODO include output
Your default web browser should launch and load up http://localhost:8888/tree
.
It should look like this:
TODO Screen Cap
Let's create a new one. Click on the New Notebook
button in the upper right.
TODO Screen Cap
Notebook Basics
This new interface is the place where all of your work is going to take place. Each notebook is made up a collection of cells. You can run each cell individually or in combination with other cells. Let's start with the basics and have it say hello world:
print("Hello, World!")
Click the play button in the toolbar that cell will run and display the result.
TODO Screen Cap
Notice that the empty brackets over to the left now have a 1
in them.
That's to show you how many commands have been run in this notebook.
Once you create a cell and run it, you can go edit it, then re-run it. Click
on the print
statement again and edit it to simply assign a variable:
who = "World"
Click play again and notice that the 1
changes to a 2
and the output went
away. The output disappeared because there is no output when you assign a
variable. Go back and edit the cell again to include the variable who
on a
line by itself, then re-run it.
who = "World"
who
This shows two things:
- Variables by themselves are outputted if they're the last item in the cell
- Each cell can be multiple lines!
Now let's go down to the cell below our first one and add another print
statement like this:
print("Hello, {}!".format(who))
This time when you click play on this cell, you notice that it pulls the
variable who
from the first cell and uses it inside the print
function.
Variables are shared throughout a notebook the same way they are when you write
a script into a single file.
Other Types of Cells
You're not limited to just Python in the cells. Through the magic of iPython, you can use bash commands like ls
to list the contents of a directory or cd ..
to navigate to another folder.
Or you can keep track of notes using
Markdown. To do that, select the current empty cell below your code, then
select the dropdown from the toolbar that says Code
and switch that to
Markdown
. There you can add whatever you like:
# Notes!
This is plain Markdown text. You can place [links][1], *emphasis*,
and so on. When you render this block by hitting the play button it
will render as full Markdown.
[1]: http://github.com/
Now, you're ready to start exploring data with pandas.