Les billets libellés: machine_learning. Afficher tous les billets.

I am working on classifying mammography scans with a TensorFlow ConvNet. The scans are classified into five classes:

  • Normal
  • Benign Calcification
  • Malignant Calcification
  • Benign Mass
  • Malignant Mass

I was unsure of how I wanted to classify the scans so I created the model in such a way that it would work for any combination of classes. I initially started training with binary classification - normal or abnormal, with the goal of then expanding the number of classes once I had a model that made decent predictions on the binary case.

For the binary prediction I used precision, recall and a pr curve as metrics. When I expanded to multiple classes obviously those metrics no longer worked. As far as precision and recall I don't really care what type of abnormal the scan is - I just care that it is abnormal at all. And I wanted to have the same metrics to compare for all my models so I had to figure out a way to do precision and recall for all versions of the model.

The solution I came to was to "squash" my multi-class labels and predictions down into binary labels and predictions and feed those into the p/r metrics. I set up the classes so that 0 was always normal, so I can do the squashing as follows:

zero = tf.constant(0, dtype=tf.int64)
collapsed_predictions = tf.greater(predictions, zero)
collapsed_labels = tf.greater(y, zero)

Collapsed_predictions and collapsed_labels will then contain True if the prediction or label is NOT 0 and False if it is. Then I can feed these into my precision and recall metrics:

recall, rec_op = tf.metrics.recall(labels=collapsed_labels, predictions=collapsed_predictions)
precision, prec_op = tf.metrics.precision(labels=collapsed_labels, predictions=collapsed_predictions)

I also created a pr curve metric to see how the thresholds would affect the predictions. First I convert the logits to probabilities via a softmax and then feed that into a pr_curve_streaming_op as the predictions. In order to make this work with multi-class classification I squash the probabilities down to the probability that the item is NOT normal. Since my labels are created such that normal is always 0, the probability that it is not normal is just 1 - the probability that it is:

probabilities = tf.nn.softmax(logits, name="probabilities")
_, update_op = summary_lib.pr_curve_streaming_op(name='pr_curve',
                                                predictions=(1 - probabilities[:, 0]),
                                                labels=collapsed_labels,
                                                updates_collections=tf.GraphKeys.UPDATE_OPS,
                                                num_thresholds=20)

 

Libellés: python, machine_learning, tensorflow
Aucun commentaire

I decided to try a Google Cloud GPU instance as well as EC2. Once I had my quotas set properly and was able to start the instance it took me all day to get TensorFlow running with GPU. The instructions Google provides are for CUDA 8.0, and the latest version of TensorFlow requires CUDA 9.0.

To get everything running follow these steps:

  1. curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
  2. sudo dpkg -i cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
  3. sudo apt-get update
  4. sudo apt-get install cuda-9-0
  5. sudo nvidia-smi -pm 1

These are the steps in the instructions with the proper repo to CUDA 9.0 inserted.

Then I had to install cudnn, which isn't mentioned at all in Google's instructions. I downloaded libcudnn7_7.0.4.31-1+cuda9.0_amd64.deb from the Nvidia cudnn site, and then uploaded it to the instance with scp. Then install it with:

sudo dpkg -i libcudnn7_7.0.4.31-1+cuda9.0_amd64.deb

Then you need to export the path with:

echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc
echo 'export PATH=$PATH:$CUDA_HOME/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$CUDA_HOME/lib64' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

And finally install TensorFlow:

sudo apt-get install python-dev python-pip libcupti-dev
sudo pip install tensorflow-gpu

I used pip3 and python3, but the rest is the same. 

Update: I thought it was working fine but I was still getting errors about locating libcupti.so.9.0. That was fixed by making symlinks as described here.

I ran these commands and now it seems to be working...

  1. # Put symlinks in /usr/local/cuda
  2. sudo mkdir /usr/local/cuda
  3. cd /usr/local/cuda
  4. sudo ln -s /usr/lib/x86_64-linux-gnu/ lib64
  5. sudo ln -s /usr/include/ include
  6. sudo ln -s /usr/bin/ bin
  7. sudo ln -s /usr/lib/x86_64-linux-gnu/ nvvm
  8. sudo mkdir -p extras/CUPTI
  9. cd extras/CUPTI
  10. sudo ln -s /usr/lib/x86_64-linux-gnu/ lib64
  11. sudo ln -s /usr/include/ include

Another Update: TensorFlow requires version 7.0.4 of the cudnn, I had originally downloaded 7.1.2, the code has been updated accordingly.

Final Update: I set up another instance and followed this process and it almost worked. I needed to export another path which I added here. The commands to export the path were temporary and had to be repeated every time the instance was booted, I changed that to echo the path to .bashrc so it would be automatically set.

Libellés: coding, machine_learning, tensorflow, google_cloud
Aucun commentaire

To resolve the problems I was having yesterday I ended up paying for an Amazon EC2 instance with the Deep Learning Ubuntu AMI. The instance type is p2.xlarge which costs $0.90/hour, but seems to be well worth it so far. In the last ten minutes I've been training a relatively small model on Google Cloud, which has been able to get through 60 steps. In contrast, on the EC2 instance the much larger model, training on the same data, has gone through 375 steps, where each epoch is 687 steps.

I did have some trouble accessing TensorBoard on the EC2 instance, but was able to get it running by following the tutorial. I also got Jupyter Notebook running and accessible from the outside world, again by following the tutorial, although I had to comment out the lines about the SSL certificates in the jupyter conf file in order to be able to connect. I decided to not use Jupyter Notebook, but it's nice to have it as an option.

Since this is just a project I am working on for myself, I'd prefer to not have to pay for the compute, but $0.90 per hour is manageable, and well worth it for the 10x increase in training speed. 

Libellés: machine_learning, tensorflow, google_cloud, ec2
Aucun commentaire

Google CoLab and Google Cloud

vendredi 23 mars 2018

While it was amazing for running smaller models, apparently CoLab has it's limitations. I'm working on a ConvNet that takes 299x299 images as input and trying to train it on Google CoLab kept crashing the runtime with no error messages provided. The training data totalled about 2.3 GB, and I guess CoLab just couldn't handle it for whatever reason. 

I tried training on my laptop, but I estimated it would take about 6 hours per epoch, which is ridiculous, so then I tried to use Google Cloud's free trial to set up an instance with GPUs. Unfortunately the free trial no longer supports the ability to add GPUs, so that didn't work. I did set up an instance without GPUs which is training faster than my laptop right now, but not that much faster. My current estimate about about 2 hours per epoch.

My plan is to let this train overnight and see how it goes. If it is too slow I may try to use Google's TPUs, which are ostensibly optimized for TensorFlow. However they are very expensive at $6/hr. Amazon EC2 instances with GPUs are about the same price, which doesn't leave me many options. 

Libellés: python, machine_learning, tensorflow, google, google_cloud
Aucun commentaire

I am currently working with a dataset that is far too large to store in memory so I am using tfrecords and queues to feed the data in. This works great, except that I was not able to evaluate the model on the validation dataset every epoch like I usually do.

After spending quite a bit of time trying to figure out ways around this, none of which worked, I found an easy solution that does work.

batch, labels = read_and_decode_single_example([train_path])
X_def, y_def = tf.train.shuffle_batch([image, label], batch_size=8, capacity=2000, min_after_dequeue=1000)
X = tf.placeholder_with_default(X_def, shape=[None, 299, 299, 1])
y = tf.placeholder_with_default(y_def, shape=[None])

I have a function that reads that data in from the tfrecords file (read_and_decode_single_example()). I then create the default features and labels using shuffle batch. Finally I create X and y as placeholders with default, with the shuffled batches as the defaults.

Then when I am training I don't pass the feed dict, and it defaults to using the data from the tfrecords file. When it is time to evaluate, I pass the data in via a feed_dict and it uses that.

This is not a great solution, it is kind of ugly, and it does require loading the validation data into memory, but it works and is simple. I had also tried using tf.cond() to switch between reading the data from a train.tfrecords file and a test.tfrecords file but was unable to get that to work.

The research I did indicates that the preferred way to handle this is to use different sessions, or different graphs with weight sharing, but that just seems ridiculous to me. It shouldn't be that complicated.

Libellés: python, data_science, machine_learning, tensorflow
1 commentaires