What I Don’t Know – Pickled ML

Starting this holiday season, I want to take some time every day to focus on a broader set of topics than just machine learning. While ML is extremely valuable, working on it day after day has given me tunnel vision. I think it’s important to remind myself that there’s more out there in the world of technology.

This post is going to be an aspirational one. I started by listing a ton of things I’m embarrassed to know nothing about. Then, in the spirit of self-improvement, I came up with a list of project ideas for each topic. I find that I learn best by doing, so I hope these projects will help me master areas where I have little or no prior experience. And who knows, maybe others will benefit from this list as well!

Networking

My knowledge of networking is severely lacking. First of all, I have no idea how the OS network stack works on either Linux or macOS. Also, I’ve never configured a complex network (e.g. for a datacenter), so I have a limited understanding of how routing works.

I am so ignorant when it comes to networking that I often struggle to formulate questions about it. Hopefully, my questions and project ideas actually make sense and turn out to be feasible.

Questions:

What APIs does your OS provide to intercept or manipulate network traffic?
What actually happens when you connect to a WiFi network?
What is a network interface? How is traffic routed through network interfaces?
How do VPNs work internally (both on the server and on the client)?
How flexible are Linux networking primitives?
How does iptables work on Linux? What’s it actually do?
How do NATs deal with different kinds of traffic (e.g. ICMP)?
How does DNS work? How do DNS records propagate? How do custom nameservers (e.g. with NS records) work? How does something like iodine work?

Projects:

Re-implement something like Little Snitch.
Try using the Berkeley Packet Filter (or some other API) to make a live bandwidth monitor.
Implement a packet-level WiFi client that gets all the way up to being able to make DNS queries. I started this with gofi and wifistack, but never finished.
Implement a program that exposes a fake LAN with a fake web server on some fake IP address. This will involve writing your own network stack.
Implement a user-space NAT that exposes a fake “gateway” through a tunnel interface.
Re-implement iodine in a way that parallelizes packet transmission to be faster on satellite internet connections (like on an airplane).
Re-implement something like ifconfig using system calls.
Connect your computer to both Ethernet and WiFi, and try to write a program that parallelizes an HTTP download over both interfaces simultaneously.
Write a script to DOS a VPN server by allocating a ton of IP addresses.
Write a simple VPN-like protocol and make a server/client for it.
Try to set something up where you can create a new Docker container and assign it its own IP address from a VPN.
Try to implement a simple firewall program that hooks into the same level of the network stack as iptables.
Implement a fake DNS server and setup your router to use it. The DNS server could forward most requests to a real DNS server, but provide fake addresses for specific domains of your choosing. This would be fun for silly pranks, or for logging domains people visit.
Try to bypass WiFi paywalls at hotels, airports, etc.

Cryptocurrency

Cryptocurrencies are extremely popular right now. So, as a tech nerd, I feel kind of lame knowing nothing about them. Maybe I should fix that!

Questions:

How do cryptocurrencies actually work?
What kinds of network protocols do cryptocurrencies use?
What does it mean that Ethereum is a distributed virtual machine?
What computations are actually involved in mining cryptocurrencies?

Projects:

Implement a toy cryptocurrency.
Write a script that, without using high-level APIs, transfers some cryptocurrency (e.g. Bitcoin) from one wallet to another.
Write a small program (e.g. “Hello World”) that runs on the Ethereum VM. I honestly don’t even know if this is possible.
Try writing a Bitcoin mining program from scratch.

Source Control

Before OpenAI, I worked mostly on solitary projects. As a result, I only used a small subset of the features offered by source control tools like Git. I never had to deal with complex merge conflicts, rebases, etc.

Questions:

What are some complicated use-cases for git rebase?
How do code reviews typically work on large open source projects?
What protocol does git use for remotes? Is a Github repository just a .git directory on a server, or is there more to it than that?
What are some common/useful git commands besides git push, git pull, git add, git commit, git merge, git remote, git checkout, git branch, and git rebase? Also, what are some unusual/useful flags for the aforementioned commands?
How do you actually set up an editor to work with git add -p?
How does git store data internally?

Projects:

Try to write a script that uses sockets/HTTPS/SSH to push code to a Github repo. Don’t use any local git commands or APIs.
On your own repos, intentionally get yourself into source control messes that you have to figure your way out of.
Submit more pull requests to big open source projects.
Read a ton of git man pages.
Write a program from scratch that converts tarballs to .git directories. The .git directory would represent a repository with a single commit that adds all the files from the tarball.

Machine Learning

Even though I work in ML, I sometimes forget to keep tabs on the field as a whole. Here’s some stuff that I feel I should brush up on:

Questions:

How do SOTA object detection systems work?
How do OCR systems deal with variable-length strings in images?
How do neural style transfer and CycleGAN actually work?

Projects:

Make a program that puts boxes around people’s noses in movies.
Make a captcha cracker.
Make a screenshot-to-text system (easy to generate training data!).
Try to make something that takes MNIST digits and makes them look like SVHN digits.

Phones

For something so ubiquitous, phones are still a mystery to me. I’m not sure how easy it is to learn about phones as a user hacking away in his apartment, but I can always try!

Questions:

What does the SMS protocol actually look like? How is the data transmitted? Why do different carriers seem to have different character limits?
How does modern telephony work? How are calls routed?
Why is it so easy to spoof a phone number, and how does one do it?
How are Android apps packaged, and how easy is it to reverse engineer them?

Projects:

Try reverse engineering SMS apps on your phone. Figure out what APIs deal with SMS messages, and how messages get from the UI all the way to the cellular antenna. Try to encode and send an SMS message from as low a level as possible.
Get a job at Verizon/AT&T/Sprint/T-Mobile. This is probably not worth it, but telephony is one of those topics that seem pretty hard to learn about from the outside.

Misc. Tools

I don’t take full advantage of many of the applications I use. I could probably get a decent productivity boost by simply learning more about these tools.

Questions:

How do you use ViM macros?
What useful keyboard shortcuts does Atom provide?
How do custom go get URLs work?
How are man pages formatted, and how can I write a good man page?

Projects:

Use a ViM macro to un-indent a chunk of lines by one space.
For a day, don’t let yourself use the mouse at all in your text editor. Lookup keyboard shortcuts all you want.
For a day, use your editor for all development (including running your code). Feasible with e.g. Hydrogen.
Make a go get service that allows for semantic versioning wildcards (e.g. go get myservice.org/github.com/unixpickle/somelib/0.1.x).
Write man pages for some existing open source projects. I bet everybody will thank you.

2 thoughts on “What I Don’t Know”

Jerry says:

July 25, 2019 at 1:38 am

This is amazing! I wonder in your experience, how long does it take/do you anticipate it will take to finish one project?

I’m asking because I sometimes get discouraged by underestimating the time for projects I want to do. DO you have tips for time estimation?
unixpickle says:

July 25, 2019 at 12:45 pm

Time estimation is really hard. I usually try to focus on projects I enjoy, that way I don’t mind if the project takes longer than expected. Usually, if a project runs longer than you expected, that means you are learning a lot (even more than you thought you would), so that’s another positive to the experience.

Comments are closed.