We ❤️ Open Source

A community education resource

7 min read

Command line magic: Extracting links with awk, grep, and tr

A quick and efficient way to pull unique URLs using just a few command line tools.

I teach a few university classes, and recently a student emailed me to ask if I could email her the links from our “weekly readings” page from the Spring semester class she was in. I have a web page in our online learning system, which provides all of the readings for each week in the semester. My student forgot to make a copy of these links during the semester, and now that the semester is over, she couldn’t access the “weekly readings” page.

I was happy to provide the links. The only problem was there are a lot of links in there, and not all of them are weekly readings. Some of the links go to other parts of the online learning system. I suppose I could have just copied and pasted the full page into an email, and left it to my student to figure out which links she needed, but I’m nice so I wanted to share just the links to the weekly readings.

Not doing it by hand

One way to do this would be to go through the web page, right-click on each link, then copy and paste that link into an email. However, some links were repeated during the semester; we might revisit a topic later in the semester, so the “weekly readings” page included the link in both weeks. So if I were to do this by hand, I’d also need to evaluate if a link was repeated and omit any duplicates.

That’s a lot of work. I’m nice, but I’m also lazy. I don’t want to spend a ton of time copying and pasting links for someone else. Instead, I used the Linux command line to do the work for me.

Read more: How I built a Markdown-to-HTML tool on a 5MB FreeDOS system

Split the file into words

First, I saved a copy of the web page to my Linux system. This is actually pretty easy for me, because the online learning system lets instructors create and edit pages. I opened the page in the editor, switched to “code” view, then selected all of the text so I could copy and paste it into a text editor. That let me save a local copy of the page contents in a file called readings.

The online learning system saves each paragraph as a single line in the file, and the file had over 400 lines in it:

$ wc readings
  406  3753 45520 readings

That’s a lot of data to work with, and that’s okay; I already knew I was going to split the content into words to make it easier to work with. There are a ton of words (wc counted over 3,750 words) but a list of words is pretty easy to work with.

To split paragraphs into words, I used the tr command. This will translate one character into another. You can also use tr to translate a class of characters to another, which is what I did. I converted any white space into a new line character:

$ tr '[:space:]' '\n' < readings | wc -l
9084

You might be surprised that this gives many more lines as a list of words. But looking at the output, it’s also converting things like multiple tabs into separate new line characters, so the output has a bunch of blank lines.

Identify the URLs

The <a> tag in HTML creates a link to another page or online resource. You also need to provide an attribute that indicates the destination, such as <a href=”https://allthingsopen.org/articles”> to link to the All Things Open articles. You can also add other attributes here, including a class definition which can be used to add styles to the content, or a relationship definition. The online course system adds several of these attributes, so links actually look like this:

<a class="inline_disabled" href="..." target="_blank" rel="noopener">Understanding the Audience</a>

When tr splits this into words, the <a> tag becomes several lines, because the spaces become new lines:

<a
class="inline_disabled"
href="..."
target="_blank"
rel="noopener">Understanding
the
Audience</a>

To provide a list of URLs to my student, I was only interested in the lines that start with href=. The grep command will print only the lines that match a specific pattern, which could be a simple string, or a regular expression. The only regular expression character I needed was ^ to indicate the start of a line, so I could print all lines that start with href=.

I also needed to process the matching lines, and I like to do things in one command if I can. In this case, I needed awk to only print the part of the href= line that was the URL. All awk statements are pattern-action statements. The typical way to specify a pattern-action is to provide a regular expression between slashes, and one or more actions between curly braces.

The URL is always between double quotes, and I used the quotes as a field separator. Usually, awk assumes white space as the field separator, but you can change this with the -F option, such as -F\” to set the field separator to a double quote.

Assuming the line href=”https://allthingsopen.org/articles” with the quote as a field separator, the first field is href=, the second field is https://allthingsopen.org/articles and the third field is empty. To print the second field in awk, use the print command with $2 to indicate the second field.

Adding awk to my command line presented me with a list of URLs, in the order they appeared in the original HTML file:

$ tr '[:space:]' '\n' < readings | awk -F\" '/^href=/ {print $2}' | wc -l
68

Read more: A throwback experiment with Linux and Unix

Only print unique URLs

That’s already a pretty short list of links, but some of these URLs might be repeated in separate weeks. I know that was the case for at least a few links on my “weekly readings” page.

To solve this last step, I sent the output through the sort command, which sorted the list of URLs. From there, I could use one final command: the uniq command removes duplicate entries in a sorted list, leaving only the unique lines.

$ tr '[:space:]' '\n' < readings | awk -F\" '/^href=/ {print $2}' | sort | uniq | wc -l
65

The final list of URLs was just 65 links. As I said, there are a lot of links in the “weekly readings” page.

Speed up your work at the command line

This saved me a ton of time. Imagine if I had tried to do this by hand, by looking at each link, then copying and pasting each link into an email. It might take a moment to do that for the first few links, but the further I got into that list of links, the longer it will take me to decide if I’ve already copied that link, and paste the link into an email in sorted order.

By the end, let’s say it takes ten seconds to find a link, copy it, and paste it in sorted order into an email. For 65 links, that’s easy math: 650 seconds, or more than ten minutes to make a list of URLs.

Building the commands by typing these commands and experimenting with the results took only a few minutes, after which I was able to share the 65 links with my student.

More from We Love Open Source

About the Author

Jim Hall is an open source software advocate and developer, best known for usability testing in GNOME and as the founder + project coordinator of FreeDOS. At work, Jim is CEO of Hallmentum, an IT executive consulting company that provides hands-on IT Leadership training, workshops, and coaching.

Read Jim's Full Bio

The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.

Want to contribute your open source content?

Contribute to We ❤️ Open Source

Help educate our community by contributing a blog post, tutorial, or how-to.

This year we're hosting two world-class events!

Check out the AllThingsOpen.ai summary and join us for All Things Open 2025, October 12-14.

Open Source Meetups

We host some of the most active open source meetups in the U.S. Get more info and RSVP to an upcoming event.