Skip to main content


Write a Linux command line to traverse a directory of files and summarize the number of instances of each file extension found (e.g., .docx, .jpg, .pdf, etc). Fold all file extensions to lower case-- ".JPG" and ".jpg" should be reported together. Files that have no extension should be reported as "other". List the extensions found in descending numeric order by the number of occurrences of each extension.

#Linux #DFIR #CommandLine #Trivia

Hal Pomeranz reshared this.

in reply to Hal Pomeranz

are you expecting a one-liner?

because this almost requires hash tables, or associative arrays. which isn't that simple to do in bash, or any shell scripting language.

I'd much rather use python or nodejs for this task.

I'm sure an awk-ninja could do it, but that isn't strictly shell.

paste.centos.org/view/025605f5

in reply to Hal Pomeranz

well the famous geirha on IRC did it for me.

shopt -s nullglob dotglob globstar ; declare -A ext=() ; for file in **/*.* ; do (( ext[".${file##*.}"]++ )) ; done ; for e in "${!ext[@]}" ; do printf '%3d %s\n' "${ext[$e]}" "$e" ; done | sort -rn | head

I could never do that myself.

in reply to Hal Pomeranz

@apgarcia well, I wouldn't try running that in a dir with a git repo. it does not differentiate between files that have no file extension.
in reply to Stefan Midjich ꙮ҄

@stemid @apgarcia It also doesn't limit itself to regular files-- directory names, etc would also be considered. But it's close. Final answer tomorrow!
in reply to Hal Pomeranz

solution

find . | while read f; do echo ${f##*.}; done | sort | uniq -c | sort -rn

in reply to Hal Pomeranz

find . -type f -exec basename {} \; | grep '\.' | tr A-Z a-z | awk -F "." '{print $NF}' | sort | uniq -c | sort -n ; printf ' ' ; find . -type f -exec basename {} \; | grep -cv '\.'| sed's/$/ other/'
Let's not talk about the difference between jpg and jpeg and thing about how terrible that looks
in reply to Hal Pomeranz

one line

find $count_dir -type f -printf "%f\n" | while read file; do ext=${file##*.}; [[ "$file" == "$ext" ]] && echo other || echo $ext; done | sort -i | uniq -ci | sort -nr

in reply to Hal Pomeranz

find . -printf '%f\n'| sed -E -e 's/^[^.]+$/other/' -e 's/.*\.//' | sort -f | uniq -ci | sort -nr
This entry was edited (1 year ago)
in reply to Hal Pomeranz

find . -type f | perl -nle 'print m{[^/](\.[^/.]+)\z} ? lc $1 : "other"' | sort | uniq -c | sort -nr<br>

The lc (lowercase) in the Perl code could be replaced by sort -f | uniq -ic, but that's both longer and harder to understand, I think.

Remove -type f to include directories in the report (most of them will show up as "other", I expect.)

The regex in the Perl code could be simplified slightly at the cost of -printf '%f\n' in find, but that's longer.

A version without perl:

find . -type f | sed 's!^.*[^/]\(\.[^/.]\+\)$!\1!; t; c other' | sort -f | uniq -ic | sort -nr<br>

But I think the perl version is actually more readable.
in reply to Hal Pomeranz

I took yesterday's Linux DFIR command line trivia from one of our old Command Line Kung Fu blog postings (Episode 99). It's an interesting challenge and useful for quickly summarizing types of files in a directory by simply collecting and counting the file extensions.

I also liked it because we get to use the "${var##pattern}" expansion, similar to my solution from a couple of days ago using "${var%%pattern}". So there's some nice symmetry.

If you read the original blog posting, my first solution used a funky "sed" substitution to whittle down to the file extensions. It was actually loyal reader and friend of the blog Jeff Haemer who suggested this much cleaner version (and to @barubary who reminded me to quote my variables) :

find Documents -type f |
while read f; do
echo "${f##*.}";
done |
sed 's/.*\/.*/other/' | tr A-Z a-z |
sort | uniq -c | sort -n

"find" gets us a list of files in the directory and then we feed that into a loop that uses "${f##*.}" to remove everything up until the final "." in the file path.

Some file names are not going to have a "." and so the original file path will be unchanged. The "sed" expression after the loop marks these files as being in category "other". Finally we shift everything to lowercase so "GIF" and "gif" are recognized as the same type. Then our usual command line histogram idiom rounds things out.

We could add some other fix-ups to make things nicer:

find Documents -type f |
while read f; do
echo "${f##*.}";
done |
sed 's/.*\/.*/other/' | tr A-Z a-z |
sed -r 's/^jpg$/jpeg/; s/(~|,v)$//' |
sort | uniq -c | sort -n

I've dropped in another "sed" expression before we start counting things. We make "jpg" and "jpeg" count the same and remove some trailing file extensions for backup files to reduce clutter in the output. This just shows that you can arbitrarily tweak the output to suit your needs.

Props to @apgarcia for getting very close to the final solution on this one!

#Linux #DFIR #CommandLine #Trivia

This entry was edited (1 year ago)

Hal Pomeranz reshared this.

in reply to Hal Pomeranz

OK, this one is broken again. Most obviously, it lists the result in ascending numeric order, not descending. 😉

Names without extension are indistinguishable from files named foo.other in the output.

However, it also mishandles

  • names that end with spaces (spaces are removed implicitly at two places in the pipeline)
  • names that end with backspace (those get silently skipped)
  • extensions that contain shell wildcards (e.g. a file like foo.* will produce funny output)

At minimum you should change the read loop to:

while read -r; do<br>    echo "${REPLY##*.}"<br>done<br>

That fixes the issues with spaces, backslashes, and wildcard characters.

However, since all you're doing is removing characters up to the last ., you could replace the whole loop by just sed 's/.*\.//' (or sed 's/.*\././' if you want to distinguish between .other and other).

This entry was edited (1 year ago)
in reply to Füsilier Breitlinger

@barubary Sigh. It's like you've become my shell programming conscience. "Forgive me, Father, for I have committed the following bash sins..."

My original solution in the blog was using sed, but I liked the loop version for readability.

But I am going to sneak back and edit my answer to quote the variable in the loop. Thank you.

in reply to Hal Pomeranz

i came across a really cool alternative to the 'sort | uniq -c | sort -rn' histogram pattern:

github.com/red-data-tools/YouP…

see screenshot...

Hal Pomeranz reshared this.

in reply to Hal Pomeranz

@apgarcia @barubary

When I run your solution for a directory with around 19.000 files your solution takes around 5 secs, while mine around 1.5 secs.

I use the case insensitive option for sort and uniq, instead of converting all extensions to lowercase. As it was only required that "abc" and "ABC" should be counted as the same extension, not that both must be converted to "abc".

My solution also missed the point with files having "other" as extension.

Our results were the same.