Hal Pomeranz

Write a Linux command line to traverse a directory of files and summarize the number of instances of each file extension found (e.g., .docx, .jpg, .pdf, etc). Fold all file extensions to lower case-- ".JPG" and ".jpg" should be reported together. Files that have no extension should be reported as "other". List the extensions found in descending numeric order by the number of occurrences of each extension.

#Linux #DFIR #CommandLine #Trivia

Hal Pomeranz reshared this.

in reply to Hal Pomeranz

lamitpObuS

in reply to Hal Pomeranz • 1 year ago • •

Also treat hidden files?

.foo - as "other"
.foo.bar - as extension bar

in reply to lamitpObuS

Hal Pomeranz

in reply to lamitpObuS • 1 year ago • •

@lamitpObuS Don’t stress about hidden files

@lamitpObuS

in reply to Hal Pomeranz

Stefan Midjich ꙮ҄

in reply to Hal Pomeranz • 1 year ago • •

are you expecting a one-liner?

because this almost requires hash tables, or associative arrays. which isn't that simple to do in bash, or any shell scripting language.

I'd much rather use python or nodejs for this task.

I'm sure an awk-ninja could do it, but that isn't strictly shell.

paste.centos.org/view/025605f5

UNTITLED - Pastebin Service

^{paste.centos.org}

in reply to Stefan Midjich ꙮ҄

Hal Pomeranz

in reply to Stefan Midjich ꙮ҄ • 1 year ago • •

@stemid I can do it without arrays or awk and in less than 256 characters

@Stefan Midjich ꙮ҄

in reply to Hal Pomeranz

Stefan Midjich ꙮ҄

in reply to Hal Pomeranz • 1 year ago • •

well the famous geirha on IRC did it for me.

shopt -s nullglob dotglob globstar ; declare -A ext=() ; for file in **/*.* ; do (( ext[".${file##*.}"]++ )) ; done ; for e in "${!ext[@]}" ; do printf '%3d %s\n' "${ext[$e]}" "$e" ; done | sort -rn | head

I could never do that myself.

in reply to Stefan Midjich ꙮ҄

Hal Pomeranz

in reply to Stefan Midjich ꙮ҄ • 1 year ago • •

@stemid I see some arrays in there. There’s definitely a less complex solution!

@Stefan Midjich ꙮ҄

in reply to Hal Pomeranz

Stefan Midjich ꙮ҄

in reply to Hal Pomeranz • 1 year ago • •

well show me then.

in reply to Stefan Midjich ꙮ҄

Hal Pomeranz

in reply to Stefan Midjich ꙮ҄ • 1 year ago • •

@stemid Actually @apgarcia's reply in the thread is very close

@Stefan Midjich ꙮ҄ @apgarcia

in reply to Hal Pomeranz

Stefan Midjich ꙮ҄

in reply to Hal Pomeranz • 1 year ago • •

@apgarcia well, I wouldn't try running that in a dir with a git repo. it does not differentiate between files that have no file extension.

@apgarcia

in reply to Stefan Midjich ꙮ҄

Hal Pomeranz

in reply to Stefan Midjich ꙮ҄ • 1 year ago • •

@stemid @apgarcia It also doesn't limit itself to regular files-- directory names, etc would also be considered. But it's close. Final answer tomorrow!

@Stefan Midjich ꙮ҄ @apgarcia

in reply to Hal Pomeranz

apgarcia

in reply to Hal Pomeranz • 1 year ago • •

solution

find . | while read f; do echo ${f##*.}; done | sort | uniq -c | sort -rn

in reply to Hal Pomeranz

silverwizard

in reply to Hal Pomeranz • 1 year ago •

find . -type f -exec basename {} \; | grep '\.' | tr A-Z a-z | awk -F "." '{print $NF}' | sort | uniq -c | sort -n ; printf ' ' ; find . -type f -exec basename {} \; | grep -cv '\.'| sed's/$/ other/'

Let's not talk about the difference between jpg and jpeg and thing about how terrible that looks

in reply to Hal Pomeranz

lamitpObuS

in reply to Hal Pomeranz • 1 year ago • •

one line

find $count_dir -type f -printf "%f\n" | while read file; do ext=${file##*.}; [[ "$file" == "$ext" ]] && echo other || echo $ext; done | sort -i | uniq -ci | sort -nr

in reply to Hal Pomeranz

Florian Diesch

in reply to Hal Pomeranz • 1 year ago from human • •

find . -printf '%f\n'| sed -E -e 's/^[^.]+$/other/' -e 's/.*\.//' | sort -f | uniq -ci | sort -nr

This entry was edited (1 year ago)

in reply to Hal Pomeranz

Füsilier Breitlinger

in reply to Hal Pomeranz • 1 year ago • •

find . -type f | perl -nle 'print m{[^/](\.[^/.]+)\z} ? lc $1 : "other"' | sort | uniq -c | sort -nr<br>

The lc (lowercase) in the Perl code could be replaced by sort -f | uniq -ic, but that's both longer and harder to understand, I think.

Remove -type f to include directories in the report (most of them will show up as "other", I expect.)

The regex in the Perl code could be simplified slightly at the cost of -printf '%f\n' in find, but that's longer.

A version without perl:

find . -type f | sed 's!^.*[^/]\(\.[^/.]\+\)$!\1!; t; c other' | sort -f | uniq -ic | sort -nr<br>

But I think the perl version is actually more readable.

in reply to Hal Pomeranz

Hal Pomeranz

in reply to Hal Pomeranz • 1 year ago • •

I took yesterday's Linux DFIR command line trivia from one of our old Command Line Kung Fu blog postings (Episode 99). It's an interesting challenge and useful for quickly summarizing types of files in a directory by simply collecting and counting the file extensions.

I also liked it because we get to use the "${var##pattern}" expansion, similar to my solution from a couple of days ago using "${var%%pattern}". So there's some nice symmetry.

If you read the original blog posting, my first solution used a funky "sed" substitution to whittle down to the file extensions. It was actually loyal reader and friend of the blog Jeff Haemer who suggested this much cleaner version (and to @barubary who reminded me to quote my variables) :

find Documents -type f |
while read f; do
echo "${f##*.}";
done |
sed 's/.*\/.*/other/' | tr A-Z a-z |
sort | uniq -c | sort -n

"find" gets us a list of files in the directory and then we feed that into a loop that uses "${f##*.}" to remove everything up until the final "." in the file path.

Some file names are not going to have a "." and so the original file path will be unchanged. The "sed" expression after the loop marks these files as being in category "other". Finally we shift everything to lowercase so "GIF" and "gif" are recognized as the same type. Then our usual command line histogram idiom rounds things out.

We could add some other fix-ups to make things nicer:

find Documents -type f |
while read f; do
echo "${f##*.}";
done |
sed 's/.*\/.*/other/' | tr A-Z a-z |
sed -r 's/^jpg$/jpeg/; s/(~|,v)$//' |
sort | uniq -c | sort -n

I've dropped in another "sed" expression before we start counting things. We make "jpg" and "jpeg" count the same and remove some trailing file extensions for backup files to reduce clutter in the output. This just shows that you can arbitrarily tweak the output to suit your needs.

Props to @apgarcia for getting very close to the final solution on this one!

#Linux #DFIR #CommandLine #Trivia

#Linux #DFIR #commandline #trivia #pattern @apgarcia @Füsilier Breitlinger

This entry was edited (1 year ago)

Hal Pomeranz reshared this.

in reply to Hal Pomeranz

Füsilier Breitlinger

in reply to Hal Pomeranz • 1 year ago • •

OK, this one is broken again. Most obviously, it lists the result in ascending numeric order, not descending. 😉

Names without extension are indistinguishable from files named foo.other in the output.

However, it also mishandles

names that end with spaces (spaces are removed implicitly at two places in the pipeline)
names that end with backspace (those get silently skipped)
extensions that contain shell wildcards (e.g. a file like foo.* will produce funny output)

At minimum you should change the read loop to:

while read -r; do<br>    echo "${REPLY##*.}"<br>done<br>

That fixes the issues with spaces, backslashes, and wildcard characters.

However, since all you're doing is removing characters up to the last ., you could replace the whole loop by just sed 's/.*\.//' (or sed 's/.*\././' if you want to distinguish between .other and other).

This entry was edited (1 year ago)

in reply to Füsilier Breitlinger

Hal Pomeranz

in reply to Füsilier Breitlinger • 1 year ago • •

@barubary Sigh. It's like you've become my shell programming conscience. "Forgive me, Father, for I have committed the following bash sins..."

My original solution in the blog was using sed, but I liked the loop version for readability.

But I am going to sneak back and edit my answer to quote the variable in the loop. Thank you.

@Füsilier Breitlinger

in reply to Hal Pomeranz

apgarcia

in reply to Hal Pomeranz • 1 year ago • •

i came across a really cool alternative to the 'sort | uniq -c | sort -rn' histogram pattern:

github.com/red-data-tools/YouP…

see screenshot...

GitHub - red-data-tools/YouPlot: A command line tool that draw plots on the terminal.

A command line tool that draw plots on the terminal. - GitHub - red-data-tools/YouPlot: A command line tool that draw plots on the terminal.

^GitHub

Hal Pomeranz reshared this.

in reply to Hal Pomeranz

lamitpObuS

in reply to Hal Pomeranz • 1 year ago • •

@apgarcia @barubary

When I run your solution for a directory with around 19.000 files your solution takes around 5 secs, while mine around 1.5 secs.

I use the case insensitive option for sort and uniq, instead of converting all extensions to lowercase. As it was only required that "abc" and "ABC" should be counted as the same extension, not that both must be converted to "abc".

My solution also missed the point with files having "other" as extension.

Our results were the same.

@apgarcia @Füsilier Breitlinger

⇧

Hal Pomeranz 1 year ago • •

Hal Pomeranz
1 year ago • •