[an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] (none) [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] (none) [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive][an error occurred while processing this directive]
 
[an error occurred while processing this directive] [an error occurred while processing this directive]
Skåne Sjælland Linux User Group - http://www.sslug.dk Home   Subscribe   Mail Archive   Forum   Calendar   Search
MhonArc Date: [Date Prev] [Date Index] [Date Next]   Thread: [Date Prev] [Thread Index] [Date Next]   MhonArc
 

Re: [LOCALE] Webskanning



> Der er nu en ny udgave på adressen:
> 
>    http://hugin.ldraw.org/temp/gafl_tekst_fra_websted-20010313.tar.gz
> 
> Jacob

Den virker fint.
Jeg har testet den på miljøministeriets hjemmeside www.mst.dk og efterprocesseret 
lidt på det resultatet og smidt det på:

http://192.38.108.132/bop/environliste1.txt

Processeringen er foretaget nogenlunde som følger:

./gafl www.mst.dk da - 1 > result

./newtok <result > result.tok

cat result.tok | tr [:upper:] [:lower:] | sort | uniq > environ

diff -u parole environ | grep "^+" | cut -c2- | grep [:alpha:] > environliste1.txt


Kommentarer:

1. Tokeniseringen er ikke helt optimal, der er stadig få rester af diverse tegn og 
paranteser.

2. Ordlisten er ikke superinteressant som "miljø korpus" fordi der er for mange 
almindelige ord i den (vores parole korpus er ikke stort nok), men også en masse 
"offentlig sektor" ord...

3. Måden jeg gør det på ovenfor forhindrer mig i i første omgang at frasortere 
lavfrekvente ord idet jeg smider frekvensinformationen væk med uniq.

/Bo


 
Home   Subscribe   Mail Archive   Index   Calendar   Search

 
 
Questions about the web-pages to <www_admin>. Last modified 2005-08-10, 20:52 CEST [an error occurred while processing this directive]
This page is maintained by [an error occurred while processing this directive]MHonArc [an error occurred while processing this directive] # [an error occurred while processing this directive] *